it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
    Andre Oppermann 
    andre at freebsd.org
       
    Wed Aug 21 18:40:45 UTC 2013
    
    
  
On 14.08.2013 12:21, Luigi Rizzo wrote:
> On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
>> I think (check the driver code in question as I'm not sure) that if you
>> "ifconfig <if> lro" and the driver has hardware support or has been made
>> aware of our software implementation, it should DTRT.
>
> The "lower throughput than linux" that julian was seeing is either
> because of a slow (CPU-bound) sender or slow receiver. Given that
> the FreeBSD tx path is quite expensive (redoing route and arp lookups
> on every packet, etc.) I highly suspect the sender side is at fault.
 >
> Then the problem remains that we should keep a copy of route and
> arp information in the socket instead of redoing the lookups on
> every single transmission, as they consume some 25% of the time of
> a sendto(), and probably even more when it comes to large tcp
> segments, sendfile() and the like.
It's the locking and ref-counting overhead in the routing table and
ARP table causing a lot of cache thrashing and bus lock cycles.
The fix is rather simple.  The routing table gets protected by a rm_lock
instead of a normal lock.  Individual routes no longer have their own
lock and no more ref-counting.  All pointers to routes and into the
routing table are prohibited.  Upon lookup the sought information is
copied out (ifp, ifaddr, nexthop) without retaining any reference to
the routing entry.  Ditto for the ARP table.  Because changes to the
routing and ARP tables are very infrequent compared to the number of
lookups performed on them, this exhibits very good cache behavior
across multiple cores and cpus.  No shared routing table memory is
dirtied during lookup.
Approaches that do NOT work (well):
  - flow caching where a separate entry is generated for every active
    connection containing direct pointers to the rtentry, arp entry and
    interface.  Besides the pointer validity and refcounting issues it
    scales very poorly for a large number of "flows" exhibiting a large
    lookup overhead.  The routing table (default and interface routes)
    and ARP table (a few hosts) stay at the same size and have a "constant"
    lookup time.
  - per cpu copies of routing and arp table have increased memory consumption
    and synchronization issues on updates especially with high core counts.
  - storing the rtentry and arp entry pointers in the inpcb has similar
    issues as the the flow table approach while periodically having to
    check if the route or arp entry changed.
The rm_lock is the fastest, cheapest and most SMP scalable approach shown
so far.  I have patches against a roughly 12 month old current laying around
if someone wants to brush them up and work out the final kinks.  The speedup
and reduction in overhead is significant.
-- 
Andre
    
    
More information about the freebsd-net
mailing list