it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

Barney Cordoba barney_cordoba at yahoo.com
Sun Aug 18 13:52:04 UTC 2013





________________________________
 From: Adrian Chadd <adrian at freebsd.org>
To: Barney Cordoba <barney_cordoba at yahoo.com> 
Cc: Luigi Rizzo <rizzo at iet.unipi.it>; Lawrence Stewart <lstewart at freebsd.org>; FreeBSD Net <net at freebsd.org> 
Sent: Saturday, August 17, 2013 11:59 AM
Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
 


... we get perfectly good throughput without 400k ints a second on the ixgbe driver.

As in, I can easily saturate 2 x 10GE on ixgbe hardware with a handful of flows. That's not terribly difficult.

However, there's a few interesting problems that need addressing:

* There's lock contention between the transmit side from userland and the TCP timers, and the receive side with ACK processing. Under very high traffic load a lot of lock contention stalls things. We (the royal "we", I'm mostly just doing tooling at the moment) working on that.
* There's lock contention on the ARP, routing table and PCB lookups. The latter will go away when we've finally implemented RSS for transmit and receive and then moved things over to using PCB groups on CPUs which have NIC driver threads bound to them.
* There's increasing cache thrashing from a larger workload, causing the expensive lookups to be even more expensive.
* All the list walks suck. We need to be batching things so we use CPU caches much more efficiently.

The idea of using TSO on the transmit side and generic LRO on the receive side is to make the per-packet overhead less. I think we can be much more efficient in general in packet processing, but that's a big task. :-) So, using at least TSO is a big benefit if purely to avoid decomposing things into smaller mbufs and contending on those locks in a very big way.

I'm working on PMC to make it easier to use to find these bottlenecks and make the code and data more efficient. Then, likely, I'll end up hacking on generic TSO/LRO, TX/RX RSS queue management and make the PCB group thing default on for SMP machines. I may even take a knife to some of the packet processing overhead.

-------------------------------

The ints/sec reference was based on Luigi's implication that turning off moderation was some sort of performance choice.

Again, you're talking "throughput" and not efficiency. I could fill a tx queue with 10gb of traffic with  yesteryear's cpus. It's not an achievement. Being able to bridge 
real traffic at 10gb/s with 2 cores is.

BC


More information about the freebsd-net mailing list