it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)

Sat Aug 17 15:59:14 UTC 2013

... we get perfectly good throughput without 400k ints a second on the
ixgbe driver.

As in, I can easily saturate 2 x 10GE on ixgbe hardware with a handful of
flows. That's not terribly difficult.

However, there's a few interesting problems that need addressing:

* There's lock contention between the transmit side from userland and the
TCP timers, and the receive side with ACK processing. Under very high
traffic load a lot of lock contention stalls things. We (the royal "we",
I'm mostly just doing tooling at the moment) working on that.
* There's lock contention on the ARP, routing table and PCB lookups. The
latter will go away when we've finally implemented RSS for transmit and
receive and then moved things over to using PCB groups on CPUs which have
NIC driver threads bound to them.
* There's increasing cache thrashing from a larger workload, causing the
expensive lookups to be even more expensive.
* All the list walks suck. We need to be batching things so we use CPU
caches much more efficiently.

The idea of using TSO on the transmit side and generic LRO on the receive
side is to make the per-packet overhead less. I think we can be much more
efficient in general in packet processing, but that's a big task. :-) So,
using at least TSO is a big benefit if purely to avoid decomposing things
into smaller mbufs and contending on those locks in a very big way.

I'm working on PMC to make it easier to use to find these bottlenecks and
make the code and data more efficient. Then, likely, I'll end up hacking on
generic TSO/LRO, TX/RX RSS queue management and make the PCB group thing
default on for SMP machines. I may even take a knife to some of the packet
processing overhead.

-adrian