Advice on a multithreaded netisr patch?

Sun Apr 5 06:54:20 PDT 2009

On Sun, 5 Apr 2009, Ivan Voras wrote:

>>> I thought this has something to deal with NIC moderation (em) but can't 
>>> really explain it. The bad performance part (not the jump) is also visible 
>>> over the loopback interface.
>>
>> FYI, if you want high performance, you really want a card supporting 
>> multiple input queues -- igb, cxgb, mxge, etc.  if_em-only cards are 
>> fundamentally less scalable in an SMP environment because they require 
>> input or output to occur only from one CPU at a time.
>
> Makes sense, but on the other hand - I see people are routing at least 
> 250,000 packets per seconds per direction with these cards, so they probably 
> aren't the bottleneck (pro/1000 pt on pci-e).

The argument is not that they are slower (although they probably are a bit 
slower), rather that they introduce serialization bottlenecks by requiring 
synchronization between CPUs in order to distribute the work.  Certainly some 
of the scalability issues in the stack are not a result of that, but a good 
number are.

Historically, we've had a number of bottlenecks in, say, the bulk data receive 
and send paths, such as:

- Initial receipt and processing of packets on a single CPU as a result of a
   single input queue from the hardware.  Addressed by using multiple input
   queue hardware with appropriately configured drivers (generally the default
   is to use multiple input queues in 7.x and 8.x for supporting hardware).

- Cache line contention on stats data structures in drivers and various levels
   of the network stack due to bouncing around exclusive ownership of the cache
   line.  ifnet introduces at least a few, but I think most of the interesting
   ones are at the IP and TCP layers for receipt.

- Global locks protecting connection lists, all rwlocks as of 7.1, but not
   necessarily always used read-only for packet processing.  For UDP we do a
   very good job at avoiding write locks, but for TCP in 7.x we still use a
   global write lock, if briefly, for every packet.  There's a change in 8.x to
   use a global read lock for most packets, especially steady state packets,
   but I didn't merge it for 7.2 because it's not well-benchmarked.  Assuming I
   get positive feedback from more people, I will merge them before 7.3.

- If the user application is multi-threaded and receiving from many threads at
   once, we see contention on the file descriptor table lock.  This was
   markedly improved by the file descriptor table locking rewrite in 7.0, but
   we're continuing to look for ways to mitigate this.  A lockless approach
   would be really nice...

On the transmit path, the bottlenecks are similar but different:

- Neither 7.x nor 8.x supports multiple transmit queues as shipped; Kip has
   patches for both that add it for cxgb.  Maintaining ordering here, and
   ideally affinity to the appropriate associated input queue, is important.
   As the patches aren't in the tree yet, or for single-queue drivers,
   contention on the device driver send path and queues can be significant,
   especially for device drivers where the send and receive path are protected
   by the same lock (bge!).

- Stats at various levels in the stack still dirty cache lines.

- We don't acquire, in the common case, any global connection list locks
   during transmit.

- Routing table locks may be an issue.  Kip has patches against 8.x to
   re-introduce inpcb route as well as link layer flow caching.  These are in
   my review queue currently...  In 8.x the global radix tree lock is a
   read-write lock and we use read-locking where possible, but in 7.x it's
   still a mutex.  This probably isn't an MFCable change.

Another change coming in 8.x is increased use of read-mostly locks, rmlocks, 
which avoid writes to shared cache lines for read-acquire, but have a more 
expensive write-acquire.  We're already using this in a few spots, including 
for firewall registration, but need to use it in more.

With a fast CPU, introducing more cores may not necessarily speed up, and 
might often slow down, processing even if all bottlenecks are 
eliminated--fundamentally, if you have the CPU capacity to do the work on one 
CPU, then moving the work to other CPUs is an overhead best avoided. 
Especially if the device itself forces serialization due to having a single 
input queue and a single output queue.  However, if we, reasonably, assume a 
capping of core speed over time, and increasing CPU density, software work 
placement becomes more important.  And with multi-queue devices, avoiding 
writing to common cache lines from CPUs is increasingly possible.

We have a 32-thread MIPS embedded eval board in the Netperf cluster now, which 
we'll begin using for 10gbps testing fairly soon, I hope.  One of its 
properties is that individual threads are decidedly non-zippy compared to, 
say, a 10gbps interface running at line-rate, so it will allow us to explore 
these issues more effectively than we could before.

Robert N M Watson
Computer Laboratory
University of Cambridge