Advice on a multithreaded netisr patch?

Sun Apr 5 10:25:44 PDT 2009

--- On Sun, 4/5/09, Robert Watson <rwatson at FreeBSD.org> wrote:

> From: Robert Watson <rwatson at FreeBSD.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: "Ivan Voras" <ivoras at freebsd.org>
> Cc: freebsd-net at freebsd.org
> Date: Sunday, April 5, 2009, 9:54 AM
> On Sun, 5 Apr 2009, Ivan Voras wrote:
> 
> >>> I thought this has something to deal with NIC
> moderation (em) but can't really explain it. The bad
> performance part (not the jump) is also visible over the
> loopback interface.
> >> 
> >> FYI, if you want high performance, you really want
> a card supporting multiple input queues -- igb, cxgb, mxge,
> etc.  if_em-only cards are fundamentally less scalable in an
> SMP environment because they require input or output to
> occur only from one CPU at a time.
> > 
> > Makes sense, but on the other hand - I see people are
> routing at least 250,000 packets per seconds per direction
> with these cards, so they probably aren't the bottleneck
> (pro/1000 pt on pci-e).
> 
> The argument is not that they are slower (although they
> probably are a bit slower), rather that they introduce
> serialization bottlenecks by requiring synchronization
> between CPUs in order to distribute the work.  Certainly
> some of the scalability issues in the stack are not a result
> of that, but a good number are.
> 
> Historically, we've had a number of bottlenecks in,
> say, the bulk data receive and send paths, such as:
> 
> - Initial receipt and processing of packets on a single CPU
> as a result of a
>   single input queue from the hardware.  Addressed by using
> multiple input
>   queue hardware with appropriately configured drivers
> (generally the default
>   is to use multiple input queues in 7.x and 8.x for
> supporting hardware).
> 
> - Cache line contention on stats data structures in drivers
> and various levels
>   of the network stack due to bouncing around exclusive
> ownership of the cache
>   line.  ifnet introduces at least a few, but I think most
> of the interesting
>   ones are at the IP and TCP layers for receipt.
> 
> - Global locks protecting connection lists, all rwlocks as
> of 7.1, but not
>   necessarily always used read-only for packet processing. 
> For UDP we do a
>   very good job at avoiding write locks, but for TCP in 7.x
> we still use a
>   global write lock, if briefly, for every packet. 
> There's a change in 8.x to
>   use a global read lock for most packets, especially
> steady state packets,
>   but I didn't merge it for 7.2 because it's not
> well-benchmarked.  Assuming I
>   get positive feedback from more people, I will merge them
> before 7.3.
> 
> - If the user application is multi-threaded and receiving
> from many threads at
>   once, we see contention on the file descriptor table
> lock.  This was
>   markedly improved by the file descriptor table locking
> rewrite in 7.0, but
>   we're continuing to look for ways to mitigate this. 
> A lockless approach
>   would be really nice...
> 
> On the transmit path, the bottlenecks are similar but
> different:
> 
> - Neither 7.x nor 8.x supports multiple transmit queues as
> shipped; Kip has
>   patches for both that add it for cxgb.  Maintaining
> ordering here, and
>   ideally affinity to the appropriate associated input
> queue, is important.
>   As the patches aren't in the tree yet, or for
> single-queue drivers,
>   contention on the device driver send path and queues can
> be significant,
>   especially for device drivers where the send and receive
> path are protected
>   by the same lock (bge!).

I'm curious as to your assertion that hardware transmit queues are a 
big win. You're really just loading a transmit ring well ahead of actual transmission; there's no need to force a "start" for
each packet queued. You then have more overheard managing the multiple
queues; more memory used, more cpu cache needed, more interrupts
 (perhaps), overhead generating the flowid. It seems to me that a more
efficient method of transmitting, such as offloading the transmit
workload to a kernel task, would be more effective than using
multiple transmit queues. All the source thread has to do is queue
the packet and get out.

As an aside, why is Kip doing development on a Chelsio card rather
than a more mainstream product such as Intel or Broadcom that would
generate more widespread interest?

Barney