Advice on a multithreaded netisr patch?
rwatson at FreeBSD.org
Sun Apr 5 06:54:20 PDT 2009
On Sun, 5 Apr 2009, Ivan Voras wrote:
>>> I thought this has something to deal with NIC moderation (em) but can't
>>> really explain it. The bad performance part (not the jump) is also visible
>>> over the loopback interface.
>> FYI, if you want high performance, you really want a card supporting
>> multiple input queues -- igb, cxgb, mxge, etc. if_em-only cards are
>> fundamentally less scalable in an SMP environment because they require
>> input or output to occur only from one CPU at a time.
> Makes sense, but on the other hand - I see people are routing at least
> 250,000 packets per seconds per direction with these cards, so they probably
> aren't the bottleneck (pro/1000 pt on pci-e).
The argument is not that they are slower (although they probably are a bit
slower), rather that they introduce serialization bottlenecks by requiring
synchronization between CPUs in order to distribute the work. Certainly some
of the scalability issues in the stack are not a result of that, but a good
Historically, we've had a number of bottlenecks in, say, the bulk data receive
and send paths, such as:
- Initial receipt and processing of packets on a single CPU as a result of a
single input queue from the hardware. Addressed by using multiple input
queue hardware with appropriately configured drivers (generally the default
is to use multiple input queues in 7.x and 8.x for supporting hardware).
- Cache line contention on stats data structures in drivers and various levels
of the network stack due to bouncing around exclusive ownership of the cache
line. ifnet introduces at least a few, but I think most of the interesting
ones are at the IP and TCP layers for receipt.
- Global locks protecting connection lists, all rwlocks as of 7.1, but not
necessarily always used read-only for packet processing. For UDP we do a
very good job at avoiding write locks, but for TCP in 7.x we still use a
global write lock, if briefly, for every packet. There's a change in 8.x to
use a global read lock for most packets, especially steady state packets,
but I didn't merge it for 7.2 because it's not well-benchmarked. Assuming I
get positive feedback from more people, I will merge them before 7.3.
- If the user application is multi-threaded and receiving from many threads at
once, we see contention on the file descriptor table lock. This was
markedly improved by the file descriptor table locking rewrite in 7.0, but
we're continuing to look for ways to mitigate this. A lockless approach
would be really nice...
On the transmit path, the bottlenecks are similar but different:
- Neither 7.x nor 8.x supports multiple transmit queues as shipped; Kip has
patches for both that add it for cxgb. Maintaining ordering here, and
ideally affinity to the appropriate associated input queue, is important.
As the patches aren't in the tree yet, or for single-queue drivers,
contention on the device driver send path and queues can be significant,
especially for device drivers where the send and receive path are protected
by the same lock (bge!).
- Stats at various levels in the stack still dirty cache lines.
- We don't acquire, in the common case, any global connection list locks
- Routing table locks may be an issue. Kip has patches against 8.x to
re-introduce inpcb route as well as link layer flow caching. These are in
my review queue currently... In 8.x the global radix tree lock is a
read-write lock and we use read-locking where possible, but in 7.x it's
still a mutex. This probably isn't an MFCable change.
Another change coming in 8.x is increased use of read-mostly locks, rmlocks,
which avoid writes to shared cache lines for read-acquire, but have a more
expensive write-acquire. We're already using this in a few spots, including
for firewall registration, but need to use it in more.
With a fast CPU, introducing more cores may not necessarily speed up, and
might often slow down, processing even if all bottlenecks are
eliminated--fundamentally, if you have the CPU capacity to do the work on one
CPU, then moving the work to other CPUs is an overhead best avoided.
Especially if the device itself forces serialization due to having a single
input queue and a single output queue. However, if we, reasonably, assume a
capping of core speed over time, and increasing CPU density, software work
placement becomes more important. And with multi-queue devices, avoiding
writing to common cache lines from CPUs is increasingly possible.
We have a 32-thread MIPS embedded eval board in the Netperf cluster now, which
we'll begin using for 10gbps testing fairly soon, I hope. One of its
properties is that individual threads are decidedly non-zippy compared to,
say, a 10gbps interface running at line-rate, so it will allow us to explore
these issues more effectively than we could before.
Robert N M Watson
University of Cambridge
More information about the freebsd-net