Advice on a multithreaded netisr patch?

Sun Apr 5 15:18:00 PDT 2009

On Sun, 5 Apr 2009, Ivan Voras wrote:

>> The argument is not that they are slower (although they probably are a bit 
>> slower), rather that they introduce serialization bottlenecks by requiring 
>> synchronization between CPUs in order to distribute the work. Certainly 
>> some of the scalability issues in the stack are not a result of that, but a 
>> good number are.
>
> I'd like to understand more. If (in netisr) I have a mbuf with headers, is 
> this data already transfered from the card or is it magically "not here 
> yet"?

A lot depends on the details of the card and driver.  The driver will take 
cache misses on the descriptor ring entry, if it's not already in cache, and 
the link layer will take a cache miss on the front of the ethernet frame in 
the cluster pointed to by the mbuf header as part of its demux.  What happens 
next depends on your dispatch model and cache line size.  Let's make a few 
simplifying assumptions that are mostly true:

- The driver associats a single cluster with each receive ring entry for each
   packet to be stored in, and the cluster is cacheline-aligned.  No header
   splitting is enabled.

- Standard ethernet encapsulation of IP is used, without additional VLAN
   headers or other encapsulation, etc.  There are no IP options.

- We don't need to validate any checksums because the hardware has done it for
   us, so no need to take cache misses on data that doesn't matter until we
   reach higher layers.

In the device driver/ithread code, we'll now proceed to take some cache 
misses assuming we're not pretty lucky:

(1) The descriptor ring entry
(2) The mbuf packet header
(3) The first cache line in the cluster

This is sufficient to figure out what protocol we're going to dispatch to, and 
depending on dispatch model, we now either enqueue the packet for delivery to 
a netisr, or we directly dispatch the handler for IP.

If the packet is processed on the current CPU and we're direct dispatching, or 
if we've dispatched to a netisr on the same CPU and we're quite lucky, the 
mbuf packet header and front of the cluster will be in the cache.

However, what happens next depends on the cache fetch and line size.  If 
things happen in 32-byte cache lines or smaller, we cache miss on the end of 
the IP header, because the last two bytes of the destination IP address start 
at offset 32 into the cluster.  If we have 64-byte fetching and line size, 
things go better because both the full IP and TCP headers should be in that 
first cache line.

One big advantage to direct dispatch is that it maximizes the chances that we 
don't blow out the low-level CPU caches between link-layer and IP-layer 
processing, meaning that we might actually get through all the IP and TCP 
headers without a cache miss on a 64-byte line size.  If we netisr dispatch to 
another CPU without a shared cache, or we netisr dispatch to the current CPU 
but there's a scheduling delay, other packets queued first, etc, we'll take a 
number of the same cache misses over again as things get pulled into the right 
cache.

This presents a strong cache motivation to keep a packet "on" a CPU and even 
in the same thread once you've started processing it.  If you have to enqueue, 
you take locks, take a context switch, deal with the fact that LRU on cache 
lines isn't going to like your queue depth, and potentially pay a number of 
additional cache misses on the same data.  There are also some other good 
reasons to use direct dispatch, such as avoiding doing work on packets that 
will later be dropped if the netisr queue overflows.

This is why we direct dispatch by default, and why this is quite a good 
strategy for multiple input queue network cards, where it also buys us 
parallelism.

Note that if the flow RSS hash is in the same cache line as the rest of the 
receive descriptor ring entry, you may be able to avoid the cache miss on the 
cluster and simply redirect it to another CPU's netisr without ever reading 
packet data, which avoids at least one and possibly two cache misses, but also 
means that you have to run the link layer in the remote netisr, rather than 
locally in the ithread.

> In the first case, the package reception code path is not changed until it's 
> queued on a thread, on which it's handled in the future (or is the influence 
> of "other" data like timers and internal TCP reassembly buffers so large?). 
> In the second case, why?

The good news about TCP reassembly is that we don't have to look at the data, 
only mbuf headers and reassembly buffer entries, so with any luck we've 
avoided actually taking a cache miss on the data.  If things go well, we can 
avoid looking at anything but mbuf and packet headers until the socket copies 
out, but I'm not sure how well we do that in practice.

> As the card and the OS can already process many packets per second for 
> something fairly complex as routing (http://www.tancsa.com/blast.html), and 
> TCP chokes swi:net at 100% of a core, isn't this indication there's 
> certainly more space for improvement even with a single-queue old-fashioned 
> NICs?

Maybe.  It depends on the relative costs of local processing vs redistributing 
the work, which involves schedulers, IPIs, additional cache misses, lock 
contention, and so on.  This means there's a period where it can't possibly be 
a win, and then at some point it's a win as long as the stack scales.  This is 
essentially the usual trade-off in using threads and parallelism: does the 
benefit of multiple parallel execution units make up for the overheads of 
synchronization and data migration?

There are some previous e-mail threads where people have observed that for 
some workloads, switching to netisr wins over direct dispatch.  For example, 
if you have a number of cores and are doing firewall processing, offloading 
work to the netisr from the input ithread may improve performance.  However, 
this appears not to be the common case for end-host workloads on the hardware 
we mostly target, and this is increasingly true as multiple input queues come 
into play, as the card itself will allow us to use multiple CPUs without any 
interactions between the CPUs.

This isn't to say that work redistribution using a netisr-like scheme isn't a 
good idea: in a world where CPU threads are weak compared to the wire 
workflow, and there's cache locality across threads on the same core, or NUMA 
is present, there may be a potential for a big win when available work 
significantly exceeds what a single CPU thread/core can handle.  In that case, 
we want to place the work as close as possible to take advantage of shared 
caches or the memory being local to the CPU thread/core doing the deferred 
work.

FYI, the localhost case is a bit weird -- I think we have some scheduling 
issues that are causing loopback netisr stuff to be pessimally scheduled. 
Here are some suggestions for things to try and see if they help, though:

- Comment out all ifnet, IP, and TCP global statistics in your local stack --
   especially look for things tcpstat.whatever++;.

- Use cpuset to pin ithreads, the netisr, and whatever else, to specific cores
   so that they don't migrate, and if your system uses HTT, experiment with
   pinning the ithread and the netisr on different threads on the same core, or
   at least, different cores on the same die.

- Experiment with using just the source IP, the source + destination IP, and
   both IPs plus TCP ports in your hash.

- If your card supports RSS, pass the flowid up the stack in the mbuf packet
   header flowid field, and use that instead of the hash for work placement.

- If you're doing pure PPS tests with UDP (or the like), and your test can
   tolerate disordering, try hashing based on the mbuf header address or
   something else that will distribute the work but not take a cache miss.

- If you have a flowid or the above disordered condition applies, try shifting
   the link layer dispatch to the netisr, rather than doing the demux in the
   ithread, as that will avoid cache misses in the ithread and do all the demux
   in the netisr.

Robert N M Watson
Computer Laboratory
University of Cambridge