Advice on a multithreaded netisr patch?
rwatson at FreeBSD.org
Sun Apr 5 15:18:00 PDT 2009
On Sun, 5 Apr 2009, Ivan Voras wrote:
>> The argument is not that they are slower (although they probably are a bit
>> slower), rather that they introduce serialization bottlenecks by requiring
>> synchronization between CPUs in order to distribute the work. Certainly
>> some of the scalability issues in the stack are not a result of that, but a
>> good number are.
> I'd like to understand more. If (in netisr) I have a mbuf with headers, is
> this data already transfered from the card or is it magically "not here
A lot depends on the details of the card and driver. The driver will take
cache misses on the descriptor ring entry, if it's not already in cache, and
the link layer will take a cache miss on the front of the ethernet frame in
the cluster pointed to by the mbuf header as part of its demux. What happens
next depends on your dispatch model and cache line size. Let's make a few
simplifying assumptions that are mostly true:
- The driver associats a single cluster with each receive ring entry for each
packet to be stored in, and the cluster is cacheline-aligned. No header
splitting is enabled.
- Standard ethernet encapsulation of IP is used, without additional VLAN
headers or other encapsulation, etc. There are no IP options.
- We don't need to validate any checksums because the hardware has done it for
us, so no need to take cache misses on data that doesn't matter until we
reach higher layers.
In the device driver/ithread code, we'll now proceed to take some cache
misses assuming we're not pretty lucky:
(1) The descriptor ring entry
(2) The mbuf packet header
(3) The first cache line in the cluster
This is sufficient to figure out what protocol we're going to dispatch to, and
depending on dispatch model, we now either enqueue the packet for delivery to
a netisr, or we directly dispatch the handler for IP.
If the packet is processed on the current CPU and we're direct dispatching, or
if we've dispatched to a netisr on the same CPU and we're quite lucky, the
mbuf packet header and front of the cluster will be in the cache.
However, what happens next depends on the cache fetch and line size. If
things happen in 32-byte cache lines or smaller, we cache miss on the end of
the IP header, because the last two bytes of the destination IP address start
at offset 32 into the cluster. If we have 64-byte fetching and line size,
things go better because both the full IP and TCP headers should be in that
first cache line.
One big advantage to direct dispatch is that it maximizes the chances that we
don't blow out the low-level CPU caches between link-layer and IP-layer
processing, meaning that we might actually get through all the IP and TCP
headers without a cache miss on a 64-byte line size. If we netisr dispatch to
another CPU without a shared cache, or we netisr dispatch to the current CPU
but there's a scheduling delay, other packets queued first, etc, we'll take a
number of the same cache misses over again as things get pulled into the right
This presents a strong cache motivation to keep a packet "on" a CPU and even
in the same thread once you've started processing it. If you have to enqueue,
you take locks, take a context switch, deal with the fact that LRU on cache
lines isn't going to like your queue depth, and potentially pay a number of
additional cache misses on the same data. There are also some other good
reasons to use direct dispatch, such as avoiding doing work on packets that
will later be dropped if the netisr queue overflows.
This is why we direct dispatch by default, and why this is quite a good
strategy for multiple input queue network cards, where it also buys us
Note that if the flow RSS hash is in the same cache line as the rest of the
receive descriptor ring entry, you may be able to avoid the cache miss on the
cluster and simply redirect it to another CPU's netisr without ever reading
packet data, which avoids at least one and possibly two cache misses, but also
means that you have to run the link layer in the remote netisr, rather than
locally in the ithread.
> In the first case, the package reception code path is not changed until it's
> queued on a thread, on which it's handled in the future (or is the influence
> of "other" data like timers and internal TCP reassembly buffers so large?).
> In the second case, why?
The good news about TCP reassembly is that we don't have to look at the data,
only mbuf headers and reassembly buffer entries, so with any luck we've
avoided actually taking a cache miss on the data. If things go well, we can
avoid looking at anything but mbuf and packet headers until the socket copies
out, but I'm not sure how well we do that in practice.
> As the card and the OS can already process many packets per second for
> something fairly complex as routing (http://www.tancsa.com/blast.html), and
> TCP chokes swi:net at 100% of a core, isn't this indication there's
> certainly more space for improvement even with a single-queue old-fashioned
Maybe. It depends on the relative costs of local processing vs redistributing
the work, which involves schedulers, IPIs, additional cache misses, lock
contention, and so on. This means there's a period where it can't possibly be
a win, and then at some point it's a win as long as the stack scales. This is
essentially the usual trade-off in using threads and parallelism: does the
benefit of multiple parallel execution units make up for the overheads of
synchronization and data migration?
There are some previous e-mail threads where people have observed that for
some workloads, switching to netisr wins over direct dispatch. For example,
if you have a number of cores and are doing firewall processing, offloading
work to the netisr from the input ithread may improve performance. However,
this appears not to be the common case for end-host workloads on the hardware
we mostly target, and this is increasingly true as multiple input queues come
into play, as the card itself will allow us to use multiple CPUs without any
interactions between the CPUs.
This isn't to say that work redistribution using a netisr-like scheme isn't a
good idea: in a world where CPU threads are weak compared to the wire
workflow, and there's cache locality across threads on the same core, or NUMA
is present, there may be a potential for a big win when available work
significantly exceeds what a single CPU thread/core can handle. In that case,
we want to place the work as close as possible to take advantage of shared
caches or the memory being local to the CPU thread/core doing the deferred
FYI, the localhost case is a bit weird -- I think we have some scheduling
issues that are causing loopback netisr stuff to be pessimally scheduled.
Here are some suggestions for things to try and see if they help, though:
- Comment out all ifnet, IP, and TCP global statistics in your local stack --
especially look for things tcpstat.whatever++;.
- Use cpuset to pin ithreads, the netisr, and whatever else, to specific cores
so that they don't migrate, and if your system uses HTT, experiment with
pinning the ithread and the netisr on different threads on the same core, or
at least, different cores on the same die.
- Experiment with using just the source IP, the source + destination IP, and
both IPs plus TCP ports in your hash.
- If your card supports RSS, pass the flowid up the stack in the mbuf packet
header flowid field, and use that instead of the hash for work placement.
- If you're doing pure PPS tests with UDP (or the like), and your test can
tolerate disordering, try hashing based on the mbuf header address or
something else that will distribute the work but not take a cache miss.
- If you have a flowid or the above disordered condition applies, try shifting
the link layer dispatch to the netisr, rather than doing the demux in the
ithread, as that will avoid cache misses in the ithread and do all the demux
in the netisr.
Robert N M Watson
University of Cambridge
More information about the freebsd-net