Advice on a multithreaded netisr patch?

Mon Apr 6 03:37:21 PDT 2009

--- On Sun, 4/5/09, Robert Watson <rwatson at FreeBSD.org> wrote:

> From: Robert Watson <rwatson at FreeBSD.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: "Ivan Voras" <ivoras at freebsd.org>
> Cc: freebsd-net at freebsd.org
> Date: Sunday, April 5, 2009, 6:17 PM
> On Sun, 5 Apr 2009, Ivan Voras wrote:
> 
> >> The argument is not that they are slower (although
> they probably are a bit slower), rather that they introduce
> serialization bottlenecks by requiring synchronization
> between CPUs in order to distribute the work. Certainly some
> of the scalability issues in the stack are not a result of
> that, but a good number are.
> > 
> > I'd like to understand more. If (in netisr) I have
> a mbuf with headers, is this data already transfered from
> the card or is it magically "not here yet"?
> 
> A lot depends on the details of the card and driver.  The
> driver will take cache misses on the descriptor ring entry,
> if it's not already in cache, and the link layer will
> take a cache miss on the front of the ethernet frame in the
> cluster pointed to by the mbuf header as part of its demux. 
> What happens next depends on your dispatch model and cache
> line size.  Let's make a few simplifying assumptions
> that are mostly true:
> 
> - The driver associats a single cluster with each receive
> ring entry for each
>   packet to be stored in, and the cluster is
> cacheline-aligned.  No header
>   splitting is enabled.
> 
> - Standard ethernet encapsulation of IP is used, without
> additional VLAN
>   headers or other encapsulation, etc.  There are no IP
> options.
> 
> - We don't need to validate any checksums because the
> hardware has done it for
>   us, so no need to take cache misses on data that
> doesn't matter until we
>   reach higher layers.
> 
> In the device driver/ithread code, we'll now proceed to
> take some cache misses assuming we're not pretty lucky:
> 
> (1) The descriptor ring entry
> (2) The mbuf packet header
> (3) The first cache line in the cluster
> 
> This is sufficient to figure out what protocol we're
> going to dispatch to, and depending on dispatch model, we
> now either enqueue the packet for delivery to a netisr, or
> we directly dispatch the handler for IP.
> 
> If the packet is processed on the current CPU and we're
> direct dispatching, or if we've dispatched to a netisr
> on the same CPU and we're quite lucky, the mbuf packet
> header and front of the cluster will be in the cache.
> 
> However, what happens next depends on the cache fetch and
> line size.  If things happen in 32-byte cache lines or
> smaller, we cache miss on the end of the IP header, because
> the last two bytes of the destination IP address start at
> offset 32 into the cluster.  If we have 64-byte fetching and
> line size, things go better because both the full IP and TCP
> headers should be in that first cache line.
> 
> One big advantage to direct dispatch is that it maximizes
> the chances that we don't blow out the low-level CPU
> caches between link-layer and IP-layer processing, meaning
> that we might actually get through all the IP and TCP
> headers without a cache miss on a 64-byte line size.  If we
> netisr dispatch to another CPU without a shared cache, or we
> netisr dispatch to the current CPU but there's a
> scheduling delay, other packets queued first, etc, we'll
> take a number of the same cache misses over again as things
> get pulled into the right cache.
> 
> This presents a strong cache motivation to keep a packet
> "on" a CPU and even in the same thread once
> you've started processing it.  If you have to enqueue,
> you take locks, take a context switch, deal with the fact
> that LRU on cache lines isn't going to like your queue
> depth, and potentially pay a number of additional cache
> misses on the same data.  There are also some other good
> reasons to use direct dispatch, such as avoiding doing work
> on packets that will later be dropped if the netisr queue
> overflows.
> 
> This is why we direct dispatch by default, and why this is
> quite a good strategy for multiple input queue network
> cards, where it also buys us parallelism.
> 
> Note that if the flow RSS hash is in the same cache line as
> the rest of the receive descriptor ring entry, you may be
> able to avoid the cache miss on the cluster and simply
> redirect it to another CPU's netisr without ever reading
> packet data, which avoids at least one and possibly two
> cache misses, but also means that you have to run the link
> layer in the remote netisr, rather than locally in the
> ithread.
> 
> > In the first case, the package reception code path is
> not changed until it's queued on a thread, on which
> it's handled in the future (or is the influence of
> "other" data like timers and internal TCP
> reassembly buffers so large?). In the second case, why?
> 
> The good news about TCP reassembly is that we don't
> have to look at the data, only mbuf headers and reassembly
> buffer entries, so with any luck we've avoided actually
> taking a cache miss on the data.  If things go well, we can
> avoid looking at anything but mbuf and packet headers until
> the socket copies out, but I'm not sure how well we do
> that in practice.
> 
> > As the card and the OS can already process many
> packets per second for something fairly complex as routing
> (http://www.tancsa.com/blast.html), and TCP chokes swi:net
> at 100% of a core, isn't this indication there's
> certainly more space for improvement even with a
> single-queue old-fashioned NICs?
> 
> Maybe.  It depends on the relative costs of local
> processing vs redistributing the work, which involves
> schedulers, IPIs, additional cache misses, lock contention,
> and so on.  This means there's a period where it
> can't possibly be a win, and then at some point it's
> a win as long as the stack scales.  This is essentially the
> usual trade-off in using threads and parallelism: does the
> benefit of multiple parallel execution units make up for the
> overheads of synchronization and data migration?
> 
> There are some previous e-mail threads where people have
> observed that for some workloads, switching to netisr wins
> over direct dispatch.  For example, if you have a number of
> cores and are doing firewall processing, offloading work to
> the netisr from the input ithread may improve performance. 
> However, this appears not to be the common case for end-host
> workloads on the hardware we mostly target, and this is
> increasingly true as multiple input queues come into play,
> as the card itself will allow us to use multiple CPUs
> without any interactions between the CPUs.
> 
> This isn't to say that work redistribution using a
> netisr-like scheme isn't a good idea: in a world where
> CPU threads are weak compared to the wire workflow, and
> there's cache locality across threads on the same core,
> or NUMA is present, there may be a potential for a big win
> when available work significantly exceeds what a single CPU
> thread/core can handle.  In that case, we want to place the
> work as close as possible to take advantage of shared caches
> or the memory being local to the CPU thread/core doing the
> deferred work.
> 
> FYI, the localhost case is a bit weird -- I think we have
> some scheduling issues that are causing loopback netisr
> stuff to be pessimally scheduled. Here are some suggestions
> for things to try and see if they help, though:
> 
> - Comment out all ifnet, IP, and TCP global statistics in
> your local stack --
>   especially look for things tcpstat.whatever++;.
> 
> - Use cpuset to pin ithreads, the netisr, and whatever
> else, to specific cores
>   so that they don't migrate, and if your system uses
> HTT, experiment with
>   pinning the ithread and the netisr on different threads
> on the same core, or
>   at least, different cores on the same die.
> 
> - Experiment with using just the source IP, the source +
> destination IP, and
>   both IPs plus TCP ports in your hash.
> 
> - If your card supports RSS, pass the flowid up the stack
> in the mbuf packet
>   header flowid field, and use that instead of the hash for
> work placement.
> 
> - If you're doing pure PPS tests with UDP (or the
> like), and your test can
>   tolerate disordering, try hashing based on the mbuf
> header address or
>   something else that will distribute the work but not take
> a cache miss.
> 
> - If you have a flowid or the above disordered condition
> applies, try shifting
>   the link layer dispatch to the netisr, rather than doing
> the demux in the
>   ithread, as that will avoid cache misses in the ithread
> and do all the demux
>   in the netisr.
> 
> Robert N M Watson
> Computer Laboratory
> University of Cambridge

Is there a way to give a kernel thread exclusive use of a core? I know you
can pin a kernel thread with sched_bind(), but is there a way to keep
other threads from using the core? On an 8 core system it almost seems
that the randomness of more cores is a negative in some situations.

Also, I've noticed that calling sched_bind() during bootup is a bad thing
in that it locks the system. I'm not certain but I suspect its the 
thread_lock that is the culprit. Is there a clean way to determine that
its safe to lock curthread and do a cpu bind?

Barney