Advice on a multithreaded netisr patch?

Sun Apr 5 06:21:23 PDT 2009

On Sun, 5 Apr 2009, Ivan Voras wrote:

> I'm developing an application that needs a high rate of small TCP 
> transactions on multi-core systems, and I'm hitting a limit where a kernel 
> task, usually swi:net (but it depends on the driver) hits 100% of a CPU at 
> some transactions/s rate and blocks further performance increase even though 
> other cores are 100% idle.

You can find a similar, if possibly more mature, implementation here:

   //depot/projects/rwatson/netisr2/...

I haven't updated it in about six months since I've been waiting for the 
RSS-based flowid support in HEAD to mature.  One of the fundamental problems 
with hashing packets to distribute work is that it involves taking cache 
misses on packet headers, not just once, but twice, which often is one of the 
largest costs in processing packets.  Most modern, interesting 
high-performance network cards can already take the hash in hardware, and you 
want to use that hash to place work where possible.

In 8.x, you shouldn't be experiencing high lock contention for the TCP receipt 
path when doing bulk transfers, as we use read locking for the tcbinfo lock in 
most cases.  In fact, you can even get fairly decent scalability even in 7.x 
because the regular packet processing path for TCP uses mutual exclusion only 
briefly.  However, the current approach does dirty a lot of cache lines, 
especially locks and stats, and does not scale well (in 8.x, or at all in 7.x) 
if you have lots of short connections.  Also, be aware that if you're 
outputting to a single interface or queue, there's a *lot* of lock contention 
in the device driver.  Kip Macy has patches to support multiple output queues 
on cxgb, which should facilitate support for other drivers as well, and the 
plan is to get that in 8.0 as well.

The patch above doesn't know about the mbuf packetheader flowid yet, but it's 
trivial to teach it about that.  I have plans to get back to the netisr2 code 
before we finalize 8.0, but have some other stuff in the queue first.  We're, 
briefly, in a period where input queue count is about the same density as CPU 
cores; it's not entirely clear, but we may soon be back in a situation where 
CPU core count exceeds queues, in which case doing software work placement 
will continue to be important.  Right now, as long as your high-performance 
card supports multiple input queues, we already do pretty effective work 
placement by virtue of RSS and multiple ithreads.

Robert N M Watson
Computer Laboratory
University of Cambridge

>
> So I've got an idea and tested it out, but it fails in an unexpected
> way. I'm not very familiar with the network code so I'm probably missing
> something obvious. The idea was to locate where the packet processing
> takes place and offload packets to several new kernel threads. I see
> this can happen in several places - netisr, ip_input and tcp_input, and
> I chose netisr because I thought maybe it would also help other uses
> (routing?). Here's a patch against CURRENT:
>
> http://people.freebsd.org/~ivoras/diffs/mpip.patch
>
> It's fairly simple - starts a configurable number of threads in
> start_netisr(), assigns circular queues to each, and modifies what I
> think are entry points for packets in the non-netisr.direct case. I also
> try to have TCP and UDP traffic from the same host+port processed by the
> same thread. It has some rough edges but I think this is enough to test
> the idea. I know that there are several people officially working in
> this area and I'm not an expert in it so think of it as a weekend hack
> for learning purposes :)
>
> These parameters are needed in loader.conf to test it:
>
> net.isr.direct=0
> net.isr.mtdispatch_n_threads=2
>
> I expected things like the contention in upper layers (TCP) leading to
> not improving performance one bit, but I can't explain what I'm getting
> here. While testing the application on a plain kernel, I get approx.
> 100,000 - 120,000 packets/s per direction (by looking at "netstat 1")
> and a similar number of transactions/s in the application. With the
> patch I get up to 250,000 packets/s in netstat (3 mtdispatch threads),
> but for some weird reason the actual number of transactions processed by
> the application drops to less than 1,000 at the beginning (~~ 30
> seconds), then jumps to close to 100,000 transactions/s, with netstat
> also showing a drop this number of packets. In the first phase, the new
> threads (netd0..3) are using CPU time almost 100%, in the second phase I
> can't see where the CPU time is going (using top).
>
> I thought this has something to deal with NIC moderation (em) but can't
> really explain it. The bad performance part (not the jump) is also
> visible over the loopback interface.
>
> Any ideas?
>
>