Advice on a multithreaded netisr patch?

Mon Apr 6 11:52:17 PDT 2009

On Mon, 6 Apr 2009, Ivan Voras wrote:

>> I think we're talking slightly at cross purposes.  There are two
>> transfers of interest:
>>
>> (1) DMA of the packet data to main memory from the NIC
>> (2) Servicing of CPU cache misses to access data in main memory
>>
>> By the time you receive an interrupt, the DMA is complete, so once you
>
> OK, this was what was confusing me - for a moment I thought you meant it's 
> not so.

It's a polite lie that we will choose to believe the purposes of 
simplification.  And probably true for all our drivers in practice right now.

>>     m = m_pullup(m, sizeof(*w));
>>     if (m == NULL)
>>         return;
>>     w = mtod(m, struct whatever *);
>>
>> m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data are 
>> contiguously stored so that the cast of w to m's data will point at a
>
> So, m_pullup() can resize / realloc() the mbuf? (not that it matters for 
> this purpose)

Yes -- if it can't meet the contiguity requirements using the current mbuf 
chain, it may reallocate and return a new head to the chain (hence m being 
reassigned).  If that reallocation fails, it may return NULL.  Once you've 
called m_pullup(), existing pointers into the chain's data will be invalid, so 
if you've already called mtod() on it, you need to call it again.

>> - A TCP segment will need to be ACK'd, so if you're sending data in
>> chunks in
>>   one direction, the ACKs will not be piggy-backed on existing data
>> tranfers,
>>   and instead be sent independently, hitting the network stack two more
>> times.
>
> No combination of these can make an accounting difference between 1,000 and 
> 250,000 pps. I must be hitting something very bad here.

Yes, you definitely want to run tcpdump to see what's going on here.

>> - Remember that TCP works to expand its window, and then maintains the
>> highest
>>   performance it can by bumping up against the top of available bandwidth
>>   continuously.  This involves detecting buffer limits by generating
>> packets
>>   that can't be sent, adding to the packet count.  With loopback
>> traffic, the
>>   drop point occurs when you exceed the size of the netisr's queue for
>> IP, so
>>   you might try bumping that from the default to something much larger.
>
> My messages are approx. 100 +/- 10 bytes. No practical way they will even 
> span multiple mbufs. TCP_NODELAY is on.

Remember that TCP_NODELAY just disables Nagle, it doesn't disable delayed 
ACKs.

>> No.  x++ is massively slow if executed in parallel across many cores on a 
>> variable in a single cache line.  See my recent commit to kern_tc.c for an 
>> example: the updating of trivial statistics for the kernel time calls 
>> reduced 30m syscalls/second to 3m syscalls/second due to heavy contention 
>> on the cache line holding the statistic.  One of my goals for
>
> I don't get it: 
> http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=189891&r2=189890&pathrev=189891
>
> you replaced x++ with no-ops if TC_COUNTER is defined? Aren't the 
> timecounters actually needed somewhere?

These are statistics, not the time counters themselves.  Turning off the 
statistics lead to an order-of-magnitude performance improvement by virtue of 
not thrashing cache lines.

>> 8.0 is to fix this problem for IP and TCP layers, and ideally also ifnet 
>> but we'll see.  We should be maintaining those stats per-CPU and then 
>> aggregating to report them to userspace.  This is what we already do for a 
>> number of system stats -- UMA and kernel malloc, syscall and trap counters, 
>> etc.
>
> How magic is this? Is it just a matter of declaring mystatarray[NCPU] and 
> updating mystat[current_cpu] or (probably), the spacing between array 
> elements should be magically fixed so two elements don't share a cache line?

The array needs to be appropriately spaced so that cache lines aren't 
potentially thrashed.  One way to do that is to tag elements with a cache-line 
sized __aligned attribute.  Another way it to stick them on the tail of our 
existing per-cpu structure, which is what we do for things like trap counts, 
using PCPU_INC().  Notice that this is very slightly lazy and subject to a 
very narrow race if the current thread decides to migrate, but that happens 
only very infrequently in practice.

>>> I'm using em hardware; I still think there's a possibility I'm fighting 
>>> the driver in some cases but this has priority #2.
>>
>> Have you tried LOCK_PROFILING?  It would quickly tell you if driver locks 
>> were a source of significant contention.  It works quite well...
>
> I don't think I'm fighting against locking artifacts, it looks more like 
> some kind of overly smart hardware thing, like interrupt moderation (but not 
> exactly interrupt moderation since the number of IRQs/s remains approx. the 
> same).

Ideally what you'll do next is run tcpdump on a machine not acting as part of 
the test, and see what's happening on the wire.

>> if_em doesn't support it, but if_igb does.  If this saves you a minimum of 
>> one and possibly two cache misses per packet, it could be a huge 
>> performance improvement.
>
> If I had the funds to upgrade hardware, I wouldn't be so interested in 
> solving it in software :)

Sure, but what I'm saying is: some problems are inherrent to the hardware 
design of what you're using.  We can work around them, but at the end of the 
day, some parts of the problem just require new hardware.  Let's see how far 
we can get without that.

Robert N M Watson
Computer Laboratory
University of Cambridge