Advice on a multithreaded netisr patch?

Mon Apr 6 04:59:12 PDT 2009

On Mon, 6 Apr 2009, Ivan Voras wrote:

>>> I'd like to understand more. If (in netisr) I have a mbuf with headers, is 
>>> this data already transfered from the card or is it magically "not here 
>>> yet"?
>>
>> A lot depends on the details of the card and driver.  The driver will take 
>> cache misses on the descriptor ring entry, if it's not already in cache, 
>> and the link layer will take a cache miss on the front of the ethernet 
>> frame in the cluster pointed to by the mbuf header as part of its demux. 
>> What happens next depends on your dispatch model and cache line size. 
>> Let's make a few simplifying assumptions that are mostly true:
>
> So, a mbuf can reference data not yet copied from the NIC hardware? I'm 
> specifically trying to undestand what m_pullup() does.

I think we're talking slightly at cross purposes.  There are two transfers of 
interest:

(1) DMA of the packet data to main memory from the NIC
(2) Servicing of CPU cache misses to access data in main memory

By the time you receive an interrupt, the DMA is complete, so once you believe 
a packet referenced by the descriptor ring is done, you don't have to wait for 
DMA.  However, the packet data is in main memory rather than your CPU cache, 
so you'll need to take a cache miss in order to retrieve it.  You don't want 
to prefetch before you know the packet data is there, or you may prefetch 
stale data from the previous packet sent or received from the cluster.

m_pullup() has to do with mbuf chain memory contiguity during packet 
processing.  The usual usage is something along the following lines:

 	struct whatever *w;

 	m = m_pullup(m, sizeof(*w));
 	if (m == NULL)
 		return;
 	w = mtod(m, struct whatever *);

m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data are 
contiguously stored so that the cast of w to m's data will point at a complete 
structure we can use to interpret packet data.  In the common case in the 
receipt path, m_pullup() should be a no-op, since almost all drivers receive 
data in a single cluster.

However, there are cases where it might not happen, such as loopback traffic 
where unusual encapsulation is used, leading to a call to M_PREPEND() that 
inserts a new mbuf on the front of the chain, which is later m_defrag()'d 
leading to a higher level header crossing a boundary or the like.

This issue is almost entirely independent from things like the cache line miss 
issue, unless you hit the uncommon case of having to do work in m_pullup(), in 
which case life sucks.

It would be useful to use DTrace to profile a number of the workfull m_foo() 
functions to make sure we're not hitting them in normal workloads, btw.

>>> As the card and the OS can already process many packets per second for
>>> something fairly complex as routing
>>> (http://www.tancsa.com/blast.html), and TCP chokes swi:net at 100% of
>>> a core, isn't this indication there's certainly more space for
>>> improvement even with a single-queue old-fashioned NICs?
>>
>> Maybe.  It depends on the relative costs of local processing vs
>> redistributing the work, which involves schedulers, IPIs, additional
>> cache misses, lock contention, and so on.  This means there's a period
>> where it can't possibly be a win, and then at some point it's a win as
>> long as the stack scales.  This is essentially the usual trade-off in
>> using threads and parallelism: does the benefit of multiple parallel
>> execution units make up for the overheads of synchronization and data
>> migration?
>
> Do you have any idea at all why I'm seeing the weird difference of netstat 
> packets per second (250,000) and my application's TCP performance (< 1,000 
> pps)? Summary: each packet is guaranteed to be a whole message causing a 
> transaction in the application - without the changes I see pps almost 
> identical to tps. Even if the source of netstat statistics somehow manages 
> to count packets multiple time (I don't see how that can happen), no 
> relation can describe differences this huge. It almost looks like something 
> in the upper layers is discarding packets (also not likely: TCP timeouts 
> would occur and the application wouldn't be able to push 250,000 pps) - but 
> what? Where to look?

Is this for the loopback workload?  If so, remember that there may be some 
other things going on:

- Every packet is processed at least two times: once went sent, and then again
   when it's received.

- A TCP segment will need to be ACK'd, so if you're sending data in chunks in
   one direction, the ACKs will not be piggy-backed on existing data tranfers,
   and instead be sent independently, hitting the network stack two more times.

- Remember that TCP works to expand its window, and then maintains the highest
   performance it can by bumping up against the top of available bandwidth
   continuously.  This involves detecting buffer limits by generating packets
   that can't be sent, adding to the packet count.  With loopback traffic, the
   drop point occurs when you exceed the size of the netisr's queue for IP, so
   you might try bumping that from the default to something much larger.

And nothing beats using tcpdump -- have you tried tcpdumping the loopback to 
see what is actually being sent?  If not, that's always educational -- perhaps 
something weird is going on with delayed ACKs, etc.

> You mean for the general code? I purposely don't lock my statistics 
> variables because I'm not that interested in exact numbers (orders of 
> magnitude are relevant). As far as I understand, unlocked "x++" should be 
> trivially fast in this case?

No.  x++ is massively slow if executed in parallel across many cores on a 
variable in a single cache line.  See my recent commit to kern_tc.c for an 
example: the updating of trivial statistics for the kernel time calls reduced 
30m syscalls/second to 3m syscalls/second due to heavy contention on the cache 
line holding the statistic.  One of my goals for 8.0 is to fix this problem 
for IP and TCP layers, and ideally also ifnet but we'll see.  We should be 
maintaining those stats per-CPU and then aggregating to report them to 
userspace.  This is what we already do for a number of system stats -- UMA and 
kernel malloc, syscall and trap counters, etc.

>> - Use cpuset to pin ithreads, the netisr, and whatever else, to specific
>> cores
>>   so that they don't migrate, and if your system uses HTT, experiment with
>>   pinning the ithread and the netisr on different threads on the same
>> core, or
>>   at least, different cores on the same die.
>
> I'm using em hardware; I still think there's a possibility I'm fighting the 
> driver in some cases but this has priority #2.

Have you tried LOCK_PROFILING?  It would quickly tell you if driver locks were 
a source of significant contention.  It works quite well...

>> - If your card supports RSS, pass the flowid up the stack in the mbuf 
>> packet
>>   header flowid field, and use that instead of the hash for work placement.
>
> Don't know about em. Don't really want to touch it if I don't have to :)

if_em doesn't support it, but if_igb does.  If this saves you a minimum of one 
and possibly two cache misses per packet, it could be a huge performance 
improvement.

Robert N M Watson
Computer Laboratory
University of Cambridge