Advice on a multithreaded netisr patch?
ivoras at freebsd.org
Mon Apr 6 05:35:57 PDT 2009
Robert Watson wrote:
> On Mon, 6 Apr 2009, Ivan Voras wrote:
>> So, a mbuf can reference data not yet copied from the NIC hardware?
>> I'm specifically trying to undestand what m_pullup() does.
> I think we're talking slightly at cross purposes. There are two
> transfers of interest:
> (1) DMA of the packet data to main memory from the NIC
> (2) Servicing of CPU cache misses to access data in main memory
> By the time you receive an interrupt, the DMA is complete, so once you
OK, this was what was confusing me - for a moment I thought you meant
it's not so.
> believe a packet referenced by the descriptor ring is done, you don't
> have to wait for DMA. However, the packet data is in main memory rather
> than your CPU cache, so you'll need to take a cache miss in order to
> retrieve it. You don't want to prefetch before you know the packet data
> is there, or you may prefetch stale data from the previous packet sent
> or received from the cluster.
> m_pullup() has to do with mbuf chain memory contiguity during packet
> processing. The usual usage is something along the following lines:
> struct whatever *w;
> m = m_pullup(m, sizeof(*w));
> if (m == NULL)
> w = mtod(m, struct whatever *);
> m_pullup() here ensures that the first sizeof(*w) bytes of mbuf data are
> contiguously stored so that the cast of w to m's data will point at a
So, m_pullup() can resize / realloc() the mbuf? (not that it matters for
> Is this for the loopback workload? If so, remember that there may be
> some other things going on:
Both loopback and physical.
> - Every packet is processed at least two times: once went sent, and then
> when it's received.
> - A TCP segment will need to be ACK'd, so if you're sending data in
> chunks in
> one direction, the ACKs will not be piggy-backed on existing data
> and instead be sent independently, hitting the network stack two more
No combination of these can make an accounting difference between 1,000
and 250,000 pps. I must be hitting something very bad here.
> - Remember that TCP works to expand its window, and then maintains the
> performance it can by bumping up against the top of available bandwidth
> continuously. This involves detecting buffer limits by generating
> that can't be sent, adding to the packet count. With loopback
> traffic, the
> drop point occurs when you exceed the size of the netisr's queue for
> IP, so
> you might try bumping that from the default to something much larger.
My messages are approx. 100 +/- 10 bytes. No practical way they will
even span multiple mbufs. TCP_NODELAY is on.
> No. x++ is massively slow if executed in parallel across many cores on
> a variable in a single cache line. See my recent commit to kern_tc.c
> for an example: the updating of trivial statistics for the kernel time
> calls reduced 30m syscalls/second to 3m syscalls/second due to heavy
> contention on the cache line holding the statistic. One of my goals for
I don't get it:
you replaced x++ with no-ops if TC_COUNTER is defined? Aren't the
timecounters actually needed somewhere?
> 8.0 is to fix this problem for IP and TCP layers, and ideally also ifnet
> but we'll see. We should be maintaining those stats per-CPU and then
> aggregating to report them to userspace. This is what we already do for
> a number of system stats -- UMA and kernel malloc, syscall and trap
> counters, etc.
How magic is this? Is it just a matter of declaring mystatarray[NCPU]
and updating mystat[current_cpu] or (probably), the spacing between
array elements should be magically fixed so two elements don't share a
>>> - Use cpuset to pin ithreads, the netisr, and whatever else, to specific
>>> so that they don't migrate, and if your system uses HTT, experiment
>>> pinning the ithread and the netisr on different threads on the same
>>> core, or
>>> at least, different cores on the same die.
>> I'm using em hardware; I still think there's a possibility I'm
>> fighting the driver in some cases but this has priority #2.
> Have you tried LOCK_PROFILING? It would quickly tell you if driver
> locks were a source of significant contention. It works quite well...
I don't think I'm fighting against locking artifacts, it looks more like
some kind of overly smart hardware thing, like interrupt moderation (but
not exactly interrupt moderation since the number of IRQs/s remains
approx. the same).
>>> - If your card supports RSS, pass the flowid up the stack in the mbuf
>>> header flowid field, and use that instead of the hash for work
>> Don't know about em. Don't really want to touch it if I don't have to :)
> if_em doesn't support it, but if_igb does. If this saves you a minimum
> of one and possibly two cache misses per packet, it could be a huge
> performance improvement.
If I had the funds to upgrade hardware, I wouldn't be so interested in
solving it in software :)
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 252 bytes
Desc: OpenPGP digital signature
Url : http://lists.freebsd.org/pipermail/freebsd-net/attachments/20090406/2d66b1e5/signature.pgp
More information about the freebsd-net