Advice on a multithreaded netisr patch?

Tue Apr 7 05:12:06 PDT 2009

--- On Mon, 4/6/09, Robert Watson <rwatson at FreeBSD.org> wrote:

> From: Robert Watson <rwatson at FreeBSD.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: "Ivan Voras" <ivoras at freebsd.org>
> Cc: freebsd-net at freebsd.org
> Date: Monday, April 6, 2009, 7:59 AM
> On Mon, 6 Apr 2009, Ivan Voras wrote:
> 
> >>> I'd like to understand more. If (in
> netisr) I have a mbuf with headers, is this data already
> transfered from the card or is it magically "not here
> yet"?
> >> 
> >> A lot depends on the details of the card and
> driver.  The driver will take cache misses on the descriptor
> ring entry, if it's not already in cache, and the link
> layer will take a cache miss on the front of the ethernet
> frame in the cluster pointed to by the mbuf header as part
> of its demux. What happens next depends on your dispatch
> model and cache line size. Let's make a few simplifying
> assumptions that are mostly true:
> > 
> > So, a mbuf can reference data not yet copied from the
> NIC hardware? I'm specifically trying to undestand what
> m_pullup() does.
> 
> I think we're talking slightly at cross purposes. 
> There are two transfers of interest:
> 
> (1) DMA of the packet data to main memory from the NIC
> (2) Servicing of CPU cache misses to access data in main
> memory
> 
> By the time you receive an interrupt, the DMA is complete,
> so once you believe a packet referenced by the descriptor
> ring is done, you don't have to wait for DMA.  However,
> the packet data is in main memory rather than your CPU
> cache, so you'll need to take a cache miss in order to
> retrieve it.  You don't want to prefetch before you know
> the packet data is there, or you may prefetch stale data
> from the previous packet sent or received from the cluster.
> 
> m_pullup() has to do with mbuf chain memory contiguity
> during packet processing.  The usual usage is something
> along the following lines:
> 
> 	struct whatever *w;
> 
> 	m = m_pullup(m, sizeof(*w));
> 	if (m == NULL)
> 		return;
> 	w = mtod(m, struct whatever *);
> 
> m_pullup() here ensures that the first sizeof(*w) bytes of
> mbuf data are contiguously stored so that the cast of w to
> m's data will point at a complete structure we can use
> to interpret packet data.  In the common case in the receipt
> path, m_pullup() should be a no-op, since almost all drivers
> receive data in a single cluster.
> 
> However, there are cases where it might not happen, such as
> loopback traffic where unusual encapsulation is used,
> leading to a call to M_PREPEND() that inserts a new mbuf on
> the front of the chain, which is later m_defrag()'d
> leading to a higher level header crossing a boundary or the
> like.
> 
> This issue is almost entirely independent from things like
> the cache line miss issue, unless you hit the uncommon case
> of having to do work in m_pullup(), in which case life
> sucks.
> 
> It would be useful to use DTrace to profile a number of the
> workfull m_foo() functions to make sure we're not
> hitting them in normal workloads, btw.
> 
> >>> As the card and the OS can already process
> many packets per second for
> >>> something fairly complex as routing
> >>> (http://www.tancsa.com/blast.html), and TCP
> chokes swi:net at 100% of
> >>> a core, isn't this indication there's
> certainly more space for
> >>> improvement even with a single-queue
> old-fashioned NICs?
> >> 
> >> Maybe.  It depends on the relative costs of local
> processing vs
> >> redistributing the work, which involves
> schedulers, IPIs, additional
> >> cache misses, lock contention, and so on.  This
> means there's a period
> >> where it can't possibly be a win, and then at
> some point it's a win as
> >> long as the stack scales.  This is essentially the
> usual trade-off in
> >> using threads and parallelism: does the benefit of
> multiple parallel
> >> execution units make up for the overheads of
> synchronization and data
> >> migration?
> > 
> > Do you have any idea at all why I'm seeing the
> weird difference of netstat packets per second (250,000) and
> my application's TCP performance (< 1,000 pps)?
> Summary: each packet is guaranteed to be a whole message
> causing a transaction in the application - without the
> changes I see pps almost identical to tps. Even if the
> source of netstat statistics somehow manages to count
> packets multiple time (I don't see how that can happen),
> no relation can describe differences this huge. It almost
> looks like something in the upper layers is discarding
> packets (also not likely: TCP timeouts would occur and the
> application wouldn't be able to push 250,000 pps) - but
> what? Where to look?
> 
> Is this for the loopback workload?  If so, remember that
> there may be some other things going on:
> 
> - Every packet is processed at least two times: once went
> sent, and then again
>   when it's received.
> 
> - A TCP segment will need to be ACK'd, so if you're
> sending data in chunks in
>   one direction, the ACKs will not be piggy-backed on
> existing data tranfers,
>   and instead be sent independently, hitting the network
> stack two more times.
> 
> - Remember that TCP works to expand its window, and then
> maintains the highest
>   performance it can by bumping up against the top of
> available bandwidth
>   continuously.  This involves detecting buffer limits by
> generating packets
>   that can't be sent, adding to the packet count.  With
> loopback traffic, the
>   drop point occurs when you exceed the size of the
> netisr's queue for IP, so
>   you might try bumping that from the default to something
> much larger.
> 
> And nothing beats using tcpdump -- have you tried
> tcpdumping the loopback to see what is actually being sent? 
> If not, that's always educational -- perhaps something
> weird is going on with delayed ACKs, etc.
> 
> > You mean for the general code? I purposely don't
> lock my statistics variables because I'm not that
> interested in exact numbers (orders of magnitude are
> relevant). As far as I understand, unlocked "x++"
> should be trivially fast in this case?
> 
> No.  x++ is massively slow if executed in parallel across
> many cores on a variable in a single cache line.  See my
> recent commit to kern_tc.c for an example: the updating of
> trivial statistics for the kernel time calls reduced 30m
> syscalls/second to 3m syscalls/second due to heavy
> contention on the cache line holding the statistic.  One of
> my goals for 8.0 is to fix this problem for IP and TCP
> layers, and ideally also ifnet but we'll see.  We should
> be maintaining those stats per-CPU and then aggregating to
> report them to userspace.  This is what we already do for a
> number of system stats -- UMA and kernel malloc, syscall and
> trap counters, etc.
> 
> >> - Use cpuset to pin ithreads, the netisr, and
> whatever else, to specific
> >> cores
> >>   so that they don't migrate, and if your
> system uses HTT, experiment with
> >>   pinning the ithread and the netisr on different
> threads on the same
> >> core, or
> >>   at least, different cores on the same die.
> > 
> > I'm using em hardware; I still think there's a
> possibility I'm fighting the driver in some cases but
> this has priority #2.
> 
> Have you tried LOCK_PROFILING?  It would quickly tell you
> if driver locks were a source of significant contention.  It
> works quite well...

When I enabled LOCK_PROFILING my side modules, such as if_ibg, 
stopped working. It seems that the ifnet structure or something 
changed with that option enabled. Is there a way to sync this without
having to integrate everything into a specific kernel build?

Barney