Advice on a multithreaded netisr patch?

Mon Apr 6 08:53:17 PDT 2009

--- On Mon, 4/6/09, Ivan Voras <ivoras at freebsd.org> wrote:

> From: Ivan Voras <ivoras at freebsd.org>
> Subject: Re: Advice on a multithreaded netisr  patch?
> To: freebsd-net at freebsd.org
> Date: Monday, April 6, 2009, 8:35 AM
> Robert Watson wrote:
> > On Mon, 6 Apr 2009, Ivan Voras wrote:
> 
> >> So, a mbuf can reference data not yet copied from
> the NIC hardware?
> >> I'm specifically trying to undestand what
> m_pullup() does.
> > 
> > I think we're talking slightly at cross purposes. 
> There are two
> > transfers of interest:
> > 
> > (1) DMA of the packet data to main memory from the NIC
> > (2) Servicing of CPU cache misses to access data in
> main memory
> > 
> > By the time you receive an interrupt, the DMA is
> complete, so once you
> 
> OK, this was what was confusing me - for a moment I thought
> you meant
> it's not so.
> 
> > believe a packet referenced by the descriptor ring is
> done, you don't
> > have to wait for DMA.  However, the packet data is in
> main memory rather
> > than your CPU cache, so you'll need to take a
> cache miss in order to
> > retrieve it.  You don't want to prefetch before
> you know the packet data
> > is there, or you may prefetch stale data from the
> previous packet sent
> > or received from the cluster.
> > 
> > m_pullup() has to do with mbuf chain memory contiguity
> during packet
> > processing.  The usual usage is something along the
> following lines:
> > 
> >     struct whatever *w;
> > 
> >     m = m_pullup(m, sizeof(*w));
> >     if (m == NULL)
> >         return;
> >     w = mtod(m, struct whatever *);
> >
> > m_pullup() here ensures that the first sizeof(*w)
> bytes of mbuf data are
> > contiguously stored so that the cast of w to m's
> data will point at a
> 
> So, m_pullup() can resize / realloc() the mbuf? (not that
> it matters for
> this purpose)
> 
> > Is this for the loopback workload?  If so, remember
> that there may be
> > some other things going on:
> 
> Both loopback and physical.
> 
> > - Every packet is processed at least two times: once
> went sent, and then
> > again
> >   when it's received.
> > 
> > - A TCP segment will need to be ACK'd, so if
> you're sending data in
> > chunks in
> >   one direction, the ACKs will not be piggy-backed on
> existing data
> > tranfers,
> >   and instead be sent independently, hitting the
> network stack two more
> > times.
> 
> No combination of these can make an accounting difference
> between 1,000
> and 250,000 pps. I must be hitting something very bad here.
> 
> > - Remember that TCP works to expand its window, and
> then maintains the
> > highest
> >   performance it can by bumping up against the top of
> available bandwidth
> >   continuously.  This involves detecting buffer limits
> by generating
> > packets
> >   that can't be sent, adding to the packet count. 
> With loopback
> > traffic, the
> >   drop point occurs when you exceed the size of the
> netisr's queue for
> > IP, so
> >   you might try bumping that from the default to
> something much larger.
> 
> My messages are approx. 100 +/- 10 bytes. No practical way
> they will
> even span multiple mbufs. TCP_NODELAY is on.
> 
> > No.  x++ is massively slow if executed in parallel
> across many cores on
> > a variable in a single cache line.  See my recent
> commit to kern_tc.c
> > for an example: the updating of trivial statistics for
> the kernel time
> > calls reduced 30m syscalls/second to 3m
> syscalls/second due to heavy
> > contention on the cache line holding the statistic. 
> One of my goals for
> 
> I don't get it:
> http://svn.freebsd.org/viewvc/base/stable/7/sys/kern/kern_tc.c?r1=189891&r2=189890&pathrev=189891
> 
> you replaced x++ with no-ops if TC_COUNTER is defined?
> Aren't the
> timecounters actually needed somewhere?
> 
> > 8.0 is to fix this problem for IP and TCP layers, and
> ideally also ifnet
> > but we'll see.  We should be maintaining those
> stats per-CPU and then
> > aggregating to report them to userspace.  This is what
> we already do for
> > a number of system stats -- UMA and kernel malloc,
> syscall and trap
> > counters, etc.
> 
> How magic is this? Is it just a matter of declaring
> mystatarray[NCPU]
> and updating mystat[current_cpu] or (probably), the spacing
> between
> array elements should be magically fixed so two elements
> don't share a
> cache line?
> 
> >>> - Use cpuset to pin ithreads, the netisr, and
> whatever else, to specific
> >>> cores
> >>>   so that they don't migrate, and if your
> system uses HTT, experiment
> >>> with
> >>>   pinning the ithread and the netisr on
> different threads on the same
> >>> core, or
> >>>   at least, different cores on the same die.
> >>
> >> I'm using em hardware; I still think
> there's a possibility I'm
> >> fighting the driver in some cases but this has
> priority #2.
> > 
> > Have you tried LOCK_PROFILING?  It would quickly tell
> you if driver
> > locks were a source of significant contention.  It
> works quite well...
> 
> I don't think I'm fighting against locking
> artifacts, it looks more like
> some kind of overly smart hardware thing, like interrupt
> moderation (but
> not exactly interrupt moderation since the number of IRQs/s
> remains
> approx. the same).
> 
> >>> - If your card supports RSS, pass the flowid
> up the stack in the mbuf
> >>> packet
> >>>   header flowid field, and use that instead of
> the hash for work
> >>> placement.
> >>
> >> Don't know about em. Don't really want to
> touch it if I don't have to :)
> > 
> > if_em doesn't support it, but if_igb does.  If
> this saves you a minimum
> > of one and possibly two cache misses per packet, it
> could be a huge
> > performance improvement.
>

There is no advantage to using if_igb. While the cards support more
features, the driver in FreeBSD really barely functions. There's also no
multiqueue support. Don't waste your money on a card.

Barney