Advice on a multithreaded netisr patch?
barney_cordoba at yahoo.com
Mon Apr 6 08:53:17 PDT 2009
--- On Mon, 4/6/09, Ivan Voras <ivoras at freebsd.org> wrote:
> From: Ivan Voras <ivoras at freebsd.org>
> Subject: Re: Advice on a multithreaded netisr patch?
> To: freebsd-net at freebsd.org
> Date: Monday, April 6, 2009, 8:35 AM
> Robert Watson wrote:
> > On Mon, 6 Apr 2009, Ivan Voras wrote:
> >> So, a mbuf can reference data not yet copied from
> the NIC hardware?
> >> I'm specifically trying to undestand what
> m_pullup() does.
> > I think we're talking slightly at cross purposes.
> There are two
> > transfers of interest:
> > (1) DMA of the packet data to main memory from the NIC
> > (2) Servicing of CPU cache misses to access data in
> main memory
> > By the time you receive an interrupt, the DMA is
> complete, so once you
> OK, this was what was confusing me - for a moment I thought
> you meant
> it's not so.
> > believe a packet referenced by the descriptor ring is
> done, you don't
> > have to wait for DMA. However, the packet data is in
> main memory rather
> > than your CPU cache, so you'll need to take a
> cache miss in order to
> > retrieve it. You don't want to prefetch before
> you know the packet data
> > is there, or you may prefetch stale data from the
> previous packet sent
> > or received from the cluster.
> > m_pullup() has to do with mbuf chain memory contiguity
> during packet
> > processing. The usual usage is something along the
> following lines:
> > struct whatever *w;
> > m = m_pullup(m, sizeof(*w));
> > if (m == NULL)
> > return;
> > w = mtod(m, struct whatever *);
> > m_pullup() here ensures that the first sizeof(*w)
> bytes of mbuf data are
> > contiguously stored so that the cast of w to m's
> data will point at a
> So, m_pullup() can resize / realloc() the mbuf? (not that
> it matters for
> this purpose)
> > Is this for the loopback workload? If so, remember
> that there may be
> > some other things going on:
> Both loopback and physical.
> > - Every packet is processed at least two times: once
> went sent, and then
> > again
> > when it's received.
> > - A TCP segment will need to be ACK'd, so if
> you're sending data in
> > chunks in
> > one direction, the ACKs will not be piggy-backed on
> existing data
> > tranfers,
> > and instead be sent independently, hitting the
> network stack two more
> > times.
> No combination of these can make an accounting difference
> between 1,000
> and 250,000 pps. I must be hitting something very bad here.
> > - Remember that TCP works to expand its window, and
> then maintains the
> > highest
> > performance it can by bumping up against the top of
> available bandwidth
> > continuously. This involves detecting buffer limits
> by generating
> > packets
> > that can't be sent, adding to the packet count.
> With loopback
> > traffic, the
> > drop point occurs when you exceed the size of the
> netisr's queue for
> > IP, so
> > you might try bumping that from the default to
> something much larger.
> My messages are approx. 100 +/- 10 bytes. No practical way
> they will
> even span multiple mbufs. TCP_NODELAY is on.
> > No. x++ is massively slow if executed in parallel
> across many cores on
> > a variable in a single cache line. See my recent
> commit to kern_tc.c
> > for an example: the updating of trivial statistics for
> the kernel time
> > calls reduced 30m syscalls/second to 3m
> syscalls/second due to heavy
> > contention on the cache line holding the statistic.
> One of my goals for
> I don't get it:
> you replaced x++ with no-ops if TC_COUNTER is defined?
> Aren't the
> timecounters actually needed somewhere?
> > 8.0 is to fix this problem for IP and TCP layers, and
> ideally also ifnet
> > but we'll see. We should be maintaining those
> stats per-CPU and then
> > aggregating to report them to userspace. This is what
> we already do for
> > a number of system stats -- UMA and kernel malloc,
> syscall and trap
> > counters, etc.
> How magic is this? Is it just a matter of declaring
> and updating mystat[current_cpu] or (probably), the spacing
> array elements should be magically fixed so two elements
> don't share a
> cache line?
> >>> - Use cpuset to pin ithreads, the netisr, and
> whatever else, to specific
> >>> cores
> >>> so that they don't migrate, and if your
> system uses HTT, experiment
> >>> with
> >>> pinning the ithread and the netisr on
> different threads on the same
> >>> core, or
> >>> at least, different cores on the same die.
> >> I'm using em hardware; I still think
> there's a possibility I'm
> >> fighting the driver in some cases but this has
> priority #2.
> > Have you tried LOCK_PROFILING? It would quickly tell
> you if driver
> > locks were a source of significant contention. It
> works quite well...
> I don't think I'm fighting against locking
> artifacts, it looks more like
> some kind of overly smart hardware thing, like interrupt
> moderation (but
> not exactly interrupt moderation since the number of IRQs/s
> approx. the same).
> >>> - If your card supports RSS, pass the flowid
> up the stack in the mbuf
> >>> packet
> >>> header flowid field, and use that instead of
> the hash for work
> >>> placement.
> >> Don't know about em. Don't really want to
> touch it if I don't have to :)
> > if_em doesn't support it, but if_igb does. If
> this saves you a minimum
> > of one and possibly two cache misses per packet, it
> could be a huge
> > performance improvement.
There is no advantage to using if_igb. While the cards support more
features, the driver in FreeBSD really barely functions. There's also no
multiqueue support. Don't waste your money on a card.
More information about the freebsd-net