igb(4) at peak in big purple

Tue May 1 18:13:10 UTC 2012

--- On Fri, 4/27/12, Juli Mallett <jmallett at FreeBSD.org> wrote:

> From: Juli Mallett <jmallett at FreeBSD.org>
> Subject: Re: igb(4) at peak in big purple
> To: "Sean Bruno" <seanbru at yahoo-inc.com>
> Cc: "freebsd-net at freebsd.org" <freebsd-net at freebsd.org>
> Date: Friday, April 27, 2012, 4:00 PM
> On Fri, Apr 27, 2012 at 12:29, Sean
> Bruno <seanbru at yahoo-inc.com>
> wrote:
> > On Thu, 2012-04-26 at 11:13 -0700, Juli Mallett wrote:
> >> Queue splitting in Intel cards is done using a hash
> of protocol
> >> headers, so this is expected behavior.  This also
> helps with TCP and
> >> UDP performance, in terms of keeping packets for
> the same protocol
> >> control block on the same core, but for other
> applications it's not
> >> ideal.  If your application does not require that
> kind of locality,
> >> there are things that can be done in the driver to
> make it easier to
> >> balance packets between all queues about-evenly.
> >
> > Oh? :-)
> >
> > What should I be looking at to balance more evenly?
> 
> Dirty hacks are involved :)  I've sent some code to
> Luigi that I think
> would make sense in netmap (since for many tasks one's going
> to do
> with netmap, you want to use as many cores as possible, and
> maybe
> don't care about locality so much) but it could be useful
> in
> conjunction with the network stack, too, for tasks that
> don't need a
> lot of locality.
> 
> Basically this is the deal: the Intel NICs hash of various
> header
> fields.  Then, some bits from that hash are used to
> index a table.
> That table indicates what queue the received packet should
> go to.
> Ideally you'd want to use some sort of counter to index that
> table and
> get round-robin queue usage if you wanted to evenly saturate
> all
> cores.  Unfortunately there doesn't seem to be a way to
> do that.
> 
> What you can do, though, is regularly update the table that
> is indexed
> by hash.  Very frequently, in fact, it's a pretty fast
> operation.  So
> what I've done, for example, is to go through an rotate all
> of the
> entries every N packets, where N is something like the
> number of
> receive descriptors per queue divided by the number of
> queues.  So
> bucket 0 goes to queue 0 and bucket 1 goes to queue 1 at
> first.  Then
> a few hundred packets are received, and the table is
> reprogrammed, so
> now bucket 0 goes to queue 1 and bucket 1 goes to queue 0.
> 
> I can provide code to do this, but I don't want to post it
> publicly
> (unless it is actually going to become an option for netmap)
> for fear
> that people will use it in scenarios where it's harmful and
> then
> complain.  It's potentially one more painful variation
> for the Intel
> drivers that Intel can't support, and that just makes
> everyone
> miserable.
> 
> Thanks,
> Juli.

That seems like a pretty naive approach. First, you want all of the packets in the same flows/connections to use the same channels, otherwise you'll
be sending a lot of stuff out of sequence. You want to balance your flows,
yes, but not balance based on packets, unless all of your traffic is icmp.
You also want to balance bits, not packets; sending 50 60 byte packets
to queue 1 and 50 1500 byte packets to queue 2 isn't balancing. They'll
be wildly out of order as well.

Also, using as many cores as possible isn't necessarily what you want to 
do, depending on your architecture. If you have 8 cores on 2 cpus, then you
 probable want to do all of your networking on four cores on one cpu. There's a big price to pay to shuffle memory between caches of separate 
cpus, splitting transactions that use the same memory space is 
counterproductive. More  queues mean more locks, and in the end, lock contention is your biggest enemy, not cpu cycles.

The idea that splitting packets that use the same memory and code space 
among cpus isn't a very good one; a better approach, assuming you can
micromanage, is to allocate X cores (as much as you need for your peaks)
to networking, and use other cores for user space to minimize the
interruptions.

BC