pf performance?

Sat Apr 27 03:01:53 UTC 2013

--- On Fri, 4/26/13, Erich Weiler <weiler at soe.ucsc.edu> wrote:

> From: Erich Weiler <weiler at soe.ucsc.edu>
> Subject: Re: pf performance?
> To: "Andre Oppermann" <andre at freebsd.org>
> Cc: "Paul Tatarsky" <paul at soe.ucsc.edu>, freebsd-net at freebsd.org
> Date: Friday, April 26, 2013, 12:04 PM
> >> But the work pf does would
> show up in 'system' on top right?  So if I
> >> see all my CPUs tied up 100%
> >> in 'interrupts' and very little 'system', would it
> be a reasonable
> >> assumption to think that if I got
> >> more CPU cores to handle the interrupts that
> eventually I would see
> >> 'system' load increase as the
> >> interrupt load became faster to be handled? 
> And thus increase my
> >> bandwidth?
> > 
> > Having the work of pf show up in 'interrupts' or
> 'system' depends on the
> > network driver and how it handles sending packets up
> the stack.  In most
> > cases drivers deliver packets from interrupt context.
> 
> Ah, I see.  Definitely appears for me in interrupts
> then.  I've got the mxge driver doing the work
> here.  So, given that I can spread out the interrupts
> to every core (like, pin an interrupt queue to each core), I
> can have all my cores work on the process.  But seeing
> as though the pf bit is still serialized I'm not sure that I
> understand how it is serialized when many CPUs are handling
> interrupts, and hence doing the work of pf as well? 
> Wouldn't that indicate that the work of pf is being handled
> by many cores, as many cores are handling the interrupts?
> 

you're thinking exactly backwards. You're creating lock contention by
having a bunch of receive processes going into a single threaded pf
process.

Think of it like a six lane highway that has 5 lanes closed a mile up the
road. The result isn't that you go the same speed as a 1 lane highway;
what you have is a parking lot. The only thing you're doing by spreading
the interrupts is using up more cycles on more cores.

What you *should* be doing, if you can engineer it, is use 1 path through
the pf filter. You could have 4 queues feed a single process that dequeues
and runs through the filter. The problem with that is that the pf process
IS the bottleneck in that its slower than the receive processes, so you'd
best just use the other cores to do userland stuff. You could use cpuset
to make sure that no userland process uses the interrupt core, and dedicate
1 cpu to packet filtering. 1 modern CPU can easily handle a gig of traffic.
There's no need to spread in most case.

BC

> Or are you saying that pf *is* being handled by many cores,
> but just in a serialized nature?  Like, packet 1 is
> handled by core 0, then packet 2 is handled by core 1,
> packet 3 is handled by core 4, etc?  Such that even
> though multiple cores are handling it, they are just doing
> so serially and not in parallel?  And if so, maybe it
> still helps to have many cores?
> 
> Thanks for all the excellent info!
> 
> >> In other words, until I see like 100% system usage
> in one core, I
> >> would have room to grow?
> > 
> > You have room to grow if 'idle' is more than 0% and the
> interrupts of
> > the networks cards are running on different
> cores.  If one core gets
> > all the interrupts a second idle core doesn't get the
> chance to help
> > out.  IIRC the interrupt allocation to cores is
> done at interrupt
> > registration time or driver attach time.  It can
> be re-distributed
> > at run time on most architecture but I'm not sure we
> have an easily
> > accessible API for that.