[fbsd] Re: [fbsd] Network performance in a dual CPU system

Thu Apr 27 15:06:22 UTC 2006

On 4/27/06, Robert Watson <rwatson at freebsd.org> wrote:
>
> On Thu, 27 Apr 2006, Jeremie Le Hen wrote:
>
> >> I missed the original thread, but in answer to the question: if you set
> >> net.isr.direct=1, then FreeBSD 6.x will run the netisr code in the ithread
> >> of the network device driver.  This will allow the IP forwarding and
> >> related paths in two threads instead of one, potentially allowing greater
> >> parallelism. Of course, you also potentially contend more locks, you may
> >> increase the time it takes for the ithread to respond to new interrupts,
> >> etc, so it's not quite cut and dry, but with a workload like the one shown
> >> above, it might make quite a difference.
> >
> > Actually you already replied in the original thread, explaining mostly
> > the same thing.
>
> :-)
>
> > BTW, what I understand is that net.isr.direct=1 prevents from multiplexing
> > all packets on the netisr thread and instead makes the ithread do the job.
> > In this case, what happens to the netisr thread ? Does it still have some
> > work to do or is it removed ?
>
> Yes -- basically, what this setting does is turn a deferred dispatch of the
> protocol level processing into a direct function invocation.  So instead of
> inserting the new IP packet into an IP processing queue from the ethernet code
> and waking up the netisr which calls the IP input routine, we directly call
> the IP input routine.  This has a number of potentially positive effects:
>
> - Avoid the queue/dequeue operation
> - Avoid a context switch
> - Allow greater parallelism since protocol layer processing is not limited to
>    the netisr thread
>
> It also has some downsides:
>
> - Perform more work in the ithread -- since any given thread is limited to a
>    single CPU's worth of processing resources, if the link layer and protocol
>    layer processing add up to more than one CPU, you slow them down
> - Increase the time it takes to pull packets out of the card -- we process
>    each packet to completion rather than pulling them out in sets and batching
>    them.  This pushes drop on overload into the card instead of the IP queue,
>    which has some benefits and some costs.
>
> The netisr is still there, and will still be used for certain sorts of things.
> In particular, we use the netisr when doing arbitrary decapsulation, as this
> places an upper bound on thread stack use.  For example: if you have an IP in
> IP in IP in IP tunneled packet, if you always used direct dispatch, then you'd
> potentially get a deeply nested stack.  By looping it back into the queue and
> picking it up from the top level of the netisr dispatch, we avoid nesting the
> stacks, which could lead to stack overflow.  We don't context switch in that
> loop, so avoid context switch costs.  We also use the netisr for loopback
> network traffic.  So, in short, the netisr is still there, it just has reduced
> work scheduled in it.
>
> Another potential model for increasing parallelism in the input path is to
> have multiple netisr threads -- this raises an interesting question relating
> to ordering.  right now, we use source ordering -- that is, we order packets
> in the network subsystem essentially in the order they come from a particular
> source.  So we guarantee that if four packets come in em0, they get processed
> in the order they are received from em0.  They may arbitrarily interlace with
> packets coming from other interfaces, such as em1, lo0, etc.  The reason for
> the strong source ordering is that some protocols, TCP in particular, respond
> really badly to misordering, which they detect as a loss and force retransmit
> for.  If we introduce multiple netisrs naively by simply having the different
> threads working from the same IP input queue, then we can potentially pull
> packets from the same source into different workers, and process them at
> different rates, resulting in misordering being introduced.  While we'd
> process packets with greater parallelism, and hence possibly faster, we'd
> toast the end-to-end protocol properties and make everyone really unhappy.
>
> There are a few common ways people have addressed this -- it's actually very
> similar to the link parallelism problem.  For example, using bonded ethernet
> links, packets are assigned to a particular link based on a hash of their
> source address, so that individual streams from the same source remain in
> order with respect to themselves.  An obvious approach would be to assign
> particular ifnets to particular netisrs, since that would maintain our current
> source ordering assumptions, but allow the ithreads and netisrs to float to
> different CPUs.  A catch in this approach is load balancing: if two ifnets are
> assigned to the same netisr, then they can't run in parallel.  This line of
> thought can, and does, continue. :-)  The direct dispatch model maintains
> source ordering in a manner similar to having a per-source netisr, which works
> pretty well, and also avoids context switches.  The main downside is reducing
> parallelism between the ithread and the netisr, which for some configurations
> can be a big deal (i.e., if ithread uses 60% cpu, and netisr uses 60% cpu,
> you've limited them both to 50% cpu by combining them in a single thread).
>

   Any words regarding polling, given that the ithreads would be
masked in this case ?

--
If it's there, and you can see it, it's real.
If it's not there, and you can see it, it's virtual.
If it's there, and you can't see it, it's transparent.
If it's not there, and you can't see it, you erased it.