Changes in the network interface queueing handoff model

Sun Jul 30 18:36:18 UTC 2006

Robert Watson wrote:
> 
> 5BOne of the ideas that I, Scott Long, and a few others have been 
> bouncing around for some time is a restructuring of the network 
> interface packet transmission API to reduce the number of locking 
> operations and allow network device drivers increased control of the 
> queueing behavior.  Right now, it works something like that following:
> 
> - When a network protocol wants to transmit, it calls the ifnet's link 
> layer
>   output routine via ifp->if_output() with the ifnet pointer, packet,
>   destination address information, and route information.
> 
> - The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
>   the packet as necessary, performs a link layer address translation 
> (such as
>   ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), 
> which
>   accepts the ifnet pointer and packet.
> 
> - The ifnet layer enqueues the packet in the ifnet send queue 
> (ifp->if_snd),
>   and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it 
> needs
>   to "start" output by the driver.  If the driver is already active, it
>   doesn't, and otherwise, it does.
> 
> - The driver dequeues the packet from ifp->if_snd, performs any driver
>   encapsulation and wrapping, and notifies the hardware.  In modern 
> hardware,
>   this consists of hooking the data of the packet up to the descriptor ring
>   and notifying the hardware to pick it up via DMA.  In order hardware, the
>   driver would perform a series of I/O operations to send the entire packet
>   directly to the card via a system bus.
> 
> Why change this?  A few reasons:
> 
> - The ifnet layer send queue is becoming decreasingly useful over time.  
> Most
>   modern hardware has a significant number of slots in its transmit 
> descriptor
>   ring, tuned for the performance of the hardware, etc, which is the 
> effective
>   transmit queue in practice.  The additional queue depth doesn't increase
>   throughput substantially (if at all) but does consume memory.
> 
> - On extremely fast hardware (with respect to CPU speed), the queue remains
>   essentially empty, so we pay the cost of enqueueing and dequeuing a 
> packet
>   from an empty queue.
> 
> - The ifnet send queue is a separately locked object from the device 
> driver,
>   meaning that for a single enqueue/dequeue pair, we pay an extra four lock
>   operations (two for insert, two for remove) per packet.
> 
> - For synthetic link layer drivers, such as if_vlan, which have no need for
>   queueing at all, the cost of queueing is eliminated.
> 
> - IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
>   driver, which helps eliminate a latent race condition involving use of 
> the
>   flag.
> 
> The proposed change is simple: right now one or more enqueue operations 
> occurs, when a call to ifp->if_start() is made to notify the driver that 
> it may need to do something (if the ACTIVE flag isn't set).  In the new 
> world order, the driver is directly passed the mbuf, and may then choose 
> to queue it or otherwise handle it as it sees fit.  The immediate 
> practical benefit is clear: if the queueing at the ifnet layer is 
> unnecessary, it is entirely avoided, skipping enqueue, dequeue, and four 
> mutex operations.  This applies immediately for VLAN processing, but 
> also means that for modern gigabit cards, the hardware queue (which will 
> be used anyway) is the only queue necessary.
> 
> There are a few downsides, of course:
> 
> - For older hardware without its own queueing, the queue is still 
> required --
>   not only that, but we've now introduced an unconditional function pointer
>   invocation, which on older hardware, is has more significant relative 
> cost
>   than it has on more recent CPUs.
> 
> - If drivers still require or use a queue, they must now synchronize 
> access to
>   the queue.  The obvious choices are to use the ifq lock (and restore the
>   above four lock operations), or to use the driver mutex (and risk higher
>   contention).  Right now, if the driver is busy (driver mutex held) 
> then an
>   enqueue is still possible, but with this change and a single mutex
>   protecting the send queue and driver, that is no longer possible.
> 

You're headed in the direction of linux where the handoff goes through a 
packet scheduling function before it hits the driver.  This is 
equivalent to altq which, as Max pointed out, you didn't mention in this 
note.  But it would be very good to move altq out of the compile-time 
macros with this.

I have a fair amount of experience with the linux model and it works ok. 
  The main complication I've seen is when a driver needs to process 
multiple queues of packets things get more involved.  This is seen in 
802.11 drivers where there are two q's, one for data frames and one for 
management frames.  With the current scheme you have two separate queues 
  and the start method handles prioritization by polling the mgt q 
before the data q.  If instead the packet is passed to the start method 
then it needs to be tagged in some way so the it's prioritized properly. 
  Otherwise you end up with multiple start methods; one per type of 
packet.  I suspect this will be ok but the end result will be that we'll 
  need to add a priority field to mbufs (unless we pass it as an arge to 
the start method).

All this is certainly doable but I think just replacing one mechanism 
with the other (as you specified) is insufficient.

 > Attached is a patch that maintains the current if_start, but adds
 > if_startmbuf.  If a device driver implements if_startmbuf and the global
 > sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in
 > the driver will be used.  Otherwise, if_start is used.  I have modified
 > the if_em driver to implement if_startmbuf also.  If there is no packet
 > backlog in the if_snd queue, it directly places the packet in the
 > transmit descriptor ring. If there is a backlog, it uses the if_snd
 > queue protected by driver mutex, rather than a separate ifq mutex.
 >
 > In some basic local micro-benchmarks, I saw a 5% improvement in UDP
 > 0-byte paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7%
 > performance improvement in the bulk serving of 1k files over HTTP.
 > These are only micro-benchmarks, and reflect a configuration in which
 > the CPU is unable to keep up with the output rate of the 1gbps ethernet
 > card in the device, so reductions in host CPU usage are immediately
 > visible in increased output as the CPU is able to better keep up with
 > the network hardware.  Other configurations are also of interest of
 > interesting, especially ones in which the network device is unable to
 > keep up with the CPU, resulting in more queueing.
 >
 > Conceptual review as well as banchmarking, etc, would be most welcome.

Why is the startmbuf knob global and not per-interface?  Seems like you 
want to convert drivers one at a time?

FWIW the original model was driven by the expectation that you could 
raise the spl so the tx path was entirely synchronized from above.  With 
the SMPng work we're synchronizing transfer through each control layer. 
  If the driver softc lock (or similar) were exposed to upper layers we 
could possibly return the "lock the tx path" model we had before and 
eliminate all the locking your changes target.  But that would be a big 
layering violation and would add significant contention in the SMP case.

I think the key observation is that most network hardware today takes 
packets directly from private queues so the fast path needs to push 
things down to those queues w/ minimal overhead.  This includes devices 
that implement QoS in h/w w/ multiple queues.

	Sam