svn commit: r299210 - in head/sys/dev/cxgbe: . tom

Sat May 7 14:09:25 UTC 2016

On Fri, May 06, 2016 at 05:52:15PM -0700, John Baldwin wrote:

> On Saturday, May 07, 2016 12:33:35 AM John Baldwin wrote:
> > Author: jhb
> > Date: Sat May  7 00:33:35 2016
> > New Revision: 299210
> > URL: https://svnweb.freebsd.org/changeset/base/299210
> > 
> > Log:
> >   Use DDP to implement zerocopy TCP receive with aio_read().
> >   
> >   Chelsio's TCP offload engine supports direct DMA of received TCP payload
> >   into wired user buffers.  This feature is known as Direct-Data Placement.
> >   However, to scale well the adapter needs to prepare buffers for DDP
> >   before data arrives.  aio_read() is more amenable to this requirement than
> >   read() as applications often call read() only after data is available in
> >   the socket buffer.
> >   
> >   When DDP is enabled, TOE sockets use the recently added pru_aio_queue
> >   protocol hook to claim aio_read(2) requests instead of letting them use
> >   the default AIO socket logic.  The DDP feature supports scheduling DMA
> >   to two buffers at a time so that the second buffer is ready for use
> >   after the first buffer is filled.  The aio/DDP code optimizes the case
> >   of an application ping-ponging between two buffers (similar to the
> >   zero-copy bpf(4) code) by keeping the two most recently used AIO buffers
> >   wired.  If a buffer is reused, the aio/DDP code is able to reuse the
> >   vm_page_t array as well as page pod mappings (a kind of MMU mapping the
> >   Chelsio NIC uses to describe user buffers).  The generation of the
> >   vmspace of the calling process is used in conjunction with the user
> >   buffer's address and length to determine if a user buffer matches a
> >   previously used buffer.  If an application queues a buffer for AIO that
> >   does not match a previously used buffer then the least recently used
> >   buffer is unwired before the new buffer is wired.  This ensures that no
> >   more than two user buffers per socket are ever wired.
> >   
> >   Note that this feature is best suited to applications sending a steady
> >   stream of data vs short bursts of traffic.
> >   
> >   Discussed with:	np
> >   Relnotes:	yes
> >   Sponsored by:	Chelsio Communications
> 
> The primary tool I used for evaluating performance was netperf's TCP stream
> test.  It is a best case for this (constant stream of traffic), but that is
> also the intended use case for this feature.
> 
> Using 2 64K buffers in a ping-pong via aio_read() to receive a 40Gbps stream
> used about about two full CPUs (~190% CPU usage) on a single-package
> Intel E5-1620 v3 @ 3.50GHz with the stock TCP stack.  Enabling TOE brings the
> usage down to about 110% CPU.  With DDP, the usage is around 30% of a single
> CPU.  With two 1MB buffers the the stock and TOE numbers are about the same,
> but the DDP usage is about 5% of single CPU.
> 
> Note that these numbers are with aio_read().  read() fares a bit better (180%
> for stock and 70% for TOE).  Before the AIO rework, trying to use aio_read()
> with two buffers in a ping-pong used twice as much CPU as bare read(), but
> aio_read() in general is now fairly comparable to read() at least in terms of
> CPU overhead.

Can be this impovement of nfsclient and etc?