svn commit: r299210 - in head/sys/dev/cxgbe: . tom
John Baldwin
jhb at freebsd.org
Sat May 7 01:15:44 UTC 2016
On Saturday, May 07, 2016 12:33:35 AM John Baldwin wrote:
> Author: jhb
> Date: Sat May 7 00:33:35 2016
> New Revision: 299210
> URL: https://svnweb.freebsd.org/changeset/base/299210
>
> Log:
> Use DDP to implement zerocopy TCP receive with aio_read().
>
> Chelsio's TCP offload engine supports direct DMA of received TCP payload
> into wired user buffers. This feature is known as Direct-Data Placement.
> However, to scale well the adapter needs to prepare buffers for DDP
> before data arrives. aio_read() is more amenable to this requirement than
> read() as applications often call read() only after data is available in
> the socket buffer.
>
> When DDP is enabled, TOE sockets use the recently added pru_aio_queue
> protocol hook to claim aio_read(2) requests instead of letting them use
> the default AIO socket logic. The DDP feature supports scheduling DMA
> to two buffers at a time so that the second buffer is ready for use
> after the first buffer is filled. The aio/DDP code optimizes the case
> of an application ping-ponging between two buffers (similar to the
> zero-copy bpf(4) code) by keeping the two most recently used AIO buffers
> wired. If a buffer is reused, the aio/DDP code is able to reuse the
> vm_page_t array as well as page pod mappings (a kind of MMU mapping the
> Chelsio NIC uses to describe user buffers). The generation of the
> vmspace of the calling process is used in conjunction with the user
> buffer's address and length to determine if a user buffer matches a
> previously used buffer. If an application queues a buffer for AIO that
> does not match a previously used buffer then the least recently used
> buffer is unwired before the new buffer is wired. This ensures that no
> more than two user buffers per socket are ever wired.
>
> Note that this feature is best suited to applications sending a steady
> stream of data vs short bursts of traffic.
>
> Discussed with: np
> Relnotes: yes
> Sponsored by: Chelsio Communications
The primary tool I used for evaluating performance was netperf's TCP stream
test. It is a best case for this (constant stream of traffic), but that is
also the intended use case for this feature.
Using 2 64K buffers in a ping-pong via aio_read() to receive a 40Gbps stream
used about about two full CPUs (~190% CPU usage) on a single-package
Intel E5-1620 v3 @ 3.50GHz with the stock TCP stack. Enabling TOE brings the
usage down to about 110% CPU. With DDP, the usage is around 30% of a single
CPU. With two 1MB buffers the the stock and TOE numbers are about the same,
but the DDP usage is about 5% of single CPU.
Note that these numbers are with aio_read(). read() fares a bit better (180%
for stock and 70% for TOE). Before the AIO rework, trying to use aio_read()
with two buffers in a ping-pong used twice as much CPU as bare read(), but
aio_read() in general is now fairly comparable to read() at least in terms of
CPU overhead.
--
John Baldwin
More information about the svn-src-head
mailing list