HEADS UP: zerocopy bpf commits impending

Mon Mar 17 11:45:53 PDT 2008

On Mon, 17 Mar 2008, Julian Elischer wrote:

>> Per previous posts, interested parties can find the slides on the design 
>> from the BSDCan 2008 developer summit here:
>>
>> 
>> http://www.watson.org/~robert/freebsd/2007bsdcan/20070517-devsummit-zerocopybpf.pdf
>
> with the video of the talk at:
>
> http://www.freebsd.org/~julian/BSDCan-2007/rwatson_bpf.mov

The primary design change since that time is that we've eliminated the 
ioctl-driven monitoring and ACKing of shared memory buffers from userspace. 
All shared memory consumers must use the shared memory ACK model, and our 
libpcap changes do that.  This removes redundancy (and complexity) from the 
set of ioctls we've added.  I've attached the (new) text from bpf.4 below, 
which I think captures the changes best.

Robert N M Watson
Computer Laboratory
University of Cambridge

BUFFER MODES
      bpf devices deliver packet data to the application via memory buffers
      provided by the application.  The buffer mode is set using the
      BIOCSETBUFMODE ioctl, and read using the BIOCGETBUFMODE ioctl.

    Buffered read mode
      By default, bpf devices operate in the BPF_BUFMODE_BUFFER mode, in which
      packet data is copied explicitly from the kernel to user memory using the
      read(2) system call.  The user process will declare a fixed buffer size
      that will be used both for sizing internal buffers and for all read(2)
      operations on the file.  This size is queried using the BIOCGBLEN ioctl,
      and is set using the BIOCSBLEN ioctl.  Note that an individual packet
      larger than the buffer size is necessarily truncated.

    Zeroâ€copy buffer mode
      bpf devices may also operate in the BPF_BUFMODE_ZEROCOPY mode, in which
      packet data is written directly into user memory buffers by the kernel,
      avoiding both system call and copying overhead.  Buffers are of fixed
      (and equal) size, pageâ€aligned, and an even multiple of the page size.
      The maximum zeroâ€copy buffer size is returned by the BIOCGETZMAX ioctl.
      Note that an individual packet larger than the buffer size is necessarily
      truncated.

      The user process registers two memory buffers using the BIOCSETZBUF
      ioctl, which accepts a struct bpf_zbuf pointer as an argument:

      struct bpf_zbuf {
              void *bz_bufa;
              void *bz_bufb;
              size_t bz_buflen;
      };

      bz_bufa is a pointer to the userspace address of the first buffer that
      will be filled, and bz_bufb is a pointer to the second buffer.  bpf will
      then cycle between the two buffers starting with bz_bufa.

      Each buffer begins with a fixedâ€length header to hold synchronization 
and
      data length information for the buffer:

      struct bpf_zbuf_header {
              volatile u_int  bzh_kernel_gen; /* Kernel generation number. */
              volatile u_int  bzh_kernel_len; /* Length of data in the buffer. 
*/
              volatile u_int  bzh_user_gen;   /* User generation number. */
              /* ...padding for future use... */
      };

      The header structure of each buffer, including all padding, should be
      zeroed before it is passed to the ioctl.  Remaining space in the buffer
      will be used by the kernel to store packet data, laid out in the same
      format as with buffered read mode.

      The kernel and the user process follow a simple acknowledgement protocol
      via the buffer header to synchronize access to the buffer: when the
      header generation numbers, bzh_kernel_gen and bzh_user_gen, hold the same
      value, the kernel owns the buffer, and when they differ, userspace owns
      the buffer.

      While the kernel owns the buffer, the contents are unstable and may
      change asynchronously; while the user process owns the buffer, its conâ€
      tents are stable and will not be changed until the buffer has been
      acknowledged.

      Initializing the buffer headers to all 0â€™s before registering the 
buffer
      has the effect of assigning initial ownership of both buffers to the 
kerâ€
      nel.  The kernel signals that a buffer has been assigned to userspace by
      modifying bzh_kernel_gen, and userspace acknowledges the buffer and
      returns it to the kernel by setting the value of bzh_user_gen to the
      value of bzh_kernel_gen.

      In order to avoid caching and memory reâ€ordering effects, the user
      process must use atomic operations and memory barriers when checking for
      and acknowledging buffers:

      #include <machine/atomic.h>

      /*
       * Return ownership of a buffer to the kernel for reuse.
       */
      static void
      buffer_acknowledge(struct bpf_zbuf_header *bzh)
      {

              atomic_store_rel_int(&bzhâ€>bzh_user_gen, 
bzhâ€>bzh_kernel_gen);
      }

      /*
       * Check whether a buffer has been assigned to userspace by the kernel.
       * Return true if userspace owns the buffer, and false otherwise.
       */
      static int
      buffer_check(struct bpf_zbuf_header *bzh)
      {

              return (bzhâ€>bzh_user_gen !=
                  atomic_load_acq_int(&bzhâ€>bzh_kernel_gen));
      }

      The user process may force the assignment of the next buffer, if any data
      is pending, to userspace using the BIOCROTZBUF ioctl.  This allows the
      user process to retrieve data in a partially filled buffer before the
      buffer is full, such as following a timeout; the process must check for
      buffer ownership using the header generation numbers, as the buffer will
      not be assigned if no data was present.

      As in the buffered read mode, kqueue(2), poll(2), and select(2) may be
      used to sleep awaiting the availbility of a completed buffer.  They will
      return a readable file descriptor when ownership of the next buffer is
      assigned to user space.

      In the current implementation, the kernel will assign ownership of at
      most one buffer at a time to the user process.  The user processes must
      acknowledge the current buffer in order to be notified that the next
      buffer is ready for processing.  Programs should not rely on this as an
      invariant, as it may change in future versions.