svn commit: r260368 - in head: share/man/man4 sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/net tools/tools/netmap

Luigi Rizzo luigi at FreeBSD.org
Mon Jan 6 12:53:16 UTC 2014


Author: luigi
Date: Mon Jan  6 12:53:15 2014
New Revision: 260368
URL: http://svnweb.freebsd.org/changeset/base/260368

Log:
  It is 2014 and we have a new version of netmap.
  Most relevant features:
  
  - netmap emulation on any NIC, even those without native netmap support.
  
    On the ixgbe we have measured about 4Mpps/core/queue in this mode,
    which is still a lot more than with sockets/bpf.
  
  - seamless interconnection of VALE switch, NICs and host stack.
  
    If you disable accelerations on your NIC (say em0)
  
          ifconfig em0 -txcsum -txcsum
  
    you can use the VALE switch to connect the NIC and the host stack:
  
          vale-ctl -h valeXX:em0
  
    allowing sharing the NIC with other netmap clients.
  
  - THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
    instead of pointers/count as before). This was unavoidable to support,
    in the future, multiple threads operating on the same rings.
    Netmap clients require very small source code changes to compile again.
        On the plus side, the new API should be easier to understand
    and the internals are a lot simpler.
  
  The manual page has been updated extensively to reflect the current
  features and give some examples.
  
  This is the result of work of several people including Giuseppe Lettieri,
  Vincenzo Maffione, Michio Honda and myself, and has been financially
  supported by EU projects CHANGE and OPENLAB, from NetApp University
  Research Fund, NEC, and of course the Universita` di Pisa.

Modified:
  head/share/man/man4/netmap.4
  head/sys/dev/e1000/if_em.c
  head/sys/dev/e1000/if_igb.c
  head/sys/dev/e1000/if_lem.c
  head/sys/dev/ixgbe/ixgbe.c
  head/sys/dev/netmap/if_em_netmap.h
  head/sys/dev/netmap/if_igb_netmap.h
  head/sys/dev/netmap/if_lem_netmap.h
  head/sys/dev/netmap/if_re_netmap.h
  head/sys/dev/netmap/ixgbe_netmap.h
  head/sys/dev/netmap/netmap.c
  head/sys/dev/netmap/netmap_freebsd.c
  head/sys/dev/netmap/netmap_generic.c
  head/sys/dev/netmap/netmap_kern.h
  head/sys/dev/netmap/netmap_mbq.c
  head/sys/dev/netmap/netmap_mbq.h
  head/sys/dev/netmap/netmap_mem2.c
  head/sys/dev/netmap/netmap_mem2.h
  head/sys/dev/netmap/netmap_vale.c
  head/sys/net/netmap.h
  head/sys/net/netmap_user.h
  head/tools/tools/netmap/bridge.c
  head/tools/tools/netmap/nm_util.c
  head/tools/tools/netmap/nm_util.h
  head/tools/tools/netmap/pcap.c
  head/tools/tools/netmap/pkt-gen.c
  head/tools/tools/netmap/vale-ctl.c

Modified: head/share/man/man4/netmap.4
==============================================================================
--- head/share/man/man4/netmap.4	Mon Jan  6 12:40:46 2014	(r260367)
+++ head/share/man/man4/netmap.4	Mon Jan  6 12:53:15 2014	(r260368)
@@ -1,4 +1,4 @@
-.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
+.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
@@ -27,434 +27,546 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd October 18, 2013
+.Dd January 4, 2014
 .Dt NETMAP 4
 .Os
 .Sh NAME
 .Nm netmap
 .Nd a framework for fast packet I/O
+.br
+.Nm VALE
+.Nd a fast VirtuAl Local Ethernet using the netmap API
 .Sh SYNOPSIS
 .Cd device netmap
 .Sh DESCRIPTION
 .Nm
 is a framework for extremely fast and efficient packet I/O
-(reaching 14.88 Mpps with a single core at less than 1 GHz)
 for both userspace and kernel clients.
-Userspace clients can use the netmap API
-to send and receive raw packets through physical interfaces
-or ports of the
-.Xr VALE 4
-switch.
-.Pp
-.Nm VALE
-is a very fast (reaching 20 Mpps per port)
-and modular software switch,
-implemented within the kernel, which can interconnect
-virtual ports, physical devices, and the native host stack.
-.Pp
-.Nm
-uses a memory mapped region to share packet buffers,
-descriptors and queues with the kernel.
-Simple
-.Pa ioctl()s
-are used to bind interfaces/ports to file descriptors and
-implement non-blocking I/O, whereas blocking I/O uses
-.Pa select()/poll() .
+It runs on FreeBSD and Linux,
+and includes
+.Nm VALE ,
+a very fast and modular in-kernel software switch/dataplane.
+.Pp
 .Nm
-can exploit the parallelism in multiqueue devices and
-multicore systems.
+and
+.Nm VALE
+are one order of magnitude faster than sockets, bpf or
+native switches based on
+.Xr tun/tap 4 ,
+reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
+and 20 Mpps per core for VALE ports.
+.Pp
+Userspace clients can dynamically switch NICs into
+.Nm
+mode and send and receive raw packets through
+memory mapped buffers.
+A selectable file descriptor supports
+synchronization and blocking I/O.
+.Pp
+Similarly,
+.Nm VALE
+can dynamically create switch instances and ports,
+providing high speed packet I/O between processes,
+virtual machines, NICs and the host stack.
 .Pp
-For the best performance,
+For best performance,
 .Nm
 requires explicit support in device drivers;
-a generic emulation layer is available to implement the
+however, the
 .Nm
-API on top of unmodified device drivers,
+API can be emulated on top of unmodified device drivers,
 at the price of reduced performance
-(but still better than what can be achieved with
-sockets or BPF/pcap).
+(but still better than sockets or BPF/pcap).
 .Pp
-For a list of devices with native
+In the rest of this (long) manual page we document
+various aspects of the
 .Nm
-support, see the end of this manual page.
-.Sh OPERATION - THE NETMAP API
+and
+.Nm VALE
+architecture, features and usage.
+.Pp
+.Sh ARCHITECTURE
 .Nm
-clients must first
-.Pa open("/dev/netmap") ,
-and then issue an
-.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
-to bind the file descriptor to a specific interface or port.
+supports raw packet I/O through a
+.Em port ,
+which can be connected to a physical interface
+.Em ( NIC ) ,
+to the host stack,
+or to a
+.Nm VALE
+switch).
+Ports use preallocated circular queues of buffers
+.Em ( rings )
+residing in an mmapped region.
+There is one ring for each transmit/receive queue of a
+NIC or virtual port.
+An additional ring pair connects to the host stack.
+.Pp
+After binding a file descriptor to a port, a
+.Nm
+client can send or receive packets in batches through
+the rings, and possibly implement zero-copy forwarding
+between ports.
+.Pp
+All NICs operating in
+.Nm
+mode use the same memory region,
+accessible to all processes who own
+.Nm /dev/netmap
+file descriptors bound to NICs.
+.Nm VALE
+ports instead use separate memory regions.
+.Pp
+.Sh ENTERING AND EXITING NETMAP MODE
+Ports and rings are created and controlled through a file descriptor,
+created by opening a special device
+.Dl fd = open("/dev/netmap");
+and then bound to a specific port with an
+.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
+.Pp
 .Nm
 has multiple modes of operation controlled by the
-content of the
-.Pa struct nmreq
-passed to the
-.Pa ioctl() .
-In particular, the
-.Em nr_name
-field specifies whether the client operates on a physical network
-interface or on a port of a
-.Nm VALE
-switch, as indicated below. Additional fields in the
-.Pa struct nmreq
-control the details of operation.
+.Vt struct nmreq
+argument.
+.Va arg.nr_name
+specifies the port name, as follows:
 .Bl -tag -width XXXX
-.It Dv Interface name (e.g. 'em0', 'eth1', ... )
-The data path of the interface is disconnected from the host stack.
-Depending on additional arguments,
-the file descriptor is bound to the NIC (one or all queues),
-or to the host stack.
+.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
+the data path of the NIC is disconnected from the host stack,
+and the file descriptor is bound to the NIC (one or all queues),
+or to the host stack;
 .It Dv valeXXX:YYY (arbitrary XXX and YYY)
-The file descriptor is bound to port YYY of a VALE switch called XXX,
-where XXX and YYY are arbitrary alphanumeric strings.
+the file descriptor is bound to port YYY of a VALE switch called XXX,
+both dynamically created if necessary.
 The string cannot exceed IFNAMSIZ characters, and YYY cannot
-matching the name of any existing interface.
-.Pp
-The switch and the port are created if not existing.
-.It Dv valeXXX:ifname (ifname is an existing interface)
-Flags in the argument control whether the physical interface
-(and optionally the corrisponding host stack endpoint)
-are connected or disconnected from the VALE switch named XXX.
-.Pp
-In this case the
-.Pa ioctl()
-is used only for configuring the VALE switch, typically through the
-.Nm vale-ctl
-command.
-The file descriptor cannot be used for I/O, and should be
-.Pa close()d
-after issuing the
-.Pa ioctl().
+be the name of any existing OS network interface.
 .El
 .Pp
-The binding can be removed (and the interface returns to
-regular operation, or the virtual port destroyed) with a
-.Pa close()
-on the file descriptor.
-.Pp
-The processes owning the file descriptor can then
-.Pa mmap()
-the memory region that contains pre-allocated
-buffers, descriptors and queues, and use them to
-read/write raw packets.
+On return,
+.Va arg
+indicates the size of the shared memory region,
+and the number, size and location of all the
+.Nm
+data structures, which can be accessed by mmapping the memory
+.Dl char *mem = mmap(0, arg.nr_memsize, fd);
+.Pp
 Non blocking I/O is done with special
-.Pa ioctl()'s ,
-whereas the file descriptor can be passed to
-.Pa select()/poll()
-to be notified about incoming packet or available transmit buffers.
-.Ss DATA STRUCTURES
-The data structures in the mmapped memory are described below
-(see
-.Xr sys/net/netmap.h
-for reference).
-All physical devices operating in
+.Xr ioctl 2
+.Xr select 2
+and
+.Xr poll 2
+on the file descriptor permit blocking I/O.
+.Xr epoll 2
+and
+.Xr kqueue 2
+are not supported on
 .Nm
-mode use the same memory region,
-shared by the kernel and all processes who own
-.Pa /dev/netmap
-descriptors bound to those devices
-(NOTE: visibility may be restricted in future implementations).
-Virtual ports instead use separate memory regions,
-shared only with the kernel.
-.Pp
-All references between the shared data structure
-are relative (offsets or indexes). Some macros help converting
-them into actual pointers.
+file descriptors.
+.Pp
+While a NIC is in
+.Nm
+mode, the OS will still believe the interface is up and running.
+OS-generated packets for that NIC end up into a
+.Nm
+ring, and another ring is used to send packets into the OS network stack.
+A
+.Xr close 2
+on the file descriptor removes the binding,
+and returns the NIC to normal mode (reconnecting the data path
+to the host stack), or destroys the virtual port.
+.Pp
+.Sh DATA STRUCTURES
+The data structures in the mmapped memory region are detailed in
+.Xr sys/net/netmap.h ,
+which is the ultimate reference for the
+.Nm
+API. The main structures and fields are indicated below:
 .Bl -tag -width XXX
 .It Dv struct netmap_if (one per interface)
-indicates the number of rings supported by an interface, their
-sizes, and the offsets of the
-.Pa netmap_rings
-associated to the interface.
-.Pp
-.Pa struct netmap_if
-is at offset
-.Pa nr_offset
-in the shared memory region is indicated by the
-field in the structure returned by the
-.Pa NIOCREGIF
-(see below).
 .Bd -literal
 struct netmap_if {
-    char          ni_name[IFNAMSIZ]; /* name of the interface.    */
-    const u_int   ni_version;        /* API version               */
-    const u_int   ni_rx_rings;       /* number of rx ring pairs   */
-    const u_int   ni_tx_rings;       /* if 0, same as ni_rx_rings */
-    const ssize_t ring_ofs[];        /* offset of tx and rx rings */
+    ...
+    const uint32_t   ni_flags;          /* properties     */
+    ...
+    const uint32_t   ni_tx_rings;       /* NIC tx rings   */
+    const uint32_t   ni_rx_rings;       /* NIC rx rings   */
+    const uint32_t   ni_extra_tx_rings; /* extra tx rings */
+    const uint32_t   ni_extra_rx_rings; /* extra rx rings */
+    ...
 };
 .Ed
+.Pp
+Indicates the number of available rings
+.Pa ( struct netmap_rings )
+and their position in the mmapped region.
+The number of tx and rx rings
+.Pa ( ni_tx_rings , ni_rx_rings )
+normally depends on the hardware.
+NICs also have an extra tx/rx ring pair connected to the host stack.
+.Em NIOCREGIF
+can request additional tx/rx rings,
+to be used between multiple processes/threads
+accessing the same
+.Nm
+port.
 .It Dv struct netmap_ring (one per ring)
-Contains the positions in the transmit and receive rings to
-synchronize the kernel and the application,
-and an array of
-.Pa slots
-describing the buffers.
-'reserved' is used in receive rings to tell the kernel the
-number of slots after 'cur' that are still in usr
-indicates how many slots starting from 'cur'
-the
-.Pp
-Each physical interface has one
-.Pa netmap_ring
-for each hardware transmit and receive ring,
-plus one extra transmit and one receive structure
-that connect to the host stack.
 .Bd -literal
 struct netmap_ring {
-    const ssize_t  buf_ofs;   /* see details */
-    const uint32_t num_slots; /* number of slots in the ring */
-    uint32_t       avail;     /* number of usable slots      */
-    uint32_t       cur;       /* 'current' read/write index  */
-    uint32_t       reserved;  /* not refilled before current */
-
-    const uint16_t nr_buf_size;
-    uint16_t       flags;
-#define NR_TIMESTAMP 0x0002   /* set timestamp on *sync()    */
-#define NR_FORWARD   0x0004   /* enable NS_FORWARD for ring  */
-#define NR_RX_TSTMP  0x0008   /* set rx timestamp in slots   */
-    struct timeval ts;
-    struct netmap_slot slot[0]; /* array of slots            */
+    ...
+    const uint32_t num_slots;   /* slots in each ring            */
+    const uint32_t nr_buf_size; /* size of each buffer           */
+    ...
+    uint32_t       head;        /* (u) first buf owned by user   */
+    uint32_t       cur;         /* (u) wakeup position           */
+    const uint32_t tail;        /* (k) first buf owned by kernel */
+    ...
+    uint32_t       flags;
+    struct timeval ts;          /* (k) time of last rxsync()      */
+    ...
+    struct netmap_slot slot[0]; /* array of slots                 */
 }
 .Ed
 .Pp
-In transmit rings, after a system call 'cur' indicates
-the first slot that can be used for transmissions,
-and 'avail' reports how many of them are available.
-Before the next netmap-related system call on the file
-descriptor, the application should fill buffers and
-slots with data, and update 'cur' and 'avail'
-accordingly, as shown in the figure below:
+Implements transmit and receive rings, with read/write
+pointers, metadata and and an array of
+.Pa slots
+describing the buffers.
+.Pp
+.It Dv struct netmap_slot (one per buffer)
 .Bd -literal
-
-              cur
-               |----- avail ---|   (after syscall)
-               v
-     TX  [*****aaaaaaaaaaaaaaaaa**]
-     TX  [*****TTTTTaaaaaaaaaaaa**]
-                    ^
-                    |-- avail --|   (before syscall)
-                   cur
+struct netmap_slot {
+    uint32_t buf_idx;           /* buffer index                 */
+    uint16_t len;               /* packet length                */
+    uint16_t flags;             /* buf changed, etc.            */
+    uint64_t ptr;               /* address for indirect buffers */
+};
 .Ed
-In receive rings, after a system call 'cur' indicates
-the first slot that contains a valid packet,
-and 'avail' reports how many of them are available.
-Before the next netmap-related system call on the file
-descriptor, the application can process buffers and
-release them to the kernel updating
-'cur' and 'avail' accordingly, as shown in the figure below.
-Receive rings have an additional field called 'reserved'
-to indicate how many buffers before 'cur' are still
-under processing and cannot be released.
+.Pp
+Describes a packet buffer, which normally is identified by
+an index and resides in the mmapped region.
+.It Dv packet buffers
+Fixed size (normally 2 KB) packet buffers allocated by the kernel.
+.El
+.Pp
+The offset of the
+.Pa struct netmap_if
+in the mmapped region is indicated by the
+.Pa nr_offset
+field in the structure returned by
+.Pa NIOCREGIF .
+From there, all other objects are reachable through
+relative references (offsets or indexes).
+Macros and functions in <net/netmap_user.h>
+help converting them into actual pointers:
+.Pp
+.Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
+.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
+.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
+.Pp
+.Dl char *buf = NETMAP_BUF(ring, buffer_index);
+.Sh RINGS, BUFFERS AND DATA I/O
+.Va Rings
+are circular queues of packets with three indexes/pointers
+.Va ( head , cur , tail ) ;
+one slot is always kept empty.
+The ring size
+.Va ( num_slots )
+should not be assumed to be a power of two.
+.br
+(NOTE: older versions of netmap used head/count format to indicate
+the content of a ring).
+.Pp
+.Va head
+is the first slot available to userspace;
+.br
+.Va cur
+is the wakeup point:
+select/poll will unblock when
+.Va tail
+passes
+.Va cur ;
+.br
+.Va tail
+is the first slot reserved to the kernel.
+.Pp
+Slot indexes MUST only move forward;
+for convenience, the function
+.Dl nm_ring_next(ring, index)
+returns the next index modulo the ring size.
+.Pp
+.Va head
+and
+.Va cur
+are only modified by the user program;
+.Va tail
+is only modified by the kernel.
+The kernel only reads/writes the
+.Vt struct netmap_ring
+slots and buffers
+during the execution of a netmap-related system call.
+The only exception are slots (and buffers) in the range
+.Va tail\  . . . head-1 ,
+that are explicitly assigned to the kernel.
+.Pp
+.Ss TRANSMIT RINGS
+On transmit rings, after a
+.Nm
+system call, slots in the range
+.Va head\  . . . tail-1
+are available for transmission.
+User code should fill the slots sequentially
+and advance
+.Va head
+and
+.Va cur
+past slots ready to transmit.
+.Va cur
+may be moved further ahead if the user code needs
+more slots before further transmissions (see
+.Sx SCATTER GATHER I/O ) .
+.Pp
+At the next NIOCTXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are pushed to the port, and
+.Va tail
+may advance if further slots have become available.
+Below is an example of the evolution of a TX ring:
+.Pp
 .Bd -literal
-                 cur
-            |-res-|-- avail --|   (after syscall)
-                  v
-     RX  [**rrrrrrRRRRRRRRRRRR******]
-     RX  [**...........rrrrRRR******]
-                       |res|--|<avail (before syscall)
-                           ^
-                          cur
+    after the syscall, slots between cur and tail are (a)vailable
+              head=cur   tail
+               |          |
+               v          v
+     TX  [.....aaaaaaaaaaa.............]
+
+    user creates new packets to (T)ransmit
+                head=cur tail
+                    |     |
+                    v     v
+     TX  [.....TTTTTaaaaaa.............]
 
+    NIOCTXSYNC/poll()/select() sends packets and reports new slots
+                head=cur      tail
+                    |          |
+                    v          v
+     TX  [..........aaaaaaaaaaa........]
 .Ed
-.It Dv struct netmap_slot (one per packet)
-contains the metadata for a packet:
+.Pp
+select() and poll() wlll block if there is no space in the ring, i.e.
+.Dl ring->cur == ring->tail
+and return when new slots have become available.
+.Pp
+High speed applications may want to amortize the cost of system calls
+by preparing as many packets as possible before issuing them.
+.Pp
+A transmit ring with pending transmissions has
+.Dl ring->head != ring->tail + 1 (modulo the ring size).
+The function
+.Va int nm_tx_pending(ring)
+implements this test.
+.Pp
+.Ss RECEIVE RINGS
+On receive rings, after a
+.Nm
+system call, the slots in the range
+.Va head\& . . . tail-1
+contain received packets.
+User code should process them and advance
+.Va head
+and
+.Va cur
+past slots it wants to return to the kernel.
+.Va cur
+may be moved further ahead if the user code wants to
+wait for more packets
+without returning all the previous slots to the kernel.
+.Pp
+At the next NIOCRXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are returned to the kernel for further receives, and
+.Va tail
+may advance to report new incoming packets.
+.br
+Below is an example of the evolution of an RX ring:
 .Bd -literal
-struct netmap_slot {
-    uint32_t buf_idx; /* buffer index */
-    uint16_t len;   /* packet length */
-    uint16_t flags; /* buf changed, etc. */
-#define NS_BUF_CHANGED  0x0001  /* must resync, buffer changed */
-#define NS_REPORT       0x0002  /* tell hw to report results
-                                 * e.g. by generating an interrupt
-                                 */
-#define NS_FORWARD      0x0004  /* pass packet to the other endpoint
-                                 * (host stack or device)
-                                 */
-#define NS_NO_LEARN     0x0008
-#define NS_INDIRECT     0x0010
-#define NS_MOREFRAG     0x0020
-#define NS_PORT_SHIFT   8
-#define NS_PORT_MASK    (0xff << NS_PORT_SHIFT)
-#define NS_RFRAGS(_slot)        ( ((_slot)->flags >> 8) & 0xff)
-    uint64_t ptr;   /* buffer address (indirect buffers) */
-};
+    after the syscall, there are some (h)eld and some (R)eceived slots
+           head  cur     tail
+            |     |       |
+            v     v       v
+     RX  [..hhhhhhRRRRRRRR..........]
+
+    user advances head and cur, releasing some slots and holding others
+               head cur  tail
+                 |  |     |
+                 v  v     v
+     RX  [..*****hhhRRRRRR...........]
+
+    NICRXSYNC/poll()/select() recovers slots and reports new packets
+               head cur        tail
+                 |  |           |
+                 v  v           v
+     RX  [.......hhhRRRRRRRRRRRR....]
 .Ed
-The flags control how the the buffer associated to the slot
-should be managed.
-.It Dv packet buffers
-are normally fixed size (2 Kbyte) buffers allocated by the kernel
-that contain packet data. Buffers addresses are computed through
-macros.
-.El
-.Bl -tag -width XXX
-Some macros support the access to objects in the shared memory
-region. In particular,
-.It NETMAP_TXRING(nifp, i)
-.It NETMAP_RXRING(nifp, i)
-return the address of the i-th transmit and receive ring,
-respectively, whereas
-.It NETMAP_BUF(ring, buf_idx)
-returns the address of the buffer with index buf_idx
-(which can be part of any ring for the given interface).
-.El
 .Pp
-Normally, buffers are associated to slots when interfaces are bound,
-and one packet is fully contained in a single buffer.
-Clients can however modify the mapping using the
-following flags:
-.Ss FLAGS
+.Sh SLOTS AND PACKET BUFFERS
+Normally, packets should be stored in the netmap-allocated buffers
+assigned to slots when ports are bound to a file descriptor.
+One packet is fully contained in a single buffer.
+.Pp
+The following flags affect slot and buffer processing:
 .Bl -tag -width XXX
 .It NS_BUF_CHANGED
-indicates that the buf_idx in the slot has changed.
-This can be useful if the client wants to implement
-some form of zero-copy forwarding (e.g. by passing buffers
-from an input interface to an output interface), or
-needs to process packets out of order.
+it MUST be used when the buf_idx in the slot is changed.
+This can be used to implement
+zero-copy forwarding, see
+.Sx ZERO-COPY FORWARDING .
 .Pp
-The flag MUST be used whenever the buffer index is changed.
 .It NS_REPORT
-indicates that we want to be woken up when this buffer
-has been transmitted. This reduces performance but insures
-a prompt notification when a buffer has been sent.
+reports when this buffer has been transmitted.
 Normally,
 .Nm
 notifies transmit completions in batches, hence signals
-can be delayed indefinitely. However, we need such notifications
-before closing a descriptor.
+can be delayed indefinitely. This flag helps detecting
+when packets have been send and a file descriptor can be closed.
 .It NS_FORWARD
-When the device is open in 'transparent' mode,
-the client can mark slots in receive rings with this flag.
-For all marked slots, marked packets are forwarded to
-the other endpoint at the next system call, thus restoring
-(in a selective way) the connection between the NIC and the
-host stack.
+When a ring is in 'transparent' mode (see
+.Sx TRANSPARENT MODE ) ,
+packets marked with this flags are forwarded to the other endpoint
+at the next system call, thus restoring (in a selective way)
+the connection between a NIC and the host stack.
 .It NS_NO_LEARN
 tells the forwarding code that the SRC MAC address for this
-packet should not be used in the learning bridge
+packet must not be used in the learning bridge code.
 .It NS_INDIRECT
-indicates that the packet's payload is not in the netmap
-supplied buffer, but in a user-supplied buffer whose
-user virtual address is in the 'ptr' field of the slot.
+indicates that the packet's payload is in a user-supplied buffer,
+whose user virtual address is in the 'ptr' field of the slot.
 The size can reach 65535 bytes.
-.Em This is only supported on the transmit ring of virtual ports
+.br
+This is only supported on the transmit ring of
+.Nm VALE
+ports, and it helps reducing data copies in the interconnection
+of virtual machines.
 .It NS_MOREFRAG
 indicates that the packet continues with subsequent buffers;
 the last buffer in a packet must have the flag clear.
+.El
+.Sh SCATTER GATHER I/O
+Packets can span multiple slots if the
+.Va NS_MOREFRAG
+flag is set in all but the last slot.
 The maximum length of a chain is 64 buffers.
-.Em This is only supported on virtual ports
-.It NS_RFRAGS(slot)
-on receive rings, returns the number of remaining buffers
-in a packet, including this one.
-Slots with a value greater than 1 also have NS_MOREFRAG set.
-The length refers to the individual buffer, there is no
-field for the total length.
+This is normally used with
+.Nm VALE
+ports when connecting virtual machines, as they generate large
+TSO segments that are not split unless they reach a physical device.
 .Pp
-On transmit rings, if NS_DST is set, it is passed to the lookup
-function, which can use it e.g. as the index of the destination
-port instead of doing an address lookup.
-.El
+NOTE: The length field always refers to the individual
+fragment; there is no place with the total length of a packet.
+.Pp
+On receive rings the macro
+.Va NS_RFRAGS(slot)
+indicates the remaining number of slots for this packet,
+including the current one.
+Slots with a value greater than 1 also have NS_MOREFRAG set.
 .Sh IOCTLS
 .Nm
-supports some ioctl() to synchronize the state of the rings
-between the kernel and the user processes, plus some
-to query and configure the interface.
-The former do not require any argument, whereas the latter
-use a
-.Pa struct nmreq
-defined as follows:
+uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
+for non-blocking I/O. They take no argument.
+Two more ioctls (NIOCGINFO, NIOCREGIF) are used
+to query and configure ports, with the following argument:
 .Bd -literal
 struct nmreq {
-        char      nr_name[IFNAMSIZ];
-        uint32_t  nr_version;     /* API version */
-#define NETMAP_API      4         /* current version */
-        uint32_t  nr_offset;      /* nifp offset in the shared region */
-        uint32_t  nr_memsize;     /* size of the shared region */
-        uint32_t  nr_tx_slots;    /* slots in tx rings */
-        uint32_t  nr_rx_slots;    /* slots in rx rings */
-        uint16_t  nr_tx_rings;    /* number of tx rings */
-        uint16_t  nr_rx_rings;    /* number of tx rings */
-        uint16_t  nr_ringid;      /* ring(s) we care about */
-#define NETMAP_HW_RING  0x4000    /* low bits indicate one hw ring */
-#define NETMAP_SW_RING  0x2000    /* we process the sw ring */
-#define NETMAP_NO_TX_POLL 0x1000  /* no gratuitous txsync on poll */
-#define NETMAP_RING_MASK 0xfff    /* the actual ring number */
-        uint16_t        nr_cmd;
-#define NETMAP_BDG_ATTACH       1       /* attach the NIC */
-#define NETMAP_BDG_DETACH       2       /* detach the NIC */
-#define NETMAP_BDG_LOOKUP_REG   3       /* register lookup function */
-#define NETMAP_BDG_LIST         4       /* get bridge's info */
-	uint16_t	nr_arg1;
-	uint16_t	nr_arg2;
-        uint32_t        spare2[3];
+    char      nr_name[IFNAMSIZ]; /* (i) port name                  */
+    uint32_t  nr_version;        /* (i) API version                */
+    uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
+    uint32_t  nr_memsize;        /* (o) size of the mmap region    */
+    uint32_t  nr_tx_slots;       /* (o) slots in tx rings          */
+    uint32_t  nr_rx_slots;       /* (o) slots in rx rings          */
+    uint16_t  nr_tx_rings;       /* (o) number of tx rings         */
+    uint16_t  nr_rx_rings;       /* (o) number of tx rings         */
+    uint16_t  nr_ringid;         /* (i) ring(s) we care about      */
+    uint16_t  nr_cmd;            /* (i) special command            */
+    uint16_t  nr_arg1;           /* (i) extra arguments            */
+    uint16_t  nr_arg2;           /* (i) extra arguments            */
+    ...
 };
-
 .Ed
-A device descriptor obtained through
+.Pp
+A file descriptor obtained through
 .Pa /dev/netmap
-also supports the ioctl supported by network devices.
+also supports the ioctl supported by network devices, see
+.Xr netintro 4 .
 .Pp
-The netmap-specific
-.Xr ioctl 2
-command codes below are defined in
-.In net/netmap.h
-and are:
 .Bl -tag -width XXXX
 .It Dv NIOCGINFO
-returns EINVAL if the named device does not support netmap.
+returns EINVAL if the named port does not support netmap.
 Otherwise, it returns 0 and (advisory) information
-about the interface.
+about the port.
 Note that all the information below can change before the
 interface is actually put in netmap mode.
 .Pp
-.Pa nr_memsize
-indicates the size of the netmap
-memory region. Physical devices all share the same memory region,
-whereas VALE ports may have independent regions for each port.
-These sizes can be set through system-wise sysctl variables.
-.Pa nr_tx_slots, nr_rx_slots
+.Bl -tag -width XX
+.It Pa nr_memsize
+indicates the size of the
+.Nm
+memory region. NICs in
+.Nm
+mode all share the same memory region,
+whereas
+.Nm VALE
+ports have independent regions for each port.
+.It Pa nr_tx_slots , nr_rx_slots
 indicate the size of transmit and receive rings.
-.Pa nr_tx_rings, nr_rx_rings
+.It Pa nr_tx_rings , nr_rx_rings
 indicate the number of transmit
 and receive rings.
 Both ring number and sizes may be configured at runtime
 using interface-specific functions (e.g.
-.Pa sysctl
-or
-.Pa ethtool .
+.Xr ethtool
+).
+.El
 .It Dv NIOCREGIF
-puts the interface named in nr_name into netmap mode, disconnecting
-it from the host stack, and/or defines which rings are controlled
-through this file descriptor.
+binds the port named in
+.Va nr_name
+to the file descriptor. For a physical device this also switches it into
+.Nm
+mode, disconnecting
+it from the host stack.
+Multiple file descriptors can be bound to the same port,
+with proper synchronization left to the user.
+.Pp
 On return, it gives the same info as NIOCGINFO, and nr_ringid
 indicates the identity of the rings controlled through the file
 descriptor.
 .Pp
-Possible values for nr_ringid are
+.Va nr_ringid
+selects which rings are controlled through this file descriptor.
+Possible values are:
 .Bl -tag -width XXXXX
 .It 0
-default, all hardware rings
+(default) all hardware rings
 .It NETMAP_SW_RING
-the ``host rings'' connecting to the host stack
-.It NETMAP_HW_RING + i
-the i-th hardware ring
+the ``host rings'', connecting to the host stack.
+.It NETMAP_HW_RING | i
+the i-th hardware ring .
 .El
+.Pp
 By default, a
-.Nm poll
+.Xr poll 2
 or
-.Nm select
+.Xr select 2
 call pushes out any pending packets on the transmit ring, even if
 no write events are specified.
 The feature can be disabled by or-ing
-.Nm NETMAP_NO_TX_SYNC
-to nr_ringid.
-But normally you should keep this feature unless you are using
-separate file descriptors for the send and receive rings, because
-otherwise packets are pushed out only if NETMAP_TXSYNC is called,
-or the send queue is full.
-.Pp
-.Pa NIOCREGIF
-can be used multiple times to change the association of a
-file descriptor to a ring pair, always within the same device.
+.Va NETMAP_NO_TX_SYNC
+to the value written to
+.Va nr_ringid.
+When this feature is used,
+packets are transmitted only on
+.Va ioctl(NIOCTXSYNC)
+or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
 .Pp
 When registering a virtual interface that is dynamically created to a
 .Xr vale 4
@@ -467,6 +579,164 @@ number of slots available for transmissi
 tells the hardware of consumed packets, and asks for newly available
 packets.
 .El
+.Sh SELECT AND POLL
+.Xr select 2
+and
+.Xr poll 2
+on a
+.Nm
+file descriptor process rings as indicated in
+.Sx TRANSMIT RINGS
+and
+.Sx RECEIVE RINGS
+when write (POLLOUT) and read (POLLIN) events are requested.
+.Pp
+Both block if no slots are available in the ring (
+.Va ring->cur == ring->tail )
+.Pp
+Packets in transmit rings are normally pushed out even without
+requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
+.Em NIOCREGIF
+disables this feature.
+.Sh LIBRARIES
+The
+.Nm
+API is supposed to be used directly, both because of its simplicity and
+for efficient integration with applications.
+.Pp
+For conveniency, the
+.Va <net/netmap_user.h>
+header provides a few macros and functions to ease creating
+a file descriptor and doing I/O with a
+.Nm
+port. These are loosely modeled after the
+.Xr pcap 3
+API, to ease porting of libpcap-based applications to
+.Nm .
+To use these extra functions, programs should
+.Dl #define NETMAP_WITH_LIBS
+before
+.Dl #include <net/netmap_user.h>
+.Pp
+The following functions are available:
+.Bl -tag -width XXXXX
+.It Va  struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
+similar to
+.Xr pcap_open ,
+binds a file descriptor to a port.
+.Bl -tag -width XX
+.It Va ifname
+is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
+.Nm VALE
+port.
+.It Va flags
+can be set to
+.Va NETMAP_SW_RING
+to bind to the host ring pair,
+or to NETMAP_HW_RING to bind to a specific ring.
+.Va ring_name
+with NETMAP_HW_RING,
+is interpreted as a string or an integer indicating the ring to use.
+.It Va ring_flags
+is copied directly into the ring flags, to specify additional parameters
+such as NR_TIMESTAMP or NR_FORWARD.
+.El
+.It Va int nm_close(struct nm_desc_t *d)
+closes the file descriptor, unmaps memory, frees resources.
+.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
+similar to pcap_inject(), pushes a packet to a ring, returns the size
+of the packet is successful, or 0 on error;
+.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
+similar to pcap_dispatch(), applies a callback to incoming packets
+.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
+similar to pcap_next(), fetches the next packet
+.Pp
+.El
+.Sh SUPPORTED DEVICES
+.Nm
+natively supports the following devices:
+.Pp
+On FreeBSD:
+.Xr em 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr lem 4 ,
+.Xr re 4 .
+.Pp
+On Linux
+.Xr e1000 4 ,
+.Xr e1000e 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr mlx4 4 ,
+.Xr forcedeth 4 ,
+.Xr r8169 4 .
+.Pp
+NICs without native support can still be used in
+.Nm
+mode through emulation. Performance is inferior to native netmap
+mode but still significantly higher than sockets, and approaching
+that of in-kernel solutions such as Linux's
+.Xr pktgen .
+.Pp
+Emulation is also available for devices with native netmap support,
+which can be used for testing or performance comparison.
+The sysctl variable
+.Va dev.netmap.admode
+globally controls how netmap mode is implemented.
+.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
+Some aspect of the operation of
+.Nm
+are controlled through sysctl variables on FreeBSD
+.Em ( dev.netmap.* )
+and module parameters on Linux
+.Em ( /sys/module/netmap_lin/parameters/* ) :

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***


More information about the svn-src-head mailing list