Intel 10Gb

Fri May 14 14:01:17 UTC 2010

On Tue, May 11, 2010 at 9:51 AM, Andrew Gallatin <gallatin at cs.duke.edu> wrote:
> Murat Balaban [murat at enderunix.org] wrote:
>>
>> Much of the FreeBSD networking stack has been made parallel in order to
>> cope with high packet rates at 10 Gig/sec operation.
>>
>> I've seen good numbers (near 10 Gig) in my tests involving TCP/UDP
>> send/receive. (latest Intel driver).
>>
>> As far as BPF is concerned, above statement does not hold true,
>> since there is some work that needs to be done here in terms
>> of BPF locking and parallelism. My tests show that there
>> is a high lock contention around "bpf interface lock", resulting
>> in input errors at high packet rates and with many bpf devices.
>
> If you're interested in 10GbE packet sniffing at line rate on the
> cheap, have a look at the Myri10GE "sniffer" interface.  This is a
> special software package that takes a normal mxge(4) NIC, and replaces
> the driver/firmware with a "myri_snf" driver/firmware which is
> optimized for packet sniffing.
>
> Using this driver/firmware combo, we can receive minimal packets at
> line rate (14.8Mpps) to userspace.  You can even access this using a
> libpcap interface.  The trick is that the fast paths are OS-bypass,
> and don't suffer from OS overheads, like lock contention.  See
> http://www.myri.com/scs/SNF/doc/index.html for details.

But your timestamps will be atrocious at 10G speeds.  Myricom doesn't
timestamp packets AFAIK.  If you want reliable timestamps you need to
look at companies like Endace, Napatech, etc.

We do a lot of packet capture and work on bpf(4) all the time.  My
biggest concern for reliable 10G packet capture is timestamps.  The
call to microtime up in catchpacket() is not going to cut it (it
barely cuts it for GIGE line rate speeds).

I'd be interested in doing the multi-queue bpf(4) myself (perhaps I
should ask? I don't know if non-summer-of-code folks are allowed?).
I believe the goal is not so much throughput but cache affinity.  It
would be nice if say the listener application (libpcap) could bind
itself to the same core that the driver's queue is receiving packets
on so everything from catching to post-processing all work with a very
warm cache (theoretically).  I think that's the idea.

It would also allow multiple applications to subscribe to potentially
different queues that are doing some form of load balancing.  Again,
Intel's 82599 chipset supports flow based queues (albeit the size of
the flow table is limited).

Note, zero-copy bpf(4) is your friend in all use cases at 10G speeds!  :)

-aps

PS I am not sure but Intel also supports writing packets directly in
cache (yet I thought the 82599 driver actually does a prefetch anyway
which had me confused on why that helps)