Does FreeBSD have sendmmsg or recvmmsg system calls?

Fri Jan 8 17:02:38 UTC 2016

On 8 January 2016 at 03:02, Bruce Evans <brde at optusnet.com.au> wrote:
> On Fri, 8 Jan 2016, Adrian Chadd wrote:
>
>> On 7 January 2016 at 23:58, Mark Delany <c2h at romeo.emu.st> wrote:
>>>
>>> On 08Jan16, Bruce Evans allegedly wrote:
>>>>
>>>> If the NIC can't reach line rate
>>>
>>>
>>>> Network stack overheads are also enormous.
>>>
>>>
>>> Bruce makes some excellent points.
>>>
>>> I challenge anyone to get line rate UDP out of FBSD (or Linux) for a
>>> 1G NIC yet alone a 10G NIC listening to a single port. It was exactly
>>> my frustration with UDP performance that led me down the path of
>>> *mmsg() and netmap.
>>>
>>> Frankly this is an opportunity for FBSD as UDP performance appears to
>>> be a neglected area.
>>
>>
>> I'm there, on 16 threads.
>>
>> I'd rather we do it on two or three, as a lot of time is wasted in
>> producer/consumer locking. but yeah, 500k tx/rx should be doable per
>> CPU with only locking changes.

.. and I did mean "kernel producer/consumer locking changes."

>
> Line rate for 1 Gbps is about 1500 kpps (small packets).
>
> With I218V2 (em), I see enormous lock contention above 3 or 4 (user)
> threads, and 8 are slightly slower than 1.  1 doesn't saturate the NIC,
> and 2 is optimal.
>

The RSS support in -HEAD lets you get away with parallelising UDP
streams very nicely.

The framework is pretty simple (!):

* drivers ask the RSS code for the RSS config and RSS hash to use, and
configure the hardware appropriately;
* the netisr input paths check the existence of the RSS hash and will
calculte it in software if reqiured;
* v4/v6 reassembly is done (at the IP level, /not/ at the protocol
level) and if it needs a new RSS hash / netisr reinjection, that'll
happen;
* the PCB lookup code for listen sockets now allows one listen socket
per RSS bucket - as the RSS / PCBGROUPS code already extended the PCB
to have one PCB table per RSS bucket (as well as a global one);

So:

* userland code queries RSS for the CPU and RSS bucket setup;
* you then create one listen socket per RSS bucket, bind it to the
local thread (if you want) and tell it "you're in RSS bucket X";
* .. and then in the UDP case for local-bound sockets, the
transmit/receive path does not require modifying the global PCB state,
so the locking is kept per-RSS bucket, and scales linearly with the
number of CPUs you have (until you hit the NIC queue limits.)

https://github.com/erikarn/freebsd-rss/

and:

http://adrianchadd.blogspot.com/2014/06/hacking-on-receive-side-scaling-rss-on.html
http://adrianchadd.blogspot.com/2014/07/application-awareness-of-receive-side.html
http://adrianchadd.blogspot.com/2014/08/receive-side-scaling-figuring-out-how.html
http://adrianchadd.blogspot.com/2014/09/receive-side-scaling-testing-udp.html
http://adrianchadd.blogspot.com/2014/10/more-rss-udp-tests-this-time-on-dell.html

-adrian