Make kernel aware of NIC queues

Wed Feb 6 14:21:08 UTC 2013

Hello list!

Today more and more NICs are capable of splitting traffic to different 
Rx/TX rings permitting OS to dispatch this traffic on different CPU 
cores. However, there are some problems that arises from using multi-nic 
(or even singe multi-port NIC) configurations:

Typical (OS) questions are:
* how much queues we should allocate per port ?
* how we should mark packets received in given queue ?
* What traffic pattern NIC is used for: should we bind queues to CPU 
cores and, if so, to which ones?

Currently, there are some AI implemented in Intel drivers like:
* use maximum available queues if CPU has large number of cores
* bind every queue to CPU core sequentially.

Problems with (probably, any AI) are:
* what NICs (ports) will be _actually_ used?
E.g:
I have 8-core system with dual 82576 Intel NIC (which is capable of 
using 8 RX queues per port).
If only one port is used, I can allocate 8 (or 7) queues and bind it to 
given cores. which is generally good for forwarding traffic.
For 2-port setups it is probably better to setup 4 queues per each port 
to make sure ithreads from different cards to not interfere with each other.

* How exactly we should mark packets?
There are traffic flows which are not hashed properly by NIC (mostly 
non-IP/IPv6 traffic, PPPoE, various tunnels are good examples) so driver 
receives all such packets on q0 and marks them with FLOWID 0, which can 
be unhandy in some situations. It can be better if we can instruct NIC 
not to mark such packets with any id permitting OS to re-calculate hash 
via probably more powerful netisr hash function.

* Traffic flow inside OS / flowid marking
Smarter flowid marking may be needed in some cases:
for example, if we are using lagg with 2 NICs for traffic forwarding,
this results in increased contention on transmit parts:
 From the previos example:
port 0 has q0-q3 bound to cores 0-3
port 1 has q0-q3 bound to cores 4-7

flow ids are the same as core numbers.

lagg uses (flowid % number_nics) which leads to TX contention:
0 (0 % 2)=port0, (0 % 4)=queue0
1 (1 % 2)=port1, (1 % 4)=queue1
2 (2 % 2)=port0, (2 % 4)=queue2
3 (3 % 2)=port1, (3 % 4)=queue3
4 (4 % 2)=port0, (4 % 4)=queue0
5 (5 % 2)=port1, (5 % 4)=queue1
6 (6 % 2)=port0, (6 % 4)=queue2
7 (7 % 2)=port1, (7 % 4)=queue3

Flow IDs 0 and 4, 1 and 5, 2 and 6, 3 and 7 use the same TX queues on 
the same egress NICs.

This can be minimized by using either GCD(queues, ports)=1 
configurations (3 queues should do the trick in this case), but this 
leads to suboptimal CPU usage.

We internally uses patched igb/ix driver which permits setting flow ids 
manually (and I heard other people are using hacks to enable/disabling 
setting M_FLOWID).

I propose implementing common API to permit drivers:
* read user-supplied number of queues/other queue options (e.g:
* notify kernel of each RX/TX queue being created/destroyed
* make binding queues to cores via given API
* Export data to userland (for example, via sysctl) to permit users:
a) quickly see current configuration
b) change CPU binding on-fly
c) change flowid numbers on-fly (with the possibility to set 1) 
NIC-supplied hash 2) manually supplied value 3) disable setting M_FLOWID)

Having common interface will help users to make network stack tuning 
easier and puts us one step further to make (probably userland) AI which 
can auto-tune system according to template ("router", "webserver") and 
rc.conf configuration (lagg presense, etc..)

What do you guys think?