Send path scaling problem

From: Wei Hu via freebsd-net <>
Date: Fri, 09 Jul 2021 09:16:59 UTC

I am working a driver for a new SRIOV nic on FreeBSD vms running on Hyper-V. The driver coding has largely completed. The performance test shows some scaling problems on the send path which I am seeking some advices.

The nic is 100Gbps. I am running iperf2 as client generating tcp traffic from the a 15-vcpu FreeBSD guest. So it has 15 tx and rx queues respectively. When just using 1 iperf send stream (-P1), it hits over 30Gbps which is quite good. With 2 send streams(-P2) in iperf2, it reaches 43Gbps. The more streams I use, the less obvious the scaling I can observe. The best performance is around 65Gbps with 6 send streams (-P6). After that, there seem sno much scaling I can see with more send streams, though the VM still has more vcpu and tx queues available. 

I can see a few things when doing the test, which I appreciate if anyone can provide more insight. 

1. In those cases with higher number of send streams (>6), I can see more likely some send streams terminated with Broken Pipe error before then full test time ends. For example, in a test with 10 send streams for 30 seconds, there could be one to four streams terminated in just a few seconds with Broken Pipe errors. The same test on Linux guest with same test server, I have never seen such problem.

2. The driver selects the tx queue based on mbuf's m_pkthdr.flowid field. I can see each stream get different 4-byte flowid values. However, it is very likely multiple flowids still collide to same tx queue, if we just use algorithm like "flowid % number_of_tx_queues" to get the tx queue. Any suggestions on how to avoid such case?

3. The tx ring size is 256. I allocate 1024 buf rings for each tx queue to queue up the send requests. I have seen under heavy tx load the tx queue has to be stopped till more completions are done, however, I have never seen any drbr queue errors. Does this number look good or need further optimization?

3.  On the tx completion path, a task thread is scheduled for each tx queue when completion interrupt is received. This thread is not bind to any cpu, so it can run on any cpu. Is it useful to bind it to a specific cpu? I did try this but see little difference.

Any other ideas are also very welcome.