Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in FreeBSD 10.1

Alexander V. Chernikov melifaro at ipfw.ru
Wed Aug 12 10:59:43 UTC 2015


12.08.2015, 02:28, "Maxim Sobolev" <sobomax at FreeBSD.org>:
> Olivier, keep in mind that we are not "kernel forwarding" packets, but "app
> forwarding", i.e. the packet goes full way
> net->kernel->recvfrom->app->sendto->kernel->net, which is why we have much
> lower PPS limits and which is why I think we are actually benefiting from
> the extra queues. Single-thread sendto() in a loop is CPU-bound at about
> 220K PPS, and while running the test I am observing that outbound traffic
> from one thread is mapped into a specific queue (well, pair of queues on
> two separate adaptors, due to lagg load balancing action). And the peak
> performance of that test is at 7 threads, which I believe corresponds to
> the number of queues. We have plenty of CPU cores in the box (24) with
> HTT/SMT disabled and one CPU is mapped to a specific queue. This leaves us
> with at least 8 CPUs fully capable of running our app. If you look at the
> CPU utilization, we are at about 10% when the issue hits.

In any case, it would be great if you could provide some profiling info since there could be
plenty of problematic places starting from TX rings contention to some locks inside udp or even
(in)famous random entropy harvester..
e.g. something like pmcstat -TS instructions -w1 might be sufficient to determine the reason
>
> ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
> 0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq 40 at
> device 0.0 on pci3
> ix0: Using MSIX interrupts with 9 vectors
> ix0: Bound queue 0 to cpu 0
> ix0: Bound queue 1 to cpu 1
> ix0: Bound queue 2 to cpu 2
> ix0: Bound queue 3 to cpu 3
> ix0: Bound queue 4 to cpu 4
> ix0: Bound queue 5 to cpu 5
> ix0: Bound queue 6 to cpu 6
> ix0: Bound queue 7 to cpu 7
> ix0: Ethernet address: 0c:c4:7a:5e:be:64
> ix0: PCI Express Bus: Speed 5.0GT/s Width x8
> 001.000008 [2705] netmap_attach success for ix0 tx 8/4096 rx
> 8/4096 queues/slots
> ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port
> 0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq 44 at
> device 0.1 on pci3
> ix1: Using MSIX interrupts with 9 vectors
> ix1: Bound queue 0 to cpu 8
> ix1: Bound queue 1 to cpu 9
> ix1: Bound queue 2 to cpu 10
> ix1: Bound queue 3 to cpu 11
> ix1: Bound queue 4 to cpu 12
> ix1: Bound queue 5 to cpu 13
> ix1: Bound queue 6 to cpu 14
> ix1: Bound queue 7 to cpu 15
> ix1: Ethernet address: 0c:c4:7a:5e:be:65
> ix1: PCI Express Bus: Speed 5.0GT/s Width x8
> 001.000009 [2705] netmap_attach success for ix1 tx 8/4096 rx
> 8/4096 queues/slots
>
> On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labbé <olivier at cochard.me>
> wrote:
>
>>  On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev <sobomax at freebsd.org>
>>  wrote:
>>
>>>  Hi folks,
>>>
>>>  ​Hi,
>>  ​
>>
>>>  We've trying to migrate some of our high-PPS systems to a new hardware
>>>  that
>>>  has four X540-AT2 10G NICs and observed that interrupt time goes through
>>>  roof after we cross around 200K PPS in and 200K out (two ports in LACP).
>>>  The previous hardware was stable up to about 350K PPS in and 350K out. I
>>>  believe the old one was equipped with the I350 and had the identical LACP
>>>  configuration. The new box also has better CPU with more cores (i.e. 24
>>>  cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3.
>>
>>  ​200K PPS, and even 350K PPS are very low value indeed.
>>  On a Intel Xeon L5630 (4 cores only) with one X540-AT2​
>>
>>  ​(then 2 10Gigabit ports)​ I've reached about 1.8Mpps (fastforwarding
>>  enabled) [1].
>>  But my setup didn't use lagg(4): Can you disable lagg configuration and
>>  re-measure your performance without lagg ?
>>
>>  Do you let Intel NIC drivers using 8 queues for port too?
>>  In my use case (forwarding smallest UDP packet size), I obtain better
>>  behaviour by limiting NIC queues to 4 (hw.ix.num_queues or
>>  hw.ixgbe.num_queues, don't remember) if my system had 8 cores. And this
>>  with Gigabit Intel[2] or Chelsio NIC [3].
>>
>>  Don't forget to disable TSO and LRO too.
>>
>>  ​Regards,
>>
>>  Olivier
>>
>>  [1]
>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs
>>  [2]
>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_superserver_5018a-ftn4#graph1
>>  [3]
>>  http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#reducing_nic_queues
>
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"


More information about the freebsd-net mailing list