Poor high-PPS performance of the 10G ixgbe(9) NIC/driver in FreeBSD 10.1

Thu Aug 13 06:47:10 UTC 2015

Hi,

Try this:

* I'd disable AIM and hard-set interrupts to something sensible;
* I'd edit sys/conf/files and sys/dev/ixgbe/Makefile on 10.1 and
remove the '-DIXGBE_FDIR' bit that enabled flow director - the
software setup for flow director is buggy, and it causes things to get
wildly unhappy.

-adrian

On 12 August 2015 at 17:47, Maxim Sobolev <sobomax at freebsd.org> wrote:
> Here we go (ix2 and ix3 are not used):
>
> ix0 at pci0:3:0:0: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01
> hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = 'Ethernet Controller 10-Gigabit X540-AT2'
>     class      = network
>     subclass   = ethernet
> ix1 at pci0:3:0:1: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01
> hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = 'Ethernet Controller 10-Gigabit X540-AT2'
>     class      = network
>     subclass   = ethernet
> ix2 at pci0:4:0:0: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01
> hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = 'Ethernet Controller 10-Gigabit X540-AT2'
>     class      = network
>     subclass   = ethernet
> ix3 at pci0:4:0:1: class=0x020000 card=0x152815d9 chip=0x15288086 rev=0x01
> hdr=0x00
>     vendor     = 'Intel Corporation'
>     device     = 'Ethernet Controller 10-Gigabit X540-AT2'
>     class      = network
>     subclass   = ethernet
>
>
> On Wed, Aug 12, 2015 at 8:23 AM, Adrian Chadd <adrian.chadd at gmail.com>
> wrote:
>>
>> Right, and for the ixgbe hardware?
>>
>>
>>
>> -a
>>
>>
>> On 12 August 2015 at 08:05, Maxim Sobolev <sobomax at freebsd.org> wrote:
>> > igb0 at pci0:7:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086
>> > rev=0x03 hdr=0x00
>> >     vendor     = 'Intel Corporation'
>> >     device     = 'I210 Gigabit Network Connection'
>> >     class      = network
>> >     subclass   = ethernet
>> > igb1 at pci0:8:0:0:        class=0x020000 card=0x153315d9 chip=0x15338086
>> > rev=0x03 hdr=0x00
>> >     vendor     = 'Intel Corporation'
>> >     device     = 'I210 Gigabit Network Connection'
>> >     class      = network
>> >     subclass   = ethernet
>> >
>> >
>> > On Wed, Aug 12, 2015 at 8:03 AM, Maxim Sobolev <sobomax at sippysoft.com>
>> > wrote:
>> >
>> >> Ok, so my current settings are:
>> >>
>> >> hw.ix.max_interrupt_rate: 20000
>> >> dev.ix.0.queue0.interrupt_rate: 20000
>> >> dev.ix.0.queue1.interrupt_rate: 20000
>> >> dev.ix.0.queue2.interrupt_rate: 20000
>> >> dev.ix.0.queue3.interrupt_rate: 20000
>> >> dev.ix.0.queue4.interrupt_rate: 20000
>> >> dev.ix.0.queue5.interrupt_rate: 20000
>> >> dev.ix.1.queue0.interrupt_rate: 20000
>> >> dev.ix.1.queue1.interrupt_rate: 20000
>> >> dev.ix.1.queue2.interrupt_rate: 20000
>> >> dev.ix.1.queue3.interrupt_rate: 20000
>> >> dev.ix.1.queue4.interrupt_rate: 20000
>> >> dev.ix.1.queue5.interrupt_rate: 20000
>> >> dev.ix.0.enable_aim: 0
>> >> dev.ix.1.enable_aim: 0
>> >> dev.ix.2.enable_aim: 0
>> >> dev.ix.3.enable_aim: 0
>> >> hw.ix.num_queues:6
>> >>
>> >> We also happen to have I210-based system with only 4 hardware queues,
>> >> it
>> >> would be interesting to see how it stacks up.
>> >>
>> >> On Wed, Aug 12, 2015 at 5:23 AM, Luigi Rizzo <rizzo at iet.unipi.it>
>> >> wrote:
>> >>
>> >>> As I was telling to maxim, you should disable aim because it only
>> >>> matches
>> >>> the max interrupt rate to the average packet size, which is the last
>> >>> thing
>> >>> you want.
>> >>>
>> >>> Setting the interrupt rate with sysctl (one per queue) gives you
>> >>> precise
>> >>> control on the max rate and (hence, extra latency). 20k interrupts/s
>> >>> give
>> >>> you 50us of latency, and the 2k slots in the queue are still enough to
>> >>> absorb a burst of min-sized frames hitting a single queue (the os will
>> >>> start dropping long before that level, but that's another story).
>> >>>
>> >>> Cheers
>> >>> Luigi
>> >>>
>> >>> On Wednesday, August 12, 2015, Babak Farrokhi <farrokhi at freebsd.org>
>> >>> wrote:
>> >>>
>> >>>> I ran into the same problem with almost the same hardware (Intel
>> >>>> X520)
>> >>>> on 10-STABLE. HT/SMT is disabled and cards are configured with 8
>> >>>> queues,
>> >>>> with the same sysctl tunings as sobomax@ did. I am not using lagg, no
>> >>>> FLOWTABLE.
>> >>>>
>> >>>> I experimented with pmcstat (RESOURCE_STALLS) a while ago and here
>> >>>> [1]
>> >>>> [2] you can see the results, including pmc output, callchain,
>> >>>> flamegraph
>> >>>> and gprof output.
>> >>>>
>> >>>> I am experiencing huge number of interrupts with 200kpps load:
>> >>>>
>> >>>> # sysctl dev.ix | grep interrupt_rate
>> >>>> dev.ix.1.queue7.interrupt_rate: 125000
>> >>>> dev.ix.1.queue6.interrupt_rate: 6329
>> >>>> dev.ix.1.queue5.interrupt_rate: 500000
>> >>>> dev.ix.1.queue4.interrupt_rate: 100000
>> >>>> dev.ix.1.queue3.interrupt_rate: 50000
>> >>>> dev.ix.1.queue2.interrupt_rate: 500000
>> >>>> dev.ix.1.queue1.interrupt_rate: 500000
>> >>>> dev.ix.1.queue0.interrupt_rate: 100000
>> >>>> dev.ix.0.queue7.interrupt_rate: 500000
>> >>>> dev.ix.0.queue6.interrupt_rate: 6097
>> >>>> dev.ix.0.queue5.interrupt_rate: 10204
>> >>>> dev.ix.0.queue4.interrupt_rate: 5208
>> >>>> dev.ix.0.queue3.interrupt_rate: 5208
>> >>>> dev.ix.0.queue2.interrupt_rate: 71428
>> >>>> dev.ix.0.queue1.interrupt_rate: 5494
>> >>>> dev.ix.0.queue0.interrupt_rate: 6250
>> >>>>
>> >>>> [1] http://farrokhi.net/~farrokhi/pmc/6/
>> >>>> [2] http://farrokhi.net/~farrokhi/pmc/7/
>> >>>>
>> >>>> Regards,
>> >>>> Babak
>> >>>>
>> >>>>
>> >>>> Alexander V. Chernikov wrote:
>> >>>> > 12.08.2015, 02:28, "Maxim Sobolev" <sobomax at FreeBSD.org>:
>> >>>> >> Olivier, keep in mind that we are not "kernel forwarding" packets,
>> >>>> but "app
>> >>>> >> forwarding", i.e. the packet goes full way
>> >>>> >> net->kernel->recvfrom->app->sendto->kernel->net, which is why we
>> >>>> >> have
>> >>>> much
>> >>>> >> lower PPS limits and which is why I think we are actually
>> >>>> >> benefiting
>> >>>> from
>> >>>> >> the extra queues. Single-thread sendto() in a loop is CPU-bound at
>> >>>> about
>> >>>> >> 220K PPS, and while running the test I am observing that outbound
>> >>>> traffic
>> >>>> >> from one thread is mapped into a specific queue (well, pair of
>> >>>> >> queues
>> >>>> on
>> >>>> >> two separate adaptors, due to lagg load balancing action). And the
>> >>>> peak
>> >>>> >> performance of that test is at 7 threads, which I believe
>> >>>> >> corresponds
>> >>>> to
>> >>>> >> the number of queues. We have plenty of CPU cores in the box (24)
>> >>>> >> with
>> >>>> >> HTT/SMT disabled and one CPU is mapped to a specific queue. This
>> >>>> leaves us
>> >>>> >> with at least 8 CPUs fully capable of running our app. If you look
>> >>>> >> at
>> >>>> the
>> >>>> >> CPU utilization, we are at about 10% when the issue hits.
>> >>>> >
>> >>>> > In any case, it would be great if you could provide some profiling
>> >>>> info since there could be
>> >>>> > plenty of problematic places starting from TX rings contention to
>> >>>> > some
>> >>>> locks inside udp or even
>> >>>> > (in)famous random entropy harvester..
>> >>>> > e.g. something like pmcstat -TS instructions -w1 might be
>> >>>> > sufficient
>> >>>> to determine the reason
>> >>>> >> ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version -
>> >>>> 2.5.15> port
>> >>>> >> 0x6020-0x603f mem 0xc7c00000-0xc7dfffff,0xc7e04000-0xc7e07fff irq
>> >>>> >> 40
>> >>>> at
>> >>>> >> device 0.0 on pci3
>> >>>> >> ix0: Using MSIX interrupts with 9 vectors
>> >>>> >> ix0: Bound queue 0 to cpu 0
>> >>>> >> ix0: Bound queue 1 to cpu 1
>> >>>> >> ix0: Bound queue 2 to cpu 2
>> >>>> >> ix0: Bound queue 3 to cpu 3
>> >>>> >> ix0: Bound queue 4 to cpu 4
>> >>>> >> ix0: Bound queue 5 to cpu 5
>> >>>> >> ix0: Bound queue 6 to cpu 6
>> >>>> >> ix0: Bound queue 7 to cpu 7
>> >>>> >> ix0: Ethernet address: 0c:c4:7a:5e:be:64
>> >>>> >> ix0: PCI Express Bus: Speed 5.0GT/s Width x8
>> >>>> >> 001.000008 [2705] netmap_attach success for ix0 tx 8/4096 rx
>> >>>> >> 8/4096 queues/slots
>> >>>> >> ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version -
>> >>>> 2.5.15> port
>> >>>> >> 0x6000-0x601f mem 0xc7a00000-0xc7bfffff,0xc7e00000-0xc7e03fff irq
>> >>>> >> 44
>> >>>> at
>> >>>> >> device 0.1 on pci3
>> >>>> >> ix1: Using MSIX interrupts with 9 vectors
>> >>>> >> ix1: Bound queue 0 to cpu 8
>> >>>> >> ix1: Bound queue 1 to cpu 9
>> >>>> >> ix1: Bound queue 2 to cpu 10
>> >>>> >> ix1: Bound queue 3 to cpu 11
>> >>>> >> ix1: Bound queue 4 to cpu 12
>> >>>> >> ix1: Bound queue 5 to cpu 13
>> >>>> >> ix1: Bound queue 6 to cpu 14
>> >>>> >> ix1: Bound queue 7 to cpu 15
>> >>>> >> ix1: Ethernet address: 0c:c4:7a:5e:be:65
>> >>>> >> ix1: PCI Express Bus: Speed 5.0GT/s Width x8
>> >>>> >> 001.000009 [2705] netmap_attach success for ix1 tx 8/4096 rx
>> >>>> >> 8/4096 queues/slots
>> >>>> >>
>> >>>> >> On Tue, Aug 11, 2015 at 4:14 PM, Olivier Cochard-Labbé <
>> >>>> olivier at cochard.me>
>> >>>> >> wrote:
>> >>>> >>
>> >>>> >>>  On Tue, Aug 11, 2015 at 11:18 PM, Maxim Sobolev <
>> >>>> sobomax at freebsd.org>
>> >>>> >>>  wrote:
>> >>>> >>>
>> >>>> >>>>  Hi folks,
>> >>>> >>>>
>> >>>> >>>>  Hi,
>> >>>> >>>
>> >>>> >>>
>> >>>> >>>>  We've trying to migrate some of our high-PPS systems to a new
>> >>>> hardware
>> >>>> >>>>  that
>> >>>> >>>>  has four X540-AT2 10G NICs and observed that interrupt time
>> >>>> >>>> goes
>> >>>> through
>> >>>> >>>>  roof after we cross around 200K PPS in and 200K out (two ports
>> >>>> >>>> in
>> >>>> LACP).
>> >>>> >>>>  The previous hardware was stable up to about 350K PPS in and
>> >>>> >>>> 350K
>> >>>> out. I
>> >>>> >>>>  believe the old one was equipped with the I350 and had the
>> >>>> identical LACP
>> >>>> >>>>  configuration. The new box also has better CPU with more cores
>> >>>> (i.e. 24
>> >>>> >>>>  cores vs. 16 cores before). CPU itself is 2 x E5-2690 v3.
>> >>>> >>>  200K PPS, and even 350K PPS are very low value indeed.
>> >>>> >>>  On a Intel Xeon L5630 (4 cores only) with one X540-AT2
>> >>>> >>>
>> >>>> >>>  (then 2 10Gigabit ports) I've reached about 1.8Mpps
>> >>>> (fastforwarding
>> >>>> >>>  enabled) [1].
>> >>>> >>>  But my setup didn't use lagg(4): Can you disable lagg
>> >>>> >>> configuration
>> >>>> and
>> >>>> >>>  re-measure your performance without lagg ?
>> >>>> >>>
>> >>>> >>>  Do you let Intel NIC drivers using 8 queues for port too?
>> >>>> >>>  In my use case (forwarding smallest UDP packet size), I obtain
>> >>>> better
>> >>>> >>>  behaviour by limiting NIC queues to 4 (hw.ix.num_queues or
>> >>>> >>>  hw.ixgbe.num_queues, don't remember) if my system had 8 cores.
>> >>>> >>> And
>> >>>> this
>> >>>> >>>  with Gigabit Intel[2] or Chelsio NIC [3].
>> >>>> >>>
>> >>>> >>>  Don't forget to disable TSO and LRO too.
>> >>>> >>>
>> >>>> >>>  Regards,
>> >>>> >>>
>> >>>> >>>  Olivier
>> >>>> >>>
>> >>>> >>>  [1]
>> >>>> >>>
>> >>>>
>> >>>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_x540-at2#graphs
>> >>>> >>>  [2]
>> >>>> >>>
>> >>>>
>> >>>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_superserver_5018a-ftn4#graph1
>> >>>> >>>  [3]
>> >>>> >>>
>> >>>>
>> >>>> http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr#reducing_nic_queues
>> >>>> >> _______________________________________________
>> >>>> >> freebsd-net at freebsd.org mailing list
>> >>>> >> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> >>>> >> To unsubscribe, send any mail to
>> >>>> >> "freebsd-net-unsubscribe at freebsd.org
>> >>>> "
>> >>>> > _______________________________________________
>> >>>> > freebsd-net at freebsd.org mailing list
>> >>>> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> >>>> > To unsubscribe, send any mail to
>> >>>> > "freebsd-net-unsubscribe at freebsd.org"
>> >>>> _______________________________________________
>> >>>> freebsd-net at freebsd.org mailing list
>> >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> >>>> To unsubscribe, send any mail to
>> >>>> "freebsd-net-unsubscribe at freebsd.org"
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>>
>> >>> -----------------------------------------+-------------------------------
>> >>>  Prof. Luigi RIZZO, rizzo at iet.unipi.it  . Dip. di Ing.
>> >>> dell'Informazione
>> >>>  http://www.iet.unipi.it/~luigi/        . Universita` di Pisa
>> >>>  TEL      +39-050-2217533               . via Diotisalvi 2
>> >>>  Mobile   +39-338-6809875               . 56122 PISA (Italy)
>> >>>
>> >>> -----------------------------------------+-------------------------------
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Maksym Sobolyev
>> >> Sippy Software, Inc.
>> >> Internet Telephony (VoIP) Experts
>> >> Tel (Canada): +1-778-783-0474
>> >> Tel (Toll-Free): +1-855-747-7779
>> >> Fax: +1-866-857-6942
>> >> Web: http://www.sippysoft.com
>> >> MSN: sales at sippysoft.com
>> >> Skype: SippySoft
>> >>
>> > _______________________________________________
>> > freebsd-net at freebsd.org mailing list
>> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> > To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>
>