Disappointing packets-per-second performance results on a Dell, PE R530

Ben RUBSON ben.rubson at gmail.com
Tue Feb 28 06:35:38 UTC 2017


Hi,

Try disabling NUMA in your BIOS settings ?
I had perf issue on a 2-CPU (24 cores) server, I was not able to run a 40G NIC at its max throughput.
We investigated a lot, disabling NUMA in the BIOS was the solution, as NUMA is not fully supported yet (as of stable/11).

Ben



> On 28 Feb 2017, at 03:13, Caraballo-vega, Jordan A. (GSFC-6062)[COMPUTER SCIENCE CORP] <jordancaraballo87 at gmail.com> wrote:
> 
> As a summarywe have a Dell R530 with a Chelsio T580 cardwith -CURRENT.
> 
> In an attempt to reduce the time the system was taking to look for the
> cpus; we changed the BIOS setting to let the system have 8 visible cores
> and tested cxl* and vcxl* chelsio interfaces. Scores are still way lower
> than what we expected:
> 
> Cxl interface
> 
> root at router1:~ # netstat -w1 -h
>            input        (Total)           output
>   packets  errs idrops      bytes    packets  errs      bytes colls
>      4.1M     0  3.4M       2.1G       725k     0       383M     0
>      3.7M     0  3.1M       1.9G       636k     0       336M     0
>      3.9M     0  3.2M       2.0G       684k     0       362M     0
>      4.0M     0  3.3M       2.1G       702k     0       371M     0
>      3.8M     0  3.2M       2.0G       658k     0       348M     0
>      3.9M     0  3.2M       2.0G       658k     0       348M     0
>      3.9M     0  3.2M       2.0G       721k     0       381M     0
>      3.3M     0  2.6M       1.7G       681k     0       360M     0
>      3.2M     0  2.5M       1.7G       666k     0       352M     0
>      2.6M     0  2.0M       1.4G       620k     0       328M     0
>      2.8M     0  2.1M       1.4G       615k     0       325M     0
>      3.2M     0  2.6M       1.7G       612k     0       323M     0
>      3.3M     0  2.7M       1.7G       664k     0       351M     0
> 
> 
> Vcxl interface
>   input        (Total)           output
>   packets  errs idrops      bytes    packets  errs      bytes colls drops
>      590k  7.5k     0       314M       590k     0       314M     0     0
>      526k  6.6k     0       280M       526k     0       280M     0     0
>      588k  7.1k     0       313M       588k     0       313M     0     0
>      532k  6.6k     0       283M       532k     0       283M     0     0
>      578k  7.2k     0       307M       578k     0       307M     0     0
>      565k  7.0k     0       300M       565k     0       300M     0     0
>      558k  7.0k     0       297M       558k     0       297M     0     0
>      533k  6.7k     0       284M       533k     0       284M     0     0
>      588k  7.3k     0       313M       588k     0       313M     0     0
>      553k  6.9k     0       295M       554k     0       295M     0     0
>      527k  6.7k     0       281M       527k     0       281M     0     0
>      585k  7.4k     0       311M       585k     0       311M     0     0
> 
> Related to pmcstat scores are:
> 
> root at router1:~/PMC_Stats/Feb22 #  pmcstat -R sample.out -G - | head
> @ CPU_CLK_UNHALTED_CORE [2091 samples]
> 
> 15.35%  [321]      lock_delay @ /boot/kernel/kernel
> 94.70%  [304]       _mtx_lock_spin_cookie
>  100.0%  [304]        __mtx_lock_spin_flags
>   57.89%  [176]         pmclog_loop @ /boot/kernel/hwpmc.ko
>    100.0%  [176]          fork_exit @ /boot/kernel/kernel
>   41.12%  [125]         pmclog_reserve @ /boot/kernel/hwpmc.ko
>    100.0%  [125]          pmclog_process_callchain
>     100.0%  [125]           pmc_process_samples
> 
> root at router1:~/PMC_Stats/Feb22 # pmcstat -R sample0.out -G - | head
> @ CPU_CLK_UNHALTED_CORE [480 samples]
> 
> 37.29%  [179]      acpi_cpu_idle_mwait @ /boot/kernel/kernel
> 100.0%  [179]       acpi_cpu_idle
>  100.0%  [179]        cpu_idle_acpi
>   100.0%  [179]         cpu_idle
>    100.0%  [179]          sched_idletd
>     100.0%  [179]           fork_exit
> 
> 12.92%  [62]       cpu_idle @ /boot/kernel/kernel
> 
> When trying to run pmcstat with the vcxl interfaces enabled the system
> just went to a state of not responding.
> 
> Based on previous scores with Centos 7 (over 3M pps), we can assume that
> it is not the hardware. However, we are still looking for a reason of
> why are we getting these scores.
> 
> Any feedback or suggestion would be highly appreciated.
> 
> - Jordan
> 
> On 2/9/17 11:34 AM, Navdeep Parhar wrote:
>> The vcxl interfaces should work under current or 11-STABLE.  Let me know
>> if you run into any trouble when trying to use netmap with cxgbe driver.
>> 
>> Regards,
>> Navdeep
>> 
>> On Thu, Feb 09, 2017 at 10:29:08AM -0500, John Jasen wrote:
>>> It's not the hardware.
>>> 
>>> Jordan booted up CentOS on the box, and untuned, were able to obtain
>>> over 3 mpps.
>>> 
>>> He has some pmcstat output from freebsd-current, but basically, it
>>> appears the system spends most of its time looking for a CPU to service
>>> the interrupts and keeps landing on one or two of them, as opposed to
>>> any of the other 16 cores on the physical silicon.
>>> 
>>> We also tried swapping out the T5 card for a Mellanox, tried different
>>> PCIe slots, adjusted cpuset for the low and the high CPUs, no matter
>>> what we try, the results have been bad.
>>> 
>>> Our network test environment is under reconstruction at the moment, but
>>> our plans afterwards are to:
>>> 
>>> a) test netmap-fwd again (the VCXL enabling works under -CURRENT?)
>>> 
>>> b) test without netmap-fwd, and with reduced cores/physical cpus (BIOS
>>> setting)
>>> 
>>> c) potentially, test with netmap-fwd and reduced core count.
>>> 
>>> Any other ideas out there?
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 02/05/2017 12:55 PM, Navdeep Parhar wrote:
>>>> I've been following the email thread on  freebsd-net on this.  The
>>>> numbers you're getting are well below what the hardware is capable of.
>>>> 
>>>> Have you tried netmap-fwd or something that bypasses the kernel?  That
>>>> will be a very quick way to make sure that the hardware is doing ok.
>>>> 
>>>> In case you try netmap:
>>>> cxgbe has virtual interfaces now and those are used for netmap (instead
>>>> of the main interface).  Add this line to /boot/loader.conf and you'll
>>>> see a 'vcxl' interface for every cxl interface.
>>>> hw.cxgbe.num_vis=2
>>>> It has its own MAC address and can be used like any other interface,
>>>> except it has native netmap support too.  You can run netmap-fwd between
>>>> these vcxl ports.
>>>> 
>>>> Regards,
>>>> Navdeep
>>>> 
>>>> On Tue, Jan 31, 2017 at 01:57:37PM -0400, Jordan Caraballo wrote:
>>>>>   Navdeep, Troy,
>>>>> 
>>>>>   I forwarded you this email to see if we could get feedback from both of
>>>>>   you. I talked with Troy during November about
>>>>> 
>>>>>   this R530 system and the use of a 40G Chelsio T-580-CR card. So far, we
>>>>>   have not experienced results above 1.4 million or so.
>>>>> 
>>>>>   Any help would be appreciated.
>>>>> 
>>>>>   - Jordan
>>>>> 
>>>>>   -------- Forwarded Message --------
>>>>> 
>>>>>   Subject: Re: Disappointing packets-per-second performance results on a     
>>>>>            Dell,PE R530                                                      
>>>>>      Date: Tue, 31 Jan 2017 13:53:15 -0400                                   
>>>>>      From: Jordan Caraballo <jordancaraballo87 at gmail.com>                    
>>>>>        To: Slawa Olhovchenkov <slw at zxy.spb.ru>                               
>>>>>        CC: freebsd-net at freebsd.org                                           
>>>>> 
>>>>>   This are the most recent stats. No advances so far. The system has
>>>>>   -Current right now.
>>>>> 
>>>>>   Any help or feedback would be appreciated.
>>>>>   Hardware Configuration:
>>>>>   Dell PowerEdge R530 with 2 Intel(R) Xeon(R) E52695 CPU's, 18 cores per
>>>>>   cpu. Equipped with a Chelsio T-580-CR dual port in an 8x slot.
>>>>> 
>>>>>   BIOS tweaks:
>>>>>   Hyperthreading (or Logical Processors) is turned off.
>>>>>   loader.conf
>>>>>   # Chelsio Modules
>>>>>   t4fw_cfg_load="YES"
>>>>>   t5fw_cfg_load="YES"
>>>>>   if_cxgbe_load="YES"
>>>>>   rc.conf
>>>>>   # Gateway Configuration
>>>>>   ifconfig_cxl0="inet 172.16.1.1/24"
>>>>>   ifconfig_cxl1="inet 172.16.2.1/24"
>>>>>   gateway_enable="YES"
>>>>> 
>>>>>   Last Results:
>>>>>   packets errs idrops bytes packets errs bytes colls drops
>>>>>   2.7M 0 2.0M 1.4G 696k 0 368M 0 0
>>>>>   2.7M 0 2.0M 1.4G 686k 0 363M 0 0
>>>>>   2.6M 0 2.0M 1.4G 668k 0 353M 0 0
>>>>>   2.7M 0 2.0M 1.4G 661k 0 350M 0 0
>>>>>   2.8M 0 2.1M 1.5G 697k 0 369M 0 0
>>>>>   2.8M 0 2.1M 1.4G 684k 0 361M 0 0
>>>>>   2.7M 0 2.1M 1.4G 674k 0 356M 0 0
>>>>> 
>>>>>   root at router1:~ # vmstat -i
>>>>> 
>>>>>   interrupt total rate
>>>>>   irq9: acpi0 73 0
>>>>>   irq18: ehci0 ehci1 1155973 3 
>>>>>   cpu0:timer 3551157 10
>>>>>   cpu29:timer 9303048 27
>>>>>   cpu9:timer 71693455 207
>>>>>   cpu16:timer 9798380 28
>>>>>   cpu18:timer 9287094 27
>>>>>   cpu26:timer 9342495 27
>>>>>   cpu20:timer 9145888 26
>>>>>   cpu8:timer 9791228 28
>>>>>   cpu22:timer 9288116 27
>>>>>   cpu35:timer 9376578 27
>>>>>   cpu30:timer 9396294 27
>>>>>   cpu23:timer 9248760 27
>>>>>   cpu10:timer 9756455 28
>>>>>   cpu25:timer 9300202 27
>>>>>   cpu27:timer 9227291 27
>>>>>   cpu14:timer 10083548 29
>>>>>   cpu28:timer 9325684 27
>>>>>   cpu11:timer 9906405 29
>>>>>   cpu34:timer 9419170 27
>>>>>   cpu31:timer 9392089 27
>>>>>   cpu33:timer 9350540 27
>>>>>   cpu15:timer 9804551 28
>>>>>   cpu32:timer 9413182 27
>>>>>   cpu19:timer 9231505 27
>>>>>   cpu12:timer 9813506 28
>>>>>   cpu13:timer 10872130 31
>>>>>   cpu4:timer 9920237 29
>>>>>   cpu2:timer 9786498 28
>>>>>   cpu3:timer 9896011 29
>>>>>   cpu5:timer 9890207 29
>>>>>   cpu6:timer 9737869 28
>>>>>   cpu7:timer 9790119 28
>>>>>   cpu1:timer 9847913 28
>>>>>   cpu21:timer 9192561 27
>>>>>   cpu24:timer 9300259 27
>>>>>   cpu17:timer 9786186 28
>>>>>   irq264: mfi0 151818 0
>>>>>   irq266: bge0 30466 0
>>>>>   irq272: t5nex0:evt 4 0
>>>>>   Total 402604945 1161
>>>>>   top -PHS
>>>>>   last pid: 18557; load averages: 2.58, 1.90, 0.95 up 4+00:39:54 18:30:46
>>>>>   231 processes: 40 running, 126 sleeping, 65 waiting
>>>>>   CPU 0: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 1: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 2: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 3: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 4: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 5: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 6: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.6% idle
>>>>>   CPU 7: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 8: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 9: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 10: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 11: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 12: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 13: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 14: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 15: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 16: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 17: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 18: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 19: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 20: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 21: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 22: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 23: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 24: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 25: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 26: 0.0% user, 0.0% nice, 0.0% system, 59.6% interrupt, 40.4% idle
>>>>>   CPU 27: 0.0% user, 0.0% nice, 0.0% system, 96.3% interrupt, 3.7% idle
>>>>>   CPU 28: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 29: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 30: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 31: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 32: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 33: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   CPU 34: 0.0% user, 0.0% nice, 0.0% system, 100% interrupt, 0.0% idle
>>>>>   CPU 35: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
>>>>>   Mem: 15M Active, 224M Inact, 1544M Wired, 393M Buf, 29G Free
>>>>>   Swap: 3881M Total, 3881M Free
>>>>> 
>>>>>   pmcstat -R sample.out -G - | head
>>>>>   @ CPU_CLK_UNHALTED_CORE [159 samples]
>>>>> 
>>>>>   39.62%  [63]       acpi_cpu_idle_mwait @ /boot/kernel/kernel
>>>>>    100.0%  [63]        acpi_cpu_idle
>>>>>     100.0%  [63]         cpu_idle_acpi
>>>>>      100.0%  [63]          cpu_idle
>>>>>       100.0%  [63]           sched_idletd
>>>>>        100.0%  [63]            fork_exit
>>>>> 
>>>>>   17.61%  [28]       cpu_idle @ /boot/kernel/kernel
>>>>> 
>>>>>   root at router1:~ # pmcstat -R sample0.out -G - | head
>>>>>   @ CPU_CLK_UNHALTED_CORE [750 samples]
>>>>> 
>>>>>   31.60%  [237]      acpi_cpu_idle_mwait @ /boot/kernel/kernel
>>>>>    100.0%  [237]       acpi_cpu_idle
>>>>>     100.0%  [237]        cpu_idle_acpi
>>>>>      100.0%  [237]         cpu_idle
>>>>>       100.0%  [237]          sched_idletd
>>>>>        100.0%  [237]           fork_exit
>>>>> 
>>>>>   10.67%  [80]       cpu_idle @ /boot/kernel/kernel
>>>>> 
>>>>>   On 03/01/17 13:46, Slawa Olhovchenkov wrote:
>>>>> 
>>>>> On Tue, Jan 03, 2017 at 12:35:42PM -0400, Jordan Caraballo wrote:
>>>>> 
>>>>> 
>>>>> We recently tested a Dell R530 with a Chelsio T580 card, under FreeBSD 10.3, 11.0, -STABLE and -CURRENT, and Centos 7.
>>>>> 
>>>>> Based on our research, including netmap-fwd and with the routing improvements project (https://wiki.freebsd.org/ProjectsRoutingProposal),
>>>>> we hoped for packets-per-second (pps) in the 5+ million range, or even higher.
>>>>> 
>>>>> Based on prior testing (http://marc.info/?t=140604252400002&r=1&w=2), we expected 3-4 Million to be easily obtainable.
>>>>> 
>>>>> Unfortunately, our current results top out at no more than 1.5 M (64 bytes length packets) with FreeBSD, and
>>>>> surprisingly around 3.2 M (128 bytes length packets) with Centos 7, and we are at a loss as to why.
>>>>> 
>>>>> Server Description:
>>>>> Dell PowerEdge R530 with 2 Intel(R) Xeon(R) E52695 CPU's, 18 cores per
>>>>> cpu. Equipped with a Chelsio T-580-CR dual port in an 8x slot.
>>>>> 
>>>>> ** Can this be a lack in support issue related to the R530's hardware? **
>>>>> 
>>>>> Any help appreciated!
>>>>> 
>>>>> What hardware configuration?
>>>>> What BIOS setting?
>>>>> What loader.conf/sysctl.conf setting?
>>>>> What `vmstat -i`?
>>>>> What `top -PHS`?
>>>>> what
>>>>> ====
>>>>> pmcstat -S CPU_CLK_UNHALTED_CORE -l 10 -O sample.out
>>>>> pmcstat -R sample.out -G out.txt
>>>>> pmcstat -c 0 -S CPU_CLK_UNHALTED_CORE -l 10 -O sample0.out
>>>>> pmcstat -R sample0.out -G out0.txt
>>>>> ====
> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"



More information about the freebsd-net mailing list