[Bug 258623] [routing] peformance - 2 numa domains vs signale numa domain

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 20 Sep 2021 10:24:43 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=258623

            Bug ID: 258623
           Summary: [routing] peformance - 2 numa domains vs signale numa
                    domain
           Product: Base System
           Version: 13.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: konrad.kreciwilk@korbank.pl
                CC: net@FreeBSD.org

Server: Dell R630, 2x CPU E5-2667 v4 - 2 numa domains, 64GB Ram
NIC: 2x T62100-SO-CR - each connected to a separate numa domain


* 2 numa domain test

I use chelsio_affinity to assign irq to correct CPU

cfg:

ifconfig_cc0="up"
ifconfig_cc1="up"
ifconfig_cc2="up"
ifconfig_cc3="up"


#LAGG LACP
ifconfig_lagg0="laggproto lacp laggport cc0 laggport cc2 -wol -vlanhwtso -tso
-lro -hwrxtstmp -txtls use_flowid use_numa up"

ifconfig_vlan2020="vlan 2020 vlandev lagg0"
ifconfig_vlan2002="vlan 2002 vlandev lagg0"


+--------+         +--------+      +---------+
|        +---------+        +------+         |
| Router |  lagg0  | switch |      |  gen    |
|        +---------+        +------+         |
+--------+         +--------+      +---------+


I can achieve around 14Mpps without drop. Above this level, drops appear on the
ccX/lagg0 interfaces. It looks like a CPU some free resources:

# netstat -i -I lagg0 1
            input          lagg0           output
   packets  errs idrops      bytes    packets  errs      bytes colls
  15939431     0 555822 2246265134   15381955     0 2167675870     0
  16600413     0 612946 2339414686   15978803     0 2253137798     0
  15259699     0 575481 2150765886   14693013     0 2070319352     0
  15935269     0 512558 2245569909   15382551     0 2167518240     0
  16159627     0 616404 2277463695   15563046     0 2195364136     0
  14841125     0 322695 1605926868   14540305     0 1562096456     0

# top -PSH
last pid:  9745;  load averages:  6.46,  2.02,  0.76                           
                                                                               
                                                                               
                                                                               
                   up 0+00:02:06  20:25:17
580 threads:   25 running, 471 sleeping, 84 waiting
CPU 0:   0.0% user,  0.0% nice,  0.0% system, 59.2% interrupt, 40.8% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 57.7% interrupt, 42.3% idle
CPU 2:   0.0% user,  0.0% nice,  0.0% system, 57.7% interrupt, 42.3% idle
CPU 3:   0.0% user,  0.0% nice,  0.0% system, 60.6% interrupt, 39.4% idle
CPU 4:   0.0% user,  0.0% nice,  0.0% system, 56.3% interrupt, 43.7% idle
CPU 5:   0.0% user,  0.0% nice,  0.0% system, 62.0% interrupt, 38.0% idle
CPU 6:   0.0% user,  0.0% nice,  0.0% system, 59.2% interrupt, 40.8% idle
CPU 7:   0.0% user,  0.0% nice,  0.0% system, 53.5% interrupt, 46.5% idle
CPU 8:   0.0% user,  0.0% nice,  1.4% system, 62.0% interrupt, 36.6% idle
CPU 9:   0.0% user,  0.0% nice,  0.0% system, 67.6% interrupt, 32.4% idle
CPU 10:  0.0% user,  0.0% nice,  0.0% system, 69.0% interrupt, 31.0% idle
CPU 11:  0.0% user,  0.0% nice,  0.0% system, 66.2% interrupt, 33.8% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system, 62.0% interrupt, 38.0% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
Mem: 536M Active, 29M Inact, 1528M Wired, 60G Free
ARC: 114M Total, 22M MFU, 88M MRU, 693K Header, 3231K Other
     30M Compressed, 92M Uncompressed, 3.12:1 Ratio
Swap: 32G Total, 32G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -92    -     0B  1472K CPU8     8   0:37  60.50% intr{irq152:
t6nex1:0a0}
   12 root        -92    -     0B  1472K CPU10   10   0:36  60.42% intr{irq154:
t6nex1:0a2}
   12 root        -92    -     0B  1472K CPU11   11   0:36  60.27% intr{irq155:
t6nex1:0a3}
   12 root        -92    -     0B  1472K CPU14   14   0:36  60.26% intr{irq158:
t6nex1:0a6}
   12 root        -92    -     0B  1472K CPU12   12   0:36  60.24% intr{irq156:
t6nex1:0a4}
   12 root        -92    -     0B  1472K CPU9     9   0:36  60.15% intr{irq153:
t6nex1:0a1}
   12 root        -92    -     0B  1472K CPU13   13   0:36  59.88% intr{irq157:
t6nex1:0a5}
   12 root        -92    -     0B  1472K CPU15   15   0:36  59.41% intr{irq159:
t6nex1:0a7}
   12 root        -92    -     0B  1472K WAIT     0   0:37  58.49% intr{irq98:
t6nex0:0a0}
   12 root        -92    -     0B  1472K WAIT     1   0:37  57.89% intr{irq99:
t6nex0:0a1}
   12 root        -92    -     0B  1472K WAIT     4   0:37  57.39% intr{irq102:
t6nex0:0a4}
   12 root        -92    -     0B  1472K WAIT     5   0:36  57.35% intr{irq103:
t6nex0:0a5}
   12 root        -92    -     0B  1472K WAIT     3   0:36  57.32% intr{irq101:
t6nex0:0a3}
   12 root        -92    -     0B  1472K WAIT     6   0:36  57.12% intr{irq104:
t6nex0:0a6}
   12 root        -92    -     0B  1472K WAIT     2   0:36  56.98% intr{irq100:
t6nex0:0a2}
   12 root        -92    -     0B  1472K WAIT     7   0:36  56.85% intr{irq105:
t6nex0:0a7}


# pcm-numa.x

Time elapsed: 1064 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM
Accesses
   0   1.31       4195 M     3203 M      3382 K                48 K
   1   1.32       4211 M     3199 M      3241 K                27 K
   2   1.33       4238 M     3196 M      3146 K                48 K
   3   1.33       4238 M     3197 M      3143 K                26 K
   4   1.32       4228 M     3197 M      3241 K                47 K
   5   1.33       4243 M     3198 M      3046 K                29 K
   6   1.33       4247 M     3195 M      3169 K                47 K
   7   1.33       4264 M     3196 M      3180 K                20 K
   8   1.29       4159 M     3224 M      2948 K                77 K
   9   1.29       4172 M     3224 M      2865 K                92 K
  10   1.29       4199 M     3247 M      3263 K                76 K
  11   1.30       4237 M     3259 M      2892 K                91 K
  12   1.30       4261 M     3274 M      3069 K                73 K
  13   1.30       4231 M     3246 M      2959 K               104 K
  14   1.30       4291 M     3291 M      3353 K                74 K
  15   1.31       4221 M     3227 M      3008 K                85 K


pmcstat-S cpu_clk_unhalted.thread flamegraph - https://files.fm/u/enhy23ffr

--------------------

* single domain test

In this scenario I create vlans on single cc0 (use one numa domian)

ifconfig_vlan2020="vlan 2020 vlandev cc0"
ifconfig_vlan2002="vlan 2002 vlandev cc0"



+--------+         +--------+      +---------+
|        +---------+        +------+         |
| Router |   cc0   | switch |      |  gen    |
|        |         |        +------+         |
+--------+         +--------+      +---------+


Using cc0 I can achieve 16Mpps without drops:

# netstat -i -I cc0 1
            input            cc0           output
   packets  errs idrops      bytes    packets  errs      bytes colls
  15934346     0     0 2245565269   15933728     0 2245477291     0
  15927621     0     0 2244617740   15928235     0 2244704202     0
  15934688     0     0 2245613662   15934213     0 2245546449     0
  15931155     0     0 2245115588   15931208     0 2245120654     0
  15926995     0     0 2244529583   15927391     0 2244585093     0
  15931114     0     0 2245109534   15931145     0 2245115823     0

# top -PSH
last pid:  9976;  load averages:  6.57,  2.51,  1.00                           
                                                                               
                                                                               
                                                                               
                   up 0+00:03:23  20:16:17
579 threads:   25 running, 470 sleeping, 84 waiting
CPU 0:   0.0% user,  0.0% nice,  0.0% system, 95.4% interrupt,  4.6% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 95.4% interrupt,  4.6% idle
CPU 2:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 3:   0.0% user,  0.0% nice,  0.0% system, 93.9% interrupt,  6.1% idle
CPU 4:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 5:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 6:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 7:   0.0% user,  0.0% nice,  0.0% system, 93.1% interrupt,  6.9% idle
CPU 8:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 9:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 10:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 11:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 537M Active, 30M Inact, 1260M Wired, 60G Free
ARC: 115M Total, 22M MFU, 89M MRU, 695K Header, 3260K Other
     30M Compressed, 93M Uncompressed, 3.10:1 Ratio
Swap: 32G Total, 32G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -92    -     0B  1472K CPU3     3   1:50  94.86% intr{irq101:
t6nex0:0a3}
   12 root        -92    -     0B  1472K CPU1     1   1:49  94.68% intr{irq99:
t6nex0:0a1}
   12 root        -92    -     0B  1472K CPU5     5   1:49  94.40% intr{irq103:
t6nex0:0a5}
   12 root        -92    -     0B  1472K CPU7     7   1:49  94.18% intr{irq105:
t6nex0:0a7}
   12 root        -92    -     0B  1472K CPU0     0   1:49  94.13% intr{irq98:
t6nex0:0a0}
   12 root        -92    -     0B  1472K CPU6     6   1:49  94.11% intr{irq104:
t6nex0:0a6}
   12 root        -92    -     0B  1472K CPU4     4   1:49  93.81% intr{irq102:
t6nex0:0a4}
   12 root        -92    -     0B  1472K CPU2     2   1:48  93.56% intr{irq100:
t6nex0:0a2}


# pcm-numa.x

Time elapsed: 1002 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM
Accesses
   0   1.93       6513 M     3374 M      4179 K                34 K
   1   1.93       6516 M     3374 M      4153 K              3655
   2   1.94       6518 M     3352 M      4122 K                33 K
   3   1.94       6516 M     3367 M      4118 K              8574
   4   1.94       6517 M     3361 M      4142 K                37 K
   5   1.93       6516 M     3376 M      4147 K                10 K
   6   1.93       6515 M     3371 M      4154 K                39 K
   7   1.94       6514 M     3360 M      4173 K                12 K
   8   0.24       1833 K     7596 K      1805                1378
   9   0.20        728 K     3726 K       467                 502
  10   0.11        312 K     2779 K       227                 234
  11   0.14        486 K     3407 K       291                 361
  12   0.12        357 K     2956 K       183                 132
  13   0.07        195 K     2664 K        46                 119
  14   0.13        381 K     3047 K       455                 212
  15   0.23        765 K     3310 K       325                 346
---------------------------------------------------------------------


pmcstat-S cpu_clk_unhalted.thread flamegraph - https://files.fm/u/3njfz2r3g


* Summary

I know, lagg makes a certain amount of overhead but based on my testing a
single card performs better than two cards in lagg0 .

-- 
You are receiving this mail because:
You are on the CC list for the bug.