svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

Mon Dec 17 19:08:50 UTC 2018

On Mon, 17 Dec 2018, Andrew Gallatin wrote:

> On 12/5/18 9:20 AM, Slava Shwartsman wrote:
>> Author: slavash
>> Date: Wed Dec  5 14:20:57 2018
>> New Revision: 341578
>> URL: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e=
>> 
>> Log:
>>    mlx5en: Remove the DRBR and associated logic in the transmit path.
>>       The hardware queues are deep enough currently and using the DRBR and 
>> associated
>>    callbacks only leads to more task switching in the TX path. The is also 
>> a race
>>    setting the queue_state which can lead to hung TX rings.
>
> The point of DRBR in the tx path is not simply to provide a software ring for 
> queuing excess packets.  Rather it provides a mechanism to
> avoid lock contention by shoving a packet into the software ring, where
> it will later be found & processed, rather than blocking the caller on
> a mtx lock.   I'm concerned you may have introduced a performance
> regression for use cases where you have N:1  or N:M lock contention where 
> many threads on different cores are contending for the same tx queue.  The 
> state of the art for this is no longer DRBR, but mp_ring,
> as used by both cxgbe and iflib.

iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.  On fast machines, it attempts to do 1 context switch per
(small) tx packet and can't keep up.  On slow machines it has a chance of
handling multiple packets per context switch, but since the machine is too
slow it can't keep up and saturates at a slightly different point.  Results
for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V
on Haswell 4 cores x 2 threads @4.08GHz running i386:

Old results with no iflib and no EM_MULTIQUEUE except as indicated:

FBSD-10     UP    1377+0
FBSD-11     UP    1326+0
FBSD-11     SMP-1 1484+0
FBSD-11     SMP-8 1395+0
FBSD-12mod  SMP-1 1386+0
FBSD-12mod  SMP-8 1422+0
FBSD-12mod  SMP-1 1270+0   # use iflib (lose 8% performance)
FBSD-12mod  SMP-8 1279+0   # use iflib (lose 10% performance using more CPU)

1377+0 means 1377 kpps sent and 0 kpps errors, etc.  SMP-8 means use all 8
CPUs.  SMP-1 means restrict netblast to 1 CPU different from the taskqueue
CPUs using cpuset.

New results:

FBSD-11     SMP-8 1440+0   # no iflib, no EM_MULTIQUEUE
FBSD-11     SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps)
FBSD-cur    SMP-8  533+0   # use iflib, use i386 with 4G KVA

iflib only decimates performance relative to the FreeBSD-11 version
with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using
more CPUs.  This gives the extra 10-20% of performance needed to
saturate the NIC and 1Gbps ethernet.  The FreeBSD-current version is
not directly comparable since using 4G KVA on i386 reduces performance
by about a factor of 2.5 for all loads with mostly small i/o's (for
128K disk i/o's the reduction is only 10-20%).  i386 ran at about the
same speed as amd64 when it had 1GB KVA, but I don't have any savd
results for amd64 to compare with precisely).  This is all with
security-related things like ibrs unavailable or turned off.

All versions use normal Intel interrupt moderation which gives an interrupt
rate of 8k/sec.

Old versions of em use a "fast" interrupt handler and a slow switch
to a taskqueue.  This gives a contex switch rate of about 16k/ sec.
In the SMP case, netblast normally runs on another CPU and I think it
fills h/w tx queue(s) synchronously, and the taskqueue only does minor
cleanups.  Old em also has a ping latency of about 10% smaller than
with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and
other tuning to kill interrupt moderation, and similar for a bge NIC
on the other end).  The synchronous queue filling probably improves
latency, but it is hard to see how it makes a difference of more than
1 usec.  73 is already too high.  An old PRO1000 Intel NIC has a latency
of only 50 usec on the same network.  The switch costs about 20 usec
of this.

iflib uses taskqueue more.  netblast normally runs on another CPU and
I think it only fills s/w tx queue(s) synchronously, and wakes up the
taskqueues for every packet.  The CPUs are almost fast enough to keep
up, and the system does about 1M context switches for this (in versions
other than i386 with 4G KVA).  That is slightly more than 2 packets per
switch to get the speed of 1279 kpps.  netblast uses 100% of 1 CPU but
the taskqueues don't saturate their CPUs although they should so as to
do even more context switches.  They still use a lot of CPU (about 50%
of 1 CPU more than in old em).  These context switches lose by doing
the opposite of interrupt moderation.

I can "fix" the extra context switches and restore some of the lost
performance and most of the lost CPU by running netblast on the same
CPU as the main taskqueue (and using my normal configuration of no
PREEMPTION and no IPI_PREEMPTION) or by breaking the scheduler to never
preempt to a higher priority thread.  Non-broken schedulers preempt
idle threads to run higher priority threads even without PREEMPTION.
PREEMPTION gives this preemption for non-idle threads too.  So my
"fix" stops the taskqueue being preempted to on every packet.
netblast gets preempted eventually and waits for the taskqueue, but
it still manages to send more packets using less CPU.

My "fix" doesn't quite give UP behaviour.  PREEMPTION is necessary with
UP, and the "fix" depends on not having it.  I haven't tested this.
Scheduling makes little difference for old em since the taskqueue only
runs for tx interrupts and then does very little.  tx interrupts are
very unimportant for this benchmark on old em and bge.  My bge tuning
delays them for up to 1 second if possible when tuning for throughput
over latency.

The relative effect of this "fix" is shown for the PRO1000 NIC by:

FBSD-cur  SMP-1 293+773    # iflib, i386 with 4G KVA, cpuset to taskq CPU
FBSD-cur  SMP-1 like SMP-8 # iflib, i386 with 4G KVA, cpuset to non-taskq CPU
FBSD-cur  SMP-8 279+525    # iflib, i386 with 4G KVA

This NIC seemed to saturate and 280 kpps on all systems, but the "fix"
lets it reach 293 kpps and leaves enough CPU to spare to generate and
drop 248 kpps.  The dropped packet count is a good test of the comination
of CPU to spare and efficiency of dropping packets.  Old versions of
FreeBSD and em have much more CPU to spare and drop packets more efficiently
by peeking at the ifq high in the network stack.  They can generate and
drop about 2000 kpps on this NIC, but the best iflib version can only
do this for about 1000 kpps.

The Haswell CPU has 4 cores x 2 threads, and sharing CPUs is about 67%
slower for each CPU of an HTT pair.  The main taskq is on CPU6 and the
other taskq is on CPU7.  Running netblast on CPU6 gives the speedup.
Running netbast on CP7 gives HTT contention, but this makes little
difference.  On the PRO1000 where the NIC saturates first so that the
taskq's don't run so often, there CPU uses are about 35% for CPU6 and
1% for CPU7 when netblast is run on CPU0.  So there is only about 35%
HTT and netblast contention when netblast is run on CPU7.

> For well behaved workloads (like Netflix's), I don't anticipate
> this being a performance issue.  However, I worry that this will impact
> other workloads and that you should consider running some testing of
> N:1 contention.   Eg, 128 netperfs running in parallel with only
> a few nic tx rings.

For the I218V, before iflib 2 netblasts got closer to saturating the NIC
but 8 netblasts were slower than 1.  Checking this now with the PRO1000,
the total kpps counts (all with about 280 kpps actually sent) are:

1 netblast:   537
2 netblasts:  949 (high variance from now on, just a higher sample)
3 netblasts: 1123
4 netblasts: 1206
5 netblasts: 1129
6 netblasts: 1094
7 netblasts: 1080
8 netblasts: 1016

So the contention is much the same as before for the dropping-packets part
after the NIC saturates.  Maybe it is all in the network stack.  There is
a lot of slowness there too, so a 4GHz CPU is needed to almost keep up with
the network for small packets sent by any 1Gbps NIC.

Multi-queue NICs obviously need something like taskqueues to avoid contention
with multiple senders, but to be fast the taskqueues you have to have enough
CPUs to dedicate 1 CPU per queue and don't wast time and latency context-
switching this CPU to the idle thread.  According to lmbench, the context
switch latency on the test system is between 1.1 and 1.8 usec for all cases
between 2proc/0K and 16proc/64K.  Context switches to and from the idle
thread are much faster, and they need to be to reach 1M/sec.  Watching
context switches more carefully using top -m io shows that for 1 netblast
to the PRO1000 they are:

259k/sec for if_io_tqg_6 (names are too verbose and are truncated by top)
259k/sec for idle: cpu<truncated> on same CPU as above
7.9k/sec for if_io_tqg_7
7.9k/sec for idle: cpu<truncated> on same CPU as above

These are much less than 1M/sec because i386 with 4G KVA is several times
slower than i386 with 1G KVA.

I mostly use the PRO1000 because its ping latency with best configuration
is 50 usec instead of 80 usec and only the latency matters for nfs use.

Bruce