svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

Tue Dec 18 18:42:05 UTC 2018

On Wed, 19 Dec 2018, Bruce Evans wrote:

> On Mon, 17 Dec 2018, Andrew Gallatin wrote:
>
>> On 12/17/18 2:08 PM, Bruce Evans wrote:
>* ...
>>> iflib uses queuing techniques to significantly pessimize em NICs with 1
>>> hardware queue.Â  On fast machines, it attempts to do 1 context switch per
> ...
>> This can happen even w/o contention when "abdicate" is enabled in mp
>> ring. I complained about this as well, and the default was changed in
>> mp ring to not always "abdicate" (eg, switch to the tq to handle the
>> packet). Abdication substantially pessimizes Netflix style web uncontended 
>> workloads, but it generally helps small packet forwarding.
>> 
>> It is interesting that you see the opposite.  I should try benchmarking
>> with just a single ring.
>
> Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
> (the sysctl is newer), but I notices the slowdown from near the first
> commit for it (r323954) and already used the folowing workaround for it:
> ...
> This essentialy just adds back the previous code with a flag to check
> both versions.  Hopefully the sysctl can do the same thing.

It doesn't.  Setting tx_abdicate to 1 gives even more context switches (almost
twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by
INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more.  Without
pessimizations it does 1M/sec instea of 400k/sec).  The behaviour is easy
to understand by watchomg top -SH -m io with netblast bound to the same
CPU as the main tgq.  Then netblast does involuntary context switches at
the same rate that the tgq does voluntary context switches, and tx_abdicate=1
doubles this rare.  netblast only switches at the quantum rate (11 per second)
when not bound (I think it does null switches and it is a bug to count these
as switches, but even null switches do too much).

This is also without my usual default of !PREEMPTION && !IPI_PREEMPTION.
Binding the netblast to the same CPU as the tgq only stops the excessive
context switches wihen !PREEMPTION.  My hack might depend on this too.
Unfortunately, the hack is not in the same kernels as the sysctl, and I
already have too many combinations to test.

Another test with only 4G KVA (no INVARIANTS, etc., no PREEMPTION):
tx_abdicate=0: tgq switch rate  997-1017k/sec (16k/sec if netblast bound)
tx_abdicate=1: tgq switch rate 1300-1350k/sec (16k/sec if netblast bound)

Another test on amd64 to escape i386 4G KVA pessimizations:
tx_abdicate=0: tgq switch rate 1110-1220k/sec (16k/sec if netblast bound)
tx_abdicate=1: tgq switch rate 1360-1430k/sec (16k/sec if netblast bound)

When netblast is bound to the tgq's CPU, the tgq actually runs on another
CPU.  Apparently, the binding is weak ot this is a bugfeature in my scheduler.

When tx_abdicate=1, the switch rate is close to the packet rate.  Since the
NIC can't keep up, most packets are dropped.  On amd64 with tx_abdicate=1,
the packet rates are:

netblast bound:   313kpps sent, 1604kpps dropped
netblast unbound: 253kpps sent, 1153kpps dropped

253kpps sent is bad.  This indicates large latencies (not due to !PREEMPTION
or secheduler bugs AFAIK).  Most tests with netblast unbound seemed to saturate
the NIC at 280kpps (but the tests with netblast bound shows that the NIC can
go a little faster).  Even an old 2GHz CPU can reach 280kpps.

This shows another problem with taskqueues.  It takes context switches just
to decide to drop packets.  Previous versions of iflib were much slower at
dropping packets.  Some had rates closer to the low send rate than the 1604kpps
achieved above.  FreeBSD-5 running on a single 3 times slower CPU can drop
packets at 2124kpps, mainly by dropping them in ip_output() after peeking at
the software ifqs to see that there is no space.  IFF_MONITOR gives better
tests of the syscall overhead.

Another test with amd64 and I218V instead of PRO1000:

netblast bound, !abdicate:   1243kpps sent,   0kpps dropped  (16k/sec csw)
netblast unbound, !abdicate: 1236kpps sent,   0kpps dropped  (16k/sec csw)
netblast bound, abdicate:    1485kpps sent, 243kpps dropped  (16k/sec csw)
netblast unbound, abdicate:  1407kpps sent, 1.7kpps dropped (850k/sec csw)

There is an i386 dependency after all!  !abdicate works on amd64 but not
on i386 to prevent the excessive context switches.  Unfortunately, it also
reduces kpps by almost 20% and leaves no spare CPU for dropping packets.

The best case of netblast bound, abdicate is competitive with FreeBSD-11
on i386 with EM_MULTIQUEUE: above result repeated:

netblast bound, abdicate:    1485kpps sent, 243kpps dropped  (16k/sec csw)

previous best result:

FBSD-11     SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps)

(this is without PREEMPTION* and without binding netblast).

The above for -current also has the lowest possible CPU use (100% of 1 CPU
for all threads, while netblast unbound takes 100% of 1 CPU for netblast and
60% of another CPU for tgq), and I think the FBSD=11 case takes 100% of 1
CPU for netblast unbound and a tiny% of another CPU for the taskqueue and
a tiny unaccounted % of various CPUs for the fast interrupt handler.  The
fast interrupt handler is not accounted for in all cases.  Since interrupt
moderation gives a rate of 8 kHz, the interrupt handler doesn't take very
long, but if it does a single PCI read then that might take 1 usec so 8 kHz
costs 1% for that alone.

Why would tx_abdicate=0 give extra context switches for i386 but not
for amd64?  More interestingly, what does it do wrong to lose 20% in
kpps sent and more in kpps dropped?

Another test with PREEMPTION*:

netblast bound, !abdicate:   same as above
netblast unbound, !abdicate: same as above
netblast bound, abdicate:     578kpps sent, 0kpps dropped (1160k/sec csw)
netblast unbound, abdicate:  1106kpps sent, 0kpps dropped  (850k/sec csw)

That is, abdicate with PREEMPTION to make it work as fully intended
destroys performance for the netblast bound case where it fixes most
peformance problems without PREEMPTION; for the network unbound case it
only reduces performance by 30%.  It uses the same amount of CPU as
!PREEMPTION.

Another test with PREEMPTION* and SCHED_ULE instead of SCHED_4BSD
(PREEMPTION* works a little differently in different schedulers.  IIRC,
IPI_PREEMPTION is useless and is mostly ignored in 4BSD and entirely
ignored in ULE, and PREEMPTION loses a little more preemption in ULE
than in 4BSD):

netblast bound, !abdicate:   same as above
netblast unbound, !abdicate: same as above
netblast bound, abdicate:    same as above (very bad)
netblast unbound, abdicate:  1485kpps sent, 0kpps dropped  (850k/sec csw)

In the 2 !abdicate cases, the CPU uses is below 1% for tgq.  The very
bad case is independent of the scheduler.  This is probably inherent
(some sort of contention on the common bound CPU when preemption is
done technically correctly).  I expected this case was much slower
before I tried it).  But ULE doesn't have the 30% loss for PREEMPTION
&& netblast unbound && abdicate.

The case where ULE is better has large latency for 4BSD.  It can be mostly
fixed by binding netblast to any set of CPUs not containing the one of
tgq or the one in the same HTT pair as that.  The speed is then only 85kpps
slower than with ULE.  Otherwise, even my version of 4BSD sees no reason
not to run netblast on the one of the CPUs on the same core as the tgq.
(My 4BSD changes affinity more often since in other benchmarks it is bad
to wait until the preferred is available; binding of taskqueues and ithreads
too often steals preferred CPUs and causes this migration.)  When netblast
decides to run on the same CPU as tgq, it contends with the tgq.  When it
decides to run on the HTT pair, both CPUs run 33% slower.

Anyway, I don't want the ~1M/sec context switches given by abdicate.
Context switching is especially bad for 4BSD.  It still uses sched_lock
for everything (it appears to use thread_lock(), but this just uses
sched_lock for 4BSD).  The slowness from this is remarkably small in
most cases.

Bruce