svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

Wed Dec 19 22:17:26 UTC 2018

On Wed, 19 Dec 2018, Bruce Evans wrote:

> On Wed, 19 Dec 2018, Bruce Evans wrote:
>
>> On Mon, 17 Dec 2018, Andrew Gallatin wrote:
>> 
>>> On 12/17/18 2:08 PM, Bruce Evans wrote:
>> * ...
>>>> iflib uses queuing techniques to significantly pessimize em NICs with 1
>>>> hardware queue.Â  On fast machines, it attempts to do 1 context switch 
>>>> per
>> ...
>>> This can happen even w/o contention when "abdicate" is enabled in mp
>>> ring. I complained about this as well, and the default was changed in
>>> mp ring to not always "abdicate" (eg, switch to the tq to handle the
>>> packet). Abdication substantially pessimizes Netflix style web uncontended 
>>> workloads, but it generally helps small packet forwarding.
>>> 
>>> It is interesting that you see the opposite.  I should try benchmarking
>>> with just a single ring.
>> 
>> Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
>> (the sysctl is newer), but I notices the slowdown from near the first
>> commit for it (r323954) and already used the folowing workaround for it:
>> ...
>> This essentialy just adds back the previous code with a flag to check
>> both versions.  Hopefully the sysctl can do the same thing.
>
> It doesn't.  Setting tx_abdicate to 1 gives even more context switches 
> (almost
> twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by
> INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more.  Without
> ...

I now understand most of the slownesses and variations in benchmarks.

Short summary:

After arcane tuning including a sysctl only available in my version
of SCHED_4BSD, on amd64 iflib in -current runs as fast as old old em
with EM_MULTIQUEUE and no other tuning in FreeBSD-11; i386 also needs
a CPU almost 3 times faster to compensate for the overhead of having
4G KVA (bit no other security pessimizations in either).

Long summary:

iflib with tx_abdicate=0 runs a bit like old em without EM_MULTIQUEUE,
provided the NIC is I218V and not PRO1000 and/or the CPU is too slow
to saturate the NIC and/or the network.  iflib is just 10% slower.
Neither does excessive context switches to tgq with I218V (context
switches seem to be limited to not much more than 2 per h/w interrupt,
and h/w interrupts are normally moderated to 8kHz).  However, iflib
does excessive context switches for PRO1000.  I don't know if this is
for hardware reasons or just for dropping packets.

iflib with tx_abdicate=1 runs a bit like old em with EM_MULTIQUEUE.  Due
to general slowness, even a 4GHz i7 has difficulty saturating 1Gbps ethernet
with small packets.  tx_abdicate=1 allows it to saturate by using tgq more.
This causes lots of context switches and otherwise uses lots of CPU (60%
of a 4GHz i7 for iflib).  Old em with EM_MULTIQUEUE gives identical kpps
and saturation and dropped packets for spare cycles on the CPU producing
the packets, but I think it does less context switches and uses less CPU
for tgq.  This is mostly for the I218V.

I got apparently-qualitativly-different results on i386 because I mostly
tested i386 with the PRO1000 where there are excessive context switches
on both amd64 and i386 with tx_abdicate=0.  tx_abdicate=1 gives even more
excessive context switches (about twice as many) for the PRO1000.

I got apparently-qualitativly-different results for some old benchmarks
because I used an old version of FreeBSD (r332488) for many of them, and
also had version problems within this version.  iflib in this version
forces tx_abdicate=1.  I noticed the extra context switches from this
long ago, and had an option which defaulted to using older iflib code
which seemed to work better.  But I misedited the non-default case of
this and had the double drainage check bug that was added in -current
in r366560 and fixed in -current in r341824.  This gave excessive extra
context switches, so the commit that added abdication (r323954) seemed
to be even slower than it was.

The fastest case by a significant amount (saturation on I218V using
1.6 times less CPU) is with netblast bound to the same CPU as tgq,
PREEMPTION* not configured, and my scheduler modification that reduces
preemption even further, and this modification selected using a sysctl),
and tx_abdicate=1.  Then the scheduler modification delays most switches
to tgq, and tx_abdicate=1 apparently allows such context switches when
they are useful (I think netblast fills a queue and then tx_abdiscate=1
gives a context switch immediately, but tx_abdicate=0 doesn't give a
context switch soon enough).  But without the scheduler modification,
this is the slowest case (tx_abdicate=1 forces context switches to tgq
after every packet, and since netblast is bound to the same CPU, it
can't run.  In both cases, only 1 CPU is used, but the context switches
reduce throughput by about a factor of 2.

It is less clear why througput counting dropped packets is lower for
netblast not bound and tx_abdicate=0.  Then tgq apparently doesn't run
promptly enough to saturate the net, but netblast has its own CPU so
it doesn't stop when tgq runs so it should be able to produce even more
packets (many more dropped ones) than in the fastest case.  This might be
caused by lock contention.

> When netblast is bound to the tgq's CPU, the tgq actually runs on another
> CPU.  Apparently, the binding is weak ot this is a bugfeature in my 
> scheduler.

It is a feature that I have forgotten about.  It was originally a bug, but
I happened to noticed that it reduced the context switches for iflib long
ago, so made it a feature.  I didn't noticed then that it also improved
throughput signficantly.  I made it the default for !PREEMPTION only, then
forgot that it was stronger than !PREEMPTION.

The feature is to not reschedule on all CPUs when a thread becomes runnable.
Only reschedule on the current CPU.  Also, run an idle CPU if there is one.
This is not suitable for general use since it results in low priority threads
staying running until the end of their quantum or voluntary context switch
instead of running a higher priority thread.

Plain !PREEMPTION does much the same thing for threads at a user priority,
but preempts from user priority to kernel priority.

> ...
> Another test with amd64 and I218V instead of PRO1000:
>
> netblast bound, !abdicate:   1243kpps sent,   0kpps dropped  (16k/sec csw)
> netblast unbound, !abdicate: 1236kpps sent,   0kpps dropped  (16k/sec csw)
> netblast bound, abdicate:    1485kpps sent, 243kpps dropped  (16k/sec csw)
> netblast unbound, abdicate:  1407kpps sent, 1.7kpps dropped (850k/sec csw)

All working correctly, except the throughput is a bit low with !abdicate
and abdicate takes too much CPU.  This must be with my anti-preemption for
the low csw in the 3rd case.

> There is an i386 dependency after all!  !abdicate works on amd64 but not
> on i386 to prevent the excessive context switches.  Unfortunately, it also
> reduces kpps by almost 20% and leaves no spare CPU for dropping packets.

The deependency was actually on the NIC.

> Why would tx_abdicate=0 give extra context switches for i386 but not
> for amd64?  More interestingly, what does it do wrong to lose 20% in
> kpps sent and more in kpps dropped?

Apparently, some dependency on the NIC.  kpps is lost because less CPU is
used and 1 4GHz CPU can't keep up.  (I use CC and CFLAGS optimized for
debugging.  This costs about 10% in sys time.)

> Another test with PREEMPTION*:
>
> netblast bound, !abdicate:   same as above
> netblast unbound, !abdicate: same as above
> netblast bound, abdicate:     578kpps sent, 0kpps dropped (1160k/sec csw)
> netblast unbound, abdicate:  1106kpps sent, 0kpps dropped  (850k/sec csw)
>
> That is, abdicate with PREEMPTION to make it work as fully intended
> destroys performance for the netblast bound case where it fixes most
> peformance problems without PREEMPTION; for the network unbound case it
> only reduces performance by 30%.  It uses the same amount of CPU as
> !PREEMPTION.

This is because PREEMPTION makes the netblast unbound case run into tgq
always.  Unquoted benchmarks show that the netblast unbound case is slower
than usual because mis-scheduling makes netblast run into tgq sometimes.

The regression in average ping latency is actually from 73 usec to 84
usec.  Anti-preemption might give latencies of 100 msec, but actually
makes no difference for ping latency  However, I now rememeber that
the maximum ping latency is often 2 msec.  That is too high.  This is
with itr=0, which is the main part of turning off interrupt moderation.
Even the default itr of 125 usec only increases worst-case latencies
by 125 usec.

Bruce