svn commit: r344027 - in stable/12/sys: dev/vmware/vmxnet3 modules/vmware/vmxnet3 net

Thu Feb 14 04:32:36 UTC 2019

On Wed, 13 Feb 2019, Marius Strobl wrote:

> As for the iflib(4) status in head, I'm aware of two remaining
> user-visible regressions I ran myself into when trying to use
> em(4) in production.

I am aware of a few more:
- tx throughput loss for minimal packets of about 10% on my low end/1
   queue NICs (I218-V, older I2*, and 82541PI).  This hasn't changed
   much in the 2+ years since em(4) was converted to iflib , except
   some versions were another 10-20% slower and some of the slowness
   can be recovered using the tx_abdicate sysctl
- average ping latency loss of about 13% on I218V.  This has only been
   there for 6-12 months.  Of course this is with tuning for latency
   by turning off interrupt moderation as much as possible
- errors on rx are recovered from badly in [l]em_isc_rxd_pkt_get() by
   incrementing the dropped packet count and returning EBADMSG.  This
   leaves the hardware queues in a bad state which is recovered from
   after a long time by resetting.  Many more packets are dropped, but
   the dropped packet count is only incremented by 1.  The pre-iflib
   driver handled this by dropping just 1 packet and continuing.  This
   is now hard to do, since iflib wants to build a list of packets and
   seems to have no way of handling bad packets in the list.  I use the
   quick fix of printing a message and putting the bad packet in the
   list.  I have only seen this problem on 82541PI.  I haven't checked
   that the EBADMSG return is still mishandled by resetting.
- the NIC is not stopped for media changes.  This causes the same
   lockups as not stopping it for resume, but is less often a problem
   since you usually don't change the media for an active NIC.

> 1) TX UDP performance is abysmal even when
> using multiple queues and, thus, MSI-X. In a quick test with
> netperf I see ~690 Mbits/s with 9216 bytes and 282 Mbits/s with
> 42080 bytes on a Xeon E3-1245V2 and 82574 with GigE connection
> (stable/11 e1000 drivers forward-ported to 12+ achieve 957 Mbit/s
> in both cases). 2) TX TCP performance is abysmal when using MSI
> or INTx (that's likely also PR 235031).
> I have an upcoming iflib(4) fix for 2) but don't have an idea
> what's causing 1) so far. I've identified two bugs in iflib(4)
> that likely have a minimal (probably more so with ixl(4), though)
> impact on UDP performance but don't explain the huge drop.

I don't see bad performance for large packets (except for the 82541PI --
it is PCI and can't get near saturating the network at any size).

Other problems: I mostly use i386, and its performance is now abysmal
due to its slow syscalls.  Its slowdowns also makes comparison with old
benchmark results more difficult.  Typical numbers for netblast tests
for I218-V on i386 on Haswell i4790K 4.08GHz are:

1500  kpps (line rate) for   tuned FreeBSD-11    using 1.5 CPUs
1400+ kpps             for untuned FreeBSD-11    using 1   CPU
1400- kpps             for -current-before-iflib using 1   CPU
1300- kpps             for -current-after-iflib  using 1.5 CPUs

The tuning for FreeBSD-11 is just EM_MULTIQUEUE.  The NIC has only 1 CPU,
but using another CPU to manage the queue seems to work right.  For iflib,
the corresponding tuning seems to be to set the tx_abdicate sysctl to 1.
This doesn't work so well.  It causes iflib to mostly waste CPU by trying
to do 2 context switches per packet (mostly from an idle thread to an
iflib thread).  The Haswell CPU can only do about 1 context switch per
microsecond, so the context switches are worse than useless for achieving
packet rates above 1000 kpps.  In old versions of iflib, tx_abdicate is
not a sysctl and is always enabled.  This is why iflib takes an extra
0.5 CPUs in the above benchmark.

Then for -current after both iflib and 4+4 address space changes:

  533  kpps             worst ever observed in -current (config unknown)
  800  kkps             typical result before pae_mode changes

Then for -current now (after iflib, 4+4 and pae changes)

  500  kkps             pae_mode=1 (default) tx_abdicate=0 (default) 1   CPU
  780  kpps             pae_mode=0           tx_abdicate=0 (default) 1   CPU
  591  kpps             pae_mode=0           tx_abdicate=1           1.5 CPUs

On amd64, the speed of syscalls hasn't changed much, so it still gets
about 1200 kpps in untuned configurations, and tx_abdicate works better so
it can almost reach line rate using a bit more CPU than tuned FreeBSD-11.

The extra context switches can also be avoided by not using SMP or by
binding the netblast thread to the same CPU as the main iflib thread.  This
only helps when tx_adbicate=1:

  975  kpps             pae_mode=0           tx_abdicate=1 cpuset -l5 1   CPU

I.e., cpusetting improves the speed from 591 to 995 kpps!  I now seem to
remember that amd64 needed that too to get near line rate.  The context
switch counts for some cases are:

- tx_abdicate=1, no cpuset: 1100+ k/sec (1 to and 1 from iflib thread per pkt)
- tx_abdicate=0, no cpuset:    8 k/sec (this is from the corrected itr=125)
- tx_abdicate=1, cpuset:       6 k/sec

The iflib thread does one switch to and 1 switch from per packet, so the
packet rate is half of its switch rate.  But the switch rate of 1M shown
by systat -v is wrong.  It apparently doesn't include context switch for
the cpu-idle threads.  top -m i/o shows these.  Context switches to and
from the idle thread are cheaper than most, especially for i386 with 4+4
and pae, but they are still heavyweight so should be counted normally.

Binding of of iflib threads to CPUs is another problem.  It gets in the
way of the scheduler choosing the best CPU dynamically, so is only
obviously right if you have CPUs to spare.  The 4BSD scheduler handles
bound CPUs especially badly.  This is fixed in my version, but I forgot
to enable the fix for these test, and anyway, the fix and scheduling in
general only makes much difference on moderately loaded systems.  (For
the light load of running only netblast and some daemons, there is CPU
to spare.  For heavy loads when there is no CPU to spare, the scheduler
can't do much.  My fixes with the Haswell 4x2 CPU topology reduce to
trying to use only 1 CPU out of each HTT pair.  So when iflib binds to
CPU 6, if CPU 6 is running another thread, this thread has to be kicked
off CPU 6 and should not be moved to CPU 7 while iflib is running on
CPU 6.  Even when there is another inactive HTT pair, moving it is slow.)

iflib has some ifdefs for SCHED_ULE only.  I doubt that static
scheduling like it does can work well.  It seems to do the opposite
of what is right -- preferring threads on the same core make these
threads run slower when they run concurrently, by competing for
resources.  The slowdown on Haswell for competing CPUs in an HTT pair
is about 2/3 (each CPU runs about 1/3 slower so the speed of 2 CPUs
is at best 4/3 times as much as 1 CPU).  Anyway, iflib obviously doesn't
understand scheduling, since its manual scheduling runs 975/591 times
slower than my manual scheduling, without even any HTT contention or
kicking netblast or another user thread off iflib's CPU.  The slowness
is just from kicking an idle thread off iflib's CPU.

If there are really CPUs to spare, then the iflib thread should not
yield to even the idle thread.  Then it would work a bit like
DEVICE_POLLING.  I don't like polling, and would want to do this with
something like halt waiting for an interrupt or better monitor waiting
for a network event.  cpu_idle() already does suitable things.
DEVICE_POLLING somehow reduced ping latency by a lot (from 60+ usec to
30 usec) on older systems and NICs, at the cost of a lot of power for
spinning in idle and not actually helping if the system is not idle.
I don't see how it can do this.  The interrupt latency with interrupt
moderation turned off should be only about 1 usec.

Summary: using unobvious tuning and small fixes, I can get ifllib'ed
em to work almost as well as FreeBSD-11 em.

Bruce