svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

Tue Dec 18 14:40:54 UTC 2018

On Mon, 17 Dec 2018, Andrew Gallatin wrote:

> On 12/17/18 2:08 PM, Bruce Evans wrote:
>> On Mon, 17 Dec 2018, Andrew Gallatin wrote:
>> 
>>> On 12/5/18 9:20 AM, Slava Shwartsman wrote:
>>>> Author: slavash
>>>> Date: Wed DecÂ  5 14:20:57 2018
>>>> New Revision: 341578
>>>> URL: 
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= 
>>>> 
>>>> Log:
>>>> Â Â  mlx5en: Remove the DRBR and associated logic in the transmit path.
>>>> Â Â Â Â Â  The hardware queues are deep enough currently and using the 
>>>> DRBR and associated
>>>> Â Â  callbacks only leads to more task switching in the TX path. The is 
>>>> also a race
>>>> Â Â  setting the queue_state which can lead to hung TX rings.
>>> 
>>> The point of DRBR in the tx path is not simply to provide a software ring 
>>> for queuing excess packets.Â  Rather it provides a mechanism to
>>> avoid lock contention by shoving a packet into the software ring, where
>>> it will later be found & processed, rather than blocking the caller on
>>> a mtx lock.Â Â  I'm concerned you may have introduced a performance
>>> regression for use cases where you have N:1Â  or N:M lock contention where 
>>> many threads on different cores are contending for the same tx queue.Â  
>>> The state of the art for this is no longer DRBR, but mp_ring,
>>> as used by both cxgbe and iflib.
>> 
>> iflib uses queuing techniques to significantly pessimize em NICs with 1
>> hardware queue.Â  On fast machines, it attempts to do 1 context switch per

Bruce Evans didn't write that.  Some mail program converted 2-space sentence
breaks to \xc2\xa0.

> This can happen even w/o contention when "abdicate" is enabled in mp
> ring. I complained about this as well, and the default was changed in
> mp ring to not always "abdicate" (eg, switch to the tq to handle the
> packet). Abdication substantially pessimizes Netflix style web uncontended 
> workloads, but it generally helps small packet forwarding.
>
> It is interesting that you see the opposite.  I should try benchmarking
> with just a single ring.

Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
(the sysctl is newer), but I notices the slowdown from near the first
commit for it (r323954) and already used the folowing workaround for it:

XX Index: iflib.c
XX ===================================================================
XX --- iflib.c	(revision 332488)
XX +++ iflib.c	(working copy)
XX @@ -1,3 +1,5 @@
XX +int bde_oldnet = 1;
XX +
XX  /*-
XX   * Copyright (c) 2014-2018, Matthew Macy <mmacy at mattmacy.io>
XX   * All rights reserved.
XX @@ -3650,9 +3652,17 @@
XX  		IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id);
XX  		return;
XX  	}
XX +if (bde_oldnet) {
XX  	if (txq->ift_db_pending)
XX  		ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX +	else
XX +		ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +} else {
XX +	if (txq->ift_db_pending)
XX +		ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX  	ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +}
XX +	ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX  	if (ctx->ifc_flags & IFC_LEGACY)
XX  		IFDI_INTR_ENABLE(ctx);
XX  	else {
XX @@ -3862,8 +3872,11 @@
XX  	DBG_COUNTER_INC(tx_seen);
XX  	err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE);
XX 
XX +if (!bde_oldnet)
XX  	GROUPTASK_ENQUEUE(&txq->ift_task);
XX  	if (err) {
XX +if (bde_oldnet)
XX +		GROUPTASK_ENQUEUE(&txq->ift_task);
XX  		/* support forthcoming later */
XX  #ifdef DRIVER_BACKPRESSURE
XX  		txq->ift_closed = TRUE;
XX @@ -3870,6 +3883,9 @@
XX  #endif
XX  		ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX  		m_freem(m);
XX +	} else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) {
XX +if (bde_oldnet)
XX +		GROUPTASK_ENQUEUE(&txq->ift_task);
XX  	}
XX 
XX  	return (err);
XX Index: mp_ring.c
XX ===================================================================
XX --- mp_ring.c	(revision 332488)
XX +++ mp_ring.c	(working copy)
XX @@ -1,3 +1,11 @@
XX +#include "opt_pci.h"
XX +
XX +#ifdef DEV_PCI
XX +extern int bde_oldnet;
XX +#else
XX +#define	bde_oldnet	0
XX +#endif
XX +
XX  /*-
XX   * Copyright (c) 2014 Chelsio Communications, Inc.
XX   * All rights reserved.
XX @@ -454,12 +462,25 @@
XX  	do {
XX  		os.state = ns.state = r->state;
XX  		ns.pidx_tail = pidx_stop;
XX +if (bde_oldnet)
XX +		ns.flags = BUSY;
XX +else {
XX  		if (os.flags == IDLE)
XX  			ns.flags = ABDICATED;
XX +}
XX  	} while (atomic_cmpset_rel_64(&r->state, os.state, ns.state) == 0);
XX  	critical_exit();
XX  	counter_u64_add(r->enqueues, n);
XX 
XX +if (bde_oldnet) {
XX +	/*
XX +	 * Turn into a consumer if some other thread isn't active as a consumer
XX +	 * already.
XX +	 */
XX +	if (os.flags != BUSY)
XX +		drain_ring_lockless(r, ns, os.flags, budget);
XX +}
XX +
XX  	return (0);
XX  }
XX  #endif
XX @@ -470,10 +491,15 @@
XX  	union ring_state os, ns;
XX 
XX  	os.state = r->state;
XX +if (bde_oldnet) {
XX +	if (os.flags != STALLED || os.pidx_head != os.pidx_tail || r->can_drain(r) == 0)
XX +		return;
XX +} else {
XX  	if ((os.flags != STALLED && os.flags != ABDICATED) ||	// Only continue in STALLED and ABDICATED
XX  	    os.pidx_head != os.pidx_tail ||			// Require work to be available
XX  	    (os.flags != ABDICATED && r->can_drain(r) == 0))	// Can either drain, or everyone left
XX  		return;
XX +}
XX 
XX  	MPASS(os.cidx != os.pidx_tail);	/* implied by STALLED */
XX  	ns.state = os.state;

This essentialy just adds back the previous code with a flag to check
both versions.  Hopefully the sysctl can do the same thing.

The log message for r323954 claims an improvement of 30% for forwarding
small packets, but I saw only lots more context switches, and more CPU
use to do these, but only a small unimprovement for send(2)ing packets.
My hack reduces the context switches but doesn't change the rate much.

I don't understand forwarding but think it is usually entirely in the
kernel and doesn't do as much blasting as netblast (just bursts of
received packets with rate and burst size limiting by interrupt
moderation, and with no syscall slowness between the packets?).  This
is a very different load.

Ping (-fq) latency increased a while ago, but I forget when, and my hack
doesn't affect it much.  The increase was much larger for an I218V than for a
PRO1000(fairly low end).  This might be due to timing problems increased
by non-synchronous transmitting.

> ...
> I think part of the problem with iflib is just its sheer size, but
> some of that is hard to avoid due to it being an intermediate shim
> layer & having to translate packets.

Hmm, the only thing I like about it is that it seems to be less sheer
than a vendor driver :-) (1.5MB for e1000 and only 175K for iflib.c).

> One thing I think could help is to do the conversion from mbufs
> to iflib's packet info data structure at entry into iflib, and not
> at xmit time.  This the stuff that parses
> the mbufs, does the virt to phys, and basically will take cache
> misses if it runs on a different CPU, but seems less likely to
> take cache misses if run just after ether_output() (which has
> likely already taken misses).   I've been trying to find the
> time to make this change for a while.  It would be interesting
> to see if it helps your workload too.

I forgot about cache misses.  Separate threads must make them worse,
especially for user threads and with spectres breaking caching.  I
think L2+ caches are shared across CPUs for x86, but some spectre
cases flush the TLB which is not quite as bad as flushing caches.

In my old work on bge, -current (6-8 at the time) seemed to be slower
than my version in ~5.2 with more than half of the difference due to
1 or 2 more cache misses in the send[to]() path.  I never located these
exactly, and got nowhere with prefetch instructions on old Athlons.
send() takes about 1600 nsec on an original Athlon64 ~2GHz, and though
the main memory latency was much lower than now (42.7 nsec according
to lmbench), adding a couple of cache misses to 1600 nsec means that
the CPU saturates before the NIC+PCI33.

CPU affinity doesn't have much affect on the results of my benchmarks.
I think threads should never be bound to CPUs like iflib does, unless
the CPUs can be dedicated to a single thread.  Schedulers usually do a
good job of picking the best CPU to run on (I fixed my version of 4BSD
to do this slightly better than ULE for HTT pairs).  Binding is home
made scheduling which cannot adapt to changing loads, and it prevents
the scheduler from picking the best CPU in some cases.

One case is if previous activity resulted in all CPUs being non-idle.
Then if CPU0 and CPU1 but no others become idle, threads on the others
shouldn't switched to CPU0 or CPU1 even when these are a HTT pair so
switching the threads now on CPU[2-3] would run faster by migrating one
of them to CPU0 -- it just takes too long to migrate.  Then if a bound
thread needs to run on CPU6, it either has to wait or the thread on
CPU6 must be migrated to CPU0 or CPU1.  The thread should be unbound so
that it can run immediately on any available CPU (CPU0 or CPU1 here).

Bruce