ixl 40G bad performance?

Tue Oct 20 14:51:12 UTC 2015

On Tue, 20 Oct 2015, Eggert, Lars wrote:

> Hi,
>
> On 2015-10-20, at 10:24, Ian Smith <smithi at nimnet.asn.au> wrote:
>> Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.
>
> Done.
>
> On 2015-10-19, at 17:55, Luigi Rizzo <rizzo at iet.unipi.it> wrote:
>> On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars <lars at netapp.com> wrote:
>>> The only other sysctls in ixl(4) that look relevant are:
>>>
>>>     hw.ixl.rx_itr
>>>             The RX interrupt rate value, set to 8K by default.
>>>
>>>     hw.ixl.tx_itr
>>>             The TX interrupt rate value, set to 4K by default.
>>>
>>
>> yes those. raise to 20-50k and see what you get in
>> terms of ping latency.
>
> While ixl(4) talks about 8K and 4K, the defaults actually seem to be:
>
> hw.ixl.tx_itr: 122
> hw.ixl.rx_itr: 62

ixl seems to have a different set of itr sysctl bugs than em.  In em,
122 for the itr means 125 initially, but it is documented (only by
sysctl -d, not by the man page) as having units usecs/4.  The units
are actually usecs*4 except initially, and these units take effect if
you write the initial value back -- writing back 122 changes the active
period from 125 to 488.  122 instead of 125 is the result of confusion
between powers of 2 and powers of 10.

The first obvious bug in ixl is that the above sysctls are read-only
global tunables (not documented as sysctls of course), but you can
write them using per-device sysctls (dev.ixl.[0-N].*itr?).  Writing
them for 1 device clobbers the globals and probably the settings for
all ixl devices.

sysctl -d doesn't say anything useful about ixl's itrs.  It misdocuments
the units for all of them as being rates.  Actually, the units for 2
of them are boolean and the units for the other 2 are periods.  ixl(4)
uses better wording for the booleans but even worse wording for the
periods ("rate value").  em uses better wording for its itr sysctl but
em(4) has no documentation for any sysctl or its itr tunable.  igb is
more like em than ixl here.

122 seems to be the result of mis-scaling 125, and 62 from correctly
scaling 62.5, but these numbers are also off by a factor of 2.  Either
there is a scaling bug or the undocumented units are usecs/2 where
em's documented units are usecs/4.  In em, the default itr rate is
8 kHz (power of 10), but in ixl it is unclear if 4K and 8K are actually
4000 and 8000, since they are scaled more in hardware (IXL_ITR_4K is
hard-coded as 122; the scale is linear but their aren't enough bits
to preserve linearity; it is unclear if the hard-coded values are
defined by the hardware or are the result of precomputing the values
(using hard-coded 0x7A (122) where em uses 1000000 / SCALE (100000
being user-friendly microseconds and SCALE a hardware clock frequency)).

I think 122 really does mean a period that approximates the period for
a frequency of 4 khz.  The period for this frequency is 250 usecs,
and 122 is 250 with units of usec*2, with an approximate error of
3 units.  Or 122 is the period for the documented frequency of 4K
(binary power of 2 with undocumented units which I assume are Hz),
with the weird usec*2 units and a tiny error.  Similarly for 62 and
8K, except there is a rounding error of almost 1.

> Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec).

Since they are periods and not frequencies, doubling them should double
the latency.  Since their units are weird and undocumented, it is hard to
predict what the latency actually is.  But I predict that if the units are
usecs*2, then the unscaled values give average latencies from interrupt
moderation.  This gives 122 + 62 = 184 plus maybe another 20 for other
delays.  Since the observed average latency is less than half that, the
units seem to usecs*1 and it is the documented frequencies that are off
by a power of 2.

> Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput.

For em and lem, I use itr = 0 or 1 when optimizing for latency.  This
reduces the latency to 50 for lem but only to 73 for em (where the
connection goes through a slow switch to not so slow bge).  24 seems
quite good, and the lowest I have seen for 1 Gbps is 26, but this
requires kludges like a direct connection and polling, and I would
hope for 40 times lower at 40 Gbps.

> (Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.)
>
> With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s.

Lowering (improving) latency always lowers (unimproves) throughput by
increasing load.  itr = 8 kHz is resonable for 1 Gbps (it gives higher
latency than I like), but scaling that to 40 Gbps gives itr = 320 kHz
and it is impossible to scale up the speed of a single CPU to reasonbly
keep up with that.

Fix for em:

X diff -u2 if_em.c~ if_em.c
X --- if_em.c~	2015-09-28 06:29:35.000000000 +0000
X +++ if_em.c	2015-10-18 18:49:36.876699000 +0000
X @@ -609,8 +609,8 @@
X  	    em_tx_abs_int_delay_dflt);
X  	em_add_int_delay_sysctl(adapter, "itr",
X -	    "interrupt delay limit in usecs/4",
X +	    "interrupt delay limit in usecs",
X  	    &adapter->tx_itr,
X  	    E1000_REGISTER(hw, E1000_ITR),
X -	    DEFAULT_ITR);
X +	    1000000 / MAX_INTS_PER_SEC);
X 
X  	/* Sysctl for limiting the amount of work done in the taskqueue */

"delay limit" is fairly good wording.  Other parameters tend to give long
delays, but itr limits the longest delay due to interrupt moderation to
whatever the itr respresents.

Bruce