svn commit: r327559 - in head: . sys/net

Fri Jan 5 23:30:06 UTC 2018

> On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eugen at grosbein.net> wrote:
> 
> CC'ng scottl@ as author of the change in question.
> 
> 06.01.2018 0:39, Matt Joras wrote:
> 
>>>> For what it's worth, this was the conclusion I came to, and at Isilon
>>>> we've made the same change being discussed here. For the case of
>>>> drivers that end up using a queue index for the flowid, you end up
>>>> with pathological behavior on the lagg; the flowid ends up getting
>>>> right shifted by (default) 16. So in the case of e.g. two bxe(4)
>>>> interfaces with 4 queues, you always end up choosing the interface in
>>>> the lagg at index 0.
> 
> Then why does if_lagg shifts 16 bits by default? Is seems senseless.
> This was introduced with r260070 by scottl:
> 

At the time, we were using cxgbe interfaces which inserted a reasonable RSS
hash into the flowid field.  The shift was done to expose different bits to the
index/modulo scheme used by the LACP module vs the cxgbe module.  In
hindsight, I should not have set a default value of 16, I should have left it at
zero so that default behavior would be preserved.

>> Multi-queue NIC drivers and multi-port lagg tend to use the same lower
>> bits of the flowid as each other, resulting in a poor distribution of
>> packets among queues in certain cases.  Work around this by adding a
>> set of sysctls for controlling a bit-shift on the flowid when doing
>> multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
>> now use bits 16 and higher instead of 0 and higher.
>> 
>> Reviewed by:    max
>> Obtained from:  Netflix
>> MFC after:      3 days
> 
> This commit message does not point to an example of NIC driver that would set
> non-zero bits 16 and higher for flowid so that shift result would be non-zero.
> Do we really have such a driver?
> 

Yes.

> Anyway, this lagg's default seems to be very driver-centric.
> 
> For example, Intel driver family also do not use such high bits for flowid
> just like mentioned bxe(4).
> 

scottl at moe:~/svn/head/sys/dev % grep -r iri_flowid *
bnxt/bnxt_txrx.c:		ri->iri_flowid = le32toh(rcp->rss_hash);
bnxt/bnxt_txrx.c:		ri->iri_flowid = le32toh(tpas->low.rss_hash);
e1000/em_txrx.c:	ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);
e1000/igb_txrx.c:	ri->iri_flowid =
ixgbe/ix_txrx.c:	ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);

The number of drivers that set m_pkhhdr.flowid directly to an RSS hash looks
to be:

cxgb
cxgbe
mlx4
mlx5
qlnx
qlxgbe
qlxge
vmxnet3

Maybe the hardware doesn’t do a great job with generating a useful RSS hash,
but that’s tangential to this conversation.

> We should change flowid_shift default to 0 for if_lagg(4), shouldn't we?
> 

In the short term, yes.  What I see is that it’s too expensive to do a hash calculation
on every TX packet in LACP (for anything resembling line rate), and flowid is unreliable
when a connection is initiated without an RX packet triggering it.  I don’t understand
the consequences of the TX initiation problem well enough to offer a solution.  For the
problem of flowid being used inconsistently by drivers (i.e. not populating all 32 bits
or using a weak RSS), that’s really a driver problem.

What I’d recommend is that the LACP code check for M_HASHTYPE_NONE and
M_HASHTYPE_OPAQUE and calculate a new hash if either are set (effectively
ignoring the flowid).  It’s then up to the drivers to set the correct hash type that
corresponds with what they’re putting into the flowid field.  An opaque type would
mean that the value is useful to the driver but should not be considered useful
anywhere else.  You’ll get more correct and less surprising behavior from that, at
the penalty of every TX packet requiring a hash calculation for hardware/drivers
that are crummy.

Scott