svn commit: r327559 - in head: . sys/net

Sat Jan 6 00:39:31 UTC 2018

On 05/01/2018 23:30, Scott Long wrote:
>
>> On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eugen at grosbein.net> wrote:
>>
>> CC'ng scottl@ as author of the change in question.
>>
>> 06.01.2018 0:39, Matt Joras wrote:
>>
>>>>> For what it's worth, this was the conclusion I came to, and at Isilon
>>>>> we've made the same change being discussed here. For the case of
>>>>> drivers that end up using a queue index for the flowid, you end up
>>>>> with pathological behavior on the lagg; the flowid ends up getting
>>>>> right shifted by (default) 16. So in the case of e.g. two bxe(4)
>>>>> interfaces with 4 queues, you always end up choosing the interface in
>>>>> the lagg at index 0.
>> Then why does if_lagg shifts 16 bits by default? Is seems senseless.
>> This was introduced with r260070 by scottl:
>>
> At the time, we were using cxgbe interfaces which inserted a reasonable RSS
> hash into the flowid field.  The shift was done to expose different bits to the
> index/modulo scheme used by the LACP module vs the cxgbe module.  In
> hindsight, I should not have set a default value of 16, I should have left it at
> zero so that default behavior would be preserved.
>
>>> Multi-queue NIC drivers and multi-port lagg tend to use the same lower
>>> bits of the flowid as each other, resulting in a poor distribution of
>>> packets among queues in certain cases.  Work around this by adding a
>>> set of sysctls for controlling a bit-shift on the flowid when doing
>>> multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
>>> now use bits 16 and higher instead of 0 and higher.
>>>
>>> Reviewed by:    max
>>> Obtained from:  Netflix
>>> MFC after:      3 days
>> This commit message does not point to an example of NIC driver that would set
>> non-zero bits 16 and higher for flowid so that shift result would be non-zero.
>> Do we really have such a driver?
>>
> Yes.
>
>> Anyway, this lagg's default seems to be very driver-centric.
>>
>> For example, Intel driver family also do not use such high bits for flowid
>> just like mentioned bxe(4).
>>
> scottl at moe:~/svn/head/sys/dev % grep -r iri_flowid *
> bnxt/bnxt_txrx.c:		ri->iri_flowid = le32toh(rcp->rss_hash);
> bnxt/bnxt_txrx.c:		ri->iri_flowid = le32toh(tpas->low.rss_hash);
> e1000/em_txrx.c:	ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);
> e1000/igb_txrx.c:	ri->iri_flowid =
> ixgbe/ix_txrx.c:	ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);
>
> The number of drivers that set m_pkhhdr.flowid directly to an RSS hash looks
> to be:
>
> cxgb
> cxgbe
> mlx4
> mlx5
> qlnx
> qlxgbe
> qlxge
> vmxnet3
>
> Maybe the hardware doesn’t do a great job with generating a useful RSS hash,
> but that’s tangential to this conversation.
>
>> We should change flowid_shift default to 0 for if_lagg(4), shouldn't we?
>>
> In the short term, yes.  What I see is that it’s too expensive to do a hash calculation
> on every TX packet in LACP (for anything resembling line rate), and flowid is unreliable
> when a connection is initiated without an RX packet triggering it.  I don’t understand
> the consequences of the TX initiation problem well enough to offer a solution.  For the
> problem of flowid being used inconsistently by drivers (i.e. not populating all 32 bits
> or using a weak RSS), that’s really a driver problem.
>
> What I’d recommend is that the LACP code check for M_HASHTYPE_NONE and
> M_HASHTYPE_OPAQUE and calculate a new hash if either are set (effectively
> ignoring the flowid).  It’s then up to the drivers to set the correct hash type that
> corresponds with what they’re putting into the flowid field.  An opaque type would
> mean that the value is useful to the driver but should not be considered useful
> anywhere else.  You’ll get more correct and less surprising behavior from that, at
> the penalty of every TX packet requiring a hash calculation for hardware/drivers
> that are crummy.

Mixing the hash methods causes problems with out of order packets even 
just for the first packet, and using a hash which is not what's 
configured by lagghash is confusing at best so that could be fixed to 
say "flowid" if that's whats going to happen or perhaps update it to the 
hash type that flowid represents if that's possible.

LACP already checks for M_HASHTYPE_NONE if use_flowid is enabled and 
manually calculates a hash, which is what leads the the first packet 
port selection inconsistency.

It's not clear what all the implications of the first packet port 
inconsistency is, it will likely be dependent a large number of factors 
including network hardware, layout and dest host + config., but when 
we've seen this in the 3 and 4 packet of a stream it leads to the 
destination sending RST, dropping the stream on the floor for 2% of all 
streams on our test box with 2 x ixgbe interfaces.

Questions:

 1. Is it possible to determine the hash method used by the HW and use
    that for all first packets?
 2. Is it possible to significantly improve the performance the CPU hashing?
 3. Is it possible to tell from the mbuf that its part of a connection
    instigated from the current host?

     Regards
     Steve