svn commit: r335967 - head/sys/dev/mxge

Sat Jul 7 14:09:03 UTC 2018

On 07/05/18 17:14, Rick Macklem wrote:
> Andrew Gallatin wrote:
> On 7/4/18 9:20 PM, Rodney W. Grimes wrote:
> [stuff snipped]
>>>
>>> It is using a magic constant twice, where one has a
>>> derived value that is dependent on the value of the other.
>>> That is bad and error prone and does not document that
>>> one depends on the other.  Please fix this.  Or at least
>>> make 65536 a #define so that it only needs changed one
>>> place and clearly shows the interdependence of these
>>> values.
>>
>> To me, 65536 is one of the few cases where the magic number is
>> more meaningful than a name.  But fine, if you feel that
>> strongly about it, I'll change it for you.
> 
> Btw, in general, if_hw_tsomax and if_hw_tsomaxsegsize are not
> related or the same value. It just happens that they both appear
> to be related to 64K in this case. (I believe this is fairly common,
> since the original Microsoft "standard" used 64K as a limit, since
> it was stored in 16bits.)

Yes; exactly.

> if_hw_tsomax is the maximum size of the entire TSO segment,
> including MAC level headers (commonly 64K, due to Mircosoft...
> but could be larger if the hardware guys chose to do so).

Given that we do TSO like Linux, and not like MS (meaning
we express the size of the pre-segmented packet using the
a 16-bit value in the IPv4/IPv6 header), supporting more
than 64K is not possible in FreeBSD, so I'm basically
saying "nerf this constraint".

MS windows does it better / different; they express the
size of the pre-segmented packet in packet metadata,
leaving ip->ip_len = 0.  This is better, since
then the pseudo hdr checksum in the template header can be
re-used (with the len added) for every segment by the NIC.
If you've ever seen a driver set ip->ip_len = 0, and re-calc
the pseudo-hdr checksum, that's why.   This is also why
MS LSOv2 can support TSO of packets larger than 64K, since they're
not constrained by the 16-bit value in the IP{4,6} header.
The value of TSO larger than 64K is questionable at best though.
Without pacing, you'd just get more packets dropped when
talking across the internet..

> if_hw_tsomaxsegsize is the maximum size of contiguous memory
> that a "chunk" of the TSO segment can be stored in for handling by
> the driver's transmit side. Since higher

And this is what I object to.  TCP should not care about
this.  Drivers should use busdma, or otherwise be capable of
chopping large contig regions down to chunks that they can
handle.   If a driver can really only handle 2K, then it should
be having busdma give it an s/g list that is 2x as long, not having
TCP call m_dupcl() 2x as often on page-sized data generated by
sendfile (or more on non-x86 with larger pages).

> level code such as NFS (and iSCSI, I think?) uses MCLBYTE clusters,
> anything 2K or higher normally works the same.  Not sure about
> sosend(), but I think it also copies the data into MCLBYTE clusters?
> This would change if someday jumbo mbuf clusters become the norm.
> (I tried changing the NFS code to use jumbo clusters, but it would
>   result in fragmentation of the memory used for mbuf cluster allocation,
>   so I never committed it.)

At least for sendfile(), vm pages are wrapped up and attached to
mbufs, so you have 4K (and potentially much more on non-x86).
Doesn't NFS do something similar when sending data, or do you copy
into clusters?

I have changes which I have not upstreamed yet which enhance mbufs to
carry TLS metadata & vector of physical addresses (which I call
unmapped mbufs) for sendfile and kernel TLS.  As part of that,
sosend (for kTLS) can allocate many pages and attach them to one mbuf.
The idea (for kTLS) is that you can keep an entire TLS record (with
framing information) in a single unmapped mbuf, which saves a
huge amount of CPU which would be lost to cache misses doing
pointer-chasing of really long mbuf chains (TLS hdrs and trailers
are generally 13 and 16 bytes).  The goal was to regain CPU
during Netflix's transition to https streaming.  However, it
is unintentionally quite helpful on i386, since it reduces
overhead from having to map/unmap sf_bufs. FWIW, these mbufs
have been in production at Netflix for over a year, and carry
a large fraction of the worlds internet traffic :)

> rick
> ps: And I'll admit I don't find 65536 very magic;-)
> 

:)

Drew