svn commit: r335967 - head/sys/dev/mxge

Sat Jul 7 20:28:40 UTC 2018

Andrew Gallatin wrote:
>Given that we do TSO like Linux, and not like MS (meaning
>we express the size of the pre-segmented packet using the
>a 16-bit value in the IPv4/IPv6 header), supporting more
>than 64K is not possible in FreeBSD, so I'm basically
>saying "nerf this constraint".
Well, my understanding was that the total length of the TSO
segment is in the first header mbuf of the chain handed to
the net driver.
I thought the 16bit IP header was normally filled in with the
length because certain drivers/hardware expected that.

>MS windows does it better / different; they express the
>size of the pre-segmented packet in packet metadata,
>leaving ip->ip_len = 0.  This is better, since
>then the pseudo hdr checksum in the template header can be
>re-used (with the len added) for every segment by the NIC.
>If you've ever seen a driver set ip->ip_len = 0, and re-calc
>the pseudo-hdr checksum, that's why.   This is also why
>MS LSOv2 can support TSO of packets larger than 64K, since they're
>not constrained by the 16-bit value in the IP{4,6} header.
>The value of TSO larger than 64K is questionable at best though.
>Without pacing, you'd just get more packets dropped when
>talking across the internet..
I think some drivers already do TSO segments greater than 64K.
(It has been a while, but I recall "grep"ng for a case where if_hw_tsomax was
set to a large value and did find one. I think it was a "vm" fake hardware
driver.)

I suspect the challenge is more finding out what the hardware actually
expects the IP header length to be set to. If MS uses a setting of 0, I'd guess
most newer hardware can handle that?
Beyond that, this is way out of my area of exeprtise;-)

> if_hw_tsomaxsegsize is the maximum size of contiguous memory
> that a "chunk" of the TSO segment can be stored in for handling by
> the driver's transmit side. Since higher

>And this is what I object to.  TCP should not care about
>this.  Drivers should use busdma, or otherwise be capable of
>chopping large contig regions down to chunks that they can
>handle.   If a driver can really only handle 2K, then it should
>be having busdma give it an s/g list that is 2x as long, not having
>TCP call m_dupcl() 2x as often on page-sized data generated by
>sendfile (or more on non-x86 with larger pages).
>
>> level code such as NFS (and iSCSI, I think?) uses MCLBYTE clusters,
>> anything 2K or higher normally works the same.  Not sure about
>> sosend(), but I think it also copies the data into MCLBYTE clusters?
>> This would change if someday jumbo mbuf clusters become the norm.
>> (I tried changing the NFS code to use jumbo clusters, but it would
>>   result in fragmentation of the memory used for mbuf cluster allocation,
>>   so I never committed it.)
>
>At least for sendfile(), vm pages are wrapped up and attached to
>mbufs, so you have 4K (and potentially much more on non-x86).
>Doesn't NFS do something similar when sending data, or do you copy
>into clusters?
Most NFS RPC messages are small and fit into a regular mbuf. I have to look
at the code to see when/if it uses an mbuf cluster for those. (It has changed
a few times over the years.)
For Read replies, it uses a chain of mbuf clusters. I suspect that it could
do what sendfile does for UFS. Part of the problem is that NFS clients can do
byte aligned reads of any size, so going through the buffer cache is useful
sometimes. For write requests, odd sized writes that are byte aligned can often
happen when a loader does its thing.
For ZFS, I have no idea. I'm not a ZFS guy.
For write requests, the server gets whatever the TCP layer passes up,
which is normally a chain of mbufs.
(For the client substitute Read/Write, since the writes are copied out of the
 buffer cache and the Read replies come up from TCP.)

>I have changes which I have not upstreamed yet which enhance mbufs to
>carry TLS metadata & vector of physical addresses (which I call
>unmapped mbufs) for sendfile and kernel TLS.  As part of that,
>sosend (for kTLS) can allocate many pages and attach them to one mbuf.
>The idea (for kTLS) is that you can keep an entire TLS record (with
>framing information) in a single unmapped mbuf, which saves a
>huge amount of CPU which would be lost to cache misses doing
>pointer-chasing of really long mbuf chains (TLS hdrs and trailers
>are generally 13 and 16 bytes).  The goal was to regain CPU
>during Netflix's transition to https streaming.  However, it
>is unintentionally quite helpful on i386, since it reduces
>overhead from having to map/unmap sf_bufs. FWIW, these mbufs
>have been in production at Netflix for over a year, and carry
>a large fraction of the worlds internet traffic :)
These could probably be useful for the NFS server doing read replies, since
it does a VOP_READ() with a "uio" that refers to buffers (which happen to be
mbuf cluster data areas right now).
For the other cases, I'd have to look at it more closely.

They do sound interesting, rick
[stuff snipped]