Re: Increasing TCP TSO size support

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Sat, 03 Feb 2024 02:05:08 UTC
On Fri, Feb 2, 2024 at 4:48 PM Drew Gallatin <gallatin@freebsd.org> wrote:
>
>
>
> On Fri, Feb 2, 2024, at 6:13 PM, Rick Macklem wrote:
>
>  A factor here is the if_hw_tsomaxsegcount limit. For example, a 1Mbyte NFS write request
> or read reply will result in a 514 element mbuf chain. Each of these (mostly 2K mbuf clusters)
> are non-contiguous data segments. (I suspect most NICs do not handle this many segments well,
> if at all.)
>
>
> Excellent point
>
>
> The NFS code does know how to use M_EXTPG mbufs (for NFS over TLS, for the ktls), but I do not
> know what it would take to make these work for non-KTLS TSO?
>
>
>
> Sendfile already uses M_EXTPG mbufs... When I was initially doing M_EXTPG stuff for kTLS, I added support for using M_EXTPG mbufs in sendfile regardless of whether or not kTLS was in use.  That reduced CPU use marginally on 64-bit platforms (due to reducing socket buffer lengths, and hence reducing pointer chasing), and quite a bit more on 32-bit platforms (due to also not needing to map memory into the kernel map, and by reducing pointer chasing even more, as more pages fit into an M_EXTPG mbuf when a paddr_t is 32-bits.
>
>
> I do not know how the TSO loop in tcp_output handles M_EXTPG mbufs.
> Does it assume each M_EXTPG mbuf is one contiguous data segment?
>
>
> No, its fully aware of how to handle M_EXTPG mbufs.  Look at tcp_m_copy().  We added code in the segment counting part of that function to count the hdr/trailer parts of an M_EXTPG mbuf, and to deal with the start/end page being misaligned.
>
> I do see that ip_output() will call mb_unmapped_to_ext() when the NIC does not have IFCAP_MEXTPG set.
> (If IFCAP_MEXTPG is set, do the pages need to be contiguous so that it can become
> a single contiguous data segment for TSO or ???)
>
>
> No, it just means that a NIC driver has been verified to call not mtod() an M_EXTPGS mbuf and deref the resulting data pointer. (which would make it go "boom").
>
> But the page size is only 4K on most platforms.  So while an M_EXTPGS mbuf can hold 5 pages (..from memory, too lazy to do the math right now) and reduces socket buffer mbuf chain lengths by a factor of 10 or so (2k vs 20k per mbuf), the S/G list that a NIC will need to consume would likely decrease only by a factor of 2.  And even then only if the busdma code to map mbufs for DMA is not coalescing adjacent mbufs.  I know busdma does some coalescing, but I can't recall if it coalesces physcally adjacent mbufs.

I'm guessing the factor of 2 comes from the fact that each page is a
contiguous segment?

The NFS code could easily use 5 contiguous pages, so maybe it would be
worthwhile
to try and make some NIC drivers capable of handling contiguous pages
as one segment
for TSO output? (It means that tcp_outpout() would need to know this
case was possible,
Maybe a new if_hw_tsoXX that covers the max number of segments if
pages are contig?)

However, given your previous post, it might not matter much, since the
larger TSO
segment might not make much difference?

>
> If TSO and the code beneath it (NIC and maybe mb_unmapped_to_ext() being called) were to
> all work ok for M_EXTPG mbufs, it would be easy to enable that for NFS (non-TLS case).
>
>
>
> It does.  You should enable it for at least TCP.
Good work!!

I will try it someday relatively soon. Even if it only reduces the use
of mbuf clusters,
that sounds like it would be worthwhile.

rick
>
> Drew