9k jumbo clusters

Sun Jul 29 21:38:24 UTC 2018

Adrian Chadd wrote:
>John-Mark Gurney wrote:
[stuff snipped]
>>
>> Drivers need to be fixed to use 4k pages instead of cluster.  I really hope
>> no one is using a card that can't do 4k pages, or if they are, then they
>> should get a real card that can do scatter/gather on 4k pages for jumbo
>> frames..
>
>Yeah but it's 2018 and your server has like minimum a dozen million 4k
>pages.
>
>So if you're doing stuff like lots of network packet kerchunking why not
>have specialised allocator paths that can do things like "hey, always give
>me 64k physical contig pages for storage/mbufs because you know what?
>they're going to be allocated/freed together always."
>
>There was always a race between bus bandwidth, memory bandwidth and
>bus/memory latencies. I'm not currently on the disk/packet pushing side of
>things, but the last couple times I were it was at different points in that
>4d space and almost every single time there was a benefit from having a
>couple of specialised allocators so you didn't have to try and manage a few
>dozen million 4k pages based on your changing workload.
>
>I enjoy the 4k page size management stuff for my 128MB routers. Your 128G
>server has a lot of 4k pages. It's a bit silly.
Here's my NFS guy perspective.
I do think 9K mbuf clusters should go away. I'll note that I once coded NFS so it
would use 4K mbuf clusters for the big RPCs (write requests and read replies) and
I actually could get the mbuf cluster pool fragmented to the point it stopped
working on a small machine, so it is possible (although not likely) to fragment
even a 2K/4K mix.

For me, send and receive are two very different cases:
- For sending a large NFS RPC (lets say a reply to a 64K read), the NFS code will
  generate a list of 33 2K mbuf clusters. If the net interface doesn't do TSO, this
  is probably fine, since tcp_output() will end up busting this up into a bunch of
  TCP segments using the list of mbuf clusters with TCP/IP headers added for
  each segment, etc...
  - If the net interface does TSO, this long list goes down to the net driver and uses
    34->35 ring entries to send it (it adds at least one segment for the MAC header
    typically). If the driver isn't buggy and the net chip supports lots of transmit
    ring entries, this works ok but...
 - If there was a 64K supercluster, the NFS code could easily use that for the 64K
   of data and the TSO enabled net interface would use 2 transmit ring entries.
   (one for the MAC/TCP/NFS header and one for the 64K of data). If the net interface
   can't handle a TSO segment over 65535bytes, it will end up getting 2 TSO segments
   from tcp_output(), but that still is a lot less than 35.
I don't know enough about net hardware to know when/if this will help perf., but
it seems that it might, at least for some chipsets?

For receive, it seems that a 64K mbuf cluster is overkill for jumbo packets, but as
others have noted, they won't be allocated for long unless packets arrive out of
order, at least for NFS. (For other apps., they  might not read the socket for a while
to get the data, so they might sit in the socket rcv queue for a while.)

I chose 64K, since that is what most net interfaces can handle for TSO these days.
(If it will soon be larger, I think this should be even larger, but all of them the same
 size to avoid fragmentation.) For the send case for NFS, it wouldn't even need to
be a very large pool, since they get free'd as soon as the net interface transmits
the TSO segment.

For NFS, it could easily call mget_supercl() and then fall back on the current code using 2K mbuf clusters if mget_supercl() failed, so a small pool would be fine for the
 NFS send side.

I'd like to see a pool for 64K or larger mbuf clusters for the send side.
For the receive side, I'll let others figure out the best solution (4K or larger
for jumbo clusters). I do think anything larger than 4K needs a separate allocation
pool to avoid fragmentation.
(I don't know, but I'd guess iSCSI could use them as well?)

rick