Limits on jumbo mbuf cluster allocation

Fri Mar 8 08:31:19 UTC 2013

On Thu, Mar 7, 2013 at 11:54 PM, Andre Oppermann <andre at freebsd.org> wrote:

> On 08.03.2013 08:10, Garrett Wollman wrote:
>
>> I have a machine (actually six of them) with an Intel dual-10G NIC on
>> the motherboard.  Two of them (so far) are connected to a network
>> using jumbo frames, with an MTU a little under 9k, so the ixgbe driver
>> allocates 32,000 9k clusters for its receive rings.  I have noticed,
>> on the machine that is an active NFS server, that it can get into a
>> state where allocating more 9k clusters fails (as reflected in the
>> mbuf failure counters) at a utilization far lower than the configured
>> limits -- in fact, quite close to the number allocated by the driver
>> for its rx ring.  Eventually, network traffic grinds completely to a
>> halt, and if one of the interfaces is administratively downed, it
>> cannot be brought back up again.  There's generally plenty of physical
>> memory free (at least two or three GB).
>>
>
> You have an amd64 kernel running HEAD or 9.x?
>
>
>  There are no console messages generated to indicate what is going on,
>> and overall UMA usage doesn't look extreme.  I'm guessing that this is
>> a result of kernel memory fragmentation, although I'm a little bit
>> unclear as to how this actually comes about.  I am assuming that this
>> hardware has only limited scatter-gather capability and can't receive
>> a single packet into multiple buffers of a smaller size, which would
>> reduce the requirement for two-and-a-quarter consecutive pages of KVA
>> for each packet.  In actual usage, most of our clients aren't on a
>> jumbo network, so most of the time, all the packets will fit into a
>> normal 2k cluster, and we've never observed this issue when the
>> *server* is on a non-jumbo network.
>>
>> Does anyone have suggestions for dealing with this issue?  Will
>> increasing the amount of KVA (to, say, twice physical memory) help
>> things?  It seems to me like a bug that these large packets don't have
>> their own submap to ensure that allocation is always possible when
>> sufficient physical pages are available.
>>
>
> Jumbo pages come directly from the kernel_map which on amd64 is 512GB.
> So KVA shouldn't be a problem.  Your problem indeed appears to come
> physical memory fragmentation in pmap.  There is a buddy memory
> allocator at work but I fear it runs into serious trouble when it has
> to allocate a large number of objects spanning more than 2 contiguous
> pages.  Also since you're doing NFS serving almost all memory will be
> in use for file caching.
>
> Running a NIC with jumbo frames enabled gives some interesting trade-
> offs.  Unfortunately most NIC's can't have multiple DMA buffer sizes
> on the same receive queue and pick the best size for the incoming frame.
> That means they need to use largest jumbo mbuf for all receive traffic,
> even a tiny 40 byte ACK.  The send side is not constrained in such a way
> and tries to use PAGE_SIZE clusters for socket buffers whenever it can.
>
> Many, but not all, NIC's are able to split a received jumbo frame into
> multiple smaller DMA segments forming an mbuf chain.  The ixgbe hardware
> is capable of doing this, though the driver supports it but doesn't
> actively makes use of it.
>
> Another issue with many drivers is their inability to deal with mbuf
> allocation failure for their receive DMA ring.  They try to fill it
> up to the maximal ring size and balk on failure.  Rings have become
> very big and usually are a power of two.  The driver could function
> with a partially filled RX ring too, maybe with some performance
> impact when it gets really low.  On every rxeof it tries to refill
> the ring, so when resources become available again it'd balance out.
> NIC's with multiple receive queues/rings make this problem even more
> acute.
>
> A theoretical fix would be to dedicate an entire super page of 1GB
> or so exclusively to the jumbo frame UMA zone as backing memory.  That
> memory is gone for all other uses though, even if not actually used.
> Allocating the superpage and determining its size would have to be
> done manually by setting loader variables.  I don't see a reasonable
> way to do this with autotuning because it requires advance knowledge
> of the usage patters.
>
> IMHO the right fix is to strongly discourage use of jumbo clusters
> larger than PAGE_SIZE when the hardware is capable of splitting the
> frame into multiple clusters.  The allocation constraint then is only
> available memory and no longer contiguous pages.  Also the waste
> factor for small frames is much lower.  The performance impact is
> minimal to non-existent.  In addition drivers shouldn't break down
> when the RX ring can't be filled to the max.
>
> I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE.
> However your case proves that such jumbo frames are indeed their own
> can of worms and should really only and exclusively be used for NIC's
> that have to do jumbo *and* are incapable of RX scatter DMA.
>
>
I am not strongly opposed to trying the 4k mbuf pool for all larger sizes,
Garrett maybe if you would try that on your system and see if that helps
you, I could envision making this a tunable at some point perhaps?

Thanks for the input Andre.

Jack