Limits on jumbo mbuf cluster allocation

Fri Mar 8 07:54:23 UTC 2013

On 08.03.2013 08:10, Garrett Wollman wrote:
> I have a machine (actually six of them) with an Intel dual-10G NIC on
> the motherboard.  Two of them (so far) are connected to a network
> using jumbo frames, with an MTU a little under 9k, so the ixgbe driver
> allocates 32,000 9k clusters for its receive rings.  I have noticed,
> on the machine that is an active NFS server, that it can get into a
> state where allocating more 9k clusters fails (as reflected in the
> mbuf failure counters) at a utilization far lower than the configured
> limits -- in fact, quite close to the number allocated by the driver
> for its rx ring.  Eventually, network traffic grinds completely to a
> halt, and if one of the interfaces is administratively downed, it
> cannot be brought back up again.  There's generally plenty of physical
> memory free (at least two or three GB).

You have an amd64 kernel running HEAD or 9.x?

> There are no console messages generated to indicate what is going on,
> and overall UMA usage doesn't look extreme.  I'm guessing that this is
> a result of kernel memory fragmentation, although I'm a little bit
> unclear as to how this actually comes about.  I am assuming that this
> hardware has only limited scatter-gather capability and can't receive
> a single packet into multiple buffers of a smaller size, which would
> reduce the requirement for two-and-a-quarter consecutive pages of KVA
> for each packet.  In actual usage, most of our clients aren't on a
> jumbo network, so most of the time, all the packets will fit into a
> normal 2k cluster, and we've never observed this issue when the
> *server* is on a non-jumbo network.
>
> Does anyone have suggestions for dealing with this issue?  Will
> increasing the amount of KVA (to, say, twice physical memory) help
> things?  It seems to me like a bug that these large packets don't have
> their own submap to ensure that allocation is always possible when
> sufficient physical pages are available.

Jumbo pages come directly from the kernel_map which on amd64 is 512GB.
So KVA shouldn't be a problem.  Your problem indeed appears to come
physical memory fragmentation in pmap.  There is a buddy memory
allocator at work but I fear it runs into serious trouble when it has
to allocate a large number of objects spanning more than 2 contiguous
pages.  Also since you're doing NFS serving almost all memory will be
in use for file caching.

Running a NIC with jumbo frames enabled gives some interesting trade-
offs.  Unfortunately most NIC's can't have multiple DMA buffer sizes
on the same receive queue and pick the best size for the incoming frame.
That means they need to use largest jumbo mbuf for all receive traffic,
even a tiny 40 byte ACK.  The send side is not constrained in such a way
and tries to use PAGE_SIZE clusters for socket buffers whenever it can.

Many, but not all, NIC's are able to split a received jumbo frame into
multiple smaller DMA segments forming an mbuf chain.  The ixgbe hardware
is capable of doing this, though the driver supports it but doesn't
actively makes use of it.

Another issue with many drivers is their inability to deal with mbuf
allocation failure for their receive DMA ring.  They try to fill it
up to the maximal ring size and balk on failure.  Rings have become
very big and usually are a power of two.  The driver could function
with a partially filled RX ring too, maybe with some performance
impact when it gets really low.  On every rxeof it tries to refill
the ring, so when resources become available again it'd balance out.
NIC's with multiple receive queues/rings make this problem even more
acute.

A theoretical fix would be to dedicate an entire super page of 1GB
or so exclusively to the jumbo frame UMA zone as backing memory.  That
memory is gone for all other uses though, even if not actually used.
Allocating the superpage and determining its size would have to be
done manually by setting loader variables.  I don't see a reasonable
way to do this with autotuning because it requires advance knowledge
of the usage patters.

IMHO the right fix is to strongly discourage use of jumbo clusters
larger than PAGE_SIZE when the hardware is capable of splitting the
frame into multiple clusters.  The allocation constraint then is only
available memory and no longer contiguous pages.  Also the waste
factor for small frames is much lower.  The performance impact is
minimal to non-existent.  In addition drivers shouldn't break down
when the RX ring can't be filled to the max.

I recently got yelled at for suggesting to remove jumbo > PAGE_SIZE.
However your case proves that such jumbo frames are indeed their own
can of worms and should really only and exclusively be used for NIC's
that have to do jumbo *and* are incapable of RX scatter DMA.

-- 
Andre