NFS threads getting stuck in vmem_bt_alloc() at "btalloc"? (mbuf allocation)

Thu Feb 13 02:02:16 UTC 2014

I wrote:
> I've been doing some testing using pagesize clusters (4K) for NFS
> instead of mclbytes (2K) on a single core i386.
> Sometimes I get threads stuck sleeping on "btalloc", which seems
> to happen in vmem_bt_alloc().
>
> The comment in vmem_bt_alloc() basically says:
>   out of address space or lost a fill race
> Since this is persistent, I suspect it is the first case?
>
> So, does anyone know what is going on here or what I should look
> at to try and resolve this?
>
> Btw, when I am testing, I don't see the pagesize cluster allocation
> exceed 400, so it doesn't seem to be a leak or excessive allocation.
>
> Thanks in advance for any help, rick
I originally posted this to freebsd-hackers@, but since it seems to
be related to mbuf allocation, I thought it might be better here.

When I posted this, I knew nothing about uma or the current mbuf allocation
mechanisms. Now, I know a little bit and the story is getting interesting...

Currently, NFS does:
    MGET(..M_WAITOK);
    MCLGET(..M_NOWAIT);
when it wants an mbuf cluster. It was done this way long ago, because mbuf
clusters could become exhausted and this allowed NFS to limp along, using
long lists of regular mbufs for the data (NFS RPC messages).

Now, it seems that this does the following (MCLGET() is just m_clget(), which
is an inline function in sys/mbuf.h):
    MGET(..M_WAITOK)   - always returns an mbuf
    m_clget(..M_NOWAIT)
    - calls uma_alloc_arg(zone_clust, M_NOWAIT..)
    if this fails, it then
        zone_drain(zone_pack);
        calls uma_alloc_arg(zone_clust, M_NOWAIT..) again
As such, it will zone_drain(zone_pack) when cluster allocations become difficult
(including when a uma zone allocation for a boundary tag can't succeed without
 waiting). I suspect this usually fixes the problem and the second attempt
succeeds. However, even if the second attempt fails, NFS still has an mbuf and
doesn't get stuck in "btalloc".

When I was doing recent testing to see how pagesize clusters would work, I
switched to m_getjcl(..M_WAITOK..), which can get stuck in "btalloc" if an
attempt to allocate a boundary tag fails, due to lack of kernel address space.
I test on i386, but it still isn't obvious how I exhausted kernel address space?
One thing I notice is that zone_pack is set to the same limit as the mbuf zone
at 168765. However, unlike the mbuf zone, I think that many of the entries in
zone_pack will have a cluster associated with them. I am thinking that the limit
for zone_pack is on the high side, since zone_clust is limited to 26368 on my
i386 and maybe this is how kernel address space gets exhausted?

In summary, to play it safe, I think that if NFS is going to use pagesize
clusters, it needs to:
- call m_getjcl(..M_NOWAIT..);   /* call with M_NOWAIT */
- if this fails (returns NULL) then
  - call MGET(..M_WAITOK..)
  - call MCLGET(.. M_NOWAIT..)
That way, I don't think the NFS threads can get stuck sleeping on "btalloc" and
calls to zone_drain(zone_pack) will happen when allocation gets constrained.

This means that the length of the mbuf list for a read reply could be
- length (64K or whatever) / MLEN
for the worst case, since allocation of clusters isn't guaranteed.
(Garrett, I think you have to make your iovec that big if you are going to
 use a fixed size allocation instead of the current code, which malloc()s
 enough for the list.)

It seems to me that m_getcl()/m_getjcl() should do a zone_drain(zone_pack)
when an allocation fails (for M_NOWAIT), but that is just a suggestion?

What do others think of the above? rick