NFS-exported ZFS instability

Mon Feb 4 00:48:44 UTC 2013

Andriy Gapon wrote:
> on 30/01/2013 00:44 Andriy Gapon said the following:
> > on 29/01/2013 23:44 Hiroki Sato said the following:
> >>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
> >>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> >
> [snip]
> > See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid
> > 100639
> > (nfsd in kmem_back).
> >
> 
> I decided to write a few more words about this issue.
> 
> I think that the root cause of the problem is that ZFS ARC code
> performs memory
> allocations with M_WAITOK while holding some ARC lock(s).
> 
> If a thread runs into such an allocation when a system is very low on
> memory
> (even for a very short period of time), then the thread is going to be
> blocked
> (to sleep in more exact terms) in VM_WAIT until a certain amount of
> memory is
> freed. To be more precise until v_free_count + v_cache_count goes
> above v_free_min.
> And quoting from the report:
> db> show page
> cnt.v_free_count: 8842
> cnt.v_cache_count: 0
> cnt.v_inactive_count: 0
> cnt.v_active_count: 169
> cnt.v_wire_count: 6081952
> cnt.v_free_reserved: 7981
> cnt.v_free_min: 38435
> cnt.v_free_target: 161721
> cnt.v_cache_min: 161721
> cnt.v_inactive_target: 242581
> 
> In this case tid 100639 is the thread:
> Tracing command nfsd pid 961 tid 100639 td 0xfffffe0027038920
> sched_switch() at sched_switch+0x17a/frame 0xffffff86ca5c9c80
> mi_switch() at mi_switch+0x1f8/frame 0xffffff86ca5c9cd0
> sleepq_switch() at sleepq_switch+0x123/frame 0xffffff86ca5c9d00
> sleepq_wait() at sleepq_wait+0x4d/frame 0xffffff86ca5c9d30
> _sleep() at _sleep+0x3d4/frame 0xffffff86ca5c9dc0
> kmem_back() at kmem_back+0x1a3/frame 0xffffff86ca5c9e50
> kmem_malloc() at kmem_malloc+0x1f8/frame 0xffffff86ca5c9ea0
> uma_large_malloc() at uma_large_malloc+0x4a/frame 0xffffff86ca5c9ee0
> malloc() at malloc+0x14d/frame 0xffffff86ca5c9f20
> arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xffffff86ca5c9f60
> arc_read_nolock() at arc_read_nolock+0x208/frame 0xffffff86ca5ca010
> arc_read() at arc_read+0x93/frame 0xffffff86ca5ca090
> dbuf_read() at dbuf_read+0x452/frame 0xffffff86ca5ca150
> dmu_buf_hold_array_by_dnode() at
> dmu_buf_hold_array_by_dnode+0x16a/frame
> 0xffffff86ca5ca1e0
> dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame
> 0xffffff86ca5ca240
> dmu_read_uio() at dmu_read_uio+0x3f/frame 0xffffff86ca5ca2a0
> zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xffffff86ca5ca3b0
> nfsvno_read() at nfsvno_read+0x2db/frame 0xffffff86ca5ca490
> nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xffffff86ca5ca710
> nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xffffff86ca5ca910
> nfssvc_program() at nfssvc_program+0x5da/frame 0xffffff86ca5caaa0
> svc_run_internal() at svc_run_internal+0x5fb/frame 0xffffff86ca5cabd0
> svc_thread_start() at svc_thread_start+0xb/frame 0xffffff86ca5cabe0
> 
> Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
> operations may get blocked. And pretty much all ZFS I/O goes through
> the ARC.
> So that's why we see all those stuck nfsd threads.
> 
> Another factor greatly contributing to the problem is that currently
> the page
> daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for
> the ARC
> reclaim thread to make a pass. This happens before the page daemon
> makes its
> own pageout pass.
> 
> But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets
> blocked
> and can not make any forward progress. Thus the page daemon also gets
> blocked.
> And thus the page daemon can not free up any pages.
> 
> 
> So, this situation is not a true deadlock. E.g. it is theoretically
> possible
> that some other threads would free some memory at their own will and
> the
> condition would clear up. But in practice this is highly unlikely.
> 
> Some possible resolutions that I can think of.
> 
> The best one is probably doing ARC memory allocations without holding
> any locks.
> 
> Also, maybe we should make a rule that no vm_lowmem hooks should
> sleep. That
> is, arc_lowmem should signal the ARC reclaim thread to do some work,
> but should
> not wait on it.
> 
> Perhaps we could also provide a mechanism to mark certain memory
> allocations as
> "special" and use that mechanism for ARC allocations. So that VM_WAIT
> unblocks
> sooner: in this case we had 8842 free pages (~35MB), but thread 100639
> was not
> waken up.
> 
> I think that ideally we should do something about all the three
> directions.
> But even one of them might turn out to be sufficient.
> As I've said, the first one seems to be the most promising, but it
> would require
> some tricky programming (flags and retries?) to move memory
> allocations out of
> locked sections.

For the NFSv4 stuff, I pre-allocate any structures that I might need
using malloc(..M_WAITOK) before going into the locked region. If I
don't need them, I just free() them at the end. (I assign "newp"
the allocation and set "newp" NULL if it is used. If "newp" != NULL
at the end, then free(newp..);)

This avoids all the "go back and retry after doing an allocation"
complexity. (One of the big names, maybe Dykstra, had a name for
this approach, but I can't remember.;-)

It won't work for cases where the locked region needs K allocations,
where K varies and has no fixed upper bound (malloc() in a loop of
no fixed number of iterations).

Thought I'd mention it, just in case the technique would be useful
in this case. (I have no idea what this code looks like;-)

Good luck with it, rick

> --
> Andriy Gapon
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
> "freebsd-stable-unsubscribe at freebsd.org"