NFS deadlock on 9.2-Beta1

Thu Aug 22 01:08:19 UTC 2013

Kostik wrote:
> On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
> > J David wrote:
> > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
> > > <rmacklem at uoguelph.ca>
> > > wrote:
> > > > Have you been able to pass the debugging info on to Kostik?
> > > >
> > > > It would be really nice to get this fixed for FreeBSD9.2.
> > > 
> > > You're probably not talking to me, but headway here is slow.  At
> > > our
> > > location, we have been continuing to test releng/9.2 extensively,
> > > but
> > > with r250907 reverted.  Since reverting it solves the issue, and
> > > since
> > > there haven't been any further changes to releng/9.2 that might
> > > also
> > > resolve this issue, re-applying r250907 is perceived here as
> > > un-fixing
> > > a problem.  Enthusiasm for doing so is correspondingly low, even
> > > if
> > > the purpose is to gather debugging info. :(
> > > 
> > > However, after finally having clearance to test releng/9.2
> > > r254540
> > > with r250907 included and with DDB on five nodes.  The problem
> > > cropped
> > > up in about an hour.  Two threads in one process deadlocked, was
> > > perfect.  Got it into DDB and saw the stack trace was scrolling
> > > off
> > > so
> > > there was no way to copy it by hand.  Also, the machine's disk is
> > > smaller than physical RAM, so no dump file. :(
> > > 
> > > Here's what is available so far:
> > > 
> > > db> show proc 33362
> > > 
> > > Process 33362 (httpd) at 0xcd225b50:
> > > 
> > >  state: NORMAL
> > > 
> > >  uid: 25000 gids: 25000
> > > 
> > >  parent: pid 25104 at 0xc95f92d4
> > > 
> > >  ABI: FreeBSD ELF32
> > > 
> > >  arguments: /usr/local/libexec/httpd
> > > 
> > >  threads: 3
> > > 
> > > 100405 D newnfs 0xc9b875e4 httpd
> > > 
> > Ok, so this one is waiting for an NFS vnode lock.
> > 
> > > 100393 D pgrbwt 0xc43a30c0 httpd
> > > 
> > This one is sleeping in vm_page_grab() { which I suspect has
> > been called from kern_sendfile() with a shared vnode lock held,
> > from what I saw on the previous debug info }.
> > 
> > > 100755 S uwait 0xc84b7c80 httpd
> > > 
> > > 
> > > Not much to go on. :(  Maybe these five can be configured with
> > > serial
> > > consoles.
> > > 
> > > So, inquiries are continuing, but the answer to "does this still
> > > happen on 9.2-RC2?" is definitely yes.
> > > 
> > Since r250027 moves a vn_lock() to before the vm_page_grab() call
> > in
> > kern_sendfile(), I suspect that is the cause of the deadlock.
> > (r250027
> > is one of the 3 commits MFC'd by r250907)
> > 
> > I don't know if it would be safe to VOP_UNLOCK() the vnode after
> > VOP_GETATTR()
> > and then put the vn_lock() call that comes after vm_page_grab()
> > back in or whether
> > r250027 should be reverted (getting rid of the VOP_GETATTR() and
> > going back to
> > using the size in the vm stuff).
> > 
> > Hopefully Kostik will know what is best to do with it now, rick
> 
> I already described what to do with this.  I need the debugging
> information to see what is going on.  Without the data, it is only
> wasted time of everybody involved.
> 
Sorry, I didn't make what I was asking clear. I was referring specifically
to stopping the hang from occurring in the soon to be released 9.2.

I think you indirectly answered the question, in that you don't know
of a fix for the hangs without more debugging information. This
implies that reverting r250907 is the main option to resolve this
for the 9.2 release (unless more debugging info arrives very soon),
since that is the only fix that has been confirmed to work.
Does this sound reasonable?

> Some technical notes.  The sendfile() uses shared lock for the
> duration
> of vnode i/o, so any thread which is sleeping on the vnode lock
> cannot
> be in the sendfile path, at least for UFS and NFS which do support
> true
> shared locks.
> 
> The right lock order is vnode lock -> page busy wait. From this PoV,
> the ordering in the sendfile is correct. Rick, are you aware of any
> situation where the VOP_READ in nfs client could drop vnode lock
> and then re-acquire it ? I was not able to find this from the code
> inspection. But, if such situation exists, it would be problematic in
> 9.
> 
I am not aware of a case where nfs_read() drops/re-acquires the vnode
lock.

However, readaheads will still be in progress when nfs_read() returns,
so those can still be in progress after the vnode lock is dropped.

vfs_busy_pages() will have been called on the page(s) that readahead
is in progress on (I think that means the shared busy bit will be set,
if I understood vfs_busy_pages()). When the readahead is completed,
bufdone() is called, so I don't understand why the page wouldn't become
unbusied (waking up the thread sleeping on "pgrbwt").
I can't see why not being able to acquire the vnode lock would affect
this, but my hunch is that it somehow does have this effect, since that
is the only way I can see that r250907 would cause the hangs.

> Last note.  The HEAD dropped pre-busying pages in the sendfile()
> syscall.
> As I understand, this is because new Attilio' busy implementation
> cannot
> support both busy and sbusy states simultaneously, and
> vfs_busy_pages()/
> vfs_drain_busy_pages() actually created such situation. I think that
> because the sbusy is removed from the sendfile(), and the vm object
> lock is dropped, there is no sense to require vm_page_grab() to wait
> for the busy state to clean.  It is done by buffer cache or
> filesystem
> code later. See the patch at the end.
> 
Wouldn't a readahead in progress have the page sbusied (via vfs_busy_pages())
and wouldn't vm_page_grab() need to wait until that readahead is done, so
that the page has valid data in it?

> Still, I do not know what happens in the supposedly reported
> deadlock.
> 
Neither do I. I just suspect that holding the shared vnode lock
while sleeping in vm_page_grab() somehow stops it from being unbusied?

rick

> diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
> index 4797444..b974f53 100644
> --- a/sys/kern/uipc_syscalls.c
> +++ b/sys/kern/uipc_syscalls.c
> @@ -2230,7 +2230,8 @@ retry_space:
>  			pindex = OFF_TO_IDX(off);
>  			VM_OBJECT_WLOCK(obj);
>  			pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
> -			    VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
> +			    VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
> +			    VM_ALLOC_WIRED | VM_ALLOC_RETRY);
>  
>  			/*
>  			 * Check if page is valid for what we need,
>