NFS deadlock on 9.2-Beta1

Thu Aug 22 09:20:54 UTC 2013

On Wed, Aug 21, 2013 at 09:08:10PM -0400, Rick Macklem wrote:
> Kostik wrote:
> > On Tue, Aug 20, 2013 at 06:18:16PM -0400, Rick Macklem wrote:
> > > J David wrote:
> > > > On Thu, Aug 15, 2013 at 5:39 PM, Rick Macklem
> > > > <rmacklem at uoguelph.ca>
> > > > wrote:
> > > > > Have you been able to pass the debugging info on to Kostik?
> > > > >
> > > > > It would be really nice to get this fixed for FreeBSD9.2.
> > > > 
> > > > You're probably not talking to me, but headway here is slow.  At
> > > > our
> > > > location, we have been continuing to test releng/9.2 extensively,
> > > > but
> > > > with r250907 reverted.  Since reverting it solves the issue, and
> > > > since
> > > > there haven't been any further changes to releng/9.2 that might
> > > > also
> > > > resolve this issue, re-applying r250907 is perceived here as
> > > > un-fixing
> > > > a problem.  Enthusiasm for doing so is correspondingly low, even
> > > > if
> > > > the purpose is to gather debugging info. :(
> > > > 
> > > > However, after finally having clearance to test releng/9.2
> > > > r254540
> > > > with r250907 included and with DDB on five nodes.  The problem
> > > > cropped
> > > > up in about an hour.  Two threads in one process deadlocked, was
> > > > perfect.  Got it into DDB and saw the stack trace was scrolling
> > > > off
> > > > so
> > > > there was no way to copy it by hand.  Also, the machine's disk is
> > > > smaller than physical RAM, so no dump file. :(
> > > > 
> > > > Here's what is available so far:
> > > > 
> > > > db> show proc 33362
> > > > 
> > > > Process 33362 (httpd) at 0xcd225b50:
> > > > 
> > > >  state: NORMAL
> > > > 
> > > >  uid: 25000 gids: 25000
> > > > 
> > > >  parent: pid 25104 at 0xc95f92d4
> > > > 
> > > >  ABI: FreeBSD ELF32
> > > > 
> > > >  arguments: /usr/local/libexec/httpd
> > > > 
> > > >  threads: 3
> > > > 
> > > > 100405 D newnfs 0xc9b875e4 httpd
> > > > 
> > > Ok, so this one is waiting for an NFS vnode lock.
> > > 
> > > > 100393 D pgrbwt 0xc43a30c0 httpd
> > > > 
> > > This one is sleeping in vm_page_grab() { which I suspect has
> > > been called from kern_sendfile() with a shared vnode lock held,
> > > from what I saw on the previous debug info }.
> > > 
> > > > 100755 S uwait 0xc84b7c80 httpd
> > > > 
> > > > 
> > > > Not much to go on. :(  Maybe these five can be configured with
> > > > serial
> > > > consoles.
> > > > 
> > > > So, inquiries are continuing, but the answer to "does this still
> > > > happen on 9.2-RC2?" is definitely yes.
> > > > 
> > > Since r250027 moves a vn_lock() to before the vm_page_grab() call
> > > in
> > > kern_sendfile(), I suspect that is the cause of the deadlock.
> > > (r250027
> > > is one of the 3 commits MFC'd by r250907)
> > > 
> > > I don't know if it would be safe to VOP_UNLOCK() the vnode after
> > > VOP_GETATTR()
> > > and then put the vn_lock() call that comes after vm_page_grab()
> > > back in or whether
> > > r250027 should be reverted (getting rid of the VOP_GETATTR() and
> > > going back to
> > > using the size in the vm stuff).
> > > 
> > > Hopefully Kostik will know what is best to do with it now, rick
> > 
> > I already described what to do with this.  I need the debugging
> > information to see what is going on.  Without the data, it is only
> > wasted time of everybody involved.
> > 
> Sorry, I didn't make what I was asking clear. I was referring specifically
> to stopping the hang from occurring in the soon to be released 9.2.
> 
> I think you indirectly answered the question, in that you don't know
> of a fix for the hangs without more debugging information. This
> implies that reverting r250907 is the main option to resolve this
> for the 9.2 release (unless more debugging info arrives very soon),
> since that is the only fix that has been confirmed to work.
> Does this sound reasonable?
I do not object against reverting it for 9.2.  Please go ahead.

On the other hand, I do not want to revert it in stable/9, at least
until the cause is understood.

> 
> > Some technical notes.  The sendfile() uses shared lock for the
> > duration
> > of vnode i/o, so any thread which is sleeping on the vnode lock
> > cannot
> > be in the sendfile path, at least for UFS and NFS which do support
> > true
> > shared locks.
> > 
> > The right lock order is vnode lock -> page busy wait. From this PoV,
> > the ordering in the sendfile is correct. Rick, are you aware of any
> > situation where the VOP_READ in nfs client could drop vnode lock
> > and then re-acquire it ? I was not able to find this from the code
> > inspection. But, if such situation exists, it would be problematic in
> > 9.
> > 
> I am not aware of a case where nfs_read() drops/re-acquires the vnode
> lock.
> 
> However, readaheads will still be in progress when nfs_read() returns,
> so those can still be in progress after the vnode lock is dropped.
> 
> vfs_busy_pages() will have been called on the page(s) that readahead
> is in progress on (I think that means the shared busy bit will be set,
> if I understood vfs_busy_pages()). When the readahead is completed,
> bufdone() is called, so I don't understand why the page wouldn't become
> unbusied (waking up the thread sleeping on "pgrbwt").
Exactly, this is the issue which I do not understand as well.

> I can't see why not being able to acquire the vnode lock would affect
> this, but my hunch is that it somehow does have this effect, since that
> is the only way I can see that r250907 would cause the hangs.
> 
> > Last note.  The HEAD dropped pre-busying pages in the sendfile()
> > syscall.
> > As I understand, this is because new Attilio' busy implementation
> > cannot
> > support both busy and sbusy states simultaneously, and
> > vfs_busy_pages()/
> > vfs_drain_busy_pages() actually created such situation. I think that
> > because the sbusy is removed from the sendfile(), and the vm object
> > lock is dropped, there is no sense to require vm_page_grab() to wait
> > for the busy state to clean.  It is done by buffer cache or
> > filesystem
> > code later. See the patch at the end.
> > 
> Wouldn't a readahead in progress have the page sbusied (via vfs_busy_pages())
> and wouldn't vm_page_grab() need to wait until that readahead is done, so
> that the page has valid data in it?
The vm object lock is dropped immediately after the grab, and other thread
might busy the page meantime.  In essence, it is the duty of the filesystem
(through the call to vn_rdwr->VOP_READ) to ensure that the page has valid
content after VOP_READ returns, and the filesystem should handle the busy
state as well.

What happens for the filesystems which use buffer cache is that
vfs_busy_pages() call on the buffer first waits for the busy state to
drain, and then busies the pages.  So it is already correct inside
VOP_READ().

> 
> > Still, I do not know what happens in the supposedly reported
> > deadlock.
> > 
> Neither do I. I just suspect that holding the shared vnode lock
> while sleeping in vm_page_grab() somehow stops it from being unbusied?
> 
> rick
> 
> > diff --git a/sys/kern/uipc_syscalls.c b/sys/kern/uipc_syscalls.c
> > index 4797444..b974f53 100644
> > --- a/sys/kern/uipc_syscalls.c
> > +++ b/sys/kern/uipc_syscalls.c
> > @@ -2230,7 +2230,8 @@ retry_space:
> >  			pindex = OFF_TO_IDX(off);
> >  			VM_OBJECT_WLOCK(obj);
> >  			pg = vm_page_grab(obj, pindex, VM_ALLOC_NOBUSY |
> > -			    VM_ALLOC_NORMAL | VM_ALLOC_WIRED | VM_ALLOC_RETRY);
> > +			    VM_ALLOC_IGN_SBUSY | VM_ALLOC_NORMAL |
> > +			    VM_ALLOC_WIRED | VM_ALLOC_RETRY);
> >  
> >  			/*
> >  			 * Check if page is valid for what we need,
> > 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130822/4a1600e5/attachment-0001.sig>