NFS deadlock on 9.2-Beta1

Sat Jul 27 20:58:26 UTC 2013

On Sat, Jul 27, 2013 at 04:20:49PM -0400, Rick Macklem wrote:
> Michael Tratz wrote:
> > 
> > On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem at uoguelph.ca>
> > wrote:
> > 
> > > Michael Tratz wrote:
> > >> Two machines (NFS Server: running ZFS / Client: disk-less), both
> > >> are
> > >> running FreeBSD r253506. The NFS client starts to deadlock
> > >> processes
> > >> within a few hours. It usually gets worse from there on. The
> > >> processes stay in "D" state. I haven't been able to reproduce it
> > >> when I want it to happen. I only have to wait a few hours until
> > >> the
> > >> deadlocks occur when traffic to the client machine starts to pick
> > >> up. The only way to fix the deadlocks is to reboot the client.
> > >> Even
> > >> an ls to the path which is deadlocked, will deadlock ls itself.
> > >> It's
> > >> totally random what part of the file system gets deadlocked. The
> > >> NFS
> > >> server itself has no problem at all to access the files/path when
> > >> something is deadlocked on the client.
> > >> 
> > >> Last night I decided to put an older kernel on the system r252025
> > >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks
> > >> on
> > >> the client machine (it should have deadlocked by now). FreeBSD is
> > >> working hard like it always does. :-) There are a few changes to
> > >> the
> > >> NFS code from the revision which seems to work until Beta1. I
> > >> haven't tried to narrow it down if one of those commits are
> > >> causing
> > >> the problem. Maybe someone has an idea what could be wrong and I
> > >> can
> > >> test a patch or if it's something else, because I'm not a kernel
> > >> expert. :-)
> > >> 
> > > Well, the only NFS client change committed between r252025 and
> > > r253506
> > > is r253124. It fixes a file corruption problem caused by a previous
> > > commit that delayed the vnode_pager_setsize() call until after the
> > > nfs node mutex lock was unlocked.
> > > 
> > > If you can test with only r253124 reverted to see if that gets rid
> > > of
> > > the hangs, it would be useful, although from the procstats, I doubt
> > > it.
> > > 
> > >> I have run several procstat -kk on the processes including the ls
> > >> which deadlocked. You can see them here:
> > >> 
> > >> http://pastebin.com/1RPnFT6r
> > > 
> > > All the processes you show seem to be stuck waiting for a vnode
> > > lock
> > > or in __utmx_op_wait. (I`m not sure what the latter means.)
> > > 
> > > What is missing is what processes are holding the vnode locks and
> > > what they are stuck on.
> > > 
> > > A starting point might be ``ps axhl``, to see what all the threads
> > > are doing (particularily the WCHAN for them all). If you can drop
> > > into
> > > the debugger when the NFS mounts are hung and do a ```show
> > > alllocks``
> > > that could help. See:
> > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > > 
> > > I`ll admit I`d be surprised if r253124 caused this, but who knows.
> > > 
> > > If there have been changes to your network device driver between
> > > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> > > waiting for a reply while holding a vnode lock, that would do it.)
> > > 
> > > Good luck with it and maybe someone else can think of a commit
> > > between r252025 and r253506 that could cause vnode locking or
> > > network
> > > problems.
> > > 
> > > rick
> > > 
> > >> 
> > >> I have tried to mount the file system with and without nolockd. It
> > >> didn't make a difference. Other than that it is mounted with:
> > >> 
> > >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> > >> 
> > >> Let me know if you need me to do something else or if some other
> > >> output is required. I would have to go back to the problem kernel
> > >> and wait until the deadlock occurs to get that information.
> > >> 
> > 
> > Thanks Rick and Steven for your quick replies.
> > 
> > I spoke too soon regarding r252025 fixing the problem. The same issue
> > started to show up after about 1 day and a few hours of uptime.
> > 
> > "ps axhl" shows all those stuck processes in newnfs
> > 
> > I recompiled the GENERIC kernel for Beta1 with the debugging options:
> > 
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > 
> > ps and debugging output:
> > 
> > http://pastebin.com/1v482Dfw
> > 
> > (I only listed processes matching newnfs, if you need the whole list,
> > please let me know)
> > 
> Is your "show alllocks" complete? If not, a complete list of locks
> would definitely help. As for "ps axhl", a complete list of processes/threads
> might be useful, but not as much, I think.
> 
> > The first PID showing up having that problem is 14001. Certainly the
> > "show alllocks" command shows interesting information for that PID.
> > I looked through the commit history for those files mentioned in the
> > output to see if there is something obvious to me. But I don't know.
> > :-)
> > I hope that information helps you to dig deeper into the issue what
> > might be causing those deadlocks.
> > 
> Well, pid 14001 is interesting in that it holds both the sleep lock
> acquired by sblock() and an NFS vnode lock, but is waiting for another
> NFS vnode lock, if I read the pastebin stuff correctly.
> 
> I suspect that this process is somewhere in kern_sendfile(), since that
> seems to be where sblock() gets called before vn_lock().
14001 is multithreaded, two its threads own shared vnode lock, and
third thread sleeps for a lock.

> 
> It's just a "shot in the dark", but if you can revert
> r250907 (dated May 22), it might be worth a try. It adds a bunch of
> stuff to kern_sendfile(), including vn_lock() calls and the date sounds
> about right.
The output provided is not useful.

I need the data listed in the referenced 'debugging deadlocks' page.

> 
> rick
> ps: I've added kib@ to the cc list, in case he can help with this. He
>     appears to have been involved in the commits MFC'd by r250907.
> 
> > I did include the pciconf -lv, because you mentioned network device
> > drivers. It's Intel igb. The same hardware is running a kernel from
> > January 19th, 2013 also as an NFS client. That machine is rock
> > solid. No problems at all.
> > 
> Ok, so it sounds like we are looking for a post-January 19 commit.
> 
> > I also went to r251611. That's before r251641 (The NFS FHA changes).
> The FHA changes are server related and my understanding is that the
> problem is showing up in an NFS client.
> 
> > Same problem. Here is another debugging output from that kernel:
> > 
> > http://pastebin.com/ryv8BYc4
> > 
> > If I should test something else or provide some other output, please
> > let me know.
> > 
> > Again thank you!
> > 
> > Michael
> > 
> > 
> > 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130727/7f415ccc/attachment.sig>