NFS deadlock on 9.2-Beta1

Sat Jul 27 06:54:38 UTC 2013

> 
> On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> > Michael Tratz wrote:
> >> Two machines (NFS Server: running ZFS / Client: disk-less), both are
> >> running FreeBSD r253506. The NFS client starts to deadlock processes
> >> within a few hours. It usually gets worse from there on. The
> >> processes stay in "D" state. I haven't been able to reproduce it
> >> when I want it to happen. I only have to wait a few hours until the
> >> deadlocks occur when traffic to the client machine starts to pick
> >> up. The only way to fix the deadlocks is to reboot the client. Even
> >> an ls to the path which is deadlocked, will deadlock ls itself. It's
> >> totally random what part of the file system gets deadlocked. The NFS
> >> server itself has no problem at all to access the files/path when
> >> something is deadlocked on the client.
> >> 
> >> Last night I decided to put an older kernel on the system r252025
> >> (June 20th). The NFS server stayed untouched. So far 0 deadlocks on
> >> the client machine (it should have deadlocked by now). FreeBSD is
> >> working hard like it always does. :-) There are a few changes to the
> >> NFS code from the revision which seems to work until Beta1. I
> >> haven't tried to narrow it down if one of those commits are causing
> >> the problem. Maybe someone has an idea what could be wrong and I can
> >> test a patch or if it's something else, because I'm not a kernel
> >> expert. :-)
> >> 
> > Well, the only NFS client change committed between r252025 and r253506
> > is r253124. It fixes a file corruption problem caused by a previous
> > commit that delayed the vnode_pager_setsize() call until after the
> > nfs node mutex lock was unlocked.
> > 
> > If you can test with only r253124 reverted to see if that gets rid of
> > the hangs, it would be useful, although from the procstats, I doubt it.
> > 
> >> I have run several procstat -kk on the processes including the ls
> >> which deadlocked. You can see them here:
> >> 
> >> http://pastebin.com/1RPnFT6r
> > 
> > All the processes you show seem to be stuck waiting for a vnode lock
> > or in __utmx_op_wait. (I`m not sure what the latter means.)
> > 
> > What is missing is what processes are holding the vnode locks and
> > what they are stuck on.
> > 
> > A starting point might be ``ps axhl``, to see what all the threads
> > are doing (particularily the WCHAN for them all). If you can drop into
> > the debugger when the NFS mounts are hung and do a ```show alllocks``
> > that could help. See:
> > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > 
> > I`ll admit I`d be surprised if r253124 caused this, but who knows.
> > 
> > If there have been changes to your network device driver between
> > r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
> > waiting for a reply while holding a vnode lock, that would do it.)
> > 
> > Good luck with it and maybe someone else can think of a commit
> > between r252025 and r253506 that could cause vnode locking or network
> > problems.
> > 
> > rick
> > 
> >> 
> >> I have tried to mount the file system with and without nolockd. It
> >> didn't make a difference. Other than that it is mounted with:
> >> 
> >> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
> >> 
> >> Let me know if you need me to do something else or if some other
> >> output is required. I would have to go back to the problem kernel
> >> and wait until the deadlock occurs to get that information.
> >> 
> 
> Thanks Rick and Steven for your quick replies.
> 
> I spoke too soon regarding r252025 fixing the problem. The same issue started to show up after about 1 day and a few hours of uptime.
> 
> "ps axhl" shows all those stuck processes in newnfs
> 
> I recompiled the GENERIC kernel for Beta1 with the debugging options:
> 
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> 
> ps and debugging output:
> 
> http://pastebin.com/1v482Dfw
> 
> (I only listed processes matching newnfs, if you need the whole list, please let me know)
> 
> The first PID showing up having that problem is 14001. Certainly the "show alllocks" command shows interesting information for that PID.
> I looked through the commit history for those files mentioned in the output to see if there is something obvious to me. But I don't know. :-)
> I hope that information helps you to dig deeper into the issue what might be causing those deadlocks.
> 
> I did include the pciconf -lv, because you mentioned network device drivers. It's Intel igb. The same hardware is running a kernel from January 19th, 2013 also as an NFS client. That machine is rock solid. No problems at all.
> 
> I also went to r251611. That's before r251641 (The NFS FHA changes). Same problem. Here is another debugging output from that kernel:
> 
> http://pastebin.com/ryv8BYc4
> 
> If I should test something else or provide some other output, please let me know.
> 
> Again thank you!
> 
> Michael

just a quick 'me too', It usually happens on our ftp server, and it's been
happening for a long time. It's diskless, and it happens randomly, so it's 
difficult to reproduce. We have many other diskless servers running quiet 
smoothly.

danny