NFS deadlock on 9.2-Beta1

Sat Jul 27 22:13:09 UTC 2013

On Jul 27, 2013, at 1:20 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Michael Tratz wrote:
>> 
>> On Jul 24, 2013, at 5:25 PM, Rick Macklem <rmacklem at uoguelph.ca>
>> wrote:
>> 
>>> Michael Tratz wrote:
>>>> Two machines (NFS Server: running ZFS / Client: disk-less), both
>>>> are
>>>> running FreeBSD r253506. The NFS client starts to deadlock
>>>> processes
>>>> within a few hours. It usually gets worse from there on. The
>>>> processes stay in "D" state. I haven't been able to reproduce it
>>>> when I want it to happen. I only have to wait a few hours until
>>>> the
>>>> deadlocks occur when traffic to the client machine starts to pick
>>>> up. The only way to fix the deadlocks is to reboot the client.
>>>> Even
>>>> an ls to the path which is deadlocked, will deadlock ls itself.
>>>> It's
>>>> totally random what part of the file system gets deadlocked. The
>>>> NFS
>>>> server itself has no problem at all to access the files/path when
>>>> something is deadlocked on the client.
>>>> 
>>>> Last night I decided to put an older kernel on the system r252025
>>>> (June 20th). The NFS server stayed untouched. So far 0 deadlocks
>>>> on
>>>> the client machine (it should have deadlocked by now). FreeBSD is
>>>> working hard like it always does. :-) There are a few changes to
>>>> the
>>>> NFS code from the revision which seems to work until Beta1. I
>>>> haven't tried to narrow it down if one of those commits are
>>>> causing
>>>> the problem. Maybe someone has an idea what could be wrong and I
>>>> can
>>>> test a patch or if it's something else, because I'm not a kernel
>>>> expert. :-)
>>>> 
>>> Well, the only NFS client change committed between r252025 and
>>> r253506
>>> is r253124. It fixes a file corruption problem caused by a previous
>>> commit that delayed the vnode_pager_setsize() call until after the
>>> nfs node mutex lock was unlocked.
>>> 
>>> If you can test with only r253124 reverted to see if that gets rid
>>> of
>>> the hangs, it would be useful, although from the procstats, I doubt
>>> it.
>>> 
>>>> I have run several procstat -kk on the processes including the ls
>>>> which deadlocked. You can see them here:
>>>> 
>>>> http://pastebin.com/1RPnFT6r
>>> 
>>> All the processes you show seem to be stuck waiting for a vnode
>>> lock
>>> or in __utmx_op_wait. (I`m not sure what the latter means.)
>>> 
>>> What is missing is what processes are holding the vnode locks and
>>> what they are stuck on.
>>> 
>>> A starting point might be ``ps axhl``, to see what all the threads
>>> are doing (particularily the WCHAN for them all). If you can drop
>>> into
>>> the debugger when the NFS mounts are hung and do a ```show
>>> alllocks``
>>> that could help. See:
>>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>>> 
>>> I`ll admit I`d be surprised if r253124 caused this, but who knows.
>>> 
>>> If there have been changes to your network device driver between
>>> r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
>>> waiting for a reply while holding a vnode lock, that would do it.)
>>> 
>>> Good luck with it and maybe someone else can think of a commit
>>> between r252025 and r253506 that could cause vnode locking or
>>> network
>>> problems.
>>> 
>>> rick
>>> 
>>>> 
>>>> I have tried to mount the file system with and without nolockd. It
>>>> didn't make a difference. Other than that it is mounted with:
>>>> 
>>>> rw,nfsv3,tcp,noatime,rsize=32768,wsize=32768
>>>> 
>>>> Let me know if you need me to do something else or if some other
>>>> output is required. I would have to go back to the problem kernel
>>>> and wait until the deadlock occurs to get that information.
>>>> 
>> 
>> Thanks Rick and Steven for your quick replies.
>> 
>> I spoke too soon regarding r252025 fixing the problem. The same issue
>> started to show up after about 1 day and a few hours of uptime.
>> 
>> "ps axhl" shows all those stuck processes in newnfs
>> 
>> I recompiled the GENERIC kernel for Beta1 with the debugging options:
>> 
>> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>> 
>> ps and debugging output:
>> 
>> http://pastebin.com/1v482Dfw
>> 
>> (I only listed processes matching newnfs, if you need the whole list,
>> please let me know)
>> 
> Is your "show alllocks" complete? If not, a complete list of locks
> would definitely help. As for "ps axhl", a complete list of processes/threads
> might be useful, but not as much, I think.

Yes that was the entire output for show alllocks.  

> 
>> The first PID showing up having that problem is 14001. Certainly the
>> "show alllocks" command shows interesting information for that PID.
>> I looked through the commit history for those files mentioned in the
>> output to see if there is something obvious to me. But I don't know.
>> :-)
>> I hope that information helps you to dig deeper into the issue what
>> might be causing those deadlocks.
>> 
> Well, pid 14001 is interesting in that it holds both the sleep lock
> acquired by sblock() and an NFS vnode lock, but is waiting for another
> NFS vnode lock, if I read the pastebin stuff correctly.
> 
> I suspect that this process is somewhere in kern_sendfile(), since that
> seems to be where sblock() gets called before vn_lock().
> 
> It's just a "shot in the dark", but if you can revert
> r250907 (dated May 22), it might be worth a try. It adds a bunch of
> stuff to kern_sendfile(), including vn_lock() calls and the date sounds
> about right.
> 
> rick
> ps: I've added kib@ to the cc list, in case he can help with this. He
>    appears to have been involved in the commits MFC'd by r250907.

Thanks for adding kib. He already responded and it looks like he needs more data.

I'll give your suggestion a try, too. I already was thinking I'm going to build some kernels from the revision I know is working fine till we hit one which starts that issue.
Then it will be easier to narrow it down what commit might be causing the problem.

> 
>> I did include the pciconf -lv, because you mentioned network device
>> drivers. It's Intel igb. The same hardware is running a kernel from
>> January 19th, 2013 also as an NFS client. That machine is rock
>> solid. No problems at all.
>> 
> Ok, so it sounds like we are looking for a post-January 19 commit.

I did more testing in the meantime. I added another machine. Using kernel r253698. Same thing after a few hours deadlocks.
I did capture a show alllocks of this deadlock. But it's probably not going to be useful for kib.

http://pastebin.com/MnpVZN85

The machine which was the original problem child has been running without a hitch with that Jan 19 kernel (r245665) for almost 2 days. 

I accidentally found out if a NFS server is freshly rebooted, the deadlock usually happens within an hour or two. That allows me now to pretty much reproduce the problem quickly.

I'll test more and I'll also see what data kib needs and provide that to him.

> 
>> I also went to r251611. That's before r251641 (The NFS FHA changes).
> The FHA changes are server related and my understanding is that the
> problem is showing up in an NFS client.

That's correct. Only the NFS client is having a problem. The NFS server remained unchanged during all the testing and is running r253506.
I was just stumbling around in the dark and trying out different things. :-)

> 
>> Same problem. Here is another debugging output from that kernel:
>> 
>> http://pastebin.com/ryv8BYc4
>> 
>> If I should test something else or provide some other output, please
>> let me know.
>> 
>> Again thank you!
>> 
>> Michael
>> 
>> 
>> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"