NFS deadlock on 9.2-Beta1

Rick Macklem rmacklem at uoguelph.ca
Mon Jul 29 23:37:15 UTC 2013


Michael Tratz wrote:
> 
> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
> <kostikbel at gmail.com> wrote:
> 
> > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
> >> Let's assume the pid which started the deadlock is 14001 (it will
> >> be a different pid when we get the results, because the machine
> >> has been restarted)
> >> 
> >> I type:
> >> 
> >> show proc 14001
> >> 
> >> I get the thread numbers from that output and type:
> >> 
> >> show thread xxxxx
> >> 
> >> for each one.
> >> 
> >> And a trace for each thread with the command?
> >> 
> >> tr xxxx
> >> 
> >> Anything else I should try to get or do? Or is that not the data
> >> at all you are looking for?
> >> 
> > Yes, everything else which is listed in the 'debugging deadlocks'
> > page
> > must be provided, otherwise the deadlock cannot be tracked.
> > 
> > The investigator should be able to see the whole deadlock chain
> > (loop)
> > to make any useful advance.
> 
> Ok, I have made some excellent progress in debugging the NFS
> deadlock.
> 
> Rick! You are genius. :-) You found the right commit r250907 (dated
> May 22) is the definitely the problem.
> 
Nowhere close, take my word for it;-) (At least you put a smiley after it.)
(I've never actually even been employed as a software developer, but that's off topic.)

I just got lucky (basically there wasn't any other commit that seemed it might cause this).

But, the good news is that it is partially isolated. Hopefully the debugging stuff
you get for Kostik will allow him (I suspect he is a genius) to solve the problem.
(If I was going to take another "shot in the dark", I'd guess its r250027 moving
 the vn_lock() call. Maybe calling vm_page_grab() with the shared vnode lock held?)

I've added re@ to the cc list, since I think this might be a show stopper for 9.2?

Thanks for reporting this and all your help with tracking it down, rick

> Here is how I did the testing: One machine received a kernel before
> r250907, the second machine received a kernel after r250907. Sure
> enough within a few hours the machine with r250907 went into the
> usual deadlock state. The machine without that commit kept on
> working fine. Then I went back to the latest revision (r253726), but
> leaving r250907 out. The machines have been running happy and rock
> solid without any deadlocks. I have expanded the testing to 3
> machines now and no reports of any issues.
> 
> I guess now Konstantin has to figure out why that commit is causing
> the deadlock. Lovely! :-) I will get that information as soon as
> possible. I'm a little behind with normal work load, but I expect to
> have the data by Tuesday evening or Wednesday.
> 
> Thanks again!!
> 
> Michael
> 
> 


More information about the freebsd-stable mailing list