NFS Locking Issue

Wed Jul 5 09:09:27 UTC 2006

On Wed, 5 Jul 2006, Danny Braniss wrote:

> In my case our main servers are NetApp, and the problems are more related to 
> am-utils running into some race condition (need more time to debug this :-) 
> the other problem is related to throughput, freebsd is slower than linux, 
> and while freebsd/nfs/tcp is faster on Freebsd than udp, on linux it's the 
> same. So it seems some tunning is needed.
>
> our main problem now is samba/rpc.lockd, we are stuck with a server running 
> FreeBSD 5.4 which crashes, and we can't upgrade to 6.1 because lockd doesn't 
> work.
>
> So, if someone is willing to look into the lockd issue, we would like to 
> help.

The most significant problem working with rpc.lockd is creating easy to 
reproduce test cases.  Not least because they can potentially involve multiple 
clients.  If you can help to produce simple test cases to reproduce the bugs 
you're seeing, that would be invaluable.

I'm aware of two general classes of problems with rpc.lockd.  First, 
architectural issues, some derived from architectural problems in the NLM 
protocol: for example, assumptions that there can be a clean mapping of 
process lock owners to locks, which fall down as locks are properties of file 
descriptors that can be inheritted.  Second, implementation bugs/misfeatures, 
such as the kernel not knowing how to cancel lock requests, so being unable to 
implement interruptible waits on locks in the distributed case.

Reducing complex failure modes to easily reproduced test cases is tricky also, 
though.  It requires careful analysis, often with ktrace and tcpdump/ethereal 
to work out what's going on, and not a little luck to perform the reduction of 
a large trace down to a simple test scenario.  The first step is to try and 
figure out what, if any, specific workload results in a problem.  For example, 
can you trigger it using work on just one client against a server, without 
client<->client interactions?  This makes tracking and reproduction a lot 
easier, as multi-client test cases are really tricky!  Once you've established 
whether it can be reproduced with a single client, you have to track down the 
behavior that triggers it -- normally, this is done by attempting to narrow 
down the specific program or sequence of events that causes the bug to 
trigger, removing things one at a time to see what causes the problem to 
disappear.  This is made more difficult as lock managers are sensitive to 
timing, so removing a high load item from the list, even if it isn't the 
source of the problem, might cause it to trigger less frequently.

Robert N M Watson
Computer Laboratory
University of Cambridge