NFS locking: lockf freezes (rpc.lockd problem?)

Sun Aug 27 19:17:40 UTC 2006

On Sun, 27 Aug 2006, Kostik Belousov wrote:
> Make sure that rpc.statd is running.
Yep.  Took me some while to figure that one out, but the first lockf test 
failed without that.

> For debugging purposes, tcpdump of the corresponding communications 
> would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' 
> may be interesting.

Um.  How interesting would tcpdump be?  I'm prepared to do the work, but 
as I've never used the tool, it may take me some effort and time to figure 
out the right commands.  Yes: `man tcpdump | wc -l` == 1543.  Fancy 
giving me a sample command to try?

As for the other test, let's have a look.  Here we are before the test 
(NFS server, 4.11, is saturn, test machine, 6.1, is venus):

saturn$ ps auxww | grep rpc\\.
root    48917  0.0  0.1   980  640  ??  Is    7:56am   0:00.01 rpc.lockd
root      115  0.0  0.1 263096  536  ??  Is   18Aug06   0:00.00 rpc.statd

venus# ps auxww | grep rpc\\.
root     510  0.0  0.9 263460  1008  ??  Ss    6:05PM   0:00.01 /usr/sbin/rpc.statd
root     515  0.0  1.0  1416  1120  ??  Is    6:05PM   0:00.02 /usr/sbin/rpc.lockd
daemon   520  0.0  1.0  1420  1124  ??  I     6:05PM   0:00.00 /usr/sbin/rpc.lockd

That's interesting.  Don't know how significant the differences are... 
Ok, let's run the test:

venus# cd /usr/src; make installworld DESTDIR=/mnt

Well, how odd: as soon as I start the test process 515 on venus goes away. 
Now to wait for it to fail... (doesn't take too long):

saturn$ ps auxww | grep rpc\\.
root    48917  0.0  0.1   980  640  ??  Is    7:56am   0:00.01 rpc.lockd
root      115  0.0  0.1 263096  536  ??  Is   18Aug06   0:00.00 rpc.statd

venus# ps auxww | grep rpc\\.
root     510  0.0  0.9 263460   992  ??  Ss    6:05PM   0:00.01 /usr/sbin/rpc.statd
daemon   520  0.0  1.0  1440  1152  ??  S     6:05PM   0:00.01 /usr/sbin/rpc.lockd
venus# ps auxww | grep lockf
...
root    7034  0.0  0.5  1172   528  v0  D+    6:51PM   0:00.01 lockf -k /mnt/usr/...

(I've truncated the lockf call: the detail of the install call it's making 
is hardly relevant!)

Note that now any call to lockf on this server will fail...  Hmm.  What 
about a different mount point?  Bet I can't unmount ...

venus# umount /mnt
umount: unmount of /mnt failed: Device busy
venus# umount -f /mnt
venus# mount saturn:/tmp /mnt
venus# lockf /mnt/test ls
(Hangs)

Now this is interesting: the file saturn:/tmp/test exists!  And it appears 
to be owned by uid=4294967294 (-2?)!  How very odd.  If I reboot venus and 
try just a single lockf:

venus# lockf /mnt/test stat -f%u /mnt/test
0

As one might expect, indeed.  A hint as to who's got stuck (saturn, I'm 
sure), but beside the point, I guess.

Note also that the `umount -f /mnt` *didn't* release the lockf, and also 
note that /tmp/test is still there (on saturn) after a reboot of venus.

In conclusion: I agree with Greg Byshenk that the NFS server is bound to 
be the one at fault, BUT, is this "freeze until reboot" behaviour really 
what we want?  I remain astonished (and irritated) that `kill -9` doesn't 
work!