rpc.lockd brokenness (2)

Wed Mar 8 22:45:33 UTC 2006

On Wed, Mar 08, 2006 at 02:01:24PM +0000, Miguel Lopes Santos Ramos wrote:

> > I wonder if something else is going wrong and it's not rpc.lockd at
> > all.
> 
> Oh, it's a locking problem alright. But perhaps not in rpc.lockd...

OK, I think I understand what is going on now...sort of.

> > It looks like this wasn't made using -s 0 - sorry if I wasn't
> > explicit.
> 
> You must give all details to rookies...

Sorry.

> I've changed things a bit, but perhaps there's a test now which is more easily
> reproducible on other systems.
> 
> The following tcpdumps were obtaining by booting in single-user mode on the
> diskless machine and doing the following sequence for initialization:
>         # mount -u /
>         # /etc/rc.d/netif start
>         # /etc/rc.d/rpcbind start
>         # /etc/rc.d/nfsclient start
>         # /etc/rc.d/nfslocking start
> 
> And then, with /var/run/cron.pid removed,
>         # /etc/rc.d/cron start
>         Starting cron.
>         # /etc/rc.d/cron stop
>         # /etc/rc.d/nfslocking stop
>         # /etc/rc.d/nfsclient stop
>         # /etc/rc.d/rpcbind stop
>         # reboot
>         see http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin
>         Everything seemed to be ok, but /var/run/cron.pid was left locked on
>         the server.

This is intentional.  It's how pidfile_*() tests whether the process
is still running.  The intention is that if someone tries to open the
pidfile again while the first process is still running, the lock
acquisition will fail and we'll know the other process is still alive,
and therefore avoid starting a second instance.

Your main problems seems to be that you're mounting the same /var via
NFS from multiple client machines.  This is basically a bad idea to
begin with because /var expects to be private to each machine (even if
locking worked as expected, you'd not be able to start cron on more
than one machine because it would fail as above).  Even if you solved
this there would be other similar problems.

In fact the diskless boot infrastructure in /etc will set up and use a
md /var for this purpose.

There is a (known) lockd bug here though, which you isolated:

> With /var/run/cron.pid still locked, on the first client, single-user, same
> initialization sequence
>         # lockf -k -t 1 /var/run/cron.pid echo ok
>         Hangs... always.

which is that lock requests through rpc.lockd cannot be cancelled, so
they'll hang until the operation succeeds or fails.  In this case
lockf does a blocking lock request and expects to cancel it with a
signal after the timer expires, but rpc.lockd doesn't know how to back
out lock requests so it just hangs forever or until something else
unlocks the file on the server.

Kris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060308/5410cdd0/attachment.bin