rpc.lockd brokenness (2)

Thu Mar 9 00:26:50 UTC 2006

> From: Kris Kennaway <kris at obsecurity.org>
> Subject: Re: rpc.lockd brokenness (2)
>
> This is intentional.  It's how pidfile_*() tests whether the process
> is still running.  The intention is that if someone tries to open the
> pidfile again while the first process is still running, the lock
> acquisition will fail and we'll know the other process is still alive,
> and therefore avoid starting a second instance.

No, no, you got me wrong. The pidfile is left locked after cron stopped
running (with /etc/rc.d/cron stop). This behaviour must be wrong.

> Your main problems seems to be that you're mounting the same /var via
> NFS from multiple client machines.  This is basically a bad idea to
> begin with because /var expects to be private to each machine (even if
> locking worked as expected, you'd not be able to start cron on more
> than one machine because it would fail as above).  Even if you solved
> this there would be other similar problems.

No, it's the whole filesystem tree for a single client, no one else uses
those files. The fact that I hung a third machine was an accident, I was
testing if cron.pid was still locked and I thought I had a window on the
server...

My single problem is locking. Actually, it worked well before I upgraded
this system to 6-STABLE. It's just for one laptop whose disk I don't want
to partition.

> In fact the diskless boot infrastructure in /etc will set up and use a
> md /var for this purpose.

Actually, they don't advise using an md /var, only /etc. Anyway, I don't use
that, because it's my only diskless machine. I have a single NFS mounted /
and an md /tmp. There's nothing shared with no one else, not even /usr,
because it's my only amd64.

> There is a (known) lockd bug here though, which you isolated:
>

So, this really is bin/80389?
If so, I can tell Jun Kuriyama that his patch didn't change it.

> > With /var/run/cron.pid still locked, on the first client, single-user, sa=
> me
> > initialization sequence
> >         # lockf -k -t 1 /var/run/cron.pid echo ok
> >         Hangs... always.
>
> which is that lock requests through rpc.lockd cannot be cancelled, so
> they'll hang until the operation succeeds or fails.  In this case
> lockf does a blocking lock request and expects to cancel it with a
> signal after the timer expires, but rpc.lockd doesn't know how to back
> out lock requests so it just hangs forever or until something else
> unlocks the file on the server.
>
> Kris

I am a bit disappointed. First, this problem didn't cause me trouble before
I went to 6-STABLE, now I must either disable cron or disable locking (which
I can't).
And I'm still not completely convinced. That problem, if I understand correctly,
existed before January...

There are two things...
- cron.pid shouldn't be locked after cron terminated. (this interaction was
fully saved as http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin)
- cron shouldn't hang on startup just because the file is locked, since
pidfile_open opens it with O_NONBLOCK (unlike lockf).
- cron shouldn't hang in such a way that it is not killable... (and should
not also the open system call in lockf be interruptible?)

So, I'm led to believe that beyond that issue with rpc.lockd, which,
I understand, is an unresolved problem, there is now another problem,
perhaps with pidfile.c...

Thank you for all your time on this issue. I'm still going to try to chase
it, although I only have the knowledge to find it if it is on pidfile.c or
in cron. I understand little of the interaction between kernel and the rest
of nfs to chase it if it is somewhere else.

Miguel