rpc.lockd brokenness (2)

Thu Mar 9 02:07:40 UTC 2006

> From: Kris Kennaway <kris at obsecurity.org>
> Subject: Re: rpc.lockd brokenness (2)
>
[...]
> OK, I misunderstood.  The rc.d script will signal cron to kill it,
> which should be closing the file descriptors and causing rpc.lockd to
> release the lock.  Perhaps this part is broken.  OK, I tested this
> with daemon -p, and it indeed seems to be broken:
>
> haessal# daemon -p pid_file sleep 100000
> haessal# kill -KILL `cat pid_file`
> haessal# ps -p `cat pid_file`
>   PID  TT  STAT      TIME COMMAND
> haessal# lockf -t 0 pid_file echo Yay
> lockf: pid_file: already locked

Well, your test is quite terse, but perhaps that is more expectable with
SIGKILL, but the same thing happens with SIGTERM.

On the other hand, what happens there is not so strange, since neither
pidfile.c nor daemon.c has any signal handling, and that should probably
be expected. Perhaps it's impossible that a lock could be released just
because it's owned by a process that dyed, it's the limitations of
distributed services...

But. cron should have pidfile_remove in it's signal handlers, and it
should have a signal handler for SIGTERM for this purpose. I must see the
pre-pidfile cron.

[...]
> > - cron shouldn't hang on startup just because the file is locked, since
> > pidfile_open opens it with O_NONBLOCK (unlike lockf).
>
> I haven't been able to reproduce this, e.g. lockf -t 0 does O_NONBLOCK
> locking and works correctly when the file is already locked.  Perhaps
> it's another locked file (not the pidfile) that was also leaked in the
> same way, and is being opened without O_NONBLOCK.
>
> > - cron shouldn't hang in such a way that it is not killable... (and should
> > not also the open system call in lockf be interruptible?)
>
> This is the bug (really: missing feature) that I described in my
> previous mail.

Shouldn't even a lock that is opened without O_NONBLOCK be interruptible by
a signal?

I don't understand why or how are these things unkillable. They did a
system call, they're supposed to be inside the kernel, how can rpc.lockd, a
user process keep them there...

Another thing, I have a question that maybe you can answer. I'm having
trouble getting rid of the lock on cron.pid, and, in the end, that's why
I can't boot normally. The lock persists even though the file is not
"physically" locked on the server. I've tried stopping nfslocking on both
sides and removing both /var/db/statd.status. Is there any other persistent
storage for this?

Miguel