rpc.lockd brokenness (2)

Wed Mar 8 14:01:29 UTC 2006

> From: Kris Kennaway <kris at obsecurity.org>
> Subject: Re: rpc.lockd brokenness (2)
>
> I wonder if something else is going wrong and it's not rpc.lockd at
> all.

Oh, it's a locking problem alright. But perhaps not in rpc.lockd...

> It looks like this wasn't made using -s 0 - sorry if I wasn't
> explicit.

You must give all details to rookies...

I've changed things a bit, but perhaps there's a test now which is more easily
reproducible on other systems.

The following tcpdumps were obtaining by booting in single-user mode on the
diskless machine and doing the following sequence for initialization:
        # mount -u /
        # /etc/rc.d/netif start
        # /etc/rc.d/rpcbind start
        # /etc/rc.d/nfsclient start
        # /etc/rc.d/nfslocking start

And then, with /var/run/cron.pid removed,
        # /etc/rc.d/cron start
        Starting cron.
        # /etc/rc.d/cron stop
        # /etc/rc.d/nfslocking stop
        # /etc/rc.d/nfsclient stop
        # /etc/rc.d/rpcbind stop
        # reboot
        see http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin
        Everything seemed to be ok, but /var/run/cron.pid was left locked on
        the server.

Then, with /var/run/cron.pid still locked,
        # /etc/rc.d/cron start
        ... cron already running (pid=111).. something like that, which is ok
        # /etc/rc.d/cron stop
        # reboot
        see http://mega.ist.utl.pt/~mlsr/nfs-lockedpass.bin
        The result of this test is ok, but when booting multiuser, cron still
        hangs instead of saying it's already running, and, when I checked if
        /var/run/cron.pid was still locked, for accident on a third machine with
        # lockf -k -t 1 .../var/run/cron.pid echo ok
        lockf hung on this third machine, in spite of -t 1 parameter, it
        remained unkillable.

With /var/run/cron.pid still locked, on the first client, single-user, same
initialization sequence
        # lockf -k -t 1 /var/run/cron.pid echo ok
        Hangs... always.
        see http://mega.ist.utl.pt/~mlsr/nfs-lockfhang.bin
        (this tcpdump is quite big, perhaps it included loading the kernel)

Now, given this, since the hang also occurs with lockf, I tried another test,
on a different machine (the one that's called dual). The tcpdump was done
on the server: tcpdump -s 0 -w nfs-other.bin host dual and udp port nfs

Now, two vts on the client, in the first, this sequence:
        # mkdir test
        # mount compaq:/x1 test
        # touch test/lock-file ; lockf -k -t 1 test/lock-file sh
        #

On the second vt,
        # lockf -k -t 1 test/lock-file echo ok
        it hung. Tried ^C. still hung.

On the first vt,
        # exit

On the second vt, lockf had returned to prompt.

The tcpdump is on http://mega.ist.utl.pt/~mlsr/nfs-other.bin

The output of uname -a on the client (dual) is:
FreeBSD dual 6.1-PRERELEASE FreeBSD 6.1-PRERELEASE #0: Tue Mar  7 18:03:35 WET 2006     root at dual:/usr/obj/usr/src/sys/DUAL  i386

and on the server (compaq) is:
FreeBSD compaq 6.1-PRERELEASE FreeBSD 6.1-PRERELEASE #3: Tue Feb 14 13:04:11 WET 2006     root at dual:/usr/obj/usr/src/sys/COMPAQ  i386

Please try also what I did, two vts on a client, trying to lock the same file
on the server with lockf. The description of the problem that I have becomes
increasingly similar to what is in pr bin/80something.

Greetings,

Miguel