rpc.lockd resource starvation

Thu Jan 15 15:38:16 PST 2004

On Thu, 15 Jan 2004, Dan Nelson wrote:

> I think you just told me why my two busiest NFS servers had to be
> rebooted a few months ago (one with 440 days of uptime :( ).  Does the
> mount fail with "mount: Can't assign requested address"?  If so, it also
> happens on 4.x servers.  Currently, they have 214 and 109 open reserved
> ports (after 102 and 73 days uptime, respectively), and I'm betting
> there are no more than 5 files actually locked on either system.  I
> wonder if it's just not closing sockets when it's done with them? 

There are a number of "known bugs/features" in rpc.lockd, but I have to
say that this one is new to me.  The issues I know about are:

(1) There appear to be problems relating to rpc.lockd and/or rpc.statd
    following client reboots.  I've experienced problems between a Solaris
    file server and a FreeBSD NFSv3 client using locking wherein a client
    crash/reboot doesn't release the locks.  It could be our rpc.statd
    simply doesn't work...?

(2) There is a known problem involving aborted lock requests -- currently,
    PCATCH is disabled in the kernel tsleep() in the client, because
    there's no way to signal to the userspace rpc.lockd that a lock
    "wasn't wanted afterall".  If you add PCATCH back, every time you
    abort a lock request with a signal you leak a lock.  The
    kernel/userspace protocol needs to be expanded a bit so that the abort
    can be sent to userspace, and userspace then needs to know what to do
    about it.

(3) There seems to be a general failure tolerance issue associated with
    situations when rpc.lockd gets back a lock acknowledgement for a lock
    it didn't request.  For safety, it should really release the lock,
    which would mask (1) and sometimes (2).

(4) There seem to be some issues with waking up processes waiting on lock
    requests when the lock arrives.  I sent an e-mail about this a while
    back, and should dig it up along with my lock testing scenarios and
    document this better. 

(5) I think there's also a problem with leaking locks when an application
    requests the lock using O_NONBLOCK; the request is sent out, but
    bad things happen if the lock is granted.

(6) I believe there was also some problem relating to a series of
    processes waiting for the same lock on the same client, and not all of
    them eventually getting the lock.

I'll dig through my past e-mail and see if I can't dig up the details. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Senior Research Scientist, McAfee Research