rpc.lockd resource starvation
Robert Watson
rwatson at freebsd.org
Thu Jan 15 15:38:16 PST 2004
On Thu, 15 Jan 2004, Dan Nelson wrote:
> I think you just told me why my two busiest NFS servers had to be
> rebooted a few months ago (one with 440 days of uptime :( ). Does the
> mount fail with "mount: Can't assign requested address"? If so, it also
> happens on 4.x servers. Currently, they have 214 and 109 open reserved
> ports (after 102 and 73 days uptime, respectively), and I'm betting
> there are no more than 5 files actually locked on either system. I
> wonder if it's just not closing sockets when it's done with them?
There are a number of "known bugs/features" in rpc.lockd, but I have to
say that this one is new to me. The issues I know about are:
(1) There appear to be problems relating to rpc.lockd and/or rpc.statd
following client reboots. I've experienced problems between a Solaris
file server and a FreeBSD NFSv3 client using locking wherein a client
crash/reboot doesn't release the locks. It could be our rpc.statd
simply doesn't work...?
(2) There is a known problem involving aborted lock requests -- currently,
PCATCH is disabled in the kernel tsleep() in the client, because
there's no way to signal to the userspace rpc.lockd that a lock
"wasn't wanted afterall". If you add PCATCH back, every time you
abort a lock request with a signal you leak a lock. The
kernel/userspace protocol needs to be expanded a bit so that the abort
can be sent to userspace, and userspace then needs to know what to do
about it.
(3) There seems to be a general failure tolerance issue associated with
situations when rpc.lockd gets back a lock acknowledgement for a lock
it didn't request. For safety, it should really release the lock,
which would mask (1) and sometimes (2).
(4) There seem to be some issues with waking up processes waiting on lock
requests when the lock arrives. I sent an e-mail about this a while
back, and should dig it up along with my lock testing scenarios and
document this better.
(5) I think there's also a problem with leaking locks when an application
requests the lock using O_NONBLOCK; the request is sent out, but
bad things happen if the lock is granted.
(6) I believe there was also some problem relating to a series of
processes waiting for the same lock on the same client, and not all of
them eventually getting the lock.
I'll dig through my past e-mail and see if I can't dig up the details.
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Senior Research Scientist, McAfee Research
More information about the freebsd-current
mailing list