NFS Locking Issue
Kostik Belousov
kostikbel at gmail.com
Wed Jul 5 12:21:21 UTC 2006
On Wed, Jul 05, 2006 at 02:38:22PM +0300, Kostik Belousov wrote:
> On Wed, Jul 05, 2006 at 10:09:24AM +0100, Robert Watson wrote:
> > The most significant problem working with rpc.lockd is creating easy to
> > reproduce test cases. Not least because they can potentially involve
> > multiple clients. If you can help to produce simple test cases to
> > reproduce the bugs you're seeing, that would be invaluable.
> >
> ........
> >
> > Reducing complex failure modes to easily reproduced test cases is tricky
> > also, though. It requires careful analysis, often with ktrace and
> > tcpdump/ethereal to work out what's going on, and not a little luck to
> > perform the reduction of a large trace down to a simple test scenario. The
> > first step is to try and figure out what, if any, specific workload results
> > in a problem. For example, can you trigger it using work on just one
> > client against a server, without client<->client interactions? This makes
> > tracking and reproduction a lot easier, as multi-client test cases are
> > really tricky! Once you've established whether it can be reproduced with a
> > single client, you have to track down the behavior that triggers it --
> > normally, this is done by attempting to narrow down the specific program or
> > sequence of events that causes the bug to trigger, removing things one at a
> > time to see what causes the problem to disappear. This is made more
> > difficult as lock managers are sensitive to timing, so removing a high load
> > item from the list, even if it isn't the source of the problem, might cause
> > it to trigger less frequently.
>
> I made the patch for rpc.lockd that could somewhat ease obtaining
> debug information. Patch is available at
> http://people.freebsd.org/~kib/rpc.lockd-debug.patch
>
> No functional changes. Patch only adds dumping of currently held locks
> (as perceived by lockd) on receiving of SIGUSR1. You need to specify
> debug level 2 or 3 to obtain the dump.
>
> Also, the both lockd processes now put identification information
> in the proctitle (srv and kern). SIGUSR1 shall be sent to srv process.
Hmm, after looking at the dump there and some code reading, I have noted
the following:
1. NLM lock request contains the field caller_name. It is filled by
(let call it) kernel rpc.lockd by the results of hostname(3).
2. This caller_name is used by server rpc.lockd to send request
for host monitoring to rpc.statd (see send_granted).
Request is made by clnt_call, that is blocking rpc call.
3. rpc.statd does getaddrinfo on caller_name to determine address of the
host to monitor.
If the getaddrinfo in step 3 waits for resolver, then your client machine
will get locking process in"lockd" state.
Could people experiencing rpc.lockd mistery at least report whether
_server_ machine successfully resolve hostname of clients as reported
by hostname? And, if yes, to what family of IP protocols ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060705/8f65895d/attachment.pgp
More information about the freebsd-stable
mailing list