9-STABLE -> NFS -> NetAPP:

Mon Feb 11 01:43:22 UTC 2013

Just reset server, so any further details will have to be 'next time' … but, just did a csup and am rebuilding … the following three files were modified since last build:

grep nfs /tmp/output
 Edit src/sys/fs/nfs/nfs_commonsubs.c
 Edit src/sys/fs/nfsclient/nfs_clrpcops.c
 Edit src/sys/fs/nfsserver/nfs_nfsdserv.c

On 2013-02-10, at 4:56 PM, Marc Fournier <scrappy at hub.org> wrote:

> 
> On 2013-02-10, at 4:31 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
>> Marc Fournier wrote:
>>> Hi John …
>>> 
>>> Does this help?
>>> 
>>> root at io:~ # ps auxl | grep du
>>> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0
>>> 81426 0 20 0 newnfs
>>> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0
>>> 91597 0 20 0 newnfs
>>> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0
>>> 43227 0 20 0 newnfs
>>> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20
>>> 0 piperd
>> It is probably too late, but all the lines (without the | grep du) would be
>> more useful. I also include the "H" flag, so it lists threads as well as
>> processes. The above just says the "du" command is waiting for a vnode lock.
>> The interesting process/thread is the one that is holding a vnode lock
>> while waiting for something else.
> 
> As requested, 'ps auxlH' attached …
> 
> 
> <ps.out.bz2>
> 
>> 
>> Are you still getting the:
>> nfs_getpages: error 13
>> vm_fault: pager read error, pid 11355 (https)
> 
> Fairly quiet:
> 
> <Screen Shot 2013-02-10 at 4.43.55 PM.png>
> 
> And that is it since last reboot ~20 days ago … 
> 
>> 
>> messages logged?
>> 
>> With John's recent patch, the error# would no longer be 13 if it was
>> caused by the "intr" flag resulting in a Read RPC terminating with EINTR.
>> If you are still getting the above with "error 13", it suggests that
>> the server is replying EACCES for the Read RPC.
>> I suggested before that you check to make sure that the executable had
>> read access for everyone one the file server. Since I didn't hear back,
>> I'll assume this is the case.
> 
> Don't understand this question … I have 34 VPSs running off of this server right now … that 'du process' runs against each of those VPSs every night, and this problem started happening on Friday night's run … ~18 days into uptime … so the same process has run repeatedly, with no issues, 18 times before it hung on Friday … also, the hang, once 'triggered', only seems to recur against the same directory … the same directory doesn't necessarily trigger it, but once it starts, it appears to do it for the same directory … I'm not sure if I've ever seem it happening to two different directories at the same time …
> 
> Also, please note that the du command is run from the physical server, as root …
> 
>> rick
>> ps: If it is still up and hasn't been rebooted, you could:
>>   sysctl debug.kdb.break_to_debugger=1
>>   - then type <ctrl><alt><esc> at the console and do the following
>>     from the debugger
>>   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
>>   How well this work depends on what options your kernel was built with.
> 
> My remote console on that one doesn't work very well … I can view, but I can't type …
> 
>