nfsd server cache flooded, try to increase nfsrc_floodlevel

Rick Macklem rmacklem at uoguelph.ca
Wed Jul 20 13:29:48 UTC 2011


Clinton Adams wrote:
> On Wed, Jul 20, 2011 at 1:09 AM, Rick Macklem <rmacklem at uoguelph.ca>
> wrote:
> > Please try the patch, which is at:
> >   http://people.freebsd.org/~rmacklem/noopen.patch
> > (This patch is against the file in -current, so patch may not like
> > it, but
> >  it should be easy to do by hand, if patch fails.)
> >
> > Again, good luck with it and please let me know how it goes, rick
> >
> 
> Thanks for your help with this, trying the patches now. Tests with one
> client look good so far, values for OpenOwner and CacheSize are more
> in line, we'll test with more clients later today. We were hitting the
> nfsrc_floodlevel with just three clients before, all using nfs4
> mounted home and other directories. Clients are running Ubuntu 10.04.2
> LTS. Usage has been general desktop usage, nothing unusual that we
> could see.
> 
> Relevant snippet of /proc/mounts on client (rsize,wsize are being
> automatically negotiated, not specified in the automount options):
> pez.votesmart.org:/public /export/public nfs4
> rw,relatime,vers=4,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5,clientaddr=192.168.255.112,minorversion=0,addr=192.168.255.25
> 0 0
> pez.votesmart.org:/home/clinton /home/clinton nfs4
> rw,relatime,vers=4,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5,clientaddr=192.168.255.112,minorversion=0,addr=192.168.255.25
> 0 0
> 
> nfsstat -e -s, with patches, after some stress testing:
> Server Info:
> Getattr Setattr Lookup Readlink Read Write Create Remove
> 95334 1 28004 50 297125 2 0 0
> Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access
> 0 0 0 0 0 1242 0 1444
> Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf
> 0 0 0 0 2 0 4 4
> Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock
> 176735 0 0 21175 0 0 49171 0
> LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH
> 0 0 21184 0 0 549853 0 17
> Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create
> 0 21186 176735 0 0 0
> Server:
> Retfailed Faults Clients
> 0 0 1
> OpenOwner Opens LockOwner Locks Delegs
> 291 2 0 0 0
> Server Cache Stats:
> Inprog Idem Non-idem Misses CacheSize TCPPeak
> 0 0 0 549969 291 2827
> 
Yes, these stats look reasonable.
(and sorry if the mail system I use munged the whitespace)

> nfsstat -e -s, prior to patches, general usage:
> 
> Server Info:
> Getattr Setattr Lookup Readlink Read Write Create Remove
> 2813477 62661 382636 1419 837492 2115422 0 33976
> Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access
> 31164 1310 0 0 0 15678 10 307236
> Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf
> 0 0 2 1 144550 0 43 43
> Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock
> 1462595 0 595 11267 0 0 550761 280674
> LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH
> 155 212299 286615 0 0 6651006 0 1234
> Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create
> 256784 320761 1495805 0 0 738
> Server:
> Retfailed Faults Clients
> 0 0 3
> OpenOwner Opens LockOwner Locks Delegs
> 6 178 8012 2 0
> Server Cache Stats:
> Inprog Idem Non-idem Misses CacheSize TCPPeak
> 0 0 96 6876610 8084 13429
> 
Hmm. LockOwners have the same property as OpenOwners in that the
server is required to hold onto the last reply in the cache until
the Open/Lock Owner is released. Unfortunately, a server can't
release a LockOwner until either the client issues a ReleaseLockOwner
operation to tell the server that it will no longer use the LockOwner
or the open is closed.

These stats suggest that the client tried to do byte range locking
over 8000 times with different LockOwners (I don't know how the Linux
client decided to use a different LockOwner?), for file(s) that were
still open. (When I test using the Fedora15 client, I do see
ReleaseLockOwner operations, but usually just before a close. I don't
know how recently that was added to the Linux client. ReleaseLockOwner
was added just before the RFC was published to try and deal with a
situation where the client uses a lot of LockOwners that the server must
hold onto until the file is closed.

If this is legitimate, all that can be done is increase
NFSRVCACHE_FLOODLEVEL and hope that you can find a value large enough
that the clients don't bump into it without exhausting mbufs. (I'd
increase "kern.ipc.nmbclusters" to something larger than what you
set NFSRVCACHE_FLOODLEEVEL to.)

However, I suspect the 8084 LockOwners is a result of some other
problem. Fingers and toes crossed that it was a side effect of the
cache SMP bugs fixed by cache.patch. (noopen.patch won't help for
this case, because it appears to be lockowners and not openowners
that are holding the cached entries, but it won't do any harm, either.)

If you see very large LockOwner counts again, with the patched
kernel, all I can suggest is doing a packet capture and emailing
it to me. "tcpdump -s 0 -w xxx" run for a short enough time 
that "xxx" isn't huge when run on the server
might catch some issue (like the client retrying a lock over and over
and over again). A packet capture might also show if the Ubuntu client
is doing ReleaseLockOwner operations. (Btw, you can look at the trace
using wireshark, which knows about NFSv4.)

In summary, It'll be interesting to see how this goes, rick
ps: Sorry about the long winded reply, but this is nfsv4 after all:-)



More information about the freebsd-fs mailing list