Limits on jumbo mbuf cluster allocation

Wed Mar 13 03:48:14 UTC 2013

Garrett Wollman wrote:
> <<On Mon, 11 Mar 2013 21:25:45 -0400 (EDT), Rick Macklem
> <rmacklem at uoguelph.ca> said:
> 
> > To be honest, I'd consider seeing a lot of non-empty receive queues
> > for TCP connections to the NFS server to be an indication that it is
> > near/at its load limit. (Sure, if you do netstat a lot, you will
> > occasionally
> > see a non-empty queue here or there, but I would not expect to see a
> > lot
> > of them non-empty a lot of the time.) If that is the case, then the
> > question becomes "what is the bottleneck?". Below I suggest getting
> > rid
> > of the DRC in case it is the bottleneck for your server.
> 
> The problem is not the DRC in "normal" operation, but the DRC when it
> gets into the livelocked state. I think we've talked about a number
> of solutions to the livelock problem, but I haven't managed to
> implement or test these ideas yet. I have a duplicate server up now,
> so I hope to do some testing this week.
> 
> In normal operation, the server is mostly idle, and the nfsd threads
> that aren't themselves idle are sleeping deep in ZFS waiting for
> something to happen on disk. When the arrival rate exceeds the rate
> at which requests are cleared from the DRC, *all* of the nfsd threads
> will spin, either waiting for the DRC mutex or walking the DRC finding
> that there is nothing that can be released yet. *That* is the
> livelock condition -- the spinning that takes over all nfsd threads is
> what causes the receive buffers to build up, and the large queues then
> maintain the livelocked condition -- and that is why it clears
> *immediately* when the DRC size is increased. (It's possible to
> reproduce this condition on a loaded server by simply reducing the
> tcphighwater to less than the current size.) Unfortunately, I'm at
> the NFSRC_FLOODSIZE limit right now (64k), so there is no room for
> further increases until I recompile the kernel. It's probably a bug
> that the sysctl definition in drc3.patch doesn't check the new value
> against this limit.
> 
> Note that I'm currently running 64 nfsd threads on a 12-core
> (24-thread) system. In the livelocked condition, as you would expect,
> the system goes to 100% CPU utilization and the load average peaks out
> at 64, while goodput goes to nearly nil.
> 
Ok, I think I finally understand what you are referring to by your livelock.
Basically, you are at the tcphighwater mark and the nfsd threads don't
succeed in freeing up many cache entries so each nfsd thread tries to
trim the cache for each RPC and that slows the server right down.

I suspect it is the cached entries from dismounted clients that are
filling up the cache (you did mention clients using amd at some point
in the discussion, which implies frequent mounts/dismounts).
I'm guessing that the tcp cache timeout needs to be made a lot smaller
for your case.

> > For either A or B, I'd suggest that you disable the DRC for TCP
> > connections
> > (email if you need a patch for that), which will have a couple of
> > effects:
> 
> I would like to see your patch, since it's more likely to be correct
> than one I might dream up.
> 
> The alternative solution is twofold: first, nfsrv_trimcache() needs to
> do something to ensure forward progress, even when that means dropping
> something that hasn't timed out yet, and second, the server code needs
> to ensure that nfsrv_trimcache() is only executing on one thread at a
> time. An easy way to do the first part would be to maintain an LRU
> queue for TCP in addition to the UDP LRU, and just blow away the first
> N (>NCPU) entries on the queue if, after checking all the TCP replies,
> the DRC is still larger than the limit. The second part is just an
> atomic_cmpset_int().
> 
I've attached a patch that has assorted changes. I didn't use an LRU list,
since that results in a single mutex to contend on, but I added a second
pass to the nfsrc_trimcache() function that frees old entries. (Approximate
LRU, using a histogram of timeout values to select a timeout value that
frees enough of the oldest ones.)

Basically, this patch:
- allows setting of the tcp timeout via vfs.nfsd.tcpcachetimeo
  (I'd suggest you go down to a few minutes instead of 12hrs)
- allows TCP caching to be disabled by setting vfs.nfsd.cachetcp=0
- does the above 2 things you describe to try and avoid the livelock,
  although not quite using an lru list
- increases the hash table size to 500 (still a compile time setting)
  (feel free to make it even bigger)
- sets nfsrc_floodlevel to at least nfsrc_tcphighwater, so you can
  grow vfs.nfsd.tcphighwater as big as you dare

The patch includes a lot of drc2.patch and drc3.patch, so don't try
and apply it to a patched kernel. Hopefully it will apply cleanly to
vanilla sources.

Tha patch has been minimally tested.

If you'd rather not apply the patch, you can change NFSRVCACHE_TCPTIMEOUT
and set the variable nfsrc_tcpidempotent == 0 to get a couple of
the changes. (You'll have to recompile the kernel for these changes to
take effect.)

Good luck with it, rick

> -GAWollman
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drc4.patch
Type: text/x-patch
Size: 18459 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20130312/4e349300/attachment.bin>