Limits on jumbo mbuf cluster allocation

Sat Mar 9 16:50:31 UTC 2013

Garrett Wollman wrote:
> <<On Fri, 8 Mar 2013 19:47:13 -0500 (EST), Rick Macklem
> <rmacklem at uoguelph.ca> said:
> 
> > If reducing the size to 4K doesn't fix the problem, you might want
> > to
> > consider shrinking the tunable vfs.nfsd.tcphighwater and suffering
> > the increased CPU overhead (and some increased mutex contention) of
> > calling nfsrv_trimcache() more frequently.
> 
> Can't do that -- the system becomes intolerably slow when it gets into
> that state, and seems to get stuck that way, such that the only way to
> restore performance is to increase the size of the "cache".
> (Essentially all of the nfsd service threads end up spinning most of
> the time, load average goes to N, and goodput goes to nearly nil.) It
> does seem like a lot of effort for an extreme edge case that, in
> practical terms, never happens.
> 
So, it sounds like you've found a reasonable setting. Yes, if it is too
small, it will keep trimming over and over and over again...

I suspect this indicates that it isn't mutex contention, since the
threads would block waiting for the mutex for that case, I think?
(Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the chance.)

> > (I'm assuming that you are using drc2.patch + drc3.patch.
> 
> I believe that's what I have. If my kernel coding skills were less
> rusty, I'd fix it to have a separate cache-trimming thread.
> 
I've thought about this. My concern is that the separate thread might
not keep up with the trimming demand. If that occurred, the cache would
grow veryyy laarrggge, with effects like running out of mbuf clusters.

By having the nfsd threads do it, they slow down, which provides feedback
to the clients (slower RPC replies->generate fewer request->less to cache).
(I think you are probably familiar with the generic concept that a system
 needs feedback to remain stable. An M/M/1 queue with open arrivals and
 no feedback to slow the arrival rate explodes when the arrival rate
 approaches the service rate, etc and so on...)

As such, I'm not convinced a separate thread is a good idea. I think
that simply allowing sysadmins to disable the DRC for TCP may make
sense. Although I prefer more reliable vs better performance, I can
see the argument that TCP transport for RPC is "good enough" for
some environments. (Basically, if a site has a high degree of
confidence in their network fabric, such that network partitioning
type failures are pretty well non-existent and the NFS server isn't
getting overloaded to the point of very slow RPC replies, I can
see TCP retransmits as being sufficient?)

> One other weird thing that I've noticed is that netstat(1) reports the
> send and receive queues on NFS connections as being far higher than I
> have the limits configured. Does NFS do something to override this?
> 
> -GAWollman
> 
The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't
recall exactly why it is that way, except that it needs to be large
enough to handle the largest RPC request a client might generate.

I should take another look at this, in case sb_max_adj is now
too large?

rick

> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"