Limits on jumbo mbuf cluster allocation

Sun Mar 10 03:48:04 UTC 2013

Garrett Wollman wrote:
> <<On Sat, 9 Mar 2013 11:50:30 -0500 (EST), Rick Macklem
> <rmacklem at uoguelph.ca> said:
> 
> > I suspect this indicates that it isn't mutex contention, since the
> > threads would block waiting for the mutex for that case, I think?
> 
> No, because our mutexes are adaptive, so each thread spins for a while
> before blocking. With the current implementation, all of them end up
> doing this in pretty close to lock-step.
> 
> > (Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the
> > chance.)
> 
> I already have it set to 129 (up from 20); I could see putting it up
> to, say, 1023. It would be nice to have a sysctl for maximum chain
> length to see how bad it's getting (and if the hash function is
> actually effective).
> 
Yep, I'd bump it up to 1000 or so for a server the size you've built.

> > I've thought about this. My concern is that the separate thread
> > might
> > not keep up with the trimming demand. If that occurred, the cache
> > would
> > grow veryyy laarrggge, with effects like running out of mbuf
> > clusters.
> 
> At a minimum, once one nfsd thread is committed to doing the cache
> trim, a flag should be set to discourage other threads from trying to
> do it. Having them all spinning their wheels punishes the clients
> much too much.
> 
Yes, this is a good idea, as I mentioned in another reply.

> > By having the nfsd threads do it, they slow down, which provides
> > feedback
> > to the clients (slower RPC replies->generate fewer request->less to
> > cache).
> > (I think you are probably familiar with the generic concept that a
> > system
> >  needs feedback to remain stable. An M/M/1 queue with open arrivals
> >  and
> >  no feedback to slow the arrival rate explodes when the arrival rate
> >  approaches the service rate, etc and so on...)
> 
> Unfortunately, the feedback channel that I have is: one user starts
> 500 virtual machines accessing a filesystem on the server -> other
> users of this server see their goodput go to zero -> everyone sends in
> angry trouble tickets -> I increase the DRC size manually. It would
> be nice if, by the time I next want to take a vacation, I have this
> figured out.
> 
I probably shouldn't say this, but my response to complaints w.r.t. a
slow NFS server was "Tell the boss to spend big bucks on a Netapp.",
back when I was a sysadmin for a college.;-)

Well, it would be easy to come up with a patch that disables the DRC
for TCP. If you'd like a patch for that, just email. So long as your
network fabric is solid, it isn't that big a risk to run that way.
If 500 virtual machines start pounding on the NFS server, I'd be surprised
if other parts of the server don't "hit the wall", but disabling the DRC
will find that out.

It would be nice if there was a way to guarantee that clients get a fair
slice of the server pie, but I don't know of a way to do that. As I noted
in another reply, a client may use multiple IP addresses for the requests.
Also, since traffic from clients tends to be very bursty, putting a limit
on traffic when there isn't a lot of load from other clients doesn't make
sense, I think? Then there is the question of "How does the NFS server know the
system is nearing its load limit so it should apply limits to clients sending
a lot of RPC requests?".
All the NFS server does is translate the RPC requests to VFS/VOP ops,
so I don't see how it will know that the underlying file systems are
nearing their load limit, as one example.

When I ran (much smaller) NFS servers in production, I usually saw the
disks hit their io ops limit.

> I'm OK with throwing memory at the problem -- these servers have 96 GB
> and can hold up to 144 GB -- so long as I can find a tuning that
> provides stability and consistent, reasonable performance for the
> users.
> 
> > The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't
> > recall exactly why it is that way, except that it needs to be large
> > enough to handle the largest RPC request a client might generate.
> 
> > I should take another look at this, in case sb_max_adj is now
> > too large?
> 
> It probably shouldn't be larger than the
> net.inet.tcp.{send,recv}buf_max, and the read and write sizes that are
> negotiated should be chosen so that a whole RPC can fit in that
> space. If that's too hard for whatever reason, nfsd should at least
> log a message saying "hey, your socket buffer limits are too small,
> I'm going to ignore them".
> 
As I mentioned in another reply, *buf_max is 2Mbytes these days. I think
I agree that 2Mbytes is larger than you need for your server, given your
LAN environment.

The problem is, I can't think of how an NFS server will know that a new
client connection is on a LAN and not a long-fat WAN connection. The
latter may need to but 2Mbytes on the wire to fill the pipe. My TCP
is *very rusty*, but I think that a sb_hiwat of 256Kbytes is going to
make the send windows shrink so that neither end can send 2Mbytes
of unacknowledged data segments to fill the pipe?

Also, the intent is to apply the "feedback" in cases where the server
is overloaded and I think doing so "late" might be sufficient, if not
ideal. The server has "agreed" to do the RPC once it has allowed it
to be received into the TCP receive queue for the connection.
How fast the RPC is sent will always vary dramatically,
based on server load and type of RPC. (A big write will normally
take many times what a Getattr does.)

rick

> -GAWollman
> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"