Limits on jumbo mbuf cluster allocation

Tue Mar 12 01:25:46 UTC 2013

Garrett Wollman wrote:
> In article <513DB550.5010004 at freebsd.org>, andre at freebsd.org writes:
> 
> >Garrett's problem is receive side specific and NFS can't do much
> >about it.
> >Unless, of course, NFS is holding on to received mbufs for a longer
> >time.
The NFS server only holds onto receive mbufs until it performs the RPC
requested. Of course, if the server hits its load limit, there will
then be a backlog of RPC requests --> the received mbufs for these
requests will be held for a longer time.

To be honest, I'd consider seeing a lot of non-empty receive queues
for TCP connections to the NFS server to be an indication that it is
near/at its load limit. (Sure, if you do netstat a lot, you will occasionally
see a non-empty queue here or there, but I would not expect to see a lot
of them non-empty a lot of the time.) If that is the case, then the
question becomes "what is the bottleneck?". Below I suggest getting rid
of the DRC in case it is the bottleneck for your server.

> 
> Well, I have two problems: one is running out of mbufs (caused, we
> think, by ixgbe requiring 9k clusters when it doesn't actually need
> them), and one is livelock. Allowing potentially hundreds of clients
> to queue 2 MB of requests before TCP pushes back on them helps to
> sustain the livelock once it gets started, and of course those packets
> will be of the 9k jumbo variety, which makes the first problem worse
> as well.
> 
The problem for the receive side is "how small should you make it?".
Suppose we have the following situation:
- only one client is active and it is flushing writes for a large file
  written into that client's buffer cache.
  --> If you set the receive size so that it is just big enough for one
      write, then the client will end up doing:
      - send one write, wait a long while for the NFS_OK reply
      - send the next write, wait a long while for the NFS_OK reply
      and so on
  --> the write back will take a long time, even though no other client
      is generating load on the server.
  --> the user for this client won't be happy

If you make the receive side large enough to handle several Write requests,
then the above works much faster, however...
- the receive size is now large enough to accept many many other RPC requests
  (a Write request is 64Kbytes+ however Read requests are typically
   less than 100bytes)

Even if you set the receive size to the minimum that will handle one Write
request, that will allow the client to issue something like 650 Read requests.

Since NFS clients wait for replies to the RPC requests they send, they will
only queue so many requests before sending no more of them until they receive
some replies. This does delay the "feedback" somewhat, but I'd argue that buffering of
requests in the server's receive queue helps when clients generate bursts of
requests on a server that is well below its load limit.

Now, I'm not sure I understand what you mean by "livelock"?
A - Do you mean that the server becomes unresponsive and is generating almost
    no RPC replies, with all the clients are reporting
    "NFS server not responding"?
or
B - Do you mean that the server keeps responding to RPCs at a steady rate,
    but that rate is slower than what the clients (and their users) would
    like to see?
If it is B, I'd just consider that as hitting the server's load limit.

For either A or B, I'd suggest that you disable the DRC for TCP connections
(email if you need a patch for that), which will have a couple of effects:
1 - It will avoid the DRC from defining the server's load limit. (If the
    DRC is the server's bottleneck, this will increase the server's load
    limit to whatever else is the next bottleneck.)
2 - If the mbuf clusters held by the DRC are somehow contributing to the
    mbuf cluster allocation problem for the receive side of the network
    interface, this would alleviate that. (I'm not saying it fixes the
    problem, but might allow the server to avoid it under the driver
    guys come up with a good solution for it.)

rick

> -GAWollman
> 
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"