Limits on jumbo mbuf cluster allocation

Mon Mar 11 10:43:31 UTC 2013

On 11.03.2013 00:46, Rick Macklem wrote:
> Andre Oppermann wrote:
>> On 10.03.2013 03:22, Rick Macklem wrote:
>>> Garett Wollman wrote:
>>>> Also, it occurs to me that this strategy is subject to livelock. To
>>>> put backpressure on the clients, it is far better to get them to
>>>> stop
>>>> sending (by advertising a small receive window) than to accept
>>>> their
>>>> traffic but queue it for a long time. By the time the NFS code gets
>>>> an RPC, the system has already invested so much into it that it
>>>> should
>>>> be processed as quickly as possible, and this strategy essentially
>>>> guarantees[1] that, once those 2 MB socket buffers start to fill
>>>> up,
>>>> they
>>>> will stay filled, sending latency through the roof. If nfsd didn't
>>>> override the usual socket-buffer sizing mechanisms, then sysadmins
>>>> could limit the buffers to ensure a stable response time.
>>>>
>>>> The bandwidth-delay product in our network is somewhere between
>>>> 12.5
>>>> kB and 125 kB, depending on how the client is connected and what
>>>> sort
>>>> of latency they experience. The usual theory would suggest that
>>>> socket buffers should be no more than twice that -- i.e., about 256
>>>> kB.
>>>>
>>> Well, the code that uses sb_max_adj wasn't written by me (I just
>>> cloned
>>> it for the new server). In the author's defence, I believe SB_MAX
>>> was 256K when
>>> it was written. It was 256K in 2011. I think sb_max_adj was used
>>> because
>>> soreserve() fails for a larger value and the code doesn't check for
>>> such a failure.
>>> (Yea, it should be fixed so that it checks for a failure return from
>>> soreserve().
>>>    I did so for the client some time ago.;-)
>>
>> We have TCP sockbuf size autotuning for some time now. So explicitly
>> setting the size shouldn't be necessary anymore.
>>
> Ok. Is it possible for the size to drop below the size of the largest RPC?
> (Currently a little over 64K and hopefully a little over 128K soon.)

Auto-tuning only goes up.  The start values are 32k for TX and 64k for RX.

> I'm thinking of the restriction in sosend_generic() where it won't allow a
> request greater than sb_hiwat to be added to the send queue. (It is passed
> in as an mbuf list via the "top" argument, which makes "atomic" true, I think?)

IIRC we actually do enforce the limit when you pass in an mbuf chain
through "top".  Have to check again though what the current situation is.

Officially a send on a socket is only atomic when it is marked as atomic.
Due to the non-enforced limit in the "top" case the append becomes atomic
in practice.  Though the API does not guarantee it.  We have to analyze
the interaction socket/NFS further to make sure no assumptions are violated.

> The soreserve() calls were done in the old days to make sure sb_hiwat was
> big enough that sosend() wouldn't return EMSGSIZE.
> (I'll take a look at the code and try to see if/when sb_hiwat gets autotuned.)

Having a socket buffer large enough for the largest RPC doesn't guarantee
anything for TCP as there may still be unacknowledged data sitting in the
send buffer.  This also counts against the limit.

The soreserve() call probably comes from UDP where the socket buffer must
be as large as the largest packet you're trying to send (64K is the limit).
UDP sockets do not buffer on send but go straight down to the interface
queue.

>>> Just grep for sb_max_adj. You'll see it sets a variable called
>>> "siz".
>>> Make "siz" whatever you want (256K sounds like a good guess). Just
>>> make
>>> sure it isn't > sb_max_adj.
>>>
>>> The I/O sizes are limited to MAXBSIZE, which is currently 64Kb,
>>> although
>>> I'd like to increase that to 128Kb someday soon. (As you note below,
>>> the
>>> largest RPC is slightly bigger than that.)
>>>
>>> Btw, net.inet.tcp.{send/recv}buf_max are both 2Mbytes, just like
>>> sb_max,
>>> so those don't seem useful in this case?
>>
>> These are just the limits for auto-tuning.
>>
>>> I'm no TCP guy, so suggestions w.r.t. how big soreserve() should be
>>> set
>>> are welcome.
>>
>> I'd have to look more at the NFS code to see what exactly is going on
>> and what the most likely settings are going to be. Won't promise any
>> ETA though.
>>
> Basically an RPC request/reply is an mbuf list where its size can be
> up to MAXBSIZE + a hundred bytes or so. (64Kb+ --> 128Kb+ soon)

OK.

> These need to be queued for sending without getting EMSGSIZE back.

If the socket buffer doesn't have this "bug" there isn't any guarantee.
We really have to look into that NFS socket interaction to get this
right.

> Then, if the mount is for a high bandwidth WAN, it would be nice if
> the send window allows several of these to be "in flight" (not yet
> acknowledged) so that the "bit pipe" can be kept full (use the
> available bandwidth). These could be read-aheads/write-behinds or
> requests for other processes/threads in the client.
> For example:
> - with a 128Kbyte MAXBSIZE and a read-ahead of 15, it would be possible
>    to have 128 * 1024 * 16 bytes on the wire, if the TCP window allows
>    that. (This would fill a 1Gbps network with a 20msec rtt, if I got
>    my rusty math correct. It is rtt and not the time for a packet to
>    go in one direction, since the RPC replies need to get back to the
>    client before it will do any more reads.) This sounds like the
>    upper bound of the current setup, given the 2Mbyte setting for
>    net.inet.tcp.sendbuf_max, I think?
>    (Yes, I know most use NFS over a LAN, but it would be nice if it
>     can work well enough over a WAN to be useful.)
> - for a fast LAN, obviously the rtt is much lower, so the limit can
>    be a lot lower. However, I'm not sure that there is an advantage
>    w.r.t. NFS for this. So long as the client sees that it can't send
>    more RPCs once several are queued for the server, it won't cause a
>    "congestion collapse" for the server.

Thanks for the explanation.

>    I think the large window might make Garrett's case worse, since it seems that
>    he is running out of mbuf clusters (actually the ability to allocate
>    more jumbo ones), but that seems to be a fairly specific resource
>    issue for his case caused in part by the fact the network interface
>    is using so many of them.
>    In other words, I'm not sure a generic NFS fix would make sense for
>    this specific case.

The send side doesn't use jumbo mbufs unless you explicitly allocate them
while creating the packet.

Garrett's problem is receive side specific and NFS can't do much about it.
Unless, of course, NFS is holding on to received mbufs for a longer time.

-- 
Andre