Scenario to make recv(MSG_WAITALL) stuck

Kostik Belousov kostikbel at gmail.com
Tue Jun 14 09:55:08 UTC 2011


On Mon, Jun 13, 2011 at 07:19:40PM +0300, Mikolaj Golub wrote:
> Hi,
> 
> Below is a scenario how to make recv(2) with MSG_WAITALL flag get stuck.
> 
> (See http://people.freebsd.org/~trociny/test_MSG_WAITALL.4.c for the test code).
> 
> Let's the size of the receive buffer is SOBUF_SIZE (e.g. 10000 bytes).
> 
> On sender side do 2 send() requests:
> 
> 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);
> 
> 2) data of size equal to SOBUF_SIZE.
> 
> After this on receiver side do 2 recv() requests with MSG_WAITALL flag set:
> 
> 1) recv() data of SOBUF_SIZE / 10 size;
> 
> 2) recv() data of SOBUF_SIZE size;
> 
> The second recv() will last for very long time. In tcpdump one can observe
> that the window is permanently stuck at 0 and pending data is only sent via
> TCP window probes (so one byte every few seconds).
> 
> 18:09:14.784698 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [S], seq 1907676797, win 65535, options [mss 16344,nop,wscale 3,sackOK,TS val 22207 ecr 0], length 0
> 18:09:14.784729 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [S.], seq 2298857585, ack 1907676798, win 10000, options [mss 16344,nop,wscale 3,sackOK,TS val 2718467987 ecr 22207], length 0
> 18:09:14.784749 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 0
> 18:09:14.785168 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [P.], seq 1:1001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 1000
> 18:09:14.785264 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 1001:10001, ack 1, win 8960, options [nop,nop,TS val 22207 ecr 2718467987], length 9000
> 18:09:14.785280 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718467987 ecr 22207], length 0
> 18:09:19.784440 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 22707 ecr 2718467987], length 1
> 18:09:19.784480 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10001, win 0, options [nop,nop,TS val 2718468487 ecr 22707], length 0
> 18:09:24.784439 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10001:10002, ack 1, win 8960, options [nop,nop,TS val 23207 ecr 2718468487], length 1
> 18:09:24.784472 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10002, win 0, options [nop,nop,TS val 2718468987 ecr 23207], length 0
> 18:09:29.784437 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10002:10003, ack 1, win 8960, options [nop,nop,TS val 23707 ecr 2718468987], length 1
> 18:09:29.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10003, win 0, options [nop,nop,TS val 2718469487 ecr 23707], length 0
> 18:09:34.784444 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10003:10004, ack 1, win 8960, options [nop,nop,TS val 24207 ecr 2718469487], length 1
> 18:09:34.784486 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10004, win 0, options [nop,nop,TS val 2718469987 ecr 24207], length 0
> 18:09:39.784443 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10004:10005, ack 1, win 8960, options [nop,nop,TS val 24707 ecr 2718469987], length 1
> 18:09:39.784478 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10005, win 0, options [nop,nop,TS val 2718470487 ecr 24707], length 0
> 18:09:44.784442 IP 127.0.0.1.53378 > 127.0.0.1.23481: Flags [.], seq 10005:10006, ack 1, win 8960, options [nop,nop,TS val 25207 ecr 2718470487], length 1
> 18:09:44.784477 IP 127.0.0.1.23481 > 127.0.0.1.53378: Flags [.], ack 10006, win 0, options [nop,nop,TS val 2718470987 ecr 25207], length 0
> ...
> 
> I first noticed this issue with HAST and suspect other people observed it with
> HAST too.
> 
> Below is explanation what is going on.
> 
> We totaly filled the receiver buffer with one SOBUF_SIZE/10 size request and
> partial SOBUF_SIZE request. When the first request was processed we got
> SOBUF_SIZE/10 free space. It was just enogh to recive the rest of bytes for
> the second request, and the reciving thread went in soreceive_generic->sbwait
> here:
> 
>         /*
>          * If we have less data than requested, block awaiting more (subject
>          * to any timeout) if:
>          *   1. the current count is less than the low water mark, or
>          *   2. MSG_WAITALL is set, and it is possible to do the entire
>          *      receive operation at once if we block (resid <= hiwat).
>          *   3. MSG_DONTWAIT is not set
>          * If MSG_WAITALL is set but resid is larger than the receive buffer,
>          * we have to do the receive in sections, and thus risk returning a
>          * short count if a timeout or signal occurs after we start.
>          */
>         if (m == NULL || (((flags & MSG_DONTWAIT) == 0 &&
>             so->so_rcv.sb_cc < uio->uio_resid) &&
>             (so->so_rcv.sb_cc < so->so_rcv.sb_lowat ||
>             ((flags & MSG_WAITALL) && uio->uio_resid <= so->so_rcv.sb_hiwat)) &&
>             m->m_nextpkt == NULL && (pr->pr_flags & PR_ATOMIC) == 0)) {
>                  ...
>                  error = sbwait(&so->so_rcv);
> 
> recvbuf is almost full but has enough space to satisfy MSG_WAITALL request
> without draining data to user buffer, and soreceive waits for data. But the
> window was closed when the buffer was filled and to avoid silly window
> syndrome it opens only when available space is larger than sb_hiwat/4 or
> maxseg:
> 
> tcp_output():
> 
>         /*
>          * Calculate receive window.  Don't shrink window,
>          * but avoid silly window syndrome.
>          */
>         if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
>             recwin < (long)tp->t_maxseg)
>                 recwin = 0;
> 
> so it is stuck and pending data is only sent via TCP window probes.
> 
> It looks like the fix could be to remove this condition to block if
> MSG_WAITALL is set and it is possible to do the entire receive operation at
> once, like in the patch:
> 
> http://people.freebsd.org/~trociny/uipc_socket.c.soreceive_generic.MSG_DONTWAIT.patch
> 
> This works for me but I am not sure this is a correct solution.
> 
> Note, the issue is not reproduced with soreceive_stream.
> 
I do not understand what then happens for the recvfrom(2) call ?
Would it get some error, or 0 as return and no data, or something else ?

Also, what is the MT_CONTROL chunk about ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-net/attachments/20110614/9af048df/attachment.pgp


More information about the freebsd-net mailing list