tcp failing to recover from a packet loss under 8.2-RELEASE?

Thu Aug 4 14:18:02 UTC 2011

Hi Steven, Andre and Slawa,

Firstly, sorry for the delay diving into this. I just returned from some 
travel.

Comments inline...

On 08/02/11 21:35, Steven Hartland wrote:
>
> ----- Original Message ----- From: "Steven Hartland"
>> ----- Original Message ----- From: "Andre Oppermann"
>> ...
>>>> I believe this is tcps_rcvmemdrop in tcp_reass.c to which there's the
>>>> following comment:-
>>>>
>>>> * XXXLAS: Using sbspace(so->so_rcv) instead of so->so_rcv.sb_hiwat
>>>> * should work but causes packets to be dropped when they shouldn't.

This is not related to the issues you were experiencing.

>>>> I notice this code is relatively new, so I'm wondering if this may be
>>>> something to do with what we're seeing, possibly still dropping packets
>>>> it shouldn't?
>>>>
>>>> @Lawrence apologies' for the direct mail, but I believe you where the
>>>> original author this particular change so wondered if you may be
>>>> able to shed any light on this?

Thanks for bringing me in directly, I haven't been keeping up with the 
mailing lists at all recently.

>>> You could be onto something here. Please try this patch:
>>> http://people.freebsd.org/~andre/tcp_reass.c-logdebug-20110802.diff
>>>
>>> You can enable the log output with
>>> sysctl net.inet.tcp.log_debug=1
>>> and report the log output (comes at LOG_DEBUG level).
>>
>> Thanks for the response Andre, I've applied the patch and I'm seeing
>> lots of the following during the test which is:-
>> 1. scp from local host (10.10.1.30) -> tcptest (10.10.1.20) reciever
>> which gets ~ 64MB/s
>> 2. scp from remote host (10.10.1.10) -> tcptest (10.10.1.20) reciever
>> which gets ~ 10MB/s (line has packet loss)
>>
>> Aug 2 11:08:50 tcptest kernel: TCP: [10.10.1.30]:60811 to
>> [10.10.1.20]:22 tcpflags 0x10<ACK>; tcp_reass: global zone limit
>> reached, segment dropped
>> Aug 2 11:08:50 tcptest kernel: TCP: [10.10.1.30]:60811 to
>> [10.10.1.20]:22 tcpflags 0x10<ACK>; tcp_reass: global zone limit
>> reached, segment dropped
>
> Hmm, based on this are we seeing something similar to the following?
> http://www.freebsd.org/cgi/query-pr.cgi?pr=155407

Slawa (CC'd) is the author of PR 155407 and Steven, your problem is the 
same. I grabbed the PR some time back but haven't found some time to sit 
down and respond in detail.

> Other potentially useful info:-
>
> vmstat -z | head -1 ; vmstat -z | grep -i tcp
> ITEM SIZE LIMIT USED FREE REQUESTS FAILURES
> tcp_inpcb: 336, 25608, 115, 556, 707, 0
> tcpcb: 880, 25600, 115, 405, 707, 0
> tcptw: 72, 5150, 0, 600, 188, 0
> tcpreass: 40, 1680, 106, 1574, 185926, 4414
>
> sysctl net.inet.tcp.reass
> net.inet.tcp.reass.overflows: 0
> net.inet.tcp.reass.cursegments: 106
> net.inet.tcp.reass.maxsegments: 1680
>
> netstat -s -f inet -p tcp | grep "discarded due"
> 4414 discarded due to memory problems
> net.inet.tcp.reass.maxsegments: 1680
>
> sysctl kern.ipc.nmbclusters
> kern.ipc.nmbclusters: 25600
>
> The default value of nmbclusters on the target machine explains
> the value of net.inet.tcp.reass.maxsegments which defaults to
> nmbclusters / 16
>
> Setting net.inet.tcp.reass.maxsegments=8148 and rerunning the
> tests appears to result in a solid 14MB/s, its still running a
> full soak test but looking very promising :)

This is exactly the necessary tuning required to drive high BDP links 
successfully. The unfortunate problem with my reassembly change was that 
by removing the global count of reassembly segments and using the uma 
zone to enforce the restrictions on memory use, we wouldn't necessarily 
have room for the last segment (particularly if a single flow has a BDP 
larger than the max size of the reassembly queue - which is the case for 
you and Slawa).

This is bad as Andre explained in his message, as we could stall 
connections. I hadn't even considered the idea of allocating on the 
stack as Andre has suggested in his patch, which I believe is an 
appropriate solution to the the stalling problem assuming the function 
will never return with the stack allocated tqe still in the reassembly 
queue. My longer term goal is discussed below.

> So I suppose the question is should maxsegments be larger by
> default due to the recent changes e.g.
> - V_tcp_reass_maxseg = nmbclusters / 16;
> + V_tcp_reass_maxseg = nmbclusters / 8;
>
> or is the correct fix something more involved?

I'm not sure if bumping the value is appropriate - we have always 
expected users to tune their network stack to perform well when used in 
"unusual" scenarios - a large BDP fibre path still being in the 
"unusual" category.

The real fix which is somewhere down on my todo list is to make all 
these memory constraints elastic and respond to VM pressure, thus 
negating the need for a hard limit at all. This would solve many if not 
most of the TCP tuning problems we currently have with one foul swoop 
and would greatly reduce the need for tuning in many situations that 
currently are in the "needs manual tuning" basket.

Andre and Steven, I'm a bit too sleepy to properly review your combined 
proposed changes right now and will follow up in the next few days instead.

Cheers,
Lawrence