Help:: Listen queue overflow killing servers

Fri Jul 26 18:56:18 UTC 2019

On 7/26/19 9:57 AM, Paul Macdonald via freebsd-questions wrote:
> 
> On 26/07/2019 17:11, David Christensen wrote:
>> On 7/26/19 4:58 AM, Paul Macdonald via freebsd-questions wrote:
>>> Over the past few months i've seen several boxes (4 or 5) become 
>>> unresponsive as a result of a Listen queue overflow state.
>>
>>> All are on ZFS and are std apache/php/mysql servers with nothing too 
>>> exotic.
>>
>>> /var/log/messages shows tyically;
>>>
>>>      kernel: sonewconn: pcb 0xfffff813395e3d58: Listen queue 
>>> overflow: 193 already in queue awaiting acceptance (83 occurrences)
>>>
>>> netstat -Lan  shows
>>>
>>> tcp4 193/0/128                          x.x.x.x.443
>>> tcp4  193/0/128                          x.x.x.x.80
>>
>>
>> What Apache/ PHP/ MySQL applications?  Did you write them?  If not, 
>> who did?  Is everything up to date?  Have you filed bug reports?
>>
>>
>> Do the applications have logging or debugging capabilities?  Have you 
>> enabled them?  What do they say?  Where is the blockage? Deadlock?
>>
>>
> These were on servers with multiple vhosts, often running wordpress , 
> but in one instance not ( which had custom software we wrote inhouse , 
> but thats been in production for 19 years without this issue!)
> 
> I suspect it's too low level for application level debugging,
> 
> all i know so far is:
> 
>                  - servers become unresponsive, Listen queue overflow 
> messages in /var/log/messages
> 
>                  - unable to quit jails or even shutdown,  tcpdrop 
> doesn't work (everything in CLOSE_WAIT)
> 
>                  - On the occasion today ( and i can;t be 100% sure, but 
> i siuspect always) , all the apache processes were in disk wait state, 
> but this was on a big new box, with a very tiny site, ( on NVMe)
> 
>                 All servers on FBSD12, with zfs and apache is within an 
> (ezjail)
> 
>                  Multiple load patterns, but 2 out of the 5ish issues 
> don't make much sense as theere would have been very little load.
> 
>                  Non reproducible, have sieged a couple of the affected 
> boxes with no effect ( and logs on a couple of boxes show no intersting 
> traffic, just normal)
> 
>                          - siege -c 255 -r 2
> 
>                          (pretty stressful)
> 
>                      (target server does now something in netstat queues 
> ,  0-100/512  but apache stays out of disk wait , siege is (un) 
> sucessfull as target copes fine
> 
>                  run multiple times , no problem, and have now generated 
> about 100,000 lines more in apache log that i saw after the server went 
> down today  ( (6600 hits to a 16C/32T  + 128GB + NVme machine went down 
> with this earlier)
> 
>                 I've just hit it with 255 concurrent users over a period 
> of 20 mins, and it doesn;t blink
> 
>                  so doesn;t look like its load..... ( and that would 
> have shown up in the logs anyway)

Is this server in production?  If so, it would be prudent to migrate 
services and data to another computer while you troubleshoot.

I would turn on debugging and crank up logging everywhere -- kernel, 
ZFS, Apache, MySQL, PHP, WP, app code, etc..  Make sure you have a big 
and fast device/ virtual device for the logs and debug dumps.

Are the stress tests hitting the server with "good" traffic?  Can you 
send "bad" traffic?

Do you have test suites for any of the components?  If so, run them.  As 
you troubleshoot, write new test scripts.

Can you capture real traffic and replay it -- preferably traffic that 
elicits the bug(s)?

David