Help:: Listen queue overflow killing servers

Fri Jul 26 20:24:02 UTC 2019

On 7/26/19 12:34 PM, Paul Macdonald via freebsd-questions wrote:
> 
> On 26/07/2019 19:56, David Christensen wrote:
>> On 7/26/19 9:57 AM, Paul Macdonald via freebsd-questions wrote:
>>>
>>> On 26/07/2019 17:11, David Christensen wrote:
>>>> On 7/26/19 4:58 AM, Paul Macdonald via freebsd-questions wrote:
>>>>> Over the past few months i've seen several boxes (4 or 5) become 
>>>>> unresponsive as a result of a Listen queue overflow state.
>>>>
>>>                  so doesn;t look like its load..... ( and that would 
>>> have shown up in the logs anyway)
>>
>>
>> Is this server in production?  If so, it would be prudent to migrate 
>> services and data to another computer while you troubleshoot.
>>
>>
> this has happened on 5 production boxes over the past few months, all 
> with different hardware and load profiles.

Which tracks and versions of FreeBSD?

Are you running stock FreeBSD?  Packages?  Ports?  Custom?

Do you have automation to detect the symptom(s) and alert you?

>> I would turn on debugging and crank up logging everywhere -- kernel, 
>> ZFS, Apache, MySQL, PHP, WP, app code, etc..  Make sure you have a big 
>> and fast device/ virtual device for the logs and debug dumps.
>>
>>
> thats  a big job  we run 110+ servers, i'd like to find something more 
> specific

Pick a representative sample (say, 10%) and crank up debug/ logging.  As 
you get clues, you can scale back depth and increase the sample size.

>> Are the stress tests hitting the server with "good" traffic?  Can you 
>> send "bad" traffic?
>>
>>
> no idea how to send bad traffic!

Metasploit comes to mind.

>> Do you have test suites for any of the components?  If so, run them. 
>> As you troubleshoot, write new test scripts.
>>
> components are not comparable across boxes, and one box that went down 
> has only our custom code ( which has worked for a decade)

Did the other failing machines have your code?

>> Can you capture real traffic and replay it -- preferably traffic that 
>> elicits the bug(s)?
>>
> the issue doesn;t seem to be that reproducible, i'l check but i think 
> only 1 of the boxes has gone down >1 times with same issue
> 
> (i can't capture traffic on all boxes)

Again, perhaps start with a sample.

> I wish it was more reproducible, i'd downgrade that server down to 11.4 
> in a heart beat ( i'm suspecting its 12.0 related)

I prefer to use the most mature and supported "production" release of 
whatever FOSS I use -- BSD, Linux, whatever.  Newer stuff usually has 
more "gremlins".

Similarly, I prefer "vendor official" binary software packages.  I have 
destabilized plenty of machines with unofficial packages and/or source 
distributions.

> ( have see historic report of similar issues on imap boxes, which do 
> have large quues anyway obv)
> 
> weirdly our imap boxes have been fine, and they have 10k connections all 
> the time.
> 
> I sieged tested the box that went down earlier today (16C/32T, 128GB 
> RAM, 1Tb NVme) and it didn;t break sweat after 300,000 conections.
> 
> am at a bit of a loss.

Which tracks/ versions of FreeBSD are you running?  Is there any 
correlation between FreeBSD track/ version and the bug(s)?  Can you run 
11.2-RELEASE?  Are you running/ can you run official FreeBSD binary 
packages?  Do you put your code into a FreeBSD package?  Do you use 
configuration management?

David