Help:: Listen queue overflow killing servers
dpchrist at holgerdanske.com
Fri Jul 26 20:24:02 UTC 2019
On 7/26/19 12:34 PM, Paul Macdonald via freebsd-questions wrote:
> On 26/07/2019 19:56, David Christensen wrote:
>> On 7/26/19 9:57 AM, Paul Macdonald via freebsd-questions wrote:
>>> On 26/07/2019 17:11, David Christensen wrote:
>>>> On 7/26/19 4:58 AM, Paul Macdonald via freebsd-questions wrote:
>>>>> Over the past few months i've seen several boxes (4 or 5) become
>>>>> unresponsive as a result of a Listen queue overflow state.
>>> so doesn;t look like its load..... ( and that would
>>> have shown up in the logs anyway)
>> Is this server in production? If so, it would be prudent to migrate
>> services and data to another computer while you troubleshoot.
> this has happened on 5 production boxes over the past few months, all
> with different hardware and load profiles.
Which tracks and versions of FreeBSD?
Are you running stock FreeBSD? Packages? Ports? Custom?
Do you have automation to detect the symptom(s) and alert you?
>> I would turn on debugging and crank up logging everywhere -- kernel,
>> ZFS, Apache, MySQL, PHP, WP, app code, etc.. Make sure you have a big
>> and fast device/ virtual device for the logs and debug dumps.
> thats a big job we run 110+ servers, i'd like to find something more
Pick a representative sample (say, 10%) and crank up debug/ logging. As
you get clues, you can scale back depth and increase the sample size.
>> Are the stress tests hitting the server with "good" traffic? Can you
>> send "bad" traffic?
> no idea how to send bad traffic!
Metasploit comes to mind.
>> Do you have test suites for any of the components? If so, run them.
>> As you troubleshoot, write new test scripts.
> components are not comparable across boxes, and one box that went down
> has only our custom code ( which has worked for a decade)
Did the other failing machines have your code?
>> Can you capture real traffic and replay it -- preferably traffic that
>> elicits the bug(s)?
> the issue doesn;t seem to be that reproducible, i'l check but i think
> only 1 of the boxes has gone down >1 times with same issue
> (i can't capture traffic on all boxes)
Again, perhaps start with a sample.
> I wish it was more reproducible, i'd downgrade that server down to 11.4
> in a heart beat ( i'm suspecting its 12.0 related)
I prefer to use the most mature and supported "production" release of
whatever FOSS I use -- BSD, Linux, whatever. Newer stuff usually has
Similarly, I prefer "vendor official" binary software packages. I have
destabilized plenty of machines with unofficial packages and/or source
> ( have see historic report of similar issues on imap boxes, which do
> have large quues anyway obv)
> weirdly our imap boxes have been fine, and they have 10k connections all
> the time.
> I sieged tested the box that went down earlier today (16C/32T, 128GB
> RAM, 1Tb NVme) and it didn;t break sweat after 300,000 conections.
> am at a bit of a loss.
Which tracks/ versions of FreeBSD are you running? Is there any
correlation between FreeBSD track/ version and the bug(s)? Can you run
11.2-RELEASE? Are you running/ can you run official FreeBSD binary
packages? Do you put your code into a FreeBSD package? Do you use
More information about the freebsd-questions