Kernel panics in tcp_twclose

Thu Sep 24 08:04:54 UTC 2015

> 24 sep 2015 kl. 09:57 skrev Julien Charbon <jch at freebsd.org>:
> 
> 
> Hi -net,
> 
> On 24/09/15 09:03, Julien Charbon wrote:
>> On 24/09/15 08:55, Palle Girgensohn wrote:
>>>> 24 sep 2015 kl. 07:51 skrev Palle Girgensohn
>>>> <girgen at pingpong.net>:
>>>>> 24 sep 2015 kl. 00:05 skrev Palle Girgensohn
>>>>> <girgen at pingpong.net>:
>>>>>> 23 sep 2015 kl. 20:32 skrev Julien Charbon <jch at freebsd.org>: 
>>>>>> On 23/09/15 20:26, Palle Girgensohn wrote:
>>>>> Kernels and userland are updated to 10.2-p3 with the patch
>>>>> removing the suspicous KASSERT.
>>>>> dtrace running continously redirecting to a log file.
>>> Just had a crash. Unfortunately, the kernel was stuck at the db>
>>> prompt, and the remote keyboard was unresponsive (HP ILO, not
>>> impressed). So I had to reset the power and never got a core dump...
>>> 
>>> panic: tcp_tw_2msl_stop: inp should not be released here
>>> cpuid = 0
>>> KDB: stack backtrace:
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>>> 0xfffffe175acd16a0 kdb_backtrace() at kdb_backtrace+0x39/frame
>>> 0xfffffe175acd1750 vpanic() at vpanic+0x126/frame 0xfffffe175acd1790
>>> kassert_panic() at kassert_panic+0x139/frame 0xfffffe175acd1800
>>> tcp_twclose() at tcp_twclose+0x2cb/frame 0xfffffe175acd1850
>>> tcp_tw_2msl_scan() at tcp_tw_2msl_scan+0x13b/frame
>>> 0xfffffe175acd1890 tcp_slowtimo() at tcp_slowtimo+0x68/frame
>>> 0xfffffe175acd18c0 pfslowtimo() at pfslowtimo+0x54/frame
>>> 0xfffffe175acd18f0 softclock_call_cc() at
>>> softclock_call_cc+0x193/frame 0xfffffe175acd19d0 softclock() at
>>> softclock+0x47/frame 0xfffffe175acd19f0 intr_event_execute_handlers()
>>> at intr_event_execute_handlers+0x93/frame 0xfffffe 175acd1a30
>>> ithread_loop() at ithread_loop+0xa6/frame 0xfffffe175acd1a70
>>> fork_exit() at fork_exit+0x84/frame 0xfffffe175acd1ab0
>>> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe175acd1ab0
>>> --- trap 0, rip = 0, rsp = 0xfffffe175acd1b70, rbp = 0 ---
>>> KDB: enter: panic
>>> [ thread pid 12 tid 100043 ]
>>> Stopped at      kdb_enter+0x3e: movq    $0,kdb_why
>>> db>
>> 
>> Thanks a log for this backstrace.  This is what at expected, when
>> tcp_close() in call in INP_TIMEWAIT case, in_pcbfree() can be called one
>> extra time that leads to:
>> 
>> tcp_tw_2msl_stop: inp should not be released here
>> 
>> Let me try to come with a tentative fix for this case.
> 
> See joined my tentative patch for these case.  It is only a first
> tentative patch as I am still waiting on -net feedbacks on what should
> be the rule here.
> 
> By the way:
> 
> - I see nothing specific to VIMAGE here
> 

We only see the probem with VIMAGE kernels and we see it on all VIMAGE kernels that we have a reasonable amount of load. For us, it started i August. It could be due to more load after the quiet summer (or system is used somewhat seasonal) or it could be due to some package update in userland that changed and triggered the bug. We cannot find anything that would clearly explain why it started right now.

> - Anyone aware of tcp_close() (or tcp_drop()) calls modified/introduced
> recently in 10.2 that could explained why this issue only appears only now?

We started by backing kernels as far as releng/10.1 from January, so the problem (OK, might not be the same reason, but at least the same crash pattern) was definitely there already in 10.1 i January.

Palle