NICs locking up, "*tcp_sc_h"

Fri Mar 13 03:10:24 PDT 2009

On Fri, 13 Mar 2009, Nick Withers wrote:

> Sorry for the original double-post, by the way, not quite sure how that 
> happened...
>
> I can reproduce this problem relatively easily, by the way (every 3 days, on 
> average). I meant to say this before, too, but it seems to happen a lot more 
> often on the fxp than on rl.
>
> I'm sorry to ask what is probably a very simple question, but is there 
> somewhere I should look to get clues on debugging from a manually generated 
> dump? I tried "panic" after manually envoking the kernel debugger but proved 
> highly inept at getting from the dump the same information "ps" / "where" 
> gave me within the debugger live.

If this is, in fact, a TCP input lock leak of some sort, then most likely some 
particular property of a host your system talks to, or a network it runs over, 
triggers this (presumably) unusual edge case -- perhaps a firewall that mucks 
with TCP in a funny way, etc.  Of course, it might be something completely 
different -- the fact that everything is blocked on *tcp_sc_h and *tcp, simply 
means that something holding TCP locks hasn't released them, and this could 
happen for a number of reasons.

Once you've acquired a crashdump, you can run crashinfo(8), which will produce 
a summary of useful debugging information.  There are some things that are a 
bit easier to do in the run-time debugger, such as lock analysis, as the 
run-time debugger is more up-close and personal with in-kernel data 
structures; other things are easier in kgdb, which has complete source code 
and C type access.  I find kgdb works pretty well for everything but "show 
much what locks are held".  Many of our system monitoring tools, including ps 
and portions of netstat, can actually be run on crashdumps to report the state 
of the system at the time it crashed -- take a look at the -M and -N command 
line arguments, which respectively allow you to point those tools at the 
crashdump and at a kernel with debugging symbols (typically kernel.debug or 
kernel.symbols) matching the kernel that was booted at the time of the crash.

Robert N M Watson
Computer Laboratory
University of Cambridge

>
> Ta for your help!
>
>> Robert N M Watson
>> Computer Laboratory
>> University of Cambridge
>>
>>
>>>
>>> Tracing PID 31 tid 100030 td 0xffffff00012016e0
>>> sched_switch() at sched_switch+0xf1
>>> mi_switch() at mi_switch+0x18f
>>> turnstile_wait() at turnstile_wait+0x1cf
>>> _mtx_lock_sleep() at _mtx_lock_sleep+0x76
>>> syncache_lookup() at syncache_lookup+0x176
>>> syncache_expand() at syncache_expand+0x38
>>> tcp_input() at tcp_input+0xa7d
>>> ip_input() at ip_input+0xa8
>>> ether_demux() at ether_demux+0x1b9
>>> ether_input() at ether_input+0x1bb
>>> fxp_intr() at fxp_intr+0x233
>>> ithread_loop() at ithread_loop+0x17f
>>> fork_exit() at fork_exit+0x11f
>>> fork_trampoline() at fork_trampoline+0xe
>>> ____
>>>
>>> A "where" on a process stuck in "*tcp", in this case "[swi4: clock]",
>>> gave the somewhat similar:
>>> ____
>>>
>>> sched_switch() at sched_switch+0xf1
>>> mi_switch() at mi_switch+0x18f
>>> turnstile_wait() at turnstile_wait+0x1cf
>>> _rw_rlock() at _rw_rlock+0x8c
>>> ipfw_chk() at ipfw_chk+0x3ab2
>>> ipfw_check_out() at ipfw_check_out+0xb1
>>> pfil_run_hooks() at pfil_run_hooks+0x9c
>>> ip_output() at ip_output+0x367
>>> syncache_respond() at syncache_respond+0x2fd
>>> syncache_timer() at syncache_timer+0x15a
>>> (...)
>>> ____
>>>
>>> In this particular case, the fxp0 card is in a lagg with rl0, but this
>>> problem can be triggered with either card on their own...
>>>
>>> The scheduler is SCHED_ULE.
>>>
>>> I'm not too sure how to give more useful information that this, I'm
>>> afraid. It's a custom kernel, too... Do I need to supply information on
>>> what code actually exists at the relevant addresses (I'm not at all
>>> clued in on how to do this... Sorry!)? Should I chuck WITNESS,
>>> INVARIANTS et al. in?
>>>
>>> I *think* every time this has been triggered there's been a "python2.5"
>>> process in the "*tcp" state. This machine runs net-p2p/deluge and
>>> generally has at least 100 TCP connections on the go at any given time.
>>>
>>> Can anyone give me a clue as to what I might do to track this down?
>>> Appreciate any pointers.
>>> --
>>> Nick Withers
>>> email: nick at nickwithers.com
>>> Web: http://www.nickwithers.com
>>> Mobile: +61 414 397 446
>>>
> -- 
> Nick Withers
> email: nick at nickwithers.com
> Web: http://www.nickwithers.com
> Mobile: +61 414 397 446
>