CURRENT freezes on Laitude D520
Robert Watson
rwatson at FreeBSD.org
Sun Dec 10 05:08:38 PST 2006
On Sun, 10 Dec 2006, Tai-hwa Liang wrote:
>> which get a bit more to the heart of most problems. debug.mpsafenet=1
>> really exists for the purposes of supporting components which are not
>> sufficiently locked to allow the stack to run MPSAFE, rather than as a
>> means of disabling direct dispatch and preemption, which speak to different
>> types of problems. The main reason that I haven't removed the administrator
>> tunable to date is that I suspect it will be quite helpful when KAME IPSEC
>> locking happens, but since that appears not to have happened yet,
>> debug.mpsafenet as an option is likely causing more harm than good by being
>> available as a stand-in sysctl masking other problems, causing people to
>> not get to the point of properly identifying the actual cause (device
>> driver bugs, etc).
>
> Can the aforementioned tricks(1/2/3) being applied to RELENG_6 as well?
WITNESS is available in RELENG_6, and should be used in combination with
INVARIANTS, DDB, KDB, and BREAK_TO_DEBUGGER to debug deadlocks.
In RELENG_6, net.isr.direct is not enabled by default, so unless you've
enabled it yourself (or are using IP fast forwarding, which is functionally
similar), that won't apply.
In RELENG_6, PREEMPTION is in GENERIC and hence enabled by default, and it can
be disabled by removing it from your kernel configuration. I'd like it if we
could add a run-time sysctl to disable preemption even if PREEMPTION is
compiled in, as it would make it easier to explore its stability and
performance impact. However, this is also just a debugging step to see if
that quiesces the problem, and not a fix for the actual bug.
Right now, we're discussing removing the manual debug.mpsafenet configuration
flag from 7.x, and not 6.x. I fully recognize the importance of having it in
place as a workaround for bugs in production, although it concerns me greatly
that we're not getting these problems debugged and fixed, and instead masking
them. Architectural changes are on the way that will require these bugs to be
fixed properly, not just masked.
> We are using RELENG_6 as our production server(postfix, squid, pf
> firewall/NAT, FAST_IPSEC VPN, ...), which is a dual Athlon MP board with
> three NICs(two fxp cards and one onboard xl, connected to three different
> networks).
>
> I haven't try WITNESS, yet; however, I'm very sure that net.isr.direct=0
> plus that there is no PREEMPTION in current kernel. The problem is that,
> with debug.mpsafenet=1, we'll always run into hard freeze w/o having any
> kdb> prompt on console.
>
> Whilst turning debug.mpsafenet off only masks the real problem, I'm still
> wondering about if there is any less damaging way to track such problem down
> in a _production_ environment.
It sounds like you need to follow the instructions for kernel debugging.
Depending on your tolerance of performance loss, downtime, etc, a good
starting point is to configure the kernel with INVARIANTS and WITNESS.
WITNESS is particularly important, if you can tolerate the performance hit, as
it warns of potential deadlocks, not just actual deadlocks. Also, compile the
kernel with KDB, DDB, and BREAK_TO_DEBUGGER, and user a serial or firewire
console. If the hang occurs, see if you can get into the debugger, in which
case the logged output from DDB for the following commands would be very
useful:
show pcpu
show allpcpu
trace
alltrace
ps
show locks
show alllocks
show lockedvnods
show uma
show malloc
Please open a PR that describes your configuration, includes your kernel
config (since it seems quite customized), any loader.conf settings, a detailed
description of the problem, and the output. I'd be quite interested in know,
once the machine is in a hung state, whether the numlock light goes on and off
when you hit the numlock key on the keyboard.
Robert N M Watson
Computer Laboratory
University of Cambridge
More information about the freebsd-current
mailing list