[zfs][hardware] Reproducible kernel panic in 8.0-STABLE

Wed Feb 3 11:23:19 UTC 2010

Andriy Gapon wrote:
> on 02/02/2010 19:38 Julian Elischer said the following:
>> Andriy Gapon wrote:
>>> on 02/02/2010 15:32 Stephane LAPIE said the following:
>>>> I have a case of kernel panic that can be consistently reproduced, and
>>>> which I guess is related to the hardware I'm using (Marvell controllers,
>>>> check my pciconf -lv output below).
>>>>
>>>> The kernel panic message is always, consistently, the following :
>>>>
>>>> Sleeping thread (tid 100021, pid 0) owns a non-sleepable lock
>>> I probably won't be able to help you, but to kickstart debugging could
>>> you please
>>> run 'procstat -t 0' and determine what kernel thread has tid 100021 on
>>> your system?
>> or in the kernel debugger after the panic, do: bt
> 
> I think that in this case it may not help.  I mean the stack trace.
> Because, I think that this panic happens after the taskqueue thread is done with
> its tasks and is parked waiting.
> 
>> you DO have options kdb and ddb right?  (I never leave home without them)
>>
> 
> 

I just rebuilt a kernel with debugger options, and obtained the 
following output upon pulling out one disk :

Sleeping thread (tid 100024, pid 0) owns a non-sleepable lock
sched_switch() at sched_switch+0xf8
mi_switch() at mi_switch+0x16f
sleepq_timedwait() at sleepq_timedwait+0x42
_cv_timedwait() at _cv_timedwait+0x129
_sema_timedwait() at _sema_timedwait+0x55
ata_queue_request() at ata_queue_request+0x526
ata_controlcmd() at ata_controlcmd+0xa1
ata_setmode() at ata_setmode+0xdc
ad_init() at ad_init+0x27
ad_reinit() at ad_reinit+0x48
ata_reinit() at ata_reinit+0x268
ata_conn_event() at ata_conn_event+0x49
taskqueue_run() at taskqueue_run+0x93
taskqueue_thread_loop() at taskqueue_thread_loop+0x46
fork_exit() at fork_exit+0x118
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff80000aad30, rbp = 0 ---
panic: sleeping thread
cpuid = 2
KDB: enter: panic
[thread pid 12 tid 100008 ]
Stopped at      kdb_enter+0x3d: movq    $0,0x4943d0(%rip)

I think the output below is not really relevant though.

db> bt
Tracing pid 12 tid 100008 td 0xffffff000187e000
kdb_enter() at kdb_enter+0x3d
panic() at panic+0x17b
turnstile_adjust() at turnstile_adjust
turnstile_wait() at turnstile_wait+0x1aa
_mtx_lock_sleep() at _mtx_lock_sleep+0xb0
softclock() at softclock+0x2a9
intr_event_execute_handlers() at intr_event_execute_handlers+0xfd
ithread_loop() at ithread_loop+0x8e
fork_exit() at fork_exit+0x118
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff800005ad30, rbp = 0 ---

If there is anything else I can run to obtain further information, all 
hints are welcome, though this clearly seems to point to a problem with 
my controller event handling as I initially thought.

I am also very suspicious of that controller because it tends to drop 
two disks at exactly the same time, which alas belong to the same raidz1 
block (BIOS level can't reset properly the port or redetect them after 
this, I have to go through a cold boot; The disks themselves could be 
damaged but I don't catch any weird readings via SMART and Reallocated 
Sectors or such). I am seriously thinking of moving some of these disks 
to the AHCI controller on my motherboard, and will resort to using my 
spares at the very least in the meantime.

Thanks for your time,
-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20100203/b2917da8/signature.pgp