NFS-exported ZFS instability
Hiroki Sato
hrs at FreeBSD.org
Tue Jan 29 21:49:38 UTC 2013
Hiroki Sato <hrs at freebsd.org> wrote
in <20130104.023244.472910818423317661.hrs at allbsd.org>:
hr> Konstantin Belousov <kostikbel at gmail.com> wrote
hr> in <20130102174044.GB82219 at kib.kiev.ua>:
hr>
hr> ko> > I might take a closer look this evening and see if I can spot anything
hr> ko> > in the log, rick
hr> ko> > ps: I hope Alan and Kostik don't mind being added to the cc list.
hr> ko>
hr> ko> What I see in the log is that the lock cascade rooted in the thread
hr> ko> 100838, which owns system map mutex. I believe this prevents malloc(9)
hr> ko> from making a progress in other threads, which e.g. own the ZFS vnode
hr> ko> locks. As the result, the whole system wedged.
hr> ko>
hr> ko> Looking back at the thread 100838, we can see that it executes
hr> ko> smp_tlb_shootdown(). It is impossible to tell from the static dump,
hr> ko> is the appearance of the smp_tlb_shootdown() in the backtrace is
hr> ko> transient, or the thread is spinning there, waiting for other CPUs to
hr> ko> acknowledge the request. But, since the system wedged, most likely,
hr> ko> smp_tlb_shootdown spins.
hr> ko>
hr> ko> Taking this hypothesis, the situation can occur, most likely, due to
hr> ko> some other core running with the interrupts disabled. Inspection of the
hr> ko> backtraces of the processes running on all cores does not show any which
hr> ko> could legitimately own a spinlock or otherwise run with the interrupts
hr> ko> disabled.
hr> ko>
hr> ko> One thing you could try to do is to enable WITNESS for the spinlocks,
hr> ko> to try to catch the leaked spinlock. I very much doubt that this is
hr> ko> the case.
hr> ko>
hr> ko> Another thing to try is to switch the CPU idle method to something
hr> ko> else. Look at the machdep.idle* sysctls. It could be some CPU errata
hr> ko> which blocks wakeup due the interrupt in some conditions in C1 ?
hr>
hr> Thank you. It can take 1-2 weeks to reproduce this, so I set
hr> debug.witness.skipspin=0 and keeping machdep.idle acpi abd will see
hr> how it goes for a while. I will report again if I can get another
hr> freeze.
Hmm, I could reproduce the same freeze when debug.witness.skipspin=0,
too. DDB and crash dump outputs are the following:
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
The value of machdep.idle was acpi. I have seen this symptom on two
boxes with the following CPUs, so I am guessing it is not specific to
a CPU model:
CPU: Intel(R) Pentium(R) D CPU 3.40GHz (3391.52-MHz K8-class CPU)
CPU: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (2666.82-MHz K8-class CPU)
-- Hiroki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130130/448ee741/attachment.sig>
More information about the freebsd-stable
mailing list