I have a DDB session open to a crashed ZFS server
Dennis Glatting
dg at pki2.com
Tue Oct 16 18:48:23 UTC 2012
On Tue, 16 Oct 2012, Andriy Gapon wrote:
> on 16/10/2012 19:15 John Baldwin said the following:
>> On Tuesday, October 16, 2012 11:16:37 am Dennis Glatting wrote:
>>> On Tue, 2012-10-16 at 08:44 -0400, John Baldwin wrote:
>>>> On Monday, October 15, 2012 12:03:39 pm Dennis Glatting wrote:
>>>>> FreeBSD/amd64 (mc) (ttyu0)
>>>>>
>>>>> login: NMI ... going to debugger
>>>>> [ thread pid 11 tid 100003 ]
>>>>
>>>> You got an NMI, not a crash. What happens if you just continue ('c' command)
>>>> from DDB?
>>>>
>>>
>>> I hit the NMI button because of the "crash," which is a misword, to get
>>> into DDB.
>>
>> Ah, I would suggest "hung" or "deadlocked" next time. It certainly seems like
>> a deadlock since all CPUs are idle. Some helpful commands here might be
>> 'show sleepchain' and 'show lockchain'.
>>
>> Pick a "stuck" process (like find) and run:
>>
>> 'show sleepchain <pid>'
>>
>> In your case though it seems both 'find' and the various 'pbzip2' threads
>> are stuck on a condition variable, so there isn't an easy way to identify
>> an "owner" that is supposed to awaken these threads. It could be a case
>> of a missed wakeup perhaps, but you'll need to get someone more familiar
>> with ZFS to identify where these codes should be awakened normally.
>>
>
> I would also re-iterate a suggestion that I made to Nikolay ealrier:
> http://article.gmane.org/gmane.os.freebsd.devel.file-systems/15981
>
> BTW, in that case it turned out to be a genuine deadlock in ZFS ARC handling of
> lowmem.
> procstat -kk -a is a great help for analyzing such situations.
>
Without restarting the server and from memory, I believe the ARC on this
server is 32GB. The L2ARC is a 50-60GB SSD. The ZIL is a 16GB partitioned
SSD but my non-ZIL systems have the same problem. Main memory is 128GB.
I can run procstat to a serial console and scarf the output. What interval
would be helpful? Five seconds? Remember when the system hangs, no
commands will run so the data will be pre-hang.
BTW, it takes 4-24 hours to hang under load.
Also, are you suggesting I apply the patch in the URL and run again? I
have been following your other posts but the patches you posted did not
cleanly apply, so I removed them from my rev.
More information about the freebsd-fs
mailing list