This diskfailure should not panic a system, but just disconnect disk from ZFS

Willem Jan Withagen wjw at digiware.nl
Mon Jun 22 12:30:35 UTC 2015


On 22/06/2015 04:31, Quartz wrote:
>>> You have a raidz2, which means THREE disks need to go down before the
>>> pool is unwritable. The problem is most likely your controller or
>>> power supply, not your disks.
>>>
>> Never make such assumptions...
>>
>> I have worked in a professional environment where 9 of 12 disks failed
>> within 24 hours of each other....
> 
> Right... but if that was his problem there should be some logs of the
> other drives going down first, and typically ZFS would correctly mark
> the pool as degraded (at least, it would in my testing). The fact that
> ZFS didn't get a chance to log anything and the pool came back up
> healthy leads me to believe the controller went south, taking several
> disks with it all at once and totally borking all IO. (Either that or
> what Tom Curry mentioned about the Arc issue, which I wasn't previously
> aware of).
> 
> Of course, if it issue isn't repeatable then who knows....

I do not think it was a full out failure, but just one transaction that
got hit by an alpha-particle...

Well, remember that the hung-diagnostics timeout is 1000 sec.
In the time-span before the panic nothing else was logged about
disks/controllers/etc... not functioning..

Only the few secs before the panic ctl/iSCSI and the network interface
started complaining that the was a memory shortage and the
networkinterafce started dumping packets....

But all that was logged really nicely in syslog. So I think that in the
1000sec it took for the deadman switch to trigger, the zpool just
functioned as was expected.... And the hardware somewhere lost one
transaction.

So I'll be crossing my fingers, and we'll see when/what/where the next
crash in going to occur. And work from there....

--WjW




More information about the freebsd-fs mailing list