This diskfailure should not panic a system, but just disconnect disk from ZFS

Sun Jun 21 14:01:28 UTC 2015

On 20/06/2015 18:11, Daryl Richards wrote:
> Check the failmode setting on your pool. From man zpool:
> 
>        failmode=wait | continue | panic
> 
>            Controls the system behavior in the event of catastrophic
> pool failure.  This  condition  is  typically  a
>            result  of  a  loss of connectivity to the underlying storage
> device(s) or a failure of all devices within
>            the pool. The behavior of such an event is determined as
> follows:
> 
>            wait        Blocks all I/O access until the device
> connectivity is recovered and the errors  are  cleared.
>                        This is the default behavior.
> 
>            continue    Returns  EIO  to  any  new write I/O requests but
> allows reads to any of the remaining healthy
>                        devices. Any write requests that have yet to be
> committed to disk would be blocked.
> 
>            panic       Prints out a message to the console and generates
> a system crash dump.

'mmm

Did not know about this setting. Nice one, but alas my current setting is:
zfsboot  failmode         wait                           default
zfsraid  failmode         wait                           default

So either the setting is not working, or something else is up?
Is waiting only meant to wait a limited time? And then panic anyways?

But then still I wonder why even in the 'continue'-case the ZFS system
ends in a state where the filesystem is not able to continue in its
standard functioning ( read and write ) and disconnects the disk???

All failmode settings result in a seriously handicapped system...
On a raidz2 system I would perhaps expected this to occur when the
second disk goes into thin space??

The other question is: The man page talks about
'Controls the system behavior in the event of catastrophic pool failure'
And is a hung disk a 'catastrophic pool failure'?

Still very puzzled?

--WjW

> 
> 
> On 2015-06-20 10:19 AM, Willem Jan Withagen wrote:
>> Hi,
>>
>> Found my system rebooted this morning:
>>
>> Jun 20 05:28:33 zfs kernel: sonewconn: pcb 0xfffff8011b6da498: Listen
>> queue overflow: 8 already in queue awaiting acceptance (48 occurrences)
>> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid' appears to be
>> hung on vdev guid 18180224580327100979 at '/dev/da0'.
>> Jun 20 05:28:33 zfs kernel: cpuid = 0
>> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s
>> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174
>> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
>>
>> Which leads me to believe that /dev/da0 went out on vacation, leaving
>> ZFS into trouble.... But the array is:
>> ----
>> NAME               SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP
>> zfsraid           32.5T  13.3T  19.2T         -     7%    41%  1.00x
>> ONLINE  -
>>    raidz2          16.2T  6.67T  9.58T         -     8%    41%
>>      da0               -      -      -         -      -      -
>>      da1               -      -      -         -      -      -
>>      da2               -      -      -         -      -      -
>>      da3               -      -      -         -      -      -
>>      da4               -      -      -         -      -      -
>>      da5               -      -      -         -      -      -
>>    raidz2          16.2T  6.67T  9.58T         -     7%    41%
>>      da6               -      -      -         -      -      -
>>      da7               -      -      -         -      -      -
>>      ada4              -      -      -         -      -      -
>>      ada5              -      -      -         -      -      -
>>      ada6              -      -      -         -      -      -
>>      ada7              -      -      -         -      -      -
>>    mirror           504M  1.73M   502M         -    39%     0%
>>      gpt/log0          -      -      -         -      -      -
>>      gpt/log1          -      -      -         -      -      -
>> cache                 -      -      -      -      -      -
>>    gpt/raidcache0   109G  1.34G   107G         -     0%     1%
>>    gpt/raidcache1   109G   787M   108G         -     0%     0%
>> ----
>>
>> And thus I'd would have expected that ZFS would disconnect /dev/da0 and
>> then switch to DEGRADED state and continue, letting the operator fix the
>> broken disk.
>> Instead it chooses to panic, which is not a nice thing to do. :)
>>
>> Or do I have to high hopes of ZFS?
>>
>> Next question to answer is why this WD RED on:
>>
>> arcmsr0 at pci0:7:14:0:    class=0x010400 card=0x112017d3 chip=0x112017d3
>> rev=0x00 hdr=0x00
>>      vendor     = 'Areca Technology Corp.'
>>      device     = 'ARC-1120 8-Port PCI-X to SATA RAID Controller'
>>      class      = mass storage
>>      subclass   = RAID
>>
>> got hung, and nothing for this shows in SMART....