This diskfailure should not panic a system, but just disconnect disk from ZFS
Tom Curry
thomasrcurry at gmail.com
Mon Jun 22 00:57:22 UTC 2015
Yes, currently I am not using the patch from that PR. But I have lowered
the ARC max size, I am confident if I left it default I would have panics
again.
On Sun, Jun 21, 2015 at 7:45 PM, Willem Jan Withagen <wjw at digiware.nl>
wrote:
> On 22/06/2015 01:34, Tom Curry wrote:
> > I asked because recently I had similar trouble. Lots of kernel panics,
> > sometimes they were just like yours, sometimes they were general
> > protection faults. But they would always occur when my nightly backups
> > took place where VMs on iSCSI zvol luns were read and then written over
> > smb to another pool on the same machine over 10GbE.
> >
> > I nearly went out of my mind trying to figure out what was going on,
> > I'll spare you the gory details, but I stumbled across this PR
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 and as I read
>
> So this is "the Karl Denninger ZFS patch"....
> I tried to follow the discussion at the moment, keeping it in the back
> of my head.....
> I concluded that the ideas where sort of accepted, but a different
> solution was implemented?
>
> > through it little light bulbs starting coming on. Luckily it was easy
> > for me to reproduce the problem so I kicked off the backups and watched
> > the system memory. Wired would grow, ARC would shrink, and then the
> > system would start swapping. If I stopped the IO right then it would
> > recover after a while. But if I let it go it would always panic, and
> > half the time it would be the same message as yours. So I applied the
> > patch from that PR, rebooted, and kicked off the backup. No more panic.
> > Recently I rebuilt a vanilla kernel from stable/10 but explicitly set
> > vfs.zfs.arc_max to 24G (I have 32G) and ran my torture tests and it is
> > stable.
>
> So you've (almost) answered my question, but English is not my native
> language and hence my question for certainty: You did not add the patch
> to your recently build stable/10 kernel...
>
> > So I don't want to send you on a wild goose chase, but it's entirely
> > possible this problem you are having is not hardware related at all, but
> > is a memory starvation issue related to the ARC under periods of heavy
> > activity.
>
> Well rsync will do that for you... And since a few months I've also
> loaded some iSCSI zvols as remote disks to some windows stations.
>
> Your suggestions are highly appreciated. Especially since I do not have
> space PCI-X parts... (It the current hardware blows up, I'm getting
> monder new stuff.) So other than checking some cabling and likes there
> is very little I could swap.
>
> Thanx,
> --WjW
>
> > On Sun, Jun 21, 2015 at 6:43 PM, Willem Jan Withagen <wjw at digiware.nl
> > <mailto:wjw at digiware.nl>> wrote:
> >
> > On 21/06/2015 21:50, Tom Curry wrote:
> > > Was there by chance a lot of disk activity going on when this
> occurred?
> >
> > Define 'a lot'??
> > But very likely, since the system is also a backup location for
> several
> > external service which backup thru rsync. And they can generate
> generate
> > quite some traffic. Next to the fact that it also serves a NVR with a
> > ZVOL trhu iSCSI...
> >
> > --WjW
> >
> > >
> > > On Sun, Jun 21, 2015 at 10:00 AM, Willem Jan Withagen <
> wjw at digiware.nl <mailto:wjw at digiware.nl>
> > > <mailto:wjw at digiware.nl <mailto:wjw at digiware.nl>>> wrote:
> > >
> > > On 20/06/2015 18:11, Daryl Richards wrote:
> > > > Check the failmode setting on your pool. From man zpool:
> > > >
> > > > failmode=wait | continue | panic
> > > >
> > > > Controls the system behavior in the event of
> > catastrophic
> > > > pool failure. This condition is typically a
> > > > result of a loss of connectivity to the
> > underlying storage
> > > > device(s) or a failure of all devices within
> > > > the pool. The behavior of such an event is
> > determined as
> > > > follows:
> > > >
> > > > wait Blocks all I/O access until the device
> > > > connectivity is recovered and the errors are cleared.
> > > > This is the default behavior.
> > > >
> > > > continue Returns EIO to any new write I/O
> > requests but
> > > > allows reads to any of the remaining healthy
> > > > devices. Any write requests that have
> > yet to be
> > > > committed to disk would be blocked.
> > > >
> > > > panic Prints out a message to the console
> > and generates
> > > > a system crash dump.
> > >
> > > 'mmm
> > >
> > > Did not know about this setting. Nice one, but alas my current
> > > setting is:
> > > zfsboot failmode wait
> default
> > > zfsraid failmode wait
> default
> > >
> > > So either the setting is not working, or something else is up?
> > > Is waiting only meant to wait a limited time? And then panic
> > anyways?
> > >
> > > But then still I wonder why even in the 'continue'-case the
> > ZFS system
> > > ends in a state where the filesystem is not able to continue
> > in its
> > > standard functioning ( read and write ) and disconnects the
> > disk???
> > >
> > > All failmode settings result in a seriously handicapped
> system...
> > > On a raidz2 system I would perhaps expected this to occur when
> the
> > > second disk goes into thin space??
> > >
> > > The other question is: The man page talks about
> > > 'Controls the system behavior in the event of catastrophic
> > pool failure'
> > > And is a hung disk a 'catastrophic pool failure'?
> > >
> > > Still very puzzled?
> > >
> > > --WjW
> > >
> > > >
> > > >
> > > > On 2015-06-20 10:19 AM, Willem Jan Withagen wrote:
> > > >> Hi,
> > > >>
> > > >> Found my system rebooted this morning:
> > > >>
> > > >> Jun 20 05:28:33 zfs kernel: sonewconn: pcb
> > 0xfffff8011b6da498: Listen
> > > >> queue overflow: 8 already in queue awaiting acceptance (48
> > > occurrences)
> > > >> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid'
> > appears
> > > to be
> > > >> hung on vdev guid 18180224580327100979 at '/dev/da0'.
> > > >> Jun 20 05:28:33 zfs kernel: cpuid = 0
> > > >> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s
> > > >> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174
> > > >> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
> > > >>
> > > >> Which leads me to believe that /dev/da0 went out on
> > vacation, leaving
> > > >> ZFS into trouble.... But the array is:
> > > >> ----
> > > >> NAME SIZE ALLOC FREE EXPANDSZ FRAG
> > CAP DEDUP
> > > >> zfsraid 32.5T 13.3T 19.2T - 7%
> > 41% 1.00x
> > > >> ONLINE -
> > > >> raidz2 16.2T 6.67T 9.58T - 8%
> 41%
> > > >> da0 - - - - -
> -
> > > >> da1 - - - - -
> -
> > > >> da2 - - - - -
> -
> > > >> da3 - - - - -
> -
> > > >> da4 - - - - -
> -
> > > >> da5 - - - - -
> -
> > > >> raidz2 16.2T 6.67T 9.58T - 7%
> 41%
> > > >> da6 - - - - -
> -
> > > >> da7 - - - - -
> -
> > > >> ada4 - - - - -
> -
> > > >> ada5 - - - - -
> -
> > > >> ada6 - - - - -
> -
> > > >> ada7 - - - - -
> -
> > > >> mirror 504M 1.73M 502M - 39%
> 0%
> > > >> gpt/log0 - - - - -
> -
> > > >> gpt/log1 - - - - -
> -
> > > >> cache - - - - - -
> > > >> gpt/raidcache0 109G 1.34G 107G - 0%
> 1%
> > > >> gpt/raidcache1 109G 787M 108G - 0%
> 0%
> > > >> ----
> > > >>
> > > >> And thus I'd would have expected that ZFS would disconnect
> > > /dev/da0 and
> > > >> then switch to DEGRADED state and continue, letting the
> > operator
> > > fix the
> > > >> broken disk.
> > > >> Instead it chooses to panic, which is not a nice thing to
> > do. :)
> > > >>
> > > >> Or do I have to high hopes of ZFS?
> > > >>
> > > >> Next question to answer is why this WD RED on:
> > > >>
> > > >> arcmsr0 at pci0:7:14:0: class=0x010400 card=0x112017d3
> > > chip=0x112017d3
> > > >> rev=0x00 hdr=0x00
> > > >> vendor = 'Areca Technology Corp.'
> > > >> device = 'ARC-1120 8-Port PCI-X to SATA RAID
> > Controller'
> > > >> class = mass storage
> > > >> subclass = RAID
> > > >>
> > > >> got hung, and nothing for this shows in SMART....
> > >
> > > _______________________________________________
> > > freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>
> > <mailto:freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>>
> > mailing list
> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > > To unsubscribe, send any mail to "
> freebsd-fs-unsubscribe at freebsd.org
> > <mailto:freebsd-fs-unsubscribe at freebsd.org>
> > > <mailto:freebsd-fs-unsubscribe at freebsd.org
> > <mailto:freebsd-fs-unsubscribe at freebsd.org>>"
> > >
> > >
> >
> >
>
>
More information about the freebsd-fs
mailing list