This diskfailure should not panic a system, but just disconnect disk from ZFS
Willem Jan Withagen
wjw at digiware.nl
Mon Jun 22 01:05:32 UTC 2015
On 22/06/2015 01:34, Tom Curry wrote:
> I asked because recently I had similar trouble. Lots of kernel panics,
> sometimes they were just like yours, sometimes they were general
> protection faults. But they would always occur when my nightly backups
> took place where VMs on iSCSI zvol luns were read and then written over
> smb to another pool on the same machine over 10GbE.
>
> I nearly went out of my mind trying to figure out what was going on,
> I'll spare you the gory details, but I stumbled across this PR
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 and as I read
So this is "the Karl Denninger ZFS patch"....
I tried to follow the discussion at the moment, keeping it in the back
of my head.....
I concluded that the ideas where sort of accepted, but a different
solution was implemented?
> through it little light bulbs starting coming on. Luckily it was easy
> for me to reproduce the problem so I kicked off the backups and watched
> the system memory. Wired would grow, ARC would shrink, and then the
> system would start swapping. If I stopped the IO right then it would
> recover after a while. But if I let it go it would always panic, and
> half the time it would be the same message as yours. So I applied the
> patch from that PR, rebooted, and kicked off the backup. No more panic.
> Recently I rebuilt a vanilla kernel from stable/10 but explicitly set
> vfs.zfs.arc_max to 24G (I have 32G) and ran my torture tests and it is
> stable.
So you've (almost) answered my question, but English is not my native
language and hence my question for certainty: You did not add the patch
to your recently build stable/10 kernel...
> So I don't want to send you on a wild goose chase, but it's entirely
> possible this problem you are having is not hardware related at all, but
> is a memory starvation issue related to the ARC under periods of heavy
> activity.
Well rsync will do that for you... And since a few months I've also
loaded some iSCSI zvols as remote disks to some windows stations.
Your suggestions are highly appreciated. Especially since I do not have
space PCI-X parts... (It the current hardware blows up, I'm getting
monder new stuff.) So other than checking some cabling and likes there
is very little I could swap.
Thanx,
--WjW
> On Sun, Jun 21, 2015 at 6:43 PM, Willem Jan Withagen <wjw at digiware.nl
> <mailto:wjw at digiware.nl>> wrote:
>
> On 21/06/2015 21:50, Tom Curry wrote:
> > Was there by chance a lot of disk activity going on when this occurred?
>
> Define 'a lot'??
> But very likely, since the system is also a backup location for several
> external service which backup thru rsync. And they can generate generate
> quite some traffic. Next to the fact that it also serves a NVR with a
> ZVOL trhu iSCSI...
>
> --WjW
>
> >
> > On Sun, Jun 21, 2015 at 10:00 AM, Willem Jan Withagen <wjw at digiware.nl <mailto:wjw at digiware.nl>
> > <mailto:wjw at digiware.nl <mailto:wjw at digiware.nl>>> wrote:
> >
> > On 20/06/2015 18:11, Daryl Richards wrote:
> > > Check the failmode setting on your pool. From man zpool:
> > >
> > > failmode=wait | continue | panic
> > >
> > > Controls the system behavior in the event of
> catastrophic
> > > pool failure. This condition is typically a
> > > result of a loss of connectivity to the
> underlying storage
> > > device(s) or a failure of all devices within
> > > the pool. The behavior of such an event is
> determined as
> > > follows:
> > >
> > > wait Blocks all I/O access until the device
> > > connectivity is recovered and the errors are cleared.
> > > This is the default behavior.
> > >
> > > continue Returns EIO to any new write I/O
> requests but
> > > allows reads to any of the remaining healthy
> > > devices. Any write requests that have
> yet to be
> > > committed to disk would be blocked.
> > >
> > > panic Prints out a message to the console
> and generates
> > > a system crash dump.
> >
> > 'mmm
> >
> > Did not know about this setting. Nice one, but alas my current
> > setting is:
> > zfsboot failmode wait default
> > zfsraid failmode wait default
> >
> > So either the setting is not working, or something else is up?
> > Is waiting only meant to wait a limited time? And then panic
> anyways?
> >
> > But then still I wonder why even in the 'continue'-case the
> ZFS system
> > ends in a state where the filesystem is not able to continue
> in its
> > standard functioning ( read and write ) and disconnects the
> disk???
> >
> > All failmode settings result in a seriously handicapped system...
> > On a raidz2 system I would perhaps expected this to occur when the
> > second disk goes into thin space??
> >
> > The other question is: The man page talks about
> > 'Controls the system behavior in the event of catastrophic
> pool failure'
> > And is a hung disk a 'catastrophic pool failure'?
> >
> > Still very puzzled?
> >
> > --WjW
> >
> > >
> > >
> > > On 2015-06-20 10:19 AM, Willem Jan Withagen wrote:
> > >> Hi,
> > >>
> > >> Found my system rebooted this morning:
> > >>
> > >> Jun 20 05:28:33 zfs kernel: sonewconn: pcb
> 0xfffff8011b6da498: Listen
> > >> queue overflow: 8 already in queue awaiting acceptance (48
> > occurrences)
> > >> Jun 20 05:28:33 zfs kernel: panic: I/O to pool 'zfsraid'
> appears
> > to be
> > >> hung on vdev guid 18180224580327100979 at '/dev/da0'.
> > >> Jun 20 05:28:33 zfs kernel: cpuid = 0
> > >> Jun 20 05:28:33 zfs kernel: Uptime: 8d9h7m9s
> > >> Jun 20 05:28:33 zfs kernel: Dumping 6445 out of 8174
> > >> MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%
> > >>
> > >> Which leads me to believe that /dev/da0 went out on
> vacation, leaving
> > >> ZFS into trouble.... But the array is:
> > >> ----
> > >> NAME SIZE ALLOC FREE EXPANDSZ FRAG
> CAP DEDUP
> > >> zfsraid 32.5T 13.3T 19.2T - 7%
> 41% 1.00x
> > >> ONLINE -
> > >> raidz2 16.2T 6.67T 9.58T - 8% 41%
> > >> da0 - - - - - -
> > >> da1 - - - - - -
> > >> da2 - - - - - -
> > >> da3 - - - - - -
> > >> da4 - - - - - -
> > >> da5 - - - - - -
> > >> raidz2 16.2T 6.67T 9.58T - 7% 41%
> > >> da6 - - - - - -
> > >> da7 - - - - - -
> > >> ada4 - - - - - -
> > >> ada5 - - - - - -
> > >> ada6 - - - - - -
> > >> ada7 - - - - - -
> > >> mirror 504M 1.73M 502M - 39% 0%
> > >> gpt/log0 - - - - - -
> > >> gpt/log1 - - - - - -
> > >> cache - - - - - -
> > >> gpt/raidcache0 109G 1.34G 107G - 0% 1%
> > >> gpt/raidcache1 109G 787M 108G - 0% 0%
> > >> ----
> > >>
> > >> And thus I'd would have expected that ZFS would disconnect
> > /dev/da0 and
> > >> then switch to DEGRADED state and continue, letting the
> operator
> > fix the
> > >> broken disk.
> > >> Instead it chooses to panic, which is not a nice thing to
> do. :)
> > >>
> > >> Or do I have to high hopes of ZFS?
> > >>
> > >> Next question to answer is why this WD RED on:
> > >>
> > >> arcmsr0 at pci0:7:14:0: class=0x010400 card=0x112017d3
> > chip=0x112017d3
> > >> rev=0x00 hdr=0x00
> > >> vendor = 'Areca Technology Corp.'
> > >> device = 'ARC-1120 8-Port PCI-X to SATA RAID
> Controller'
> > >> class = mass storage
> > >> subclass = RAID
> > >>
> > >> got hung, and nothing for this shows in SMART....
> >
> > _______________________________________________
> > freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>
> <mailto:freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org>>
> mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org
> <mailto:freebsd-fs-unsubscribe at freebsd.org>
> > <mailto:freebsd-fs-unsubscribe at freebsd.org
> <mailto:freebsd-fs-unsubscribe at freebsd.org>>"
> >
> >
>
>
More information about the freebsd-fs
mailing list