ZFS w/failing drives - any equivalent of Solaris FMA?
Karl Pielorz
kpielorz_lst at tdx.co.uk
Fri Sep 12 19:45:55 UTC 2008
--On 12 September 2008 09:04 -0700 Jeremy Chadwick <koitsu at FreeBSD.org>
wrote:
> I know ATA will notice a detached channel, because I myself have done
> it: administratively, that is -- atacontrol detach ataX. But the only
> time that can happen "automatically" is if the actual controller does
> so itself, or if FreeBSD is told to do it administratively.
I think the problem at the moment is, ZFS "doesn't care" - it's
deliberately remote from things like drivers, and drives - and at the
moment, there's no 'middle layer' or way for at least the ATA drivers to
communicate to ZFS that a drive 'has failed' (I mean, for starters, you've
got the problem of "what's a failed drive" - presumably a drive that's
operating outside a set of limits? - The first probably being 'is it still
attached?' :)
That was a thread recently on the Open Solaris ZFS forum - and discussed at
length...
> I am also very curious to know the exact brand/model of 8-port SATA
> controller from Supermicro you are using, *especially* if it uses ata(4)
> rather than CAM and da(4).
The controllers ID as:
Marvell 88SX6081 SATA300 controller
They're SuperMicro 8 PORT PCI-X SATA controllers, 'AOC-SAT2-MV8' - and they
definitely show as 'adX'
> Such Supermicro controllers were recently
> discussed on freebsd-stable (or was it -hardware?), and no one was able
> to come to a concise decision as to whether or not they were decent or
> even remotely trusted. Supermicro provides a few different SATA HBAs.
Well, I've tested these cards for a number of months now, and they seem
fine here - at least with the WD drives I'm currently running (not saying
they're 'perfect' - but for my setup, I've not seen any issues). I didn't
notice any 'bad behaviour' when testing them under UFS, and when running
under ZFS they've picked up no checksum errors (or console messages) for
the duration the box has been running.
> I can see the usefulness in Solaris's FMA thing. My big concern is
> whether or not FMA actually pulls the disk off the channel, or if it
> just leaves the disk/channel connected and simply informs kernel pieces
> not to use it. If it pulls the disk off the channel, I have serious
> qualms with it.
I don't think it pulls it - I think it's looks at it's policies, and does
what they say, which would seem to be the equivalent of 'zpool offline dev'
by default (which, again doesn't pull it off any busses - it just notifies
ZFS not to send I/O to that device).
I'll have to do a test using da / CAM driven disks (or ask someone who
worked on the port ;) - but I'd guess, unless there's something been added
to CAM to tell ZFS to offline the disk, it'll do the same - i.e. ZFS will
continue to issue I/O requests to disks as it needs - as at least in Open
Solaris, it's deemed *not* to be ZFS's job to detect failed disks, or do
anything about them - other than what it's told.
ZFS under FreeBSD still works despite this (and works wonderfully well) -
it just means if any of your drives 'go out to lunch' - unless they fail in
such a way that the I/O requests are returned immediately as 'failed' (i.e.
I guess if the device node has gone) - ZFS will keep issuing (and
potentially pausing) waiting for I/O requests to failed drives, because it
doesn't know, doesn't care - and hasn't been told to do otherwise.
-Kp
More information about the freebsd-hackers
mailing list