ZFS w/failing drives - any equivalent of Solaris FMA?

Fri Sep 12 15:59:09 UTC 2008

On September 12, 2008 02:45 am Karl Pielorz wrote:
> Recently, a ZFS pool on my FreeBSD box started showing lots of errors
> on one drive in a mirrored pair.
>
> The pool consists of around 14 drives (as 7 mirrored pairs), hung off
> of a couple of SuperMicro 8 port SATA controllers (1 drive of each pair
> is on each controller).
>
> One of the drives started picking up a lot of errors (by the end of
> things it was returning errors pretty much for any reads/writes issued)
> - and taking ages to complete the I/O's.
>
> However, ZFS kept trying to use the drive - e.g. as I attached another
> drive to the remaining 'good' drive in the mirrored pair, ZFS was still
> trying to read data off the failed drive (and remaining good one) in
> order to complete it's re-silver to the newly attached drive.

For the one time I've had a drive fail, and the three times I've replaced 
drives for larger ones, the process used was:

  zpool offline <pool> <old device>
  <remove old device>
  <insert new device>
  zpool replace <pool> <old device> <new device>

For one machine, I had to shut it off after the offline, as it didn't have 
hot-swappable drive bays.  For the other machine, it did everything while 
online and running.

IOW, the old device never had a chance to interfere with anything.  Same 
process we've used with hardware RAID setups in the past.

> Is there anything similar to this on FreeBSD yet? - i.e. Does/can
> anything on the system tell ZFS "This drives experiencing failures"
> rather than ZFS just seeing lots of timed out I/O 'errors'? (as appears
> to be the case).

Beyond the periodic script that checks for things like this, and sends 
root an e-mail, I haven't seen anything.

-- 
Freddie Cash
fjwcash at gmail.com