ZFS w/failing drives - any equivalent of Solaris FMA?

Fri Sep 12 09:59:04 UTC 2008

Hi,

Recently, a ZFS pool on my FreeBSD box started showing lots of errors on 
one drive in a mirrored pair.

The pool consists of around 14 drives (as 7 mirrored pairs), hung off of a 
couple of SuperMicro 8 port SATA controllers (1 drive of each pair is on 
each controller).

One of the drives started picking up a lot of errors (by the end of things 
it was returning errors pretty much for any reads/writes issued) - and 
taking ages to complete the I/O's.

However, ZFS kept trying to use the drive - e.g. as I attached another 
drive to the remaining 'good' drive in the mirrored pair, ZFS was still 
trying to read data off the failed drive (and remaining good one) in order 
to complete it's re-silver to the newly attached drive.

Having posted on the Open Solaris ZFS list - it appears, under Solaris 
there's an 'FMA Engine' which communicates drive failures and the like to 
ZFS - advising ZFS when a drive should be marked as 'failed'.

Is there anything similar to this on FreeBSD yet? - i.e. Does/can anything 
on the system tell ZFS "This drives experiencing failures" rather than ZFS 
just seeing lots of timed out I/O 'errors'? (as appears to be the case).

In the end, the failing drive was timing out literally every I/O - I did 
recover the situation by detaching it from the pool (which hung the machine 
- probably caused by ZFS having to update the meta-data on all drives, 
including the failed one). A reboot bought the pool back, minus the 
'failed' drive, so enough of the 'detach' must have completed.

The newly attached drive completed the re-silver in half an hour (as 
opposed to an estimated 755 hours and climbing with the other drive still 
in the pool, limping along).

-Kp