Detecting failing drives - ZFS carries on regardless

From: Frank Leonhardt <freebsd-doc_at_fjl.co.uk>
Date: Mon, 17 Feb 2025 20:52:11 UTC
I've been investigating what the current ZFS on 14.2 does with failing 
drives. It's a bit worrying.

ZFS doesn’t "fault" a drive until it's taken offline by the OS. So if 
you've got a flaky drive you have to wait for FreeBSD to disconnect it, 
and then ZFS will notice. At least that's how I understand it.

I used to test ZFS by pulling drives, but now I have a collection of 
flaky drives (data centre discards) that are unreliable, and it turns 
out that ZFS will wait a very long time for a SAS drive to complete an 
operation. If the operation fails through retries, FreeBSD logs a cam 
error but ZFS still doesn't fail the drive. You can have a SAS drive 
rattling and groaning away, but FreeBSD patiently waits for it to 
complete by relocating the block or multiple retries and ZFS is none the 
wiser. Or maybe ZFS is relocating the block after the CAM error. Either 
way, ZFS says the drive is "ONLINE" and carries on using it. Yikes!

************

So my question is this: Is there a way of telling FreeBSD to fail a 
drive at the first sign of trouble? Or better yet, if it's had more than 
one operation take more than ten seconds in the last hour?

************

If anyone else is interested in sharing research please get in touch.

Incidentally, smartmon doesn't show failing drives unless an operation 
actually fails. I've found nothing using camcontrol. If you use a 
stethoscope on the drive (one of my favourite tricks) it's obvious it's 
not happy but FreeBSD won't offline it until it catches fire. In fact I 
suspect it would need to explode before it noticed.

Thanks, Frank.