Detecting failing drives - ZFS carries on regardless
- Reply: Frank Leonhardt : "Re: Detecting failing drives - ZFS carries on regardless"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 17 Feb 2025 20:52:11 UTC
I've been investigating what the current ZFS on 14.2 does with failing drives. It's a bit worrying. ZFS doesn’t "fault" a drive until it's taken offline by the OS. So if you've got a flaky drive you have to wait for FreeBSD to disconnect it, and then ZFS will notice. At least that's how I understand it. I used to test ZFS by pulling drives, but now I have a collection of flaky drives (data centre discards) that are unreliable, and it turns out that ZFS will wait a very long time for a SAS drive to complete an operation. If the operation fails through retries, FreeBSD logs a cam error but ZFS still doesn't fail the drive. You can have a SAS drive rattling and groaning away, but FreeBSD patiently waits for it to complete by relocating the block or multiple retries and ZFS is none the wiser. Or maybe ZFS is relocating the block after the CAM error. Either way, ZFS says the drive is "ONLINE" and carries on using it. Yikes! ************ So my question is this: Is there a way of telling FreeBSD to fail a drive at the first sign of trouble? Or better yet, if it's had more than one operation take more than ten seconds in the last hour? ************ If anyone else is interested in sharing research please get in touch. Incidentally, smartmon doesn't show failing drives unless an operation actually fails. I've found nothing using camcontrol. If you use a stethoscope on the drive (one of my favourite tricks) it's obvious it's not happy but FreeBSD won't offline it until it catches fire. In fact I suspect it would need to explode before it noticed. Thanks, Frank.