Re: Detecting failing drives - ZFS carries on regardless

From: Frank Leonhardt <freebsd-doc_at_fjl.co.uk>
Date: Wed, 19 Feb 2025 00:17:40 UTC
On 18/02/2025 17:16, Alejandro Imass wrote:
> On Tue, Feb 18, 2025 at 11:41 AM Carl Johnson <carlj@peak.org> wrote:
>
>     Frank Leonhardt <freebsd-doc@fjl.co.uk> writes:
>
>     > As an example, the following is NOT ENOUGH for ZFS to fail a drive,
>     > because FreeBSD doesn't offline it. Would you want a drive like this
>     > in a production environment?
>     >
>     > Feb 17 16:20:48 zfs2 kernel: (da0:mps0:0:14:0): WRITE(10). CDB:
>     2a 00
>     > 02 c1 38 00 00 08 00 00
>     > Feb 17 16:20:48 zfs2 kernel: (da0:mps0:0:14:0): CAM status: SCSI
>     > Status Error
>     > ...
>
>     I haven't used it, but you might want to look into zfsd and see if it
>     will help.  It is supposed to watch for errors and handle them.
>
>
> Great thread.
>
> I am currently implementing a huge RAIDZ2 NAS and wondering a lot 
> about this...
>
> Does SMART monitoring help in these scenarios?
> Or does ZFS checksumming and scrubbing detect data corruption before 
> SMART does?
> In that case zfsd is the better choice ? or should you still monitor 
> and warn on SMART counters ?
> or both?

My understanding of zfsd (which may be wrong) is that it basically 
handles activating a hot spare if one of the drives in a vdev fails. It 
also brings stuff online automatically if you insert a disk. It's only 
really useful if you want to use automated hot spares.

This is all good stuff, but I prefer to do things manually. For hot 
spares, if possible I leave a a spare drive or two in an enclosure that 
I can bring into a vdev if I lose a drive. I'd rather make that call 
taking all the circumstances into account, not have a daemon do it. If 
you've a spare drive for a RAIDZ2, why not make it a RAIDZ3? There's a 
slight performance hit, but a much reduced performance hit if a drive 
fails because you're not re-silvering the new drive.

Now here I may be wrong, but I believe zfsd listens on devctl for device 
errors - the same stuff I'm seeing (or not seeing) in error messages 
from a flaky drive. I don't think it would see any more than I'm seeing 
on the console. I know the drive is flaky because of the sound it makes 
and the fact it takes forever to complete a read/write, but it still 
doesn't log an error.

I'm investigating these drives. They're ex-data centre at the far end of 
the bell curve, and I suspect there's something in the mode page that 
makes them carry on retrying and relocating blocks in the background 
without ever returning a sense error to the OS if can possibly avoid it. 
I don't think these drives have ever erred, they've just ground the 
system to a halt by retrying and recovering bad blocks for minutes at a 
time. I just get fed up after an hour waiting for it to finish booting.

Unfortunately SCSI has got a lot more complex since I last wrote a 
driver and the Seagate manual is a stonking 400 page pot boiler.

SMART was a system for providing some kind of prediction failure to IDE 
drives as the OS had no clue what was happening when it wasn't 
controlling ST506. (IDE is now known as ATA or SATA). Smartmontools was 
a means of interrogating the SMART data, but it's been updated to talk 
to SCSI drives too. You get very different stuff back between ATA and SCSI.

Looking at these particular flaky SCSI drives I'm not seeing a lot of 
useful information returned by smartctl. There's nothing returned by the 
ones doing multiple retries that's different from the ones that don't. 
You can also get this information using camcontrol - possibly more.

Part of smartmontools is a deamon that can email you if one of the 
drives it's monitoring is unhappy. I've not used this myself.

As to ZFS checksums and scrubbing... your question assumes the SMART 
data (or SCSI defect data) will tell you a drive is failing!

ZFS will work around a hardware failure and if the OS takes a drive 
offline it will fail it in the vdev. It won't predict failures, and as 
I've discovered, won't even detect a struggling drive until the OS 
decides it's a dud. At least not out of the box - I'm working on this 
matter.

A ZFS scrub simply reads all the stuff in the pool. If it finds an 
unreadable block it will deal with it by reconstructing it or whatever 
it has to. What it won't do is tell you that the next unused block its 
going to write to is a dud, because it won't have ever tried to read it. 
A scrub on a nearly empty drive is really quick - it doesn't do a 
surface scan, only reads the allocated blocks to make sure they're all 
reachable.

Now all of the above is my current understanding based on stuff I've 
read over the years and experience. Things may have changed, and I may 
have the wrong idea about things. I'd like to hear other opinions.

Incidentally, the way I monitor problems is using a script that does a 
zpool status and checks for "DEGRADED", and greps /var/log/messages for 
"CAM status: SCSI Status Error" and similar. If it finds anything it 
drops me an email. Crude, but I'd say simple and effective. I also check 
for overheating drives using smartctl. I'm old school. Fight me!