Re: Detecting failing drives - ZFS carries on regardless
Date: Wed, 19 Feb 2025 00:17:40 UTC
On 18/02/2025 17:16, Alejandro Imass wrote: > On Tue, Feb 18, 2025 at 11:41 AM Carl Johnson <carlj@peak.org> wrote: > > Frank Leonhardt <freebsd-doc@fjl.co.uk> writes: > > > As an example, the following is NOT ENOUGH for ZFS to fail a drive, > > because FreeBSD doesn't offline it. Would you want a drive like this > > in a production environment? > > > > Feb 17 16:20:48 zfs2 kernel: (da0:mps0:0:14:0): WRITE(10). CDB: > 2a 00 > > 02 c1 38 00 00 08 00 00 > > Feb 17 16:20:48 zfs2 kernel: (da0:mps0:0:14:0): CAM status: SCSI > > Status Error > > ... > > I haven't used it, but you might want to look into zfsd and see if it > will help. It is supposed to watch for errors and handle them. > > > Great thread. > > I am currently implementing a huge RAIDZ2 NAS and wondering a lot > about this... > > Does SMART monitoring help in these scenarios? > Or does ZFS checksumming and scrubbing detect data corruption before > SMART does? > In that case zfsd is the better choice ? or should you still monitor > and warn on SMART counters ? > or both? My understanding of zfsd (which may be wrong) is that it basically handles activating a hot spare if one of the drives in a vdev fails. It also brings stuff online automatically if you insert a disk. It's only really useful if you want to use automated hot spares. This is all good stuff, but I prefer to do things manually. For hot spares, if possible I leave a a spare drive or two in an enclosure that I can bring into a vdev if I lose a drive. I'd rather make that call taking all the circumstances into account, not have a daemon do it. If you've a spare drive for a RAIDZ2, why not make it a RAIDZ3? There's a slight performance hit, but a much reduced performance hit if a drive fails because you're not re-silvering the new drive. Now here I may be wrong, but I believe zfsd listens on devctl for device errors - the same stuff I'm seeing (or not seeing) in error messages from a flaky drive. I don't think it would see any more than I'm seeing on the console. I know the drive is flaky because of the sound it makes and the fact it takes forever to complete a read/write, but it still doesn't log an error. I'm investigating these drives. They're ex-data centre at the far end of the bell curve, and I suspect there's something in the mode page that makes them carry on retrying and relocating blocks in the background without ever returning a sense error to the OS if can possibly avoid it. I don't think these drives have ever erred, they've just ground the system to a halt by retrying and recovering bad blocks for minutes at a time. I just get fed up after an hour waiting for it to finish booting. Unfortunately SCSI has got a lot more complex since I last wrote a driver and the Seagate manual is a stonking 400 page pot boiler. SMART was a system for providing some kind of prediction failure to IDE drives as the OS had no clue what was happening when it wasn't controlling ST506. (IDE is now known as ATA or SATA). Smartmontools was a means of interrogating the SMART data, but it's been updated to talk to SCSI drives too. You get very different stuff back between ATA and SCSI. Looking at these particular flaky SCSI drives I'm not seeing a lot of useful information returned by smartctl. There's nothing returned by the ones doing multiple retries that's different from the ones that don't. You can also get this information using camcontrol - possibly more. Part of smartmontools is a deamon that can email you if one of the drives it's monitoring is unhappy. I've not used this myself. As to ZFS checksums and scrubbing... your question assumes the SMART data (or SCSI defect data) will tell you a drive is failing! ZFS will work around a hardware failure and if the OS takes a drive offline it will fail it in the vdev. It won't predict failures, and as I've discovered, won't even detect a struggling drive until the OS decides it's a dud. At least not out of the box - I'm working on this matter. A ZFS scrub simply reads all the stuff in the pool. If it finds an unreadable block it will deal with it by reconstructing it or whatever it has to. What it won't do is tell you that the next unused block its going to write to is a dud, because it won't have ever tried to read it. A scrub on a nearly empty drive is really quick - it doesn't do a surface scan, only reads the allocated blocks to make sure they're all reachable. Now all of the above is my current understanding based on stuff I've read over the years and experience. Things may have changed, and I may have the wrong idea about things. I'd like to hear other opinions. Incidentally, the way I monitor problems is using a script that does a zpool status and checks for "DEGRADED", and greps /var/log/messages for "CAM status: SCSI Status Error" and similar. If it finds anything it drops me an email. Crude, but I'd say simple and effective. I also check for overheating drives using smartctl. I'm old school. Fight me!