zfs mirror pool online but drives have read errors

Reply: David Christensen : "Re: zfs mirror pool online but drives have read errors"
Reply: grarpamp : "Re: zfs mirror pool online but drives have read errors"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: Bram Van Steenlandt <bram_at_diomedia.be>
Date: Sat, 26 Mar 2022 16:45:57 UTC

Hi all,

English is not my native language,sorry about any errors

I'm experiencing something which I don't fully understand, maybe someone 
here can offer some insight.

I have a zfs mirror of 2 Samsung 980 pro 2TB nvme drives, according to 
zfs the pool is online,
It did repair 54M on the last scrub, I did another scrub today and again 
repairs are needed (only 128K this time).

   pool: zextra
  state: ONLINE
   scan: scrub repaired 54M in 0 days 00:41:42 with 0 errors on Thu Mar 
24 09:44:02 2022
config:

         NAME        STATE     READ WRITE CKSUM
         zextra      ONLINE       0     0     0
           mirror-0  ONLINE       0     0     0
             nvd2    ONLINE       0     0     0
             nvd3    ONLINE       0     0     0

errors: No known data errors

In dmesg I have messages like this:
nvme2: UNRECOVERED READ ERROR (02/81) sqid:3 cid:80 cdw0:0
nvme2: READ sqid:8 cid:119 nsid:1 lba:3831589512 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:8 cid:119 cdw0:0
nvme2: READ sqid:2 cid:123 nsid:1 lba:186822304 len:256
nvme2: UNRECOVERED READ ERROR (02/81) sqid:2 cid:123 cdw0:0
nvme2: READ sqid:5 cid:97 nsid:1 lba:186822560 len:256
also for the other drive:
nvme3: READ sqid:7 cid:84 nsid:1 lba:1543829024 len:256
nvme3: UNRECOVERED READ ERROR (02/81) sqid:7 cid:84 cdw0:0

smartctl does see the errors (but still says SMART overall-health 
self-assessment test result: PASSED ):
Media and Data Integrity Errors:    190
Error Information Log Entries:      190
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
   0        190     1  0x006e  0xc502  0x000   3649951416     1     -
   1        189     6  0x0067  0xc502  0x000   2909882960     1     -

and for the other drive:
Media and Data Integrity Errors:    284
Error Information Log Entries:      284

Is the following thinking somewhat correct ?
-zfs doesn't remove the drives because it has no write errors and I've 
been lucky so far in that read errors were repairable.
-Both drives are unreliable, if it was a hardware (both sit on a pcie 
card, not the motherboard) or software problem elsewhere smartctl would 
not find these errors in the drive logs.

I'll replace one drive and see if any of the errors go away for that 
drive, If this works I'll replace the other one as well, I have this 
same setup on another machine, this one is error free.
Could more expensive ssd's made a difference here ? according to 
smartctl I've now written 50TB, these drives should be good for 1200TBW

I backup the drives by making a snapshot and then using "zfs send > 
imgfile" to a hard drive, what would have have happened here if more and 
more read errors would occur ?
I may change this to a separate imgfile for the even and uneven days, or 
even one for every day of the week if I have enough room for that.

thx for any input
Bram