Possible zpool online, resilvering issue

Wed Aug 10 18:56:21 UTC 2016

On 2016-08-04 07:22, Ultima wrote:
> Hello,
> 
> I recently had some issue with a PSU and ran several scrubs on a pool with
> around 35T. Random drives would drop and require a zpool online, this found
> checksum errors. (as expected) However, after all the scrubs I ran, I think
> I may have found a bug with zpool online resilvering process.
> 
> 24 disks total, 4 vdevs raidz2 (6 drives each).
> 
> Before this next part... I had a backup PSU, however it was also going bad
> and waiting for RMA. The current one seemed to be dieing but ran fine with
> less drives. So I decided I would run the server short 4 drives.
> 
> Started by offline(or already removed from psu) 4 drives from different
> vdevs, then ran a scrub to verify everything. Many sum errors were present
> on some of the drives, but this was expected due to faulty psu. Then
> offlined 4 different drives and onlined the other 4 and scrubbed once
> again. After resilver, again, many sum errors on these drives as expected.
> 
> After the scrub completed, I decided to offline 4 different drives, then
> online the ones that were out of pool for awhile. During the resilver,
> checksum errors were once again found. I was surprised due to the recent
> scrub, So I decided to run another scrub, and it found even more checksum
> errors on these recently onlined drives. I didn't think much about it,
> however after the replacement PSU arrived, I onlined all the drives out of
> pool and again, resilver had checksum errors as well as another scrub with
> more sum errors.
> 
> Is this issue known? Is it common for a scrub to be required after onlining
> a disk that was out of pool for some time?
> 
> The drives are ST4000NM0033, and until recent have never had a single
> checksum error in they're lifetime.(at least with zfs)
> FreeBSD S1 12.0-CURRENT FreeBSD 12.0-CURRENT #19 r303224: Sat Jul 23
> 10:41:12 EDT 2016
> root at S1:/usr/src/head/obj/usr/src/head/src/sys/MYKERNEL-NODEBUG
>  amd64
> 
> 
> Sorry for the wall of text, but I hope this helps in tracking down this
> possible bug.
> 

Perhaps on or more of the drives running out of Realloc Sectors?
I had once a case where smartctl showed no issues but zfs scrubbing showed
a defect, some weeks later smartctl was showing some reallocated sectors
and one week later the HD was out of spare sectors.

Have you already tested every single HD for smart issues?

-- 
olli