Constant minor ZFS corruption
Mike Tancsa
mike at sentex.net
Wed Mar 9 14:04:12 UTC 2011
On 3/9/2011 7:41 AM, Stephen McKay wrote:
> On Tuesday, 8th March 2011, Chris Forgeron wrote:
>
>> Have you make sure it's not always the same drives with the checksum
>> errors? It make take a few days to know for sure..
>
> Of the 12 disks, only 1 has been error-free. I've been doing this for
> about 10 days now and there is no pattern that I can see in the errors.
>
We sort of went through something similar to this on our offsite/DR
backup server just last week. I dont have as many disks as you, but
0(offsite)# zpool status
pool: tank1
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ad0 ONLINE 0 0 0
ada4 ONLINE 0 0 0
ad4 ONLINE 0 0 0
ad6 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ada5 ONLINE 0 0 0
ada8 ONLINE 0 0 0
ada7 ONLINE 0 0 0
ada6 ONLINE 0 0 0
errors: No known data errors
0(offsite)#
After adding a larger case for future expansion, we found the next day
we were seeing all sorts of random errors
Like
Mar 3 05:34:47 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2281852580
Mar 3 06:11:59 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1
retry left) LBA=2292675553
Mar 3 06:11:59 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2292675553
Mar 3 06:23:54 offsite kernel: ad1: TIMEOUT - WRITE_DMA48 retrying (1
retry left) LBA=2292734035
Mar 3 06:23:54 offsite kernel: ad1: FAILURE - WRITE_DMA48
status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=2292734035
and
Mar 4 08:56:15 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801e2000 serr 00000000
Mar 4 09:18:33 offsite kernel: siisch1: Timeout on slot 26
Mar 4 09:18:33 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801b2000 serr 00000000
Mar 4 09:21:09 offsite kernel: siisch1: Timeout on slot 26
Mar 4 09:21:09 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000
Mar 4 09:22:44 offsite kernel: siisch1: Timeout on slot 26
Mar 4 09:22:44 offsite kernel: siisch1: siis_timeout is 00040000 ss
04000000 rs 04000000 es 00000000 sts 801d2000 serr 00000000
Mar 4 09:23:16 offsite kernel: siisch1: Timeout on slot 30
Mar 4 09:23:16 offsite kernel: siisch1: siis_timeout is 00040000 ss
40000000 rs 40000000 es 00000000 sts 801a2000 serr 00000000
on multiple disks and on multiple controllers... I have disks off the MB
and off 2 PMPs on an sil3124 controller.
We narrowed it down to 2 problems. Failing / Marginal power supply and
bad SATA cables. After changing the power supply, we still had a few
disks errors.
smartctl said all disks didnt have errors... Changed the SATA cables,
and those too were fixed.
After almost 5 days of uptime, no problems at all now. Not one error.
---Mike
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike at sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada http://www.tancsa.com/
More information about the freebsd-fs
mailing list