From nobody Sat Jan 14 15:42:48 2023 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4NvMxd2whtz2r5qn for ; Sat, 14 Jan 2023 15:43:01 +0000 (UTC) (envelope-from milkyindia@gmail.com) Received: from mail-vs1-xe2d.google.com (mail-vs1-xe2d.google.com [IPv6:2607:f8b0:4864:20::e2d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4NvMxd16Kvz3ln2 for ; Sat, 14 Jan 2023 15:43:01 +0000 (UTC) (envelope-from milkyindia@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-vs1-xe2d.google.com with SMTP id p1so8262818vsr.5 for ; Sat, 14 Jan 2023 07:43:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=P19ao9odss6nbT+RY/iy2/ulKKNfJTn2VGCkBz0WWUM=; b=Y5B2Jyhfw0aDmhY1JgksfyqayN2w1SMBxMStAkHJxLoeqW0Za9owA97+At1FfOi1aD ebvlEgkRIPJ0CK1rAOFBMZRp08C1wmdPdohTZSMz1lW4y7G72pKBJbR4t/SxvKgni4eY tHp99kx6TdgdmyjbEialbTocEJQrxbMby0NiDBdXp2selq/AxRkQM3aP06XJpoDSGl2h V/vM9uSntZOmYj10YvvrLObypQflM09BIRyrG9yBFjla+/U+Dj4TRLCC6OIFhbUL7xYl K5G0hp457TdvZEXxKSbkaEqGzdzeZWxA51SDicPy5LoCd3Sjg9NmAArvzOwX4CsvERKq VaTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=P19ao9odss6nbT+RY/iy2/ulKKNfJTn2VGCkBz0WWUM=; b=rle8mk5Paice6VvhCB2azREZUR/2LHSY99dTnbH7ivgYsYVNKDPg4XZkzMuyxPtnHP PIiUzgqdNHNoBjCz7ltLR+kXroLsWxcvO3Dg4ZRRxNMKVT5JeRaOfGkWLpv/2YJ2k1Ol LKmYet3uPfR511IaLJLZqRIyN/PsAMgiOcp4UC6beTHEhRFYjkqCTITJ5jphuCEWKe5n 8Pz0xtWyhLLL8AixSeTD+OmVNJAyso9jKyE6C7vPxreHGrTASnTX9qzOr3p/Y/ndIeuQ kTZ4k4GTDqSuJvXAPppiovh1dHMYPLGxEXr5ugDbQIJWIeeybqtARKizCbFjirCErI6E 5MVw== X-Gm-Message-State: AFqh2krcof4vfq7evyfEEFQzIcSwMzgvS2qJLeAmJsORjdhl3XVpmoW9 Smx+c3TsCdwxCxXpEyJ1JHG2pP81c7SZBZRjPLQ= X-Google-Smtp-Source: AMrXdXucxohWLJKx/jfrDPyyxx2oz0weW/aMHZY1WlyS8O5vY99Uyd/ILb5IsNCmYkbMPjvsxUt0LdK4k8WzGCgVoYQ= X-Received: by 2002:a05:6102:30a3:b0:3d0:caa1:590e with SMTP id y3-20020a05610230a300b003d0caa1590emr2378792vsd.48.1673710980339; Sat, 14 Jan 2023 07:43:00 -0800 (PST) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 References: <17D411EE-9815-44FC-A135-68EBB53B2D50@vanderzwan.org> In-Reply-To: <17D411EE-9815-44FC-A135-68EBB53B2D50@vanderzwan.org> From: milky india Date: Sat, 14 Jan 2023 19:42:48 +0400 Message-ID: Subject: Re: ZFS checksum error on 2 disks of mirror To: freebsd@vanderzwan.org Cc: freebsd-fs Content-Type: multipart/alternative; boundary="000000000000709c3a05f23b3465" X-Rspamd-Queue-Id: 4NvMxd16Kvz3ln2 X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --000000000000709c3a05f23b3465 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > Scrub is finding no errors so I think the pool and data should be healthy= . Yes that's what I assumed as well only to later discover it wasn't ok. >Scrubbing all pools roughly every 4 weeks so I=E2=80=99ll notice if that c= hanges. Would probably do it sooner and a couple of scrubs across a couple of reboots , just to be doubly sure. I hope nothing bad comes of it and you have your peace of mind later. PS: Sorry if it feels like I'm insisting but had a bad experience with this bug. On Sat, Jan 14, 2023, 19:36 wrote: > Hi > > > On 14 Jan 2023, at 16:29, milky india wrote: > > > No panics on my system, it just kept running. And there is no way that = I > know of to repoduce it. > > Yes (not being able to) reproducing issues is a huge problem. > When the scrub was producing the error do you remember the exact error > message or have it recorded? > > > Scrub did not give any errors. Zpool status -v showed one file with an > error but that was also gone after the scrub. > So no evidence of any error except for what was logged in > /var/log/messages remains. > > In this case it was a meta data level corruption error that lead to > https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/ which seemed like > a dead end, or in your case at least ensuring things are backed up in cas= e > the issue arises later. > > > Scrub is finding no errors so I think the pool and data should be healthy= . > > Scrubbing all pools roughly every 4 weeks so I=E2=80=99ll notice if that = changes. > > Paul > > Ultimately if its zfs > On Sat, Jan 14, 2023, 19:13 wrote: > >> >> >> On 14 Jan 2023, at 15:57, milky india wrote: >> >> > Output of zpool status -v gives no read/write/cksum errors but lists >> one file with an error. >> Had faced a similar issue, when I tried to delete the file the error >> still persisted, although realised it after a few shutdown cycles >> >> >> For me after a scrub there was no more mention of a file with an error s= o >> I assume the error was transient. >> >> >> >After running a scrub on the pool all seems to be well, no more files >> with errors. >> Please monitor if the error shows up again sometime soon. While I don't >> know what the issue is but zfs error no 97 seems like a serious bug. >> >> Definitely keeping a close look for this. >> >> Is this a similar issue for which PR is open? >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D268333 >> >> >> No panics on my system, it just kept running. And there is no way that I >> know of to repoduce it. >> >> At the moment I suspect it was the power grid issue we had the night >> that error was logged. >> Large part of the city where I live had an outage after a fire in a >> substation. >> I only had a dip for about 1s when it happened but this server did need >> a reboot as it was unresponsive. >> >> The time of the error roughly matches the time they started restoring >> power to the affected parts of the city. >> Maybe that created another event on the grid. >> >> The server is not behind a UPS as power grid is usually very reliable >> here in the Netherlands. >> >> Paul >> >> >> >> On Fri, Jan 13, 2023, 19:35 wrote: >> >>> Hi, >>> I noticed zpool status gave an error for one of my pools. >>> Looking back in the logs I found thus: >>> >>> Dec 24 00:58:39 freebsd ZFS[40537]: pool I/O failure, zpool=3Dbackuppoo= l >>> error=3D97 >>> Dec 24 00:58:39 freebsd ZFS[40541]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJL4JYGp2 offset=3D1634427084800 size=3D53248 >>> Dec 24 00:58:39 freebsd ZFS[40545]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJKNA9Gp2 offset=3D1634427084800 size=3D53248 >>> >>> These are 2 WD Red Plus 8TB drives (same age, same firmware, attached t= o >>> same controller). >>> >>> Looking back in the logs I found this occurred earlier without me >>> noticing: >>> >>> Aug 8 03:17:56 freebsd ZFS[12328]: pool I/O failure, zpool=3Dbackuppoo= l >>> error=3D97 >>> Aug 8 03:17:56 freebsd ZFS[12332]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072 >>> Aug 8 03:17:56 freebsd ZFS[12336]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072 >>> Aug 8 13:37:26 freebsd ZFS[22317]: pool I/O failure, zpool=3Dbackuppoo= l >>> error=3D97 >>> Aug 8 13:37:26 freebsd ZFS[22321]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072 >>> Aug 8 13:37:26 freebsd ZFS[22325]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072 >>> Aug 8 15:37:44 freebsd ZFS[24704]: pool I/O failure, zpool=3Dbackuppoo= l >>> error=3D97 >>> Aug 8 15:37:44 freebsd ZFS[24708]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072 >>> Aug 8 15:37:44 freebsd ZFS[24712]: checksum mismatch, zpool=3Dbackuppo= ol >>> path=3D/dev/gpt/VGJKNA9Gp2 offset=3D4056214130688 size=3D131072 >>> >>> Output of zpool status -v gives no read/write/cksum errors but lists >>> one file with an error. >>> >>> After running a scrub on the pool all seems to be well, no more files >>> with errors. >>> >>> System is a homebuilt with Asrock Rack C2550 board with 16 GB of ECC RA= M >>> Any idea how I could get checksum errors on the identical block of 2 >>> disks in a mirror ? >>> >>> Regards, >>> Paul >>> >> >> > --000000000000709c3a05f23b3465 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
> Scrub is finding no errors so I think the pool = and data should be healthy.


Yes that's what I assumed as wel= l only to later discover it wasn't ok.

>Scrubbing all pools roughly every 4 weeks so I=E2=80= =99ll notice if that changes.

Would probably do it sooner and a couple of scrubs across a couple of= reboots , just to be doubly sure. I hope nothing bad comes of it and you h= ave your peace of mind later.

PS:=C2=A0 Sorry if it feels like I'm insisting but had a bad expe= rience with this bug.

On Sat, Jan 14, 2023, 1= 9:36 <freebsd@vanderzwan.org<= /a>> wrote:
Hi



> No panics on my system, it j= ust kept running. And there is no way that I know of to repoduce it.
Yes (not being able to) reproducing issues is a huge problem.
When the scrub was producing the error do you remember the exact= error message or have it recorded?

=

Scrub did not give any errors. Zpool stat= us -v showed one file with an error but that was also gone after the scrub.=
So no evidence of any error except for what was logged in /var/l= og/messages remains.

In this case it was a meta data level corruptio= n error that lead to=C2=A0https://openzfs.git= hub.io/openzfs-docs/msg/ZFS-8000-8A/ which seemed like a dead end, or i= n your case at least ensuring things are backed up in case the issue arises= later.


=
Scrub is finding no errors so I think the pool and data should be hea= lthy.

Scrubbing all pools roughly every 4 weeks so= I=E2=80=99ll notice if that changes.

Paul

Ultimately if its zfs
On= Sat, Jan 14, 2023, 19:13 <freebsd@vanderzwan.org> wrote:


On 14 Jan 2023, at 15:57= , milky india <milkyindia@gmail.com> wrote:
>=C2=A0Output of zpool status -v gives no read/write/cksum errors = =C2=A0but lists one file with an error.
Had f= aced a similar issue, when I tried to delete the file the error still persi= sted, although realised it after a few shutdown cycles

For me after a scrub there was no more mention of = a file with an error so I assume the error was transient.


<= div dir=3D"auto">>After running a scrub on the pool all seems to be well= , no more files with errors.
Please monitor if the e= rror shows up again sometime soon. While I don't know what the issue is= but zfs error no 97 seems like a serious bug.=C2=A0

Definitely keeping a close look for thi= s.

Is this a similar issue for which PR is open? https://bugs.freebsd.org/bugzilla/show_bug.cgi?id= =3D268333=C2=A0


No panics on my system, it just kept running. And there i= s no way that I know of to repoduce it.

At the mom= ent I suspect it was the power grid =C2=A0issue we had the night that error= was logged.
Large part of the city where I live had an outage af= ter a fire in a substation.
I =C2=A0only had a dip for about 1s w= hen it happened but this server did need a reboot as it was unresponsive.

The time of the error roughly matches the time =C2= =A0they started restoring power to the affected parts of the city.
Maybe that created another event on the grid.

Th= e server is not behind a UPS as power grid is usually very reliable here in= the Netherlands.

Paul

=C2=A0
On Fri, Jan 13, 2023= , 19:35 <freebsd@vanderzwan.org> wrote:
=
=
Hi,=
I noticed= zpool status gave an error for one of my pools.
Looking back in the logs I fo= und thus:

Dec 24 00:58:39 freebsd ZFS[40537]: pool I/O failure, = zpool=3Dbackuppool error=3D97
Dec 24 00:58:39 freebsd ZFS[40541]: checksum mis= match, zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 offset=3D1634427084800= size=3D53248
Dec 24 00:58:39 freebsd ZFS[40545]: checksum mismatch, zpool=3Db= ackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D1634427084800 size=3D53248

These are 2 WD Red Plus 8TB drives (same age, same firmware, attac= hed to same controller).

Looking back in the logs I found this o= ccurred earlier without me noticing:

Aug =C2=A08 03:17:56 freebs= d ZFS[12328]: pool I/O failure, zpool=3Dbackuppool error=3D97

Aug =C2=A08 03:1= 7:56 freebsd ZFS[12332]: checksum mismatch, zpool=3Dbackuppool path=3D/dev/= gpt/VGJL4JYGp2 offset=3D4056214130688 size=3D131072
Aug =C2=A08 03:17:56 freeb= sd ZFS[12336]: checksum mismatch, zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA= 9Gp2 offset=3D4056214130688 size=3D131072
Aug =C2=A08 13:37:26 freebsd ZFS[2= 2317]: pool I/O failure, zpool=3Dbackuppool error=3D97
= Aug =C2=A08 13:37:26 fr= eebsd ZFS[22321]: checksum mismatch, zpool=3Dbackuppool path=3D/dev/gpt/VGJ= KNA9Gp2 offset=3D4056214130688 size=3D131072
Aug =C2=A08 13:37:26 freebsd ZFS[= 22325]: checksum mismatch, zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 of= fset=3D4056214130688 size=3D131072
Aug =C2=A08 15:37:44 freebsd ZFS[24704]: po= ol I/O failure, zpool=3Dbackuppool error=3D97
Aug =C2=A08 15:37:44 freebsd ZFS= [24708]: checksum mismatch, zpool=3Dbackuppool path=3D/dev/gpt/VGJL4JYGp2 o= ffset=3D4056214130688 size=3D131072
Aug =C2=A08 15:37:44 freebsd ZFS[24712]: c= hecksum mismatch, zpool=3Dbackuppool path=3D/dev/gpt/VGJKNA9Gp2 offset=3D40= 56214130688 size=3D131072

Output of zpool status -v gives no rea= d/write/cksum errors =C2=A0but lists one file with an error.

Aft= er running a scrub on the pool all seems to be well, no more files with err= ors.

System is a homebuilt with Asrock Rack C2550 board with 16 = GB of ECC RAM
Any idea how I could get checksum errors on the identical block = of 2 disks in a mirror ?

Regards,
Paul


--000000000000709c3a05f23b3465--