From nobody Sat Apr 05 10:40:16 2025
X-Original-To: questions@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4ZVBnx6h2Mz5rxpD
	for <questions@mlmmj.nyi.freebsd.org>; Sat, 05 Apr 2025 10:40:37 +0000 (UTC)
	(envelope-from freebsd-doc@fjl.co.uk)
Received: from bs1.fjl.org.uk (bs1.fjl.org.uk [84.45.41.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 4ZVBnw3Ytbz3YCr
	for <questions@freebsd.org>; Sat, 05 Apr 2025 10:40:36 +0000 (UTC)
	(envelope-from freebsd-doc@fjl.co.uk)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=none;
	spf=pass (mx1.freebsd.org: domain of freebsd-doc@fjl.co.uk designates 84.45.41.196 as permitted sender) smtp.mailfrom=freebsd-doc@fjl.co.uk
Received: from [192.168.1.109] (host86-173-148-176.range86-173.btcentralplus.com [86.173.148.176])
	(authenticated bits=0)
	by bs1.fjl.org.uk (8.14.4/8.14.4) with ESMTP id 535AeSx9063340
	for <questions@freebsd.org>; Sat, 5 Apr 2025 11:40:28 +0100 (BST)
	(envelope-from freebsd-doc@fjl.co.uk)
Message-ID: <0fbd2584-6e07-40bf-b0e0-8d9198db100b@fjl.co.uk>
Date: Sat, 5 Apr 2025 11:40:16 +0100
List-Id: User questions <freebsd-questions.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-questions
List-Help: <mailto:questions+help@freebsd.org>
List-Post: <mailto:questions@freebsd.org>
List-Subscribe: <mailto:questions+subscribe@freebsd.org>
List-Unsubscribe: <mailto:questions+unsubscribe@freebsd.org>
X-BeenThere: freebsd-questions@freebsd.org
Sender: owner-freebsd-questions@FreeBSD.org
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Sudden zpool checksums errors
Content-Language: en-GB
To: questions@freebsd.org
References: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it>
From: Frank Leonhardt <freebsd-doc@fjl.co.uk>
In-Reply-To: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spamd-Result: default: False [0.55 / 15.00];
	RBL_SENDERSCORE_REPUT_9(-1.00)[84.45.41.196:from];
	NEURAL_SPAM_LONG(1.00)[1.000];
	NEURAL_SPAM_MEDIUM(1.00)[1.000];
	NEURAL_HAM_SHORT(-0.45)[-0.451];
	ONCE_RECEIVED(0.20)[];
	R_SPF_ALLOW(-0.20)[+ip4:84.45.41.196];
	MIME_GOOD(-0.10)[text/plain];
	RCVD_NO_TLS_LAST(0.10)[];
	RCPT_COUNT_ONE(0.00)[1];
	ASN(0.00)[asn:25577, ipnet:84.45.0.0/17, country:GB];
	RCVD_COUNT_ONE(0.00)[1];
	MIME_TRACE(0.00)[0:+];
	FROM_EQ_ENVFROM(0.00)[];
	FROM_HAS_DN(0.00)[];
	ARC_NA(0.00)[];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	MID_RHS_MATCH_FROM(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org];
	MLMMJ_DEST(0.00)[questions@freebsd.org];
	DMARC_NA(0.00)[fjl.co.uk];
	R_DKIM_NA(0.00)[]
X-Rspamd-Queue-Id: 4ZVBnw3Ytbz3YCr
X-Spamd-Bar: /

On 04/04/2025 16:42, Andrea Venturoli wrote:
> Hello.
>
> I've got a box with two zpools:
> _ 1 mirror on 2 SSDs;
> _ 1 raidz1 on 12 HDDs.
>
> Suddenly one daily run showed the following:
>>  pool: backup
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>     attempt was made to correct the error.  Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the 
>> errors
>>     using 'zpool clear' or replace the device with 'zpool replace'.
>>    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
>>   scan: scrub repaired 3.18M in 16:53:16 with 0 errors on Tue Apr  1 
>> 20:16:55 2025
>> config:
>>
>>     NAME        STATE     READ WRITE CKSUM
>>     backup      ONLINE       0     0     0
>>       raidz1-0  ONLINE       0     0     0
>>         da4     ONLINE       0     0     0
>>         da10    ONLINE       0     0     0
>>         da5     ONLINE       0     0    57
>>         da2     ONLINE       0     0     0
>>         da8     ONLINE       0     0    25
>>         da0     ONLINE       0     0     0
>>         da1     ONLINE       0     0    49
>>         da12    ONLINE       0     0     8
>>         da6     ONLINE       0     0     6
>>         da11    ONLINE       0     0     0
>>         da9     ONLINE       0     0    56
>>         da13    ONLINE       0     0    73
>>
>> errors: No known data errors
>
>
Assuming you've checked the logs etc as you say I'd be suspicious of the 
HBA and cabling, and presumably a SAS expander. But IME it's well worth 
testing the drives. Just dd them to /dev/null and see if anything 
sqwalks. There's nothing stopping you doing this on a live ZFS pool, 
although maybe do them one at a time if the array is busy :-)

Given the nature of SCSI you may find the only indication that a drive 
isn't 100% is an unusually slow read rate.

I agree it would be a coincidence 50% of the drives were flaky but it 
does happen, or it might be there're on one flaky HBA connecting half of 
them. I can't help being drawn to the fact it's exactly half that are 
throwing errors.

Anyway, checking the drives out by reading is minimal effort before 
diving into more esoteric reasons.

ZFS isn't as good as people think about detecting failing drives until 
they're actually on fire (see my posts passim on this matter).

Regards, Frank.