From nobody Sat Apr 05 10:40:16 2025 X-Original-To: questions@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4ZVBnx6h2Mz5rxpD for ; Sat, 05 Apr 2025 10:40:37 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Received: from bs1.fjl.org.uk (bs1.fjl.org.uk [84.45.41.196]) by mx1.freebsd.org (Postfix) with ESMTP id 4ZVBnw3Ytbz3YCr for ; Sat, 05 Apr 2025 10:40:36 +0000 (UTC) (envelope-from freebsd-doc@fjl.co.uk) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of freebsd-doc@fjl.co.uk designates 84.45.41.196 as permitted sender) smtp.mailfrom=freebsd-doc@fjl.co.uk Received: from [192.168.1.109] (host86-173-148-176.range86-173.btcentralplus.com [86.173.148.176]) (authenticated bits=0) by bs1.fjl.org.uk (8.14.4/8.14.4) with ESMTP id 535AeSx9063340 for ; Sat, 5 Apr 2025 11:40:28 +0100 (BST) (envelope-from freebsd-doc@fjl.co.uk) Message-ID: <0fbd2584-6e07-40bf-b0e0-8d9198db100b@fjl.co.uk> Date: Sat, 5 Apr 2025 11:40:16 +0100 List-Id: User questions List-Archive: https://lists.freebsd.org/archives/freebsd-questions List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-questions@freebsd.org Sender: owner-freebsd-questions@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Sudden zpool checksums errors Content-Language: en-GB To: questions@freebsd.org References: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it> From: Frank Leonhardt In-Reply-To: <6aeb488d-b3c3-4393-80ca-0b89c1ebc446@netfence.it> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [0.55 / 15.00]; RBL_SENDERSCORE_REPUT_9(-1.00)[84.45.41.196:from]; NEURAL_SPAM_LONG(1.00)[1.000]; NEURAL_SPAM_MEDIUM(1.00)[1.000]; NEURAL_HAM_SHORT(-0.45)[-0.451]; ONCE_RECEIVED(0.20)[]; R_SPF_ALLOW(-0.20)[+ip4:84.45.41.196]; MIME_GOOD(-0.10)[text/plain]; RCVD_NO_TLS_LAST(0.10)[]; RCPT_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:25577, ipnet:84.45.0.0/17, country:GB]; RCVD_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[questions@freebsd.org]; MLMMJ_DEST(0.00)[questions@freebsd.org]; DMARC_NA(0.00)[fjl.co.uk]; R_DKIM_NA(0.00)[] X-Rspamd-Queue-Id: 4ZVBnw3Ytbz3YCr X-Spamd-Bar: / On 04/04/2025 16:42, Andrea Venturoli wrote: > Hello. > > I've got a box with two zpools: > _ 1 mirror on 2 SSDs; > _ 1 raidz1 on 12 HDDs. > > Suddenly one daily run showed the following: >>  pool: backup >>  state: ONLINE >> status: One or more devices has experienced an unrecoverable error.  An >>     attempt was made to correct the error.  Applications are unaffected. >> action: Determine if the device needs to be replaced, and clear the >> errors >>     using 'zpool clear' or replace the device with 'zpool replace'. >>    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P >>   scan: scrub repaired 3.18M in 16:53:16 with 0 errors on Tue Apr  1 >> 20:16:55 2025 >> config: >> >>     NAME        STATE     READ WRITE CKSUM >>     backup      ONLINE       0     0     0 >>       raidz1-0  ONLINE       0     0     0 >>         da4     ONLINE       0     0     0 >>         da10    ONLINE       0     0     0 >>         da5     ONLINE       0     0    57 >>         da2     ONLINE       0     0     0 >>         da8     ONLINE       0     0    25 >>         da0     ONLINE       0     0     0 >>         da1     ONLINE       0     0    49 >>         da12    ONLINE       0     0     8 >>         da6     ONLINE       0     0     6 >>         da11    ONLINE       0     0     0 >>         da9     ONLINE       0     0    56 >>         da13    ONLINE       0     0    73 >> >> errors: No known data errors > > Assuming you've checked the logs etc as you say I'd be suspicious of the HBA and cabling, and presumably a SAS expander. But IME it's well worth testing the drives. Just dd them to /dev/null and see if anything sqwalks. There's nothing stopping you doing this on a live ZFS pool, although maybe do them one at a time if the array is busy :-) Given the nature of SCSI you may find the only indication that a drive isn't 100% is an unusually slow read rate. I agree it would be a coincidence 50% of the drives were flaky but it does happen, or it might be there're on one flaky HBA connecting half of them. I can't help being drawn to the fact it's exactly half that are throwing errors. Anyway, checking the drives out by reading is minimal effort before diving into more esoteric reasons. ZFS isn't as good as people think about detecting failing drives until they're actually on fire (see my posts passim on this matter). Regards, Frank.