From nobody Tue May 17 23:38:53 2022
X-Original-To: bugs@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id C35AA1ADF18F
	for <bugs@mlmmj.nyi.freebsd.org>; Tue, 17 May 2022 23:38:59 +0000 (UTC)
	(envelope-from bugzilla-noreply@freebsd.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4L2syW2pmQz3FLD
	for <bugs@FreeBSD.org>; Tue, 17 May 2022 23:38:59 +0000 (UTC)
	(envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 38CAA1E415
	for <bugs@FreeBSD.org>; Tue, 17 May 2022 23:38:59 +0000 (UTC)
	(envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org ([127.0.1.5])
	by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 24HNcxcu029925
	for <bugs@FreeBSD.org>; Tue, 17 May 2022 23:38:59 GMT
	(envelope-from bugzilla-noreply@freebsd.org)
Received: (from www@localhost)
	by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 24HNcxX4029924
	for bugs@FreeBSD.org; Tue, 17 May 2022 23:38:59 GMT
	(envelope-from bugzilla-noreply@freebsd.org)
X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f
From: bugzilla-noreply@freebsd.org
To: bugs@FreeBSD.org
Subject: [Bug 224496] mpr and mps drivers seems to have issues with large
 seagate drives
Date: Tue, 17 May 2022 23:38:53 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 11.1-STABLE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Some People
X-Bugzilla-Who: contact@jerratt.com
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-224496-227-ToF70zRbpC@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
References: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-bugs
List-Help: <mailto:freebsd-bugs+help@freebsd.org>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Subscribe: <mailto:freebsd-bugs+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-bugs+unsubscribe@freebsd.org>
Sender: owner-freebsd-bugs@freebsd.org
MIME-Version: 1.0
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1652830739;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=+kTMIqQqPuqt3PVJqwv0zUg6I5hs9fLVw8z0FXe8jIw=;
	b=meA1G3YcI+lrHSUFXg1CKQVImD+46w0KukIfafLTPEd97CC3adM+WEM7b47pQdP8Aib32/
	ifrJHr5KyNHLqSTfBSjvyhFpvP/ZfQc53flpD7XvOfcHKvfSTlPIQzui+hpy3aamS0Blzm
	iuflsMJa2vJjMJFwaGYfQ6qTAiJc2zszh0H7uPNWrLv//kudWG4uZQH84fxyzC1PiJyf2Q
	t30E6+M9kRZ/+5Jdqro0BorEhRTSjkIBRe/MlKnkCrvQppmXwy7/mm3Bl2SPO+nMZOCm7x
	RJnbQOwH9AE3tAruE2i8Cj8UZvAnVk0ve94qF7RivBz6hlCk1f6Rj0Xv5z22iQ==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1652830739; a=rsa-sha256; cv=none;
	b=V3355M+OYmECTKCJ7MarsBR203hjCVouzb9krD/i67aXQ+/2fPLlzwDb54RfS+jomRC3+P
	mMpG4G/foG71j25e0vDKhnCVmG8TLDL0dkL7j4boX9U9dJlwsgtES7/jDxRfEfGafC7QrG
	bUJjcEzdmjktTmYw5Zxm45oP2UsYt6DAlGUcxkdDOxXqPJ32S6j/nyMO7fFhT2VzkfKVbD
	2kMeX1ZquweOsvHtrpSlAu67aJ00AVzckzc5thZGf+tMFZAQkE6P+t/Xjt8N4/ymL7Dz8a
	ODJV4zZOPs3W1bA/CTahbQg0UxuugFpBk9GkMcxvqel5vixi+dydU2kZR39iYw==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
X-ThisMailContainsUnwantedMimeParts: N

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D224496

JerRatt IT <contact@jerratt.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |contact@jerratt.com

--- Comment #54 from JerRatt IT <contact@jerratt.com> ---
I'm reporting either the same or similar issue, here are my findings, and
please let me know if my plans sound correct:

Setup:
TrueNAS Scale 22.02.0.1
AMD Threadripper 1920X
ASRock X399 Taichi
128GB (8x16GB) Crucial CT8G4WFD824A Unbuffered ECC
AVAGO/LSI 9400-8i SAS3408 12Gbps HBA Adapter
Supermicro BPN-SAS3-743A 8-Port SAS3/SAS2/SATA 12Gbps Backplane
8 x Seagate Exos X18 18TB HDD ST18000NM004J SAS 12Gbps 512e/4Kn
2 x Crucial 120GB SSD
2 x Crucial 1TB SSD
2 x Western Digital 960GB NVME
Supermicro 4U case w/2000watt Redundant Power Supply

The system is connected with a large APC data-center battery system and
conditioner, in a HVAC controlled area. All hard drives have the newest
firmware, and in 4k sectors both logical and native. The controller has the
newest firmware, both regular and legacy roms, and with the SATA/SAS only m=
ode
flashed (dropping the NVME multi/tri-mode option that the new 9400 series c=
ards
support).

Running any kind of heavy I/O onto the 18TB drives that are connected to the
BPN-SAS3-743A backplane and through to the LSI 9400-8i HBA eventually resul=
ts
in the drive resetting. This happens even without the drives assigned to any
kind of ZFS pool. This also happens whether running from the shell within t=
he
GUI or from the shell itself. This happens on all drives, that are using two
separate SFF8643 cables with a backplane that has two separate SFF8643 port=
s.

To cause this to happen, I can either run badblocks on each drive (using:
badblocks -c 1024 -w -s -v -e 1 -b 65536 /dev/sdX), or even just running a
SMART extended/long test.

Eventually, all or nearly all drives will reset, even spin down (according =
to
the shell logs). Sometimes they reset in batches, while others continue
chugging along. It's made completing any kind of SMART extended test not
possible. Badblocks will fail out, reporting too many bad blocks, on multip=
le
hard drives all at nearly the exact same moment, yet consecutive badblock s=
cans
won't report bad blocks in the same areas. SMART test will just show "abort=
ed,
drive reset?" as the result.

My plan was to replace the HBA with an older LSI 9305-16i, replace the two
SFF8643-SFF8643 cables going from the HBA to the backplane just for good
measure, install two different SFF8643-SFF8482 cables that bypass the backp=
lane
fully, then four of the existing Seagate 18TB drives and put them on the the
SFF8643-SFF8482 connections that bypass the backplane, as well as install f=
our
new WD Ultrastar DC HC550 (WUH721818AL5204) drives into the mix (some using=
 the
backplane, some not). That should reveal if this is a compatibility/bug iss=
ue
with all large drives or certain large drives on a LSI controller, the mpr
driver, and/or this backplane.

If none of that works or doesn't eliminate all the potential points of
failures, I'm left with nothing but subpar work arounds, such as just using=
 the
onboard SATA ports, disabling NCQ in the LSI controller, or setting up a L2=
ARC
cache (or I might try a metadata cache to see if that circumvents the issue=
 as
well).


Condensed logs when one drive errors out:

sd 0:0:0:0: device_unblock and setting to running, handle(0x000d)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0=
e05)
mpt3sas_cm0: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0=
e05)
~
~
~
~
sd 0:0:0:0: Power-on or device reset occurred
.......ready
sd 0:0:6:0: device_block, handle(0x000f)
sd 0:0:9:0: device_block, handle(0x0012)
sd 0:0:10:0: device_block, handle(0x0014)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x0=
10c)
sd 0:0:9:0: device_unblock and setting to running, handle(0x0012)
sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
sd 0:0:10:0: device_unblock and setting to running, handle(0x0014)
sd 0:0:9:0: Power-on or device reset occurred
sd 0:0:6:0: Power-on or device reset occurred
sd 0:0:10:0: Power-on or device reset occurred
scsi_io_completion_action: 5 callbacks suppressed
sd 0:0:10:0: [sdd] tag#5532 FAILED Result: hostbyte=3DDID_OK
driverbyte=3DDRIVER_SENSE cmd_age=3D2s
sd 0:0:10:0: [sdd] tag#5532 Sense Key : Not Ready [current] [descriptor]=20
sd 0:0:10:0: [sdd] tag#5532 Add. Sense: Logical unit not ready, additional
power granted
sd 0:0:10:0: [sdd] tag#5532 CDB: Write(16) 8a 00 00 00 00 00 5c 75 7a 12 00=
 00
01 40 00 00
print_req_error: 5 callbacks suppressed
blk_update_request: I/O error, dev sdd, sector 12409622672 op 0x1:(WRITE) f=
lags
0xc800 phys_seg 1 prio class 0
sd 0:0:10:0: [sdd] tag#5533 FAILED Result: hostbyte=3DDID_OK
driverbyte=3DDRIVER_SENSE cmd_age=3D2s
sd 0:0:10:0: [sdd] tag#5533 Sense Key : Not Ready [current] [descriptor]=20
sd 0:0:10:0: [sdd] tag#5533 Add. Sense: Logical unit not ready, additional
power use not yet granted
sd 0:0:10:0: [sdd] tag#5533 CDB: Write(16) 8a 00 00 00 00 00 5c 75 76 52 00=
 00
01 40 00 00
blk_update_request: I/O error, dev sdd, sector 12409614992 op 0x1:(WRITE) f=
lags
0xc800 phys_seg 1 prio class 0
~
~
~
~
sd 0:0:10:0: [sdd] Spinning up disk...
.
sd 0:0:3:0: device_block, handle(0x0013)
mpt3sas_cm0: log_info(0x3112010c): originator(PL), code(0x12), sub_code(0x0=
10c)
.
sd 0:0:3:0: device_unblock and setting to running, handle(0x0013)
.
sd 0:0:3:0: Power-on or device reset occurred
.................ready

--=20
You are receiving this mail because:
You are the assignee for the bug.=