From nobody Sat Feb 08 21:43:11 2025 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Yr48J4RSvz5nFhS; Sat, 08 Feb 2025 21:43:12 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R11" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Yr48J1WnNz3Nc6; Sat, 08 Feb 2025 21:43:12 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1739050992; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=yxfKOD4pG531sKBCxCS1GN4xAPnSrm+TkhBe+FWLlCI=; b=uDNSZEe5zao6aR3Jhf83rPAVY6jSO9LjAe4oqo6BO4P86LT5A6Xburq31svaGSxrRQ2zLI hcjeqhKtTP377Zf+7tpQp1BtQ0LsGTwKhyP7/ZyMykLKCDNxngfuNRmYXtTgWp+rToz80z Pji0BXgAM2DFb8iU24O80k3AyeMt8ivBRDiSevcRW7VgTgtQ8PRbhFNbr5PqrYziQvazPm UmwLg+E7Nn4+BdNE2jzq/89/Mjhm21jgCPcgN9w59pbToQHcsnYe29nKFGJ6kpZKyLBD8e q8VgwYztvHpt2SidnfKrITbXjo+g/yu36OSEACKOx5Ha5GGd3p55wuhAtSeFPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1739050992; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=yxfKOD4pG531sKBCxCS1GN4xAPnSrm+TkhBe+FWLlCI=; b=u/yOX2gxtFHwq4CEo9/SrJqm/yVp0Z7eDR6dgJZtRdGqqlBUSbY7GFop+CW+Uc11udTMSq J+HKbvJst7lbbn1mei+V9LxZHZGQBD5J0GNJ3ZFGKx//ABQnSZHQypWGAtO9yUhZ644AG7 G4CXyv9ATAJAifouuflbMfKbK3O1ofZbLx6c7xZ/YTQxcUCsuY8dyB/ZxBHrHRGRItvSg8 ALHXNAZfYPySxlHfU4tY5JB3Hvr4eeIRtCZdfLrcysahB4ozDBAXE8Eu2wdl9kDVtWFVbJ hWlG/YKaqFKwXr1SSCThzW4iFjDEtzvIqgJQIfvUQWX05ipy4SqsBo2tpMdlmg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1739050992; a=rsa-sha256; cv=none; b=Fp1DLaTcPfxsTX/xTFMeHzAHRHK0wyU1tU7PQTXmIUVO2d1pvfzydqGOK0H8FZrfrUzp3T FiefcM+Uu/4TFS8tSa5Sz+Radcw91Q3sJAutGU2E3xYoXhjVCKCAZlWNiYo0hAcgv50H1V QYw4Ie6JrpZWr7ke0zyM8YR+MQRxDxYLVjM4C5m+8OwuLKUql4tC76twoEN0twCR1ho7O+ vCVxb125mTjFtQwnUwiIcv3LcMvVyr/ojW0Nm4dmIt4+zLzzJN+m0AQ1+Eu5CD6kpE13JC iFK879i1sDWhJIJe8k3fIVpBDi/cpUjVlMig9t9TuuJTUqo5E45X9Hf/sJct9A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4Yr48J16wLzCtY; Sat, 08 Feb 2025 21:43:12 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.18.1/8.18.1) with ESMTP id 518LhBX9097576; Sat, 8 Feb 2025 21:43:11 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.18.1/8.18.1/Submit) id 518LhBoU097572; Sat, 8 Feb 2025 21:43:11 GMT (envelope-from git) Date: Sat, 8 Feb 2025 21:43:11 GMT Message-Id: <202502082143.518LhBoU097572@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Warner Losh Subject: git: f8de2be7d920 - main - cam/da: Call cam_periph_invalidate on ENXIO in dadone List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: dev-commits-src-all@freebsd.org Sender: owner-dev-commits-src-all@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: imp X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: f8de2be7d920d4e8d9a60804819282dc89f4881a Auto-Submitted: auto-generated The branch main has been updated by imp: URL: https://cgit.FreeBSD.org/src/commit/?id=f8de2be7d920d4e8d9a60804819282dc89f4881a commit f8de2be7d920d4e8d9a60804819282dc89f4881a Author: Warner Losh AuthorDate: 2025-02-08 21:31:14 +0000 Commit: Warner Losh CommitDate: 2025-02-08 21:31:14 +0000 cam/da: Call cam_periph_invalidate on ENXIO in dadone Use cam_periph_invalidate() instead of just setting the PACK_INVALID flag in the da softc. It's a more appropriate and bigger hammer for this case. PACK_INVALID is set as part of that, so remove the now-redundant setting. This also has the side effect of short-circuiting errors for other I/O still in the drive which is just about to fail (sometimes with different error codes than what triggered this ENXIO). The prior practice of just setting the PACK_INVALID flag, however, was too ephemeral to be effective.. Since daopen would clear PACK_INVALID after a successful open, we'd have to rediscover the error (which takes tens of seconds) for every different geom tasting the drive. These two factors lead to a watchdog before we could get through all the devices if we had multiple failed drives with this syndrome. By invalidating the periph, we fail fast enough to reboot enough to start petting the watchdog. If we disable the watchdog, the tasting eventually completes, but takes over an hour which is too long. As it is, it takes an extra minute per failed drive, which is tolerable. When the PACK_INVALID flag is already set, just flush remaining I/Os with ENXIO. This bit will be set either when we've called cam_periph_invalidate() before (so we've just waiting for the I/Os to complete) or more typically when we've seen an ASC 0x3a, which is the catch all for 'drive is otherwise OK, we're just missing the media to get data from'. In the latter case, we do not want to invalidate the periph since we allow recovery from this with a trip through daopen(). While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for failing the entire drive, I've opted not to use that. That flag will also causes all attached drivers, like pass, to detach, which is undesireable. By not adding that flag, but just invalidating the da periph driver, we prevent I/Os, but still allow collection of logs from the device. We can also simplify the logic w/o bloating the change, so do that too. Finally, this has been tested on all the removeable/non-removeable disks I could find, cd players, combo cd/da memory sticks, etc. I've removed the media while doing I/O on several of them. With these changes, we handle things corretly in all the cases I tested (except partially inserted media, which fails chaotically the same as before). The numbre of devices out there is, however, huge. mav@ raised concerns about what happens when we have asc/ascq 28/0. I see that on boot for one of my cards (that's not autoquirked) and as preditected in the review, we retry that transaction and we get proper behavior. To be fair, though, I only ever saw it at startup where it was a transient. I couldn't get some of my energy saving disks to ever throw that ASC/ASCQ, even after they spun down, so I've not tested that case. Sponsored by: Netflix Discussed with: mav@ Differential Revision: https://reviews.freebsd.org/D48689 --- sys/cam/scsi/scsi_da.c | 59 +++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c index 44dc21d1bc2f..1fd6d4919c61 100644 --- a/sys/cam/scsi/scsi_da.c +++ b/sys/cam/scsi/scsi_da.c @@ -1805,7 +1805,10 @@ daopen(struct disk *dp) /* * Only 'validate' the pack if the media size is non-zero and the - * underlying peripheral isn't invalid (the only error != 0 path). + * underlying peripheral isn't invalid (the only error != 0 path). Once + * the periph is marked invalid, we only get here on lost races with its + * teardown, so keeping the pack invalid also keeps more I/O from + * starting. */ if (error == 0 && softc->params.sectors != 0) softc->flags &= ~DA_FLAG_PACK_INVALID; @@ -4609,33 +4612,45 @@ dadone(struct cam_periph *periph, union ccb *done_ccb) */ bp = (struct bio *)done_ccb->ccb_h.ccb_bp; if (error != 0) { - int queued_error; + bool pack_invalid = + (softc->flags & DA_FLAG_PACK_INVALID) != 0; - /* - * return all queued I/O with EIO, so that - * the client can retry these I/Os in the - * proper order should it attempt to recover. - */ - queued_error = EIO; - - if (error == ENXIO - && (softc->flags & DA_FLAG_PACK_INVALID)== 0) { + if (error == ENXIO && !pack_invalid) { /* - * Catastrophic error. Mark our pack as - * invalid. + * ENXIO flags ASC/ASCQ codes for either media + * missing, or the drive being extremely + * unhealthy. Invalidate peripheral on this + * catestrophic error when the pack is valid + * since we set the pack invalid bit only for + * the few ASC/ASCQ codes indicating missing + * media. The invalidation will flush any + * queued I/O and short-circuit retries for + * other I/O. We only invalidate the da device + * so the passX device remains for recovery and + * diagnostics. * - * XXX See if this is really a media - * XXX change first? + * While we do also set the pack invalid bit + * after invalidating the peripheral, the + * pending I/O will have been flushed then with + * no new I/O starting, so this 'edge' case + * doesn't matter. */ xpt_print(periph->path, "Invalidating pack\n"); - softc->flags |= DA_FLAG_PACK_INVALID; -#ifdef CAM_IO_STATS - softc->invalidations++; -#endif - queued_error = ENXIO; + cam_periph_invalidate(periph); + } else { + /* + * Return all queued I/O with EIO, so that the + * client can retry these I/Os in the proper + * order should it attempt to recover. When the + * pack is invalid, fail all I/O with ENXIO + * since we can't assume when the media returns + * it's the same media and we force a trip + * through daclose / daopen and the client won't + * retry. + */ + cam_iosched_flush(softc->cam_iosched, NULL, + pack_invalid ? ENXIO : EIO); } - cam_iosched_flush(softc->cam_iosched, NULL, - queued_error); if (bp != NULL) { bp->bio_error = error; bp->bio_resid = bp->bio_bcount;