git: f8de2be7d920 - main - cam/da: Call cam_periph_invalidate on ENXIO in dadone
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 08 Feb 2025 21:43:11 UTC
The branch main has been updated by imp:
URL: https://cgit.FreeBSD.org/src/commit/?id=f8de2be7d920d4e8d9a60804819282dc89f4881a
commit f8de2be7d920d4e8d9a60804819282dc89f4881a
Author: Warner Losh <imp@FreeBSD.org>
AuthorDate: 2025-02-08 21:31:14 +0000
Commit: Warner Losh <imp@FreeBSD.org>
CommitDate: 2025-02-08 21:31:14 +0000
cam/da: Call cam_periph_invalidate on ENXIO in dadone
Use cam_periph_invalidate() instead of just setting the PACK_INVALID
flag in the da softc. It's a more appropriate and bigger hammer for this
case. PACK_INVALID is set as part of that, so remove the now-redundant
setting. This also has the side effect of short-circuiting errors for
other I/O still in the drive which is just about to fail (sometimes with
different error codes than what triggered this ENXIO).
The prior practice of just setting the PACK_INVALID flag, however, was
too ephemeral to be effective.. Since daopen would clear PACK_INVALID
after a successful open, we'd have to rediscover the error (which takes
tens of seconds) for every different geom tasting the drive. These two
factors lead to a watchdog before we could get through all the devices
if we had multiple failed drives with this syndrome. By invalidating the
periph, we fail fast enough to reboot enough to start petting the
watchdog. If we disable the watchdog, the tasting eventually completes,
but takes over an hour which is too long. As it is, it takes an extra
minute per failed drive, which is tolerable.
When the PACK_INVALID flag is already set, just flush remaining I/Os
with ENXIO. This bit will be set either when we've called
cam_periph_invalidate() before (so we've just waiting for the I/Os to
complete) or more typically when we've seen an ASC 0x3a, which is the
catch all for 'drive is otherwise OK, we're just missing the media to
get data from'. In the latter case, we do not want to invalidate the
periph since we allow recovery from this with a trip through daopen().
While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for
failing the entire drive, I've opted not to use that. That flag will
also causes all attached drivers, like pass, to detach, which is
undesireable. By not adding that flag, but just invalidating the da
periph driver, we prevent I/Os, but still allow collection of logs from
the device.
We can also simplify the logic w/o bloating the change, so do that too.
Finally, this has been tested on all the removeable/non-removeable disks
I could find, cd players, combo cd/da memory sticks, etc. I've removed
the media while doing I/O on several of them. With these changes, we
handle things corretly in all the cases I tested (except partially
inserted media, which fails chaotically the same as before). The numbre
of devices out there is, however, huge.
mav@ raised concerns about what happens when we have asc/ascq 28/0. I
see that on boot for one of my cards (that's not autoquirked) and as
preditected in the review, we retry that transaction and we get proper
behavior. To be fair, though, I only ever saw it at startup where it was
a transient. I couldn't get some of my energy saving disks to ever throw
that ASC/ASCQ, even after they spun down, so I've not tested that case.
Sponsored by: Netflix
Discussed with: mav@
Differential Revision: https://reviews.freebsd.org/D48689
---
sys/cam/scsi/scsi_da.c | 59 +++++++++++++++++++++++++++++++-------------------
1 file changed, 37 insertions(+), 22 deletions(-)
diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c
index 44dc21d1bc2f..1fd6d4919c61 100644
--- a/sys/cam/scsi/scsi_da.c
+++ b/sys/cam/scsi/scsi_da.c
@@ -1805,7 +1805,10 @@ daopen(struct disk *dp)
/*
* Only 'validate' the pack if the media size is non-zero and the
- * underlying peripheral isn't invalid (the only error != 0 path).
+ * underlying peripheral isn't invalid (the only error != 0 path). Once
+ * the periph is marked invalid, we only get here on lost races with its
+ * teardown, so keeping the pack invalid also keeps more I/O from
+ * starting.
*/
if (error == 0 && softc->params.sectors != 0)
softc->flags &= ~DA_FLAG_PACK_INVALID;
@@ -4609,33 +4612,45 @@ dadone(struct cam_periph *periph, union ccb *done_ccb)
*/
bp = (struct bio *)done_ccb->ccb_h.ccb_bp;
if (error != 0) {
- int queued_error;
+ bool pack_invalid =
+ (softc->flags & DA_FLAG_PACK_INVALID) != 0;
- /*
- * return all queued I/O with EIO, so that
- * the client can retry these I/Os in the
- * proper order should it attempt to recover.
- */
- queued_error = EIO;
-
- if (error == ENXIO
- && (softc->flags & DA_FLAG_PACK_INVALID)== 0) {
+ if (error == ENXIO && !pack_invalid) {
/*
- * Catastrophic error. Mark our pack as
- * invalid.
+ * ENXIO flags ASC/ASCQ codes for either media
+ * missing, or the drive being extremely
+ * unhealthy. Invalidate peripheral on this
+ * catestrophic error when the pack is valid
+ * since we set the pack invalid bit only for
+ * the few ASC/ASCQ codes indicating missing
+ * media. The invalidation will flush any
+ * queued I/O and short-circuit retries for
+ * other I/O. We only invalidate the da device
+ * so the passX device remains for recovery and
+ * diagnostics.
*
- * XXX See if this is really a media
- * XXX change first?
+ * While we do also set the pack invalid bit
+ * after invalidating the peripheral, the
+ * pending I/O will have been flushed then with
+ * no new I/O starting, so this 'edge' case
+ * doesn't matter.
*/
xpt_print(periph->path, "Invalidating pack\n");
- softc->flags |= DA_FLAG_PACK_INVALID;
-#ifdef CAM_IO_STATS
- softc->invalidations++;
-#endif
- queued_error = ENXIO;
+ cam_periph_invalidate(periph);
+ } else {
+ /*
+ * Return all queued I/O with EIO, so that the
+ * client can retry these I/Os in the proper
+ * order should it attempt to recover. When the
+ * pack is invalid, fail all I/O with ENXIO
+ * since we can't assume when the media returns
+ * it's the same media and we force a trip
+ * through daclose / daopen and the client won't
+ * retry.
+ */
+ cam_iosched_flush(softc->cam_iosched, NULL,
+ pack_invalid ? ENXIO : EIO);
}
- cam_iosched_flush(softc->cam_iosched, NULL,
- queued_error);
if (bp != NULL) {
bp->bio_error = error;
bp->bio_resid = bp->bio_bcount;