[Bug 229745] ahcich: CAM status: Command timeout

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 22 Jul 2022 15:13:28 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229745

--- Comment #62 from Warner Losh <imp@FreeBSD.org> ---
This bug should be closed. There's too many different symptoms that have been
co-located in this bug that are likely unrelated. There's clearly some bad
hardware here. There's clearly some issues with ahci itself that appear to my
eye to be 'quirky' resets of the bridge (though w/o traces it will be hard to
know). Things that fail to reset on reboot are different than transient errors
than are WRITE errors with codes that aren't timeouts. It's hard to sort out
all the issues here. A number of other bugs should be filed to take its place,
for real bugs that can be reproducible (because otherwise we won't know if any
changes fix the problem or not).

By and large, if a drive hangs, it is to blame. If a drive throws write errors,
it's always the drive (though the CRC errors might be cabling issues).

Reducing the write load makes sense at having the problem 'disappear': it puts
a much higher instantaneous load on the drive than would otherwise be seen for
drives that have marginal data and can cope with retries for a few writes vs
retires on lots of writes all at once. The latter can overwhelm some drives'
firmware.

It's also possible that error recovery could be better in ahci, since we do
recovery things when we get a timeout. However, those improvements can be hard
to roll out and test due to needing real hardware that's basically good but
sometimes misbehaves and most operations will retire / discard such hardware.

-- 
You are receiving this mail because:
You are the assignee for the bug.