Adaptec 3210S, 4.9-STABLE, corruption when disk fails
gemini at geminix.org
Tue Mar 1 08:58:42 GMT 2005
Uwe Doering wrote:
> Don Bowman wrote:
>> I have merged asr.c from RELENG_4 to get this fix:
>> "Fix a mis-merge in the MFC of rev. 1.64 in rev. 188.8.131.52; the following
>> change wasn't included:
>> - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
>> in case of a CHECK CONDITION."
>> since I guess its conceivable this could cause my problem.
> I have to admit that I didn't think of this right away, even though I
> was kind of involved.
> Did you merge 184.108.40.206 as well? This actually should have been one MFC
> but it was done in two steps due to an oversight. Please let us know
> whether the fix makes any difference in your case. Its author made it
> for CD burners and wasn't sure whether it has any effect on other
> devices, like da(4).
Memory's coming back piecemeal. ;-) There's another thing you could
try. The 'asr' driver's original timeout is 360 seconds, because its
author knew that this type of controller can be busy for quite some
time. FreeBSD's SCSI driver, however, sets it to its default of 60
seconds, which can be way too short.
What happens when the controller is busy trying to deal with a failed
disk is that the 'asr' driver sends a bus reset to the controller as a
whole, due to the short timeout. You should be able to see this clash
in the controller's event log. My feeling is that this collision of
events may have ill effects, like the data corruption you've observed.
On our machines we've set the SCSI timeout and thereby also the 'asr'
driver's timeout back to the original 360 seconds, in order to leave the
controller alone while it is busy. There is a 'sysctl' variable for this:
Maybe that's the actual fix for your problem.
Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers
gemini at geminix.org | http://www.escapebox.net
More information about the freebsd-stable