Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Tue Mar 1 08:58:42 GMT 2005

Uwe Doering wrote:
> Don Bowman wrote:
>>
>> I have merged asr.c from RELENG_4 to get this fix:
>>
>> "Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following
>> change wasn't included:
>> - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
>>   in case of a CHECK CONDITION."
>>
>> since I guess its conceivable this could cause my problem.
> 
> I have to admit that I didn't think of this right away, even though I 
> was kind of involved.
> 
> Did you merge 1.3.2.3 as well?  This actually should have been one MFC 
> but it was done in two steps due to an oversight.  Please let us know 
> whether the fix makes any difference in your case.  Its author made it 
> for CD burners and wasn't sure whether it has any effect on other 
> devices, like da(4).

Memory's coming back piecemeal. ;-)  There's another thing you could 
try.  The 'asr' driver's original timeout is 360 seconds, because its 
author knew that this type of controller can be busy for quite some 
time.  FreeBSD's SCSI driver, however, sets it to its default of 60 
seconds, which can be way too short.

What happens when the controller is busy trying to deal with a failed 
disk is that the 'asr' driver sends a bus reset to the controller as a 
whole, due to the short timeout.  You should be able to see this clash 
in the controller's event log.  My feeling is that this collision of 
events may have ill effects, like the data corruption you've observed.

On our machines we've set the SCSI timeout and thereby also the 'asr' 
driver's timeout back to the original 360 seconds, in order to leave the 
controller alone while it is busy.  There is a 'sysctl' variable for this:

   kern.cam.da.default_timeout=360

Maybe that's the actual fix for your problem.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini at geminix.org  |  http://www.escapebox.net