kern/135408: Adaptec 5405 RAID (aac) controller hanging under high load +suggested fix

Tue Jun 9 09:10:04 UTC 2009

>Number:         135408
>Category:       kern
>Synopsis:       Adaptec 5405 RAID (aac) controller hanging under high load +suggested fix
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jun 09 09:10:03 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator:     Michael Gmelin
>Release:        7.2-RELEASE amd64, 7.1-RELEASE amd64, 7.2 RC2 amd64, 8-CURRENT amd64
>Organization:
/bin/done digital solutions GmbH
>Environment:
FreeBSD srv02 7.2-RELEASE FreeBSD 7.2-RELEASE #0: Thu May 14 00:07:23 UTC 2009     root at srv02:/usr/src/sys/amd64/compile/GENERIC  amd64
FreeBSD srv04 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Tue Jun  2 18:01:36 UTC 2009     root at srv04:/usr/src/sys/amd64/compile/GENERIC  amd64

>Description:
The affected machines sometimes lose the Adaptec 5405 RAID controller under high load, stating:

aac0: COMMAND 0xffffffffxxxxxxxxx TIMEOUT AFTER nn SECONDS
(x changing, n getting higher)

This goes on up to a point where it states "WARNING! Controller is no longer running! code=xxxxx".

At that point the machine cannot recover and needs a hard reset.

This is reproducable on all the BSD versions stated above (the drivers haven't changed in quite a while), the code triggering this is in aac.c (lines 2375ff).

I suspect this is a problem related to command timeouts (see "Fix to the problem" below). There have been various threads going on about similar issues on aacraid based controllers in the past, but none of them was concolusive, so I hope this can be resolved asap :)
>How-To-Repeat:
Put high load on the I/O subsystem, e.g. by doing
make -j16 buildworld
(depending on your hardware, dual quad core xeon here) or run
a benchmark (e.g. /usr/ports/benchmarks/blogbench).

The time required to crash the machine varies, it seems to be easier to crash directly after a reboot (but that's highly speculative).
>Fix:
I've been wondering if this is somehow related to the following article in the adaptec knowledge base:

http://ask.adaptec.com/scripts/adaptec_tic.cfg/php.exe/enduser/std_adp.php?p_faqid=15357&p_created=1225366599&p_sid=NqNtKZrj&p_accessibility=0&p_redirect=&p_lva=&p_sp=cF9zcmNoPSZwX3NvcnRfYnk9JnBfZ3JpZHNvcnQ9JnBfcm93X2NudD0yNjk3LDI2OTcmcF9wcm9kcz0mcF9jYXRzPSZwX3B2PSZwX2N2PSZwX3NlYXJjaF90eXBlPWFuc3dlcnMuc2VhcmNoX25sJnBfcGFnZT0x&p_li=&p_topview=1

It states:
"AACRAID based controllers have an underlying timeout/recovery cycle that is 35 seconds long. 

The default in some SCSI subsystems was 60 seconds in the past, but is now standardized at 30 seconds which results in an interference pattern between the controller and the L*n*x SCSI subsystem."

(I copy and pasted the entire article at the end of this post).

Since sys/dev/aac/aacvar.h sets AAC_CMD_TIMEOUT to 30 seconds I've been wondering if this is somehow related (there are also timeouts for immediate commands and the period check for timeouts interval - not sure how they're used in aac.c and too lazy to check).

The bottom line is, that adaptec states that their AACRAID based controllers may sometimes need >35 seconds to process a command under normal operational circumstances, if the controller is going through an "error correction cycle on the SAS/SATA bus" and they recommend to increase the timeouts on the OS side to 45 seconds.

(which might also explain why the take required to reproduce the problem varies heavily).

----------- Entire knowledge base article (see link above) ---------------
AACRAID based controllers have an underlying timeout/recovery cycle that is 35 seconds long. 

The default in some SCSI subsystems was 60 seconds in the past, but is now standardized at 30 seconds which results in an interference pattern between the controller and the L*n*x SCSI subsystem. 

The alternate workaround is for the user to adjust the timeout in SYSFS if it is shorter than 35 seconds. 

Changing the timeout values for a L*n*x block device can be done via SYSFS. For example, if /dev/sdc , /dev/sdd and /dev/sde are the device LUNs on a given Linux host, then the following commands need to be issued: 
echo 45 > /sys/block/sdc/device/timeout 
echo 45 > /sys/block /sdd/device/timeout 
echo 45 > /sys/block/sde/device/timeout 
In this example the timeout is 45 seconds which should be enough. 

Note: Any AACRAID based controller is going through an error correction cycle on the SAS/SATA bus that is delaying the completion of I/O beyond the L*n*x default timeout set for the device, this may be a hardware issue or a problem with the default timeout value as outlined above. If changing the timeout value doesn't solve the problem then please follow the steps we recommend to trouble shoot "Host adapter reset request. SCSI hang ?" messages: 
Check for any updated firmware for the motherboard, controller, targets and enclosure on the respective manufacturer's web sites.
Check per-device queue depth in SYSFS to make sure it is reasonable.
Engage disk drive manufacturer's technical support department to check through compatibility or drive class issues.
Engage enclosure manufacturer's technical support department to check through compatibility issues.

>Release-Note:
>Audit-Trail:
>Unformatted: