Adaptec 5405 (aac0) hanging on high load

Michael freebsdusb at bindone.de
Tue Jun 9 09:04:18 UTC 2009


(I filed this one as a PR through the website as well but waiting for
confirmation and assignment of a PR number)

Hi folks,

I've got issues with an Adaptec sometimes hanging under high load
stating COMMAND xxx TIMEOUT AFTER yyy SECONDS multiple times and then
"Controller is no longer running". (on 7.1-RELEASE, 7.2-RC2,
7.2-RELEASE, 8-CURRENT).

This can be provoked by high load like highly parallel make buildworld
or various benchmarks (e.g. /usr/ports/benchmarks/blogbench).

I've been wondering if this is somehow related to the following article
in the adaptec knowledge base:

http://ask.adaptec.com/scripts/adaptec_tic.cfg/php.exe/enduser/std_adp.php?p_faqid=15357&p_created=1225366599&p_sid=NqNtKZrj&p_accessibility=0&p_redirect=&p_lva=&p_sp=cF9zcmNoPSZwX3NvcnRfYnk9JnBfZ3JpZHNvcnQ9JnBfcm93X2NudD0yNjk3LDI2OTcmcF9wcm9kcz0mcF9jYXRzPSZwX3B2PSZwX2N2PSZwX3NlYXJjaF90eXBlPWFuc3dlcnMuc2VhcmNoX25sJnBfcGFnZT0x&p_li=&p_topview=1

It states:
"AACRAID based controllers have an underlying timeout/recovery cycle
that is 35 seconds long.

The default in some SCSI subsystems was 60 seconds in the past, but is
now standardized at 30 seconds which results in an interference pattern
between the controller and the Linux SCSI subsystem."

(I copy and pasted the entire article at the end of this post).

Since sys/dev/aac/aacvar.h sets AAC_CMD_TIMEOUT to 30 seconds I've been
wondering if this is somehow related (there are also timeouts for
immediate commands and the period check for timeouts interval - not sure
how they're used in aac.c and too lazy to check).

The bottom line is, that adaptec states that they're AACRAID based
controllers may sometimes need >35 seconds to process a command under
normal operational circumstances, if the controller is going through an
"error correction cycle on the SAS/SATA bus".

cheers
Michael


-- Complete Adaptec knowledge base entry --
AACRAID based controllers have an underlying timeout/recovery cycle that
is 35 seconds long.

The default in some SCSI subsystems was 60 seconds in the past, but is
now standardized at 30 seconds which results in an interference pattern
between the controller and the Linux SCSI subsystem.

The alternate workaround is for the user to adjust the timeout in SYSFS
if it is shorter than 35 seconds.

Changing the timeout values for a Linux block device can be done via
SYSFS. For example, if /dev/sdc , /dev/sdd and /dev/sde are the device
LUNs on a given Linux host, then the following commands need to be issued:
echo 45 > /sys/block/sdc/device/timeout
echo 45 > /sys/block /sdd/device/timeout
echo 45 > /sys/block/sde/device/timeout
In this example the timeout is 45 seconds which should be enough.

Note: Any AACRAID based controller is going through an error correction
cycle on the SAS/SATA bus that is delaying the completion of I/O beyond
the Linux default timeout set for the device, this may be a hardware
issue or a problem with the default timeout value as outlined above. If
changing the timeout value doesn't solve the problem then please follow
the steps we recommend to trouble shoot "Host adapter reset request.
SCSI hang ?" messages:
Check for any updated firmware for the motherboard, controller, targets
and enclosure on the respective manufacturer's web sites.
Check per-device queue depth in SYSFS to make sure it is reasonable.
Engage disk drive manufacturer's technical support department to check
through compatibility or drive class issues.
Engage enclosure manufacturer's technical support department to check
through compatibility issues.


More information about the freebsd-bugs mailing list