misc/148502: arcmsr driver offlines all devices on a SAS port if any device fails

Rich Ercolani rercola at acm.jhu.edu
Sun Jul 11 21:20:10 UTC 2010


>Number:         148502
>Category:       misc
>Synopsis:       arcmsr driver offlines all devices on a SAS port if any device fails
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Jul 11 21:20:09 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Rich Ercolani
>Release:        8.1-RC2 r209893
>Organization:
JHU ACM
>Environment:
FreeBSD manticore.acm.jhu.edu 8.1-RC2 FreeBSD 8.1-RC2 #2 r209893: Sun Jul 11 03:26:50 EDT 2010     root at manticore.acm.jhu.edu:/usr/obj/usr/local/ncvs/src/sys/DTRACE  amd64

>Description:
Using the arcmsr driver to drive an Areca ARC-1280ML device, whenever any of the four SATA devices plugged into a SAS port using an SFF-8087 -> 4xSATA cable fail, the other three devices on that port also are not present in FreeBSD under /dev. 

The areca-cli utility demonstrates that this is not a problem with the card itself - it happily will interrogate the other drives on the port, and list the correct drive as "Failed" - but the failed device, as well as any other devices on that SAS "port", will simply not exist in /dev.

[This behavior may also occur if a device fails on a running system, rather than coming up with a failed device - I can't tell, the machine seems to have eventually hung whenever a device failed, despite the arcmsr driver reporting the error in dmesg and ZFS reporting the error as corrected [or not, depending on which pool.]

[This could also be a problem at a different layer than arcmsr, I suppose - something not noticing the difference between 4 drives on the lanes of the SAS connection versus 1 drive using multiple lanes, and toggling the port...]
>How-To-Repeat:
1) Have >1 devices plugged into a single SFF-8087->4xSATA fanout cable on an Areca ARC-1280ML.
2) Have one of the devices report sufficient SMART errors for the 1280ML to determine the device has "failed".
3) Watch the other devices fail to show up on any boot in the future, and possibly disappear whilst running.
>Fix:
Removing the failed device is a presumable workaround, but as this practically limits the device to 6 drives connected if you do not wish to sustain a critical outage from one disk failing, it's rather poor as one.

>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list