Can MPS discard a misbehaving disk?
dustinwenz at xtechllc.com
Tue Apr 24 18:04:38 UTC 2012
I am having trouble with MPS becoming unresponsive in certain disk failure conditions. So far, I've experienced this with 3TB Hitachi disks (0S03208) and 3TB Seagate Barracuda disks (ST3000DM001, firmware CC9D) while using the MPS driver with an LSI SAS2116 controller on FreeBSD 8.2-STABLE.
In these particular instances, the disks are part of a zpool of mirrors. When a disk fails, I generally see a message like "kernel: (da5:mps0:0:5:0): SCSI command timeout on device handle 0x0017 SMID 148", followed by an indefinite number of "mps0: (0:5:0) terminated ioc 804b scsi 0 state c xfer 65536" messages.
What I would want to happen in this case is for the disk to simply go offline in the zpool, in order for the pool to continue functioning. However, the pool status still shows the disk as online. Any attempts to disable the disk (such as with zpool offline, remove, or detach) will hang and never complete, as will attempting a rescan with camcontrol. Of course, any attempts to access data in the pool will hang as well.
Rebooting the system in this state is also bad; when the disk is first discovered, it will begin a cycle of mps scsi errors during startup that never seem to stop. The only way to recover, at least that I know of, is to physically remove the disk from the chassis. Once I do that, the system continues running perfectly.
Basically my question is this: How can I get MPS to ignore a failed disk and never attempt to access it again? I don't care if it does so automatically, or I if I need to perform some administrative operation to drop the device reference. I've seen a number of people on the list having problems that appear similar to this; but those seem more to do with firmware or compatibility issues. I my case, these disks are definitely dead... they no longer work in any other systems, and often make sad clicking noises.
I suppose this is also something that ZFS could do, independent of the driver. If a device is unresponsive, shouldn't it take it offline on it's own?
More information about the freebsd-stable