ahcich timeouts, only with ahci, not with ataahci

Harald Schmalzbauer h.schmalzbauer at omnilan.de
Tue Feb 23 17:44:43 UTC 2010


Alexander Motin schrieb am 23.02.2010 18:35 (localtime):
...
>> One understanding question: If the drive doesn't complete a command,
>> regardless if it's due to a firmware bug, a disk surface error or
>> whatever, is there no way for the driver to terminate the request and
>> take the drive offline after some time? This would be a very important
>> behaviour for me. It doesn't make sense building RAIDz storage when a
>> failing drive hangs the complete machine, even if the system partitions
>> are on a complete different SSD.
> 
> That's what timeouts are used for. When timeout detected, driver resets
> device and reports error to upper layer. After receiving error, CAM
> reinitializes device. If device is completely dead, reinitialization
> will fail and device will be dropped immediately. If device is still
> alive, reinit succeed and CAM will retry command again. If all retries
> failed, error reported to the GEOM layer and then possibly to file
> system. I have no idea how RAIDZ behaves in such case. May be after few
> such errors it should drop that device out of array.
> 
> Timeout is a worst possible case for any device, as it takes too much
> time and doesn't give any recovery information. Half-dead case is worst
> possible case of timeout. It is difficult to say what which way is
> better: drop last drive from degraded array and lost all info, or retry
> forever. There is probably no right answer.

I see. Thanks a lot for clarification.
Before getting the machine onsite I did some ZFS tests like removing one 
disk when cvs checkout was running.
I can remember that ZFS hadn't showed the removed drive as offline, but 
there was no hang. The pool was degraded and after reinserting and 
rebooting I could resilver the pool. I couldn't manage to get it 
consistent without rebooting, but I accepted that since I would have to 
walk on site for changing the drive any way.
I'll restore the default vfs.zfs.txg.timeout=30, so the hang can be 
easily reproduced and see if I can 'camcontrol stop' the drive. Do you 
think I can get usefull information with that test?

Thanks,

-Harry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: OpenPGP digital signature
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100223/8fb51978/signature.pgp


More information about the freebsd-stable mailing list