out of HDD space - zfs degraded

Mon Oct 4 14:04:00 UTC 2010

Alexander Leidinger wrote:
> On Sat, 02 Oct 2010 22:25:18 -0400 Steve Polyack <korvus at comcast.net>
> wrote:
> 
>> I thin its worth it to think about TLER (or the absence of it) here - 
>> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery .  Your 
>> consumer / SATA Hitachi drives likely do not put a limit on the time
>> the drive may block on a command while handling inernal errors.  If
>> we consider that gpt/gisk06-live encountered some kind of error and
>> had to relocate a significant number of blocks or perform some other
>> error recovery, then it very well may have timed out long enough for
>> siis(4) to drop the device.  I have no idea what the timeouts are set
>> to in the siis(4) driver, nor does anything in your SMART report
>> stick out to me (though I'm certainly no expert with SMART data, and
>> my understanding is that many drive manufacturers report the various
>> parameters in different ways).

Timeouts for commands usually defined by ada(4) peripheral driver and
ATA transport layer of CAM. Most of timeouts set to 30 seconds. Only
time value defined by siis(4) is hard reset time - 15 seconds now.

As soon as drive didn't reappeared after `camcontrol reset/rescan ...`
done after significant period of time, but required power cycle, I have
doubt that any timeout value could help it.

It may be also theoretically possible that it was controller firmware
stuck, not drive. It would be interesting to power cycle specific drive
if problem repeats.

> IIRC mav@ (CCed) made a commit regarding this to -current in the not so
> distant past. I do not know about the MFC status of this, or if it may
> have helped or not in this situation.

My last commit to siis(4) 2 weeks ago (merged recently) fixed specific
bug in timeout handling, leading to system crash. I don't see alike
symptoms here.

If there was any messages before "Oct  2 00:50:53 kraken kernel:
(ada0:siisch0:0:0:0): lost device", they could give some hints about
original problem. Messages after it could be consequence.

Enabling verbose kernel messages could give some more information about
what happened there.

-- 
Alexander Motin