out of HDD space - zfs degraded

Jeremy Chadwick freebsd at jdc.parodius.com
Mon Oct 4 14:59:27 UTC 2010


On Mon, Oct 04, 2010 at 05:03:47PM +0300, Alexander Motin wrote:
> Alexander Leidinger wrote:
> > On Sat, 02 Oct 2010 22:25:18 -0400 Steve Polyack <korvus at comcast.net>
> > wrote:
> > 
> >> I thin its worth it to think about TLER (or the absence of it) here - 
> >> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery .  Your 
> >> consumer / SATA Hitachi drives likely do not put a limit on the time
> >> the drive may block on a command while handling inernal errors.  If
> >> we consider that gpt/gisk06-live encountered some kind of error and
> >> had to relocate a significant number of blocks or perform some other
> >> error recovery, then it very well may have timed out long enough for
> >> siis(4) to drop the device.  I have no idea what the timeouts are set
> >> to in the siis(4) driver, nor does anything in your SMART report
> >> stick out to me (though I'm certainly no expert with SMART data, and
> >> my understanding is that many drive manufacturers report the various
> >> parameters in different ways).
> 
> Timeouts for commands usually defined by ada(4) peripheral driver and
> ATA transport layer of CAM. Most of timeouts set to 30 seconds. Only
> time value defined by siis(4) is hard reset time - 15 seconds now.
> 
> As soon as drive didn't reappeared after `camcontrol reset/rescan ...`
> done after significant period of time, but required power cycle, I have
> doubt that any timeout value could help it.
>
> It may be also theoretically possible that it was controller firmware
> stuck, not drive. It would be interesting to power cycle specific drive
> if problem repeats.

FWIW, I agree with mav at .  I also wanted to talk a bit about TLER, since
it wouldn't have helped in this situation.

TLER wouldn't have helped because the drive (either mechanically or the
firmware) went catatonic and required a power-cycle.  TLER would have
caused siis(4) to witness a proper ATA error code for the read/write it
submit to the drive, and the timeout would (ideally) be shorter than
what of the siis(4) or ada(4) layers.  So that ATA command would have
failed and the OS, almost certainly, would have continued to submit
more requests, resulted in an error response, etc...

Depending on how the drivers were written, this situation could cause
the storage driver to get stuck in an infinite loop trying to read or
write to a device that's catatonic.  I've seen this happen on Solaris
with, again, those Fujitsu disks (the drives return an error status in
response to the CDB, the OS/driver says "that's nice" and continues to
submit commands to the drive because it's still responding).

What I'm trying to say: what ada(4) and/or siis(4) did was the Correct
Thing(tm) in my opinion.

This is one of the reasons I don't blindly enable TLER on WDC Black
disks that I have.  Someone would need to quite honestly test the
behaviour of TLER with FreeBSD (testing both disk catatonic state as
well as transient error + recovery) to see how things behave.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |



More information about the freebsd-stable mailing list