A failed drive causes system to hang

Sun Apr 14 21:24:42 UTC 2013

On Sun, Apr 14, 2013 at 10:35:33PM +0200, Radio m?odych bandytw wrote:
> On 14/04/2013 21:52, Jeremy Chadwick wrote:
> > {snipping lots for brevity}
> > 
> > On Sun, Apr 14, 2013 at 12:28:30PM -0700, Jeremy Chadwick wrote:
> >> On Sun, Apr 14, 2013 at 12:34:46PM +0200, Radio m?odych bandytw wrote:
> >>> Sorry. I thought just the error was important. So here you are:
> >>> dmesg.boot:
> >>> http://pastebin.com/LFXPusMX
> >>
> >> Thank you.  Please read everything I have written below before doing
> >> anything.
> >>
> >> Based on this output, we can see the following:
> >>
> >> * AHCI is actively in use, and is a slowly-becoming-infamous ATI IXP700
> >>   controller:
> >>
> >>   ahci0: <ATI IXP700 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xf9fffc00-0xf9ffffff irq 19 at device 17.0 on pci0
> >>
> >> * The system has 3 disks attached to this controller:
> >>
> >>   ada0 at ahcich0 bus 0 scbus2 target 0 lun 0
> >>   ada0: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
> >>   ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> >>   ada0: Command Queueing enabled
> >>   ada0: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
> >>   ada1 at ata0 bus 0 scbus6 target 0 lun 0
> >>   ada1: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
> >>   ada1: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
> >>   ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
> >>   ada2 at ata0 bus 0 scbus6 target 1 lun 0
> >>   ada2: <ST3640323AS SD13> ATA-8 SATA 2.x device
> >>   ada2: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
> >>   ada2: 610480MB (1250263728 512 byte sectors: 16H 63S/T 16383C)
> >>
> >> Let's talk about ada0 and ada1 first.
> > 
> > Hold up a minute -- I just noticed some key information here (see what
> > happens with big conflated threads?), and it sheds some light on my
> > concerns with AHCI vs. classic ata(4):
> > 
> > ada0 -- attached to ahcich0
> > ada1 -- attached to ata0 (presumably a "master" drive)
> > ada2 -- attached to ata0 (presumably a "slave" drive)
> > 
> > This is extremely confusing, because ata0 is a classic ATA controller (I
> > can even tell from the classic ISA I/O port ranges):
> > 
> > atapci1: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
> > ata0: <ATA channel> at channel 0 on atapci1
> > ata1: <ATA channel> at channel 1 on atapci1
> > 
> > Yet the WD15EARS and ST3640323AS drives are physically SATA drives.
> > 
> > Are you using SATA-to-IDE adapters on these two drives?
> No.
> > 
> > If not, this seems to indicate the motherboard and/or SATA controller
> > is actually only binding 1 disk to AHCI, while the others are bound to
> > the same controller operating in (possibly) "SATA Enhanced" mode.
> > 
> > This would be the first I've ever seen of this (a controller operating
> > in both modes simultaneously), but I have a lot more experience with
> > Intel SATA controllers than I do AMD.
> > 
> > I don't know why a system would do this, unless all of this can be
> > controlled via the BIOS somehow.  What a mess.
> > 
> I looked into BIOS and it can be controlled. 6 ports are divided into 2
> triples and I can switch mode of each triple independently. One drive is
> connected to one and two to the other.
> Looks like there's a bug because both triples are set to ATA.
> I left them like that for now.

What exact motherboard model is this?  I'd like to review the manual.

> Anyway, I got the hang again, so I can provide dmesg. I was not at the
> computer when it happened, so there's only the last screen though...
> pastebin.com/bjYtzPgs

Thank you.  Sadly the log snippet doesn't have timestamps but this is
what transpired:

The log snippet you showed indicates the following:

* An NCQ-based write CDB (WRITE_FPDMA_QUEUED) was issued to the
  ada0 drive attached to channel ahcich0 of controller ahci0, and
  the disk or controller did not respond within 30 seconds (I'm
  assuming PC-BSD did not change kern.cam.ada.default_timeout from
  the default of 30 seconds)

* The same request was resubmit to the controller (CAM will try
  submission of a CDB up to 5 times (i.e. 4 retries), which is
  controlled with kern.cam.ada.retry_count).

* The AHCI controller (rather the specific channel of the AHCI
  controller) also reported that the underlying disk/device was
  not responding (re: "Timeout on slot X port X").  I see no
  SERR condition.

* An ATA_IDENTIFY CDB was issued to the ada0 drive attached to
  channel ahcich0 of controller ahci0, and this also timed out
  after 30 seconds.  My gut feeling is that this system is
  running smartd(8); it's possible the kernel itself could submit
  the CDB to the drive, but in this condition/state I don't know
  why it'd do that.

* Rinse lather repeat.

To me, at first glance, this looks like the ada0 disk is going
catatonic.  The controller itself seems to be responding fine, just that
the disk attached to ahcich0 is locking up hard.  I see no sign of an
AHCI reset ("AHCI reset..." message) either.

So why does your system "hang" (meaning why can't you log in, why do
applications stop working, etc.) when this happens?  Simple:

You're using ZFS for your root filesystem, as shown here:

Trying to mount root from zfs:tank1/ROOT/default []...

Your ZFS pool called tank1 consists of a raidz1 pool of 3 devices
(more specifically partitions): ada0, ada1, and something that is
missing.  Recap:

http://pastebin.com/D3Av7x9X

The pool is already degraded, and as you know, raidz1 can only suffer up
to loss of one vdev (in this case a disk) before ZFS will begin behaving
based on what the pool's "failmode" property is.

In effect, when this happens, you're down to only 1 disk: ada1, and
that's not sufficient.  So ZFS does exactly what it should (with
failmode=wait, the default): it waits indefinitely, hoping that things
recover.

Because this is your root filesystem, as well as tons of other
filesystems (including /usr, /var, /var/log, and so on):

http://pastebin.com/4sT37VqZ

...any I/O submit to filesystems part of pool tank1 will indefinitely
block/wait until things recover.  Except they don't recover (and that
isn't the fault of ZFS).

I imagine if you let the system sit for roughly 5*30 seconds (see above
for how I calculated that), you would eventually see a message on the
console that looks something like this:

(ada0:ahcich0:0:0:0): lost device
(ada0:ahcich0:0:0:0): removing device entry

So, the crux of your problem is:

1. Your disks fall off the bus for reasons unknown at this time -- I'm
still waiting on smartctl -x output for each of your disks.  Your disks
themselves may actually be okay (I need to review the output to
determine that) and the issue may be with SATA cabling, a faulty PSU, or
a completely broken SATA controller or motherboard (bad traces, etc.).
I am not going to help diagnose those problems, because the only
reliable method is to start replacing each part, piece by piece, and see
if the issue goes away.

2. Your array is already degraded/broken yet you don't care to fix it.
If the array was in decent shape and ada0 fell off the bus, things would
still work because ada1 and ada2 would be functioning (re: raidz1).

If you were using UFS instead of ZFS for your root filesystem, you would
still have the same issue, just that the system would kernel panic.  You
can induce that behaviour with ZFS as well using failmode=panic.

There isn't much more for me to say.  Everything is behaving how it's
designed, from what I can tell.

When you lose your root filesystem, you really can't expect the system
to be in some "magical usable state".

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |