A failed drive causes system to hang

Mon Apr 15 06:24:44 UTC 2013

On 14/04/2013 23:24, Jeremy Chadwick wrote:
> On Sun, Apr 14, 2013 at 10:35:33PM +0200, Radio m?odych bandytw wrote:
>> On 14/04/2013 21:52, Jeremy Chadwick wrote:
>>> {snipping lots for brevity}
>>>
>>> On Sun, Apr 14, 2013 at 12:28:30PM -0700, Jeremy Chadwick wrote:
>>>> On Sun, Apr 14, 2013 at 12:34:46PM +0200, Radio m?odych bandytw wrote:
>>>>> Sorry. I thought just the error was important. So here you are:
>>>>> dmesg.boot:
>>>>> http://pastebin.com/LFXPusMX
>>>>
>>>> Thank you.  Please read everything I have written below before doing
>>>> anything.
>>>>
>>>> Based on this output, we can see the following:
>>>>
>>>> * AHCI is actively in use, and is a slowly-becoming-infamous ATI IXP700
>>>>   controller:
>>>>
>>>>   ahci0: <ATI IXP700 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xf9fffc00-0xf9ffffff irq 19 at device 17.0 on pci0
>>>>
>>>> * The system has 3 disks attached to this controller:
>>>>
>>>>   ada0 at ahcich0 bus 0 scbus2 target 0 lun 0
>>>>   ada0: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
>>>>   ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
>>>>   ada0: Command Queueing enabled
>>>>   ada0: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
>>>>   ada1 at ata0 bus 0 scbus6 target 0 lun 0
>>>>   ada1: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
>>>>   ada1: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
>>>>   ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
>>>>   ada2 at ata0 bus 0 scbus6 target 1 lun 0
>>>>   ada2: <ST3640323AS SD13> ATA-8 SATA 2.x device
>>>>   ada2: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
>>>>   ada2: 610480MB (1250263728 512 byte sectors: 16H 63S/T 16383C)
>>>>
>>>> Let's talk about ada0 and ada1 first.
>>>
>>> Hold up a minute -- I just noticed some key information here (see what
>>> happens with big conflated threads?), and it sheds some light on my
>>> concerns with AHCI vs. classic ata(4):
>>>
>>> ada0 -- attached to ahcich0
>>> ada1 -- attached to ata0 (presumably a "master" drive)
>>> ada2 -- attached to ata0 (presumably a "slave" drive)
>>>
>>> This is extremely confusing, because ata0 is a classic ATA controller (I
>>> can even tell from the classic ISA I/O port ranges):
>>>
>>> atapci1: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
>>> ata0: <ATA channel> at channel 0 on atapci1
>>> ata1: <ATA channel> at channel 1 on atapci1
>>>
>>> Yet the WD15EARS and ST3640323AS drives are physically SATA drives.
>>>
>>> Are you using SATA-to-IDE adapters on these two drives?
>> No.
>>>
>>> If not, this seems to indicate the motherboard and/or SATA controller
>>> is actually only binding 1 disk to AHCI, while the others are bound to
>>> the same controller operating in (possibly) "SATA Enhanced" mode.
>>>
>>> This would be the first I've ever seen of this (a controller operating
>>> in both modes simultaneously), but I have a lot more experience with
>>> Intel SATA controllers than I do AMD.
>>>
>>> I don't know why a system would do this, unless all of this can be
>>> controlled via the BIOS somehow.  What a mess.
>>>
>> I looked into BIOS and it can be controlled. 6 ports are divided into 2
>> triples and I can switch mode of each triple independently. One drive is
>> connected to one and two to the other.
>> Looks like there's a bug because both triples are set to ATA.
>> I left them like that for now.
> 
> What exact motherboard model is this?  I'd like to review the manual.
> 
>> Anyway, I got the hang again, so I can provide dmesg. I was not at the
>> computer when it happened, so there's only the last screen though...
>> pastebin.com/bjYtzPgs
> 
> Thank you.  Sadly the log snippet doesn't have timestamps but this is
> what transpired:
> 
> The log snippet you showed indicates the following:
> 
> * An NCQ-based write CDB (WRITE_FPDMA_QUEUED) was issued to the
>   ada0 drive attached to channel ahcich0 of controller ahci0, and
>   the disk or controller did not respond within 30 seconds (I'm
>   assuming PC-BSD did not change kern.cam.ada.default_timeout from
>   the default of 30 seconds)
Checked: default.
> 
> * The same request was resubmit to the controller (CAM will try
>   submission of a CDB up to 5 times (i.e. 4 retries), which is
>   controlled with kern.cam.ada.retry_count).
> 
Checked: default.
> * The AHCI controller (rather the specific channel of the AHCI
>   controller) also reported that the underlying disk/device was
>   not responding (re: "Timeout on slot X port X").  I see no
>   SERR condition.
> 
> * An ATA_IDENTIFY CDB was issued to the ada0 drive attached to
>   channel ahcich0 of controller ahci0, and this also timed out
>   after 30 seconds.  My gut feeling is that this system is
>   running smartd(8); it's possible the kernel itself could submit
>   the CDB to the drive, but in this condition/state I don't know
>   why it'd do that.
Nope, smartd doesn't run.
> 
> * Rinse lather repeat.
> 
> To me, at first glance, this looks like the ada0 disk is going
> catatonic.  The controller itself seems to be responding fine, just that
> the disk attached to ahcich0 is locking up hard.  I see no sign of an
> AHCI reset ("AHCI reset..." message) either.
> 
> So why does your system "hang" (meaning why can't you log in, why do
> applications stop working, etc.) when this happens?  Simple:
> 
> You're using ZFS for your root filesystem, as shown here:
> 
> Trying to mount root from zfs:tank1/ROOT/default []...
> 
> Your ZFS pool called tank1 consists of a raidz1 pool of 3 devices
> (more specifically partitions): ada0, ada1, and something that is
> missing.  Recap:
> 
> http://pastebin.com/D3Av7x9X
> 
> The pool is already degraded, and as you know, raidz1 can only suffer up
> to loss of one vdev (in this case a disk) before ZFS will begin behaving
> based on what the pool's "failmode" property is.
> 
> In effect, when this happens, you're down to only 1 disk: ada1, and
> that's not sufficient.  So ZFS does exactly what it should (with
> failmode=wait, the default): it waits indefinitely, hoping that things
> recover.
> 
> Because this is your root filesystem, as well as tons of other
> filesystems (including /usr, /var, /var/log, and so on):
> 
> http://pastebin.com/4sT37VqZ
> 
> ...any I/O submit to filesystems part of pool tank1 will indefinitely
> block/wait until things recover.  Except they don't recover (and that
> isn't the fault of ZFS).
> 
> I imagine if you let the system sit for roughly 5*30 seconds (see above
> for how I calculated that), you would eventually see a message on the
> console that looks something like this:
> 
> (ada0:ahcich0:0:0:0): lost device
> (ada0:ahcich0:0:0:0): removing device entry
I don't think so. I'm nearly sure it took me longer than that to write
the errors down alone. And when I discovered the system lockup it was in
such state already. The next time I can take precise time measurements.
> 
> So, the crux of your problem is:
> 
> 1. Your disks fall off the bus for reasons unknown at this time -- I'm
> still waiting on smartctl -x output for each of your disks.  Your disks
> themselves may actually be okay (I need to review the output to
> determine that) and the issue may be with SATA cabling, a faulty PSU, or
> a completely broken SATA controller or motherboard (bad traces, etc.).
> I am not going to help diagnose those problems, because the only
> reliable method is to start replacing each part, piece by piece, and see
> if the issue goes away.
> 
> 2. Your array is already degraded/broken yet you don't care to fix it.
> If the array was in decent shape and ada0 fell off the bus, things would
> still work because ada1 and ada2 would be functioning (re: raidz1).
> 
> If you were using UFS instead of ZFS for your root filesystem, you would
> still have the same issue, just that the system would kernel panic.  You
> can induce that behaviour with ZFS as well using failmode=panic.
> 
> There isn't much more for me to say.  Everything is behaving how it's
> designed, from what I can tell.
> 
> When you lose your root filesystem, you really can't expect the system
> to be in some "magical usable state".
> 
The disk is out of the array because I didn't put it back after RMA.
I RMA'd it because it used to cause precisely this kind of lockups on a
non-degraded array. And I've seen it in the installer running from an
entirely different device too.
I guess that after reproducing the issue and taking time measurements, I
should put the RMA'd drive back. I expect the problem to keep happening.
-- 
Twoje radio