A failed drive causes system to hang

Sun Apr 14 19:28:31 UTC 2013

On Sun, Apr 14, 2013 at 12:34:46PM +0200, Radio m?odych bandytw wrote:
> On 13/04/2013 02:07, Jeremy Chadwick wrote:
> >On Sat, Apr 13, 2013 at 12:33:10AM +0200, Radio m?odych bandytw wrote:
> >>On 13/04/2013 00:03, Jeremy Chadwick wrote:
> >>>On Fri, Apr 12, 2013 at 11:52:31PM +0200, Radio m?odych bandytw wrote:
> >>>>On 11/04/2013 23:24, Jeremy Chadwick wrote:
> >>>>>On Thu, Apr 11, 2013 at 10:47:32PM +0200, Radio m?odych bandytw wrote:
> >>>>>>Seeing a ZFS thread, I decided to write about a similar problem that
> >>>>>>I experience.
> >>>>>>I have a failing drive in my array. I need to RMA it, but don't have
> >>>>>>time and it fails rarely enough to be a yet another annoyance.
> >>>>>>The failure is simple: it fails to respond.
> >>>>>>When it happens, the only thing I found I can do is switch consoles.
> >>>>>>Any command fails, login fails, apps hang.
> >>>>>>
> >>>>>>On the 1st console I see a series of messages like:
> >>>>>>
> >>>>>>(ada0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED
> >>>>>>
> >>>>>>I use RAIDZ1 and I'd expect that none single failure would cause the
> >>>>>>system to fail...
> >>>>>
> >>>>>You need to provide full output from "dmesg", and you need to define
> >>>>>what the word "fails" means (re: "any command fails", "login fails").
> >>>>Fails = hangs. When trying to log it, I can type my user name, but
> >>>>after I press enter the prompt for password never appear.
> >>>>As to dmesg, tough luck. I have 2 photos on my phone and their
> >>>>transcripts are all I can give until the problem reappears (which
> >>>>should take up to 2 weeks). Photos are blurry and in many cases I'm
> >>>>not sure what exactly is there.
> >>>>
> >>>>Screen1:
> >>>>(ada0:ahcich0:0:0:0): FLUSHCACHE40. ACB: (ea?) 00 00 00 00 (cut?)
> >>>>(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut)
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 05 d3(cut)
> >>>>00
> >>>>(ada0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 7b(cut)
> >>>>00
> >>>>(ada0:ahcich0:0:0:0): CAM status: Unconditionally Re-qu (cut)
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 03 d0(cut)
> >>>>00
> >>>>(ada0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>
> >>>>
> >>>>Screen 2:
> >>>>ahcich0: Timeout on slot 29 port 0
> >>>>ahcich0: (unreadable, lots of numbers, some text)
> >>>>(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut)
> >>>>(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked
> >>>>ahcich0: Timeout on slot 29 port 0
> >>>>ahcich0: (unreadable, lots of numbers, some text)
> >>>>(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: (cc?) 00 (cut)
> >>>>(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(aprobe0:ahcich0:0:0:0): Error (5?), Retry was blocked
> >>>>ahcich0: Timeout on slot 30 port 0
> >>>>ahcich0: (unreadable, lots of numbers, some text)
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut)
> >>>>(ada0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 01 (cut)
> >>>>
> >>>>Both are from the same event. In general, messages:
> >>>>
> >>>>(ada0:ahcich0:0:0:0): CAM status: Command timeout
> >>>>(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
> >>>>(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED.
> >>>>
> >>>>are the most common.
> >>>>
> >>>>I've waited for more than 1/2 hour once and the system didn't return
> >>>>to a working state, the messages kept flowing and pretty much
> >>>>nothing was working. What's interesting, I remember that it happened
> >>>>to me even when I was using an installer (PC-BSD one), before the
> >>>>actual installation began, so the disk stored no program data. And I
> >>>>*think* there was no ZFS yet anyway.
> >>>>
> >>>>>
> >>>>>I've already demonstrated that loss of a disk in raidz1 (or even 2 disks
> >>>>>in raidz2) does not cause ""the system to fail"" on stable/9.  However,
> >>>>>if you lose enough members or vdevs to cause catastrophic failure, there
> >>>>>may be anomalies depending on how your system is set up:
> >>>>>
> >>>>>http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html
> >>>>>
> >>>>>If the pool has failmode=wait, any I/O to that pool will block (wait)
> >>>>>indefinitely.  This is the default.
> >>>>>
> >>>>>If the pool has failmode=continue, existing write I/O operations will
> >>>>>fail with EIO (I/O error) (and hopefully applications/daemons will
> >>>>>handle that gracefully -- if not, that's their fault) but any subsequent
> >>>>>I/O (read or write) to that pool will block (wait) indefinitely.
> >>>>>
> >>>>>If the pool has failmode=panic, the kernel will immediately panic.
> >>>>>
> >>>>>If the CAM layer is what's wedged, that may be a different issue (and
> >>>>>not related to ZFS).  I would suggest running stable/9 as many
> >>>>>improvements in this regard have been committed recently (some related
> >>>>>to CAM, others related to ZFS and its new "deadman" watcher).
> >>>>
> >>>>Yeah, because of the installer failure, I don't think it's related to ZFS.
> >>>>Even if it is, for now I won't set any ZFS properties in hope it
> >>>>repeats and I can get better data.
> >>>>>
> >>>>>Bottom line: terse output of the problem does not help.  Be verbose,
> >>>>>provide all output (commands you type, everything!), as well as any
> >>>>>physical actions you take.
> >>>>>
> >>>>Yep. In fact having little data was what made me hesitate to write
> >>>>about it; since I did already, I'll do my best to get more info,
> >>>>though for now I can only wait for a repetition.
> >>>>
> >>>>
> >>>>On 12/04/2013 00:08, Quartz wrote:>
> >>>>>>Seeing a ZFS thread, I decided to write about a similar problem that I
> >>>>>>experience.
> >>>>>
> >>>>>I'm assuming you're referring to my "Failed pool causes system to hang"
> >>>>>thread. I wonder if there's some common issue with zfs where it locks up
> >>>>>if it can't write to disks how it wants to.
> >>>>>
> >>>>>I'm not sure how similar your problem is to mine. What's your pool setup
> >>>>>look like? Redundancy options? Are you booting from a pool? I'd be
> >>>>>interested to know if you can just yank the cable to the drive and see
> >>>>>if the system recovers.
> >>>>>
> >>>>>You seem to be worse off than me- I can still login and run at least a
> >>>>>couple commands. I'm booting from a straight ufs drive though.
> >>>>>
> >>>>>______________________________________
> >>>>>it has a certain smooth-brained appeal
> >>>>>
> >>>>Like I said, I don't think it's ZFS-specific, but just in case...:
> >>>>RAIDZ1, root on ZFS. I should reduce severity of a pool loss before
> >>>>pulling cables, so no tests for now.
> >>>
> >>>Key points:
> >>>
> >>>1. We now know why "commands hang" and anything I/O-related blocks
> >>>(waits) for you: because your root filesystem is ZFS.  If the ZFS layer
> >>>is waiting on CAM, and CAM is waiting on your hardware, then those I/O
> >>>requests are going to block indefinitely.  So now you know the answer to
> >>>why that happens.
> >>>
> >>>2. I agree that the problem is not likely in ZFS, but rather either with
> >>>CAM, the AHCI implementation used, or hardware (either disk or storage
> >>>controller).
> >>>
> >>>3. Your lack of "dmesg" is going to make this virtually impossible to
> >>>solve.  We really, ***really*** need that.  I cannot stress this enough.
> >>>This will tell us a lot of information about your system.  We're also
> >>>going to need to see "zpool status" output, as well as "zpool get all"
> >>>and "zfs get all".  "pciconf -lvbc" would also be useful.
> >>>
> >>>There are some known "gotchas" with certain models of hard disks or AHCI
> >>>controllers (which is responsible is unknown at this time), but I don't
> >>>want to start jumping to conclusions until full details can be provided
> >>>first.
> >>>
> >>>I would recommend formatting a USB flash drive as FAT/FAT32, booting
> >>>into single-user mode, then mounting the USB flash drive and issuing
> >>>the above commands + writing the output to files on the flash drive,
> >>>then provide those here.
> >>>
> >>>We really need this information.
> >>>
> >>>4. Please involve the PC-BSD folks in this discussion.  They need to be
> >>>made aware of issues like this so they (and iXSystems, potentially) can
> >>>investigate from their side.
> >>>
> >>OK, thanks for the info.
> >>Since dmesg is so important, I'd say the best thing is to wait for
> >>the problem to happen again. When it does, I'll restart the thread
> >>with every information that you requested here and with a PC-BSD
> >>cross-post.
> >>
> >>However, I just got a different hang just a while ago. This time it
> >>was temporary, I don't know, I switched to console0 after ~10
> >>seconds, there were 2 errors. Nothing appeared for ~1 minute, so I
> >>switched back and the system was OK. Different drive, I haven't seen
> >>problems with this one. And I think they used to be ahci, here's
> >>ata.
> >>
> >>dmesg:
> >>
> >>fuse4bsd: version 0.3.9-pre1, FUSE ABI 7.19
> >>(ada1:ata0:0:0:0): READ_DMA48. ACB: 25 00 82 46 b8 40 25 00 00 00 01 00
> >>(ada1:ata0:0:0:0): CAM status: Command timeout
> >>(ada1:ata0:0:0:0): Retrying command
> >>vboxdrv: fAsync=0 offMin=0x53d offMax=0x52b9
> >>linux: pid 17170 (npviewer.bin): syscall pipe2 not implemented
> >>(ada1:ata0:0:0:0): READ_DMA48. ACB: 25 00 87 1a c7 40 1a 00 00 00 01 00
> >>(ada1:ata0:0:0:0): CAM status: Command timeout
> >>(ada1:ata0:0:0:0): Retrying command
> >>
> >>{another 150KBytes of data snipped}
> >
> >The above output indicates that there was a timeout when trying to issue
> >a 48-bit DMA request to the disk.  The disk did not respond to the
> >request within 30 seconds.
> >
> >If you were using AHCI, we'd be able to see if the AHCI layer was
> >reporting signalling problems or other anomalies that could explain the
> >behaviour.  With ATA, such is significantly limited.  It's worse if
> >you're hiding/not showing us the entire information.
> >
> >The classic FreeBSD ATA driver does not provide command queueing (NCQ),
> >while AHCI via CAM does.  The difference is that command queueing causes
> >xxx_FPDMA_QUEUED CDBs to be issued to the disk.
> >
> >I'm going to repeat myself -- for the last time: CAN YOU PLEASE JUST
> >PROVIDE "DMESG" FROM THE SYSTEM?  Like after a fresh reboot?  If you're
> >able to provide all of the above, I don't know why you can't provide
> >dmesg.  It is the most important information that there is.  I am sick
> >and tired of stressing this point.
> Sorry. I thought just the error was important. So here you are:
> dmesg.boot:
> http://pastebin.com/LFXPusMX

Thank you.  Please read everything I have written below before doing
anything.

Based on this output, we can see the following:

* AHCI is actively in use, and is a slowly-becoming-infamous ATI IXP700
  controller:

  ahci0: <ATI IXP700 AHCI SATA controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xf9fffc00-0xf9ffffff irq 19 at device 17.0 on pci0

* The system has 3 disks attached to this controller:

  ada0 at ahcich0 bus 0 scbus2 target 0 lun 0
  ada0: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
  ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
  ada0: Command Queueing enabled
  ada0: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
  ada1 at ata0 bus 0 scbus6 target 0 lun 0
  ada1: <WDC WD15EARS-22MVWB0 51.0AB51> ATA-8 SATA 2.x device
  ada1: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
  ada1: 1430799MB (2930277168 512 byte sectors: 16H 63S/T 16383C)
  ada2 at ata0 bus 0 scbus6 target 1 lun 0
  ada2: <ST3640323AS SD13> ATA-8 SATA 2.x device
  ada2: 150.000MB/s transfers (SATA, UDMA6, PIO 8192bytes)
  ada2: 610480MB (1250263728 512 byte sectors: 16H 63S/T 16383C)

Let's talk about ada0 and ada1 first.

The WD15EARS drives have two features that can cause problems during I/O
requests:

- They are "energy efficient" drives ("Green", "GreenPower", or -GP).
  These drives are known to repeatedly and very aggressively park their
  heads, which can cause horrible performance and other I/O anomalies,
  particularly timeouts during I/O operations (reads or writes).

- Their physical sector size is 4096 bytes, but like all drives, logically
  advertise 512 byte sectors to retain compatibility.  Partitions which
  do not align themselves to 4096-byte boundaries will result in abysmally
  degraded performance, particularly during writes.  The drive
  internally may also have issues trying to deal with this situation
  after prolonged use (in a non-aligned state).

  When using 4KByte sector drives with ZFS, you have to "prep" them
  using gnop(8) first.  Ivan Voras describes this procedure here:

  http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

  When using 4KByte sector drives with UFS/UFS2, the easiest method is
  to use the GPT partitioning scheme (rather than classic MBR) to ensure
  alignment.  This can be done manually through gpart(8).  Warren
  Block's article on this is recommended (for both SSDs and MHDDs)<
  though for MHDDs you can change the alignment size to "4k" rather than
  "1m":

  http://www.wonkity.com/~wblock/docs/html/ssd.html

- ada1 is only negotiating SATA150 speeds, even though this is a
  SATA300-capable drive (compare it to ada0).  The WD15EARS drives have
  a jumper that can limit the PHY speed to SATA150 speeds.  Please
  shut the system down and remove the disk (ada1) and physically
  examine it to see if that jumper is installed.  If it is, please
  remove it.  Jumper location:
  http://wdc.custhelp.com/app/answers/detail/search/1/a_id/1679#jumper

Now on ada2...

The ST3640323AS is one of Seagate's infamous Barracuda 7200.11 drives,
which are known for:

- Infamous firmware bugs, the most major of which is the drive becoming
  permanently catatonic (this has been covered by the media at great
  length).

- Being "energy efficient", which means excessively parking its heads
  (same issue as the WD "Green" drives, but with no way to disable
  or inhibit the behaviour).

The firmware on your ST3640323AS is version SD13; the latest firmware
is SD1B (8 versions newer).  I would strongly suggest upgrading this
drive ASAP.

Thankfully the ST3640323AS is a true 512-byte sector drive.

Back to the rest of the specifics...

* ZFS is in use for the root filesystem and possibly others:

Trying to mount root from zfs:tank1/ROOT/default []...

* The system is amd64 and has 4GB RAM; ZFS prefetch is therefore
  forcefully disabled:

real memory  = 4294967296 (4096 MB)
avail memory = 4073431040 (3884 MB)
ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
            to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
ZFS filesystem version 5
ZFS storage pool version 28

...now, all that said, much more output is needed.

I would like to see output from the following commands:

- gpart show ada0
- gpart show ada1
- gpart show ada2
- zfs get all (keep reading for why I'm asking for this again)

And also this command, run per every pool you have on the system:

- zdb -C {poolname} | grep ashift

(For readers) I do not need "zpool status" because I've seen it here:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/017011.html
http://pastebin.com/D3Av7x9X

(Also for readers) I do not need "zfs get all" or "zpool get all"
because the individual has made them available here:

zfs get all   -- http://pastebin.com/4sT37VqZ
zpool get all -- http://pastebin.com/HZJTJPa2

One thing I do see is that at some point you did enable compression on
your tank1 filesystem (I can tell because "compressratio" is 1.03x
rather than 1.00x), but may have disabled it later.  I'm not going to
get into a debate about this, but my advice is to not use compression or
dedup (either feature) on FreeBSD ZFS.

Finally, as I asked in another post in this thread, I would like you to
provide output from the following command (once per disk):

- smartctl -x /dev/adaX

> >Furthermore, please stop changing ATA vs. AHCI interface drivers.
> >The more you change/screw around with, the less likely people are going
> >to help.  CHANGE NOTHING ON THE SYSTEM.  Leave it how it is.  Do not
> >fiddle with things or start flipping switches/changing settings/etc. to
> >"try and relieve the problem".  You're asking other people for help,
> >which means you need to be patient and follow what we ask.
> I haven't changed one bit myself. It may have been a change of
> defaults in PC-BSD. I just asked them about it.
> Or maybe different drives use different drivers.

If AHCI is enabled in your system BIOS, FreeBSD 9.x will use AHCI with
CAM.  If AHCI is not enabled in your system BIOS, FreeBSD 9.x will use
classic ata(4) with CAM.  In both cases disks will show up as /dev/adaX,
but whether one is controlled with ahcichXX or ataX depends on AHCI
capability.

Your initial post showed this:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/016996.html

(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Error 5, Periph was invalidated
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED

Which shows AHCI in use.

But then later, a different post said this:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-April/017011.html

(ada1:ata0:0:0:0): READ_DMA48. ACB: 25 00 87 1a c7 40 1a 00 00 00 01 00
(ada1:ata0:0:0:0): CAM status: Command timeout
(ada1:ata0:0:0:0): Retrying command

Which shows AHCI **not** in use.

FreeBSD **does not** use "different drivers per drive".  No OS does
this: period.  This is not how storage subsystems work, nor have ever
worked.

If you ever see the system suddenly reporting "ataX" (read: I said
"atax" not "adaX"), and you are **ABSOLUTELY CERTAIN** AHCI mode is
enabled in your BIOS, then to me that means your motherboard or SATA
controller is behaving very erratically/wrong.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |