A failed drive causes system to hang

Sat Apr 13 15:41:32 UTC 2013

On Sat, Apr 13, 2013 at 04:31:06AM -0400, Quartz wrote:
> >If the ZFS layer
> >is waiting on CAM, and CAM is waiting on your hardware, then those I/O
> >requests are going to block indefinitely.
> 
> >2. I agree that the problem is not likely in ZFS, but rather either with
> >CAM, the AHCI implementation used, or hardware (either disk or storage
> >controller).
> 
> Question:
> 
> How (or does) this relate to the hang that I'm seeing with my
> system?

It doesn't relate in any way, shape, or form.

This is what happens when end-users start to try and "correlate" issues
to one another's without actually taking the time to fully read the
thread and follow along actively.  This has now happened *twice* with
this thread (once from user Lawrence K. Chen, and now another from
radiomlodychbandytow at o2.pl).

This sort of behavioural thing has happened with FreeBSD, particularly
with regards to storage/filesystems/etc., for as long as I can remember.

I am not going to get into a discussion on how to solve such social
dilemmas because the procedure is to use send-pr and wait for someone
in-the-know to respond asking for relevant information.  The FreeBSD
Handbook goes over how to file a PR and what to put in it.

http://www.freebsd.org/send-pr.html
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/problem-reports/article.html

> You mentioned cam issues when talking to me earlier, but
> less decisively than your comment here. What's the difference?

Your issue: "on my raidz2 pool, when I lose more than 2 disks, I/O to
the pool stalls indefinitely, but I can still use the system barring
ZFS-related things; I don't know how to get the system back into a
usable state from this situation".  That's based on these two
statements:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016822.html
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016847.html

radiomlodychbandytow at o2.pl's issue: "I'm seeing ATA-level errors from
one or more of my disks, can someone help?"

Lawrence K. Chen's issue: "I had a crash/issue and then the system hung
for a very long time at the mountroot phase".

Given the information known at this time, ALL THREE of these issues are
unrelated to one another.

As I've said elsewhere: it is very important every single issue reported
is handled individually/separately.  I was given this advice from a
FreeBSD kernel developer some years ago and it's excellent.  It might
seem logical to try and correlate such things, but a lot of the time
this turns out to be wrong and is a great waste of everyone's time.  So
Just Don't Do It(tm).

> >We're also
> >going to need to see "zpool status" output, as well as "zpool get all"
> >and "zfs get all".  "pciconf -lvbc" would also be useful.
> 
> You never asked for these when talking to me, but I can provide any
> of it if you want to look at it.

At this point in the conversation, WRT your issue, there's no indication
that it would help, but you've already given dmesg output:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016840.html

Else, all you've provided so far is a general explanation.  You have
still not provided concise step-by-step information like I've asked.
I've gone so far as to give you an example of what to provide:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html

I will again point to the 2nd-to-last paragraph of my above referenced
mail.

Another example of troubleshooting and how to do it: here's effort I
went through over the course of some months to track down a bug in CAM:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016324.html

READ: I'm not saying your issue is with CAM (it may be, but it may not
be -- there isn't enough information right now to determine that).  I'm
giving you an example of the troubleshooting/debugging effort that has
to go into things for issues of this nature.  You can even see from my
quoted material in that link that I spent many hours doing step-by-step
QA only to find I messed up in the process and had to start over the
following day.  It happens.

Once concise details are given and (highly preferable!) a step-by-step
way to reproduce the issue 100% of the time (including all commands, all
output seen, all physical actions taken, etc.), then the kernel folks
tend to get involved.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |