A failed drive causes system to hang
Jeremy Chadwick
jdc at koitsu.org
Sat Apr 13 15:41:32 UTC 2013
On Sat, Apr 13, 2013 at 04:31:06AM -0400, Quartz wrote:
> >If the ZFS layer
> >is waiting on CAM, and CAM is waiting on your hardware, then those I/O
> >requests are going to block indefinitely.
>
> >2. I agree that the problem is not likely in ZFS, but rather either with
> >CAM, the AHCI implementation used, or hardware (either disk or storage
> >controller).
>
> Question:
>
> How (or does) this relate to the hang that I'm seeing with my
> system?
It doesn't relate in any way, shape, or form.
This is what happens when end-users start to try and "correlate" issues
to one another's without actually taking the time to fully read the
thread and follow along actively. This has now happened *twice* with
this thread (once from user Lawrence K. Chen, and now another from
radiomlodychbandytow at o2.pl).
This sort of behavioural thing has happened with FreeBSD, particularly
with regards to storage/filesystems/etc., for as long as I can remember.
I am not going to get into a discussion on how to solve such social
dilemmas because the procedure is to use send-pr and wait for someone
in-the-know to respond asking for relevant information. The FreeBSD
Handbook goes over how to file a PR and what to put in it.
http://www.freebsd.org/send-pr.html
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/problem-reports/article.html
> You mentioned cam issues when talking to me earlier, but
> less decisively than your comment here. What's the difference?
Your issue: "on my raidz2 pool, when I lose more than 2 disks, I/O to
the pool stalls indefinitely, but I can still use the system barring
ZFS-related things; I don't know how to get the system back into a
usable state from this situation". That's based on these two
statements:
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016822.html
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016847.html
radiomlodychbandytow at o2.pl's issue: "I'm seeing ATA-level errors from
one or more of my disks, can someone help?"
Lawrence K. Chen's issue: "I had a crash/issue and then the system hung
for a very long time at the mountroot phase".
Given the information known at this time, ALL THREE of these issues are
unrelated to one another.
As I've said elsewhere: it is very important every single issue reported
is handled individually/separately. I was given this advice from a
FreeBSD kernel developer some years ago and it's excellent. It might
seem logical to try and correlate such things, but a lot of the time
this turns out to be wrong and is a great waste of everyone's time. So
Just Don't Do It(tm).
> >We're also
> >going to need to see "zpool status" output, as well as "zpool get all"
> >and "zfs get all". "pciconf -lvbc" would also be useful.
>
> You never asked for these when talking to me, but I can provide any
> of it if you want to look at it.
At this point in the conversation, WRT your issue, there's no indication
that it would help, but you've already given dmesg output:
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016840.html
Else, all you've provided so far is a general explanation. You have
still not provided concise step-by-step information like I've asked.
I've gone so far as to give you an example of what to provide:
http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016814.html
I will again point to the 2nd-to-last paragraph of my above referenced
mail.
Another example of troubleshooting and how to do it: here's effort I
went through over the course of some months to track down a bug in CAM:
http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016324.html
READ: I'm not saying your issue is with CAM (it may be, but it may not
be -- there isn't enough information right now to determine that). I'm
giving you an example of the troubleshooting/debugging effort that has
to go into things for issues of this nature. You can even see from my
quoted material in that link that I spent many hours doing step-by-step
QA only to find I messed up in the process and had to start over the
following day. It happens.
Once concise details are given and (highly preferable!) a step-by-step
way to reproduce the issue 100% of the time (including all commands, all
output seen, all physical actions taken, etc.), then the kernel folks
tend to get involved.
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-fs
mailing list