disk "flipped" - a known problem?

Tue Jan 22 00:40:32 UTC 2013

On Mon, Jan 21, 2013 at 10:43:55PM -0000, Steven Hartland wrote:
> ----- Original Message ----- From: "Jeremy Chadwick"
> <jdc at koitsu.org>
> To: <freebsd-fs at freebsd.org>
> Cc: <mav at freebsd.org>; <avg at freebsd.org>
> Sent: Monday, January 21, 2013 10:16 PM
> Subject: Re: disk "flipped" - a known problem?
> 
> 
> >(Please keep me CC'd as I am not subscribed)
> >
> >WRT this:
> >
> >http://lists.freebsd.org/pipermail/freebsd-fs/2013-January/016197.html
> >
> >I can reproduce the first problem 100% of the time on my home system
> >here.  I can provide hardware specs if needed, but the important part is
> >that I'm using RELENG_9 / r245697, the controller is an ICH9R in AHCI
> >mode (and does not share an IRQ), hot-swap bays are in use, and I'm
> >using ahci.ko.
> >
> >I also want to make this clear to Andriy: I'm not saying "there's a
> >problem with your disk".  In my case, I KNOW there's a problem with the
> >disk (that's the entire point to my tests! :-) ).
> >
> >In my case the disk is a WD Raptor (150GB, circa 2006) that has a very
> >badly-designed firmware that goes completely catatonic when encountering
> >certain sector-level conditions.  That's not the problem though -- the
> >problem is with FreeBSD apparently getting confused as to the internal
> >state of its devices after a device falls off the bus and comes back.
> >Explanation:
> >
> >1. System powered off; disk is attached; system powered on, shows up as
> >ada5.  Can communicate with device in every way (the way I tend to test
> >simple I/O is to use "smartctl -a /dev/ada5").  This disk has no
> >filesystems or other "stuff" on it -- it's just a raw disk, so I believe
> >the g_wither_washer oddity does not apply in this situation.
> >
> >2. "dd if=/dev/zero of=/dev/ada5 bs=64k"
> >
> >3. Drive hits a bad sector which it cannot remap/deal with.  Drive
> >firmware design flaw results in drive becoming 100% stuck trying to
> >re-read the sector and work out internal decisions to do remapping or
> >not.  Drive audibly clicking during this time (not actuator arm being
> >reset to track 0 noise; some other mechanical issue).  Due to firmware
> >issue, drive remains in this state indefinitely.
> >
> >4. FreeBSD CAM reports repeated WRITE_FPDMA_QUEUED (i.e. writes using NCQ)
> >errors every 30 seconds (kern.cam.ada.default_timeout), for a total of 5
> >times (kern.cam.da.retry_count+1).
> >
> >5. FreeBSD spits out similar messages you see; retries exhausted,
> >cam_periph_alloc error, and devfs claims device removal.
> >
> >6. Drive is still catatonic of course.  Only way to reset the drive is
> >to power-cycle it.  Drive removed from hot-swap bay, let sit for 20
> >seconds, then is reinserted.
> >
> >7. FreeBSD sees the disk reappear, shows up much like it did during #1,
> >except...
> >
> >8. "smartctl -a /dev/ada5" claims no such device or unknown device type
> >(I forget which).  "ls -l /dev/ada5" shows an entry.  "camcontrol
> >devlist" shows the disk on the bus, yet I/O does not work.  If I
> >remember right, re-attempting the dd command returns some error (I
> >forget which).
> >
> >9. "camcontrol rescan all" stalls for quite some time when trying to
> >communicate with entry 5, but eventually does return (I think with some
> >error).  camcontrol reset all" works without a hitch.  "camcontrol
> >devlist" during this time shows the same disk on ada5 (which to me means
> >ATA IDENTIFY, i.e. vendor strings, etc. are reobtained somehow, meaning
> >I/O works at some level).
> >
> >10. System otherwise works fine, but the only way to bring back
> >usability of ada5 is to reboot ("shutdown -r now").
> >
> >To me, this looks like FreeBSD at some layer within the kernel (or some
> >driver (I don't know which)) is internally confused about the true state
> >of things.
> >
> >Alexander, do you have any ideas?
> >
> >I can enable CAM debugging (I do use options CAMDEBUG so I can toggle
> >this with camcontrol) as well as take notes and do a full step-by-step
> >diagnosis (along with relevant kernel output seen during each phase) if
> >that would help you.  And I can test patches but not against -CURRENT
> >(will be a cold day in hell before I run that, sorry).
> >
> >Let me know, time permitting.  :-)
> 
> Do you have a controller which not ata based you can test this on e.g.
> mps as this may help identify if the issue is ata specific or more
> generic.

I do not.  Well, that's not entirely true -- I have an Adaptec 2410SA
laying around here somewhere, but xxx(4), as I understand it, has been
neglected for quite some time and I stopped using that controller a few
months after I got it simply because it sucked.  :P  It's SiI-3112-based
with a bunch of hullabaloo on it.

The best I could do is try to pick up an inexpensive 3124-based siis(4)
controller and try that.

I understand your logic here -- you're trying to narrow down if the
issue is within CAM(4) or not.

The above used to work.  That is to say, I could literally yank a disk
out of my hot-swap bay, insert a new one, and FreeBSD would do the right
thing.  Possibly the issue is with the same disk being re-inserted?  Not
sure.

> ... If you have the messages log for above scenario that also might
> help to track down the problem.

I can do that but will need some time (not a lot, just dedicated linear
time.  :-) ).

I also need to know what kind of output folks are wanting -- I know you
want kernel output, as well as whatever physical action was last taken,
but output from userland (ex. "camcontrol devlist") seems relevant, thus
need some advice on what would be useful.

"camcontrol debug" might be helpful but I'd need to know what printfs
are wanted (see man page).  camcontrol(8) implies that simply doing
"camcontrol debug -x -y -z ahcich5" would be sufficient to see
everything within ahci(4) on port 5 as well as ada5.  Yes/no?

Want to make this clear too: the issue I see is not specific to just
ada5 on my system, i.e. it is not a problem with the physical port or
somesuch.  It's reproducible regardless of port number -- it just so
happens that I actively use ports 0-4 on my system and leave port 5
available for drive testing/forensics.

> It does, as you say, sound like something isn't being cleaned up properly
> which might be confirmed by adding a printf just inside cam_periph_alloc's
>    if ((periph = cam_periph_find(path, name)) != NULL) {

I'd rather wait on this; I never feel comfortable poking about in kernel
innards, especially something like CAM.  Remember: the system I'm
testing with is actually used, so I don't want to risk impacting other
("good") CAM transactions with a change.  I just don't have the
knowledge or familiarity with those pieces to be poking around in there
with comfort.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |