7.2-RELEASE-p4, IO errors & RAID1 failure

Fri Jun 18 17:42:10 UTC 2010

On Fri, Jun 18, 2010 at 04:47:11PM +0100, Matthew Lear wrote:
> Hello Jeremy,
> Thanks very much for the feedback.
> 
> [snip]
> > Could you please provide the full output from "smartctl -a /dev/ad0"
> > here?  Your drive may be completely fine and you may not have to swap it
> > at all; hard to say.
> 
> Sure. See below:
> {snip}

Your SMART statistics look completely OK.  There's nothing there that
indicates there were any write failures or otherwise.  I'll explain near
the end of the Email how to test a range of LBAs "just in case".

I'll take a moment to point out that the error previously seen was a
timeout during a write transaction (WRITE_DMA48).  Recap:

> > > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335
> > > ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=395032335
> > > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode

The status codes shown (status=51 and error=10) are hexadecimal.  I'm
pointing this out because they aren't preceded by '0x' or '$' and it
clarifies my next point:

NID_NOT_FOUND (bit 4 set in the ATA error field) is referred to as IDNF
per ATA6-ACS specification and onward, so I'll refer to it as that.
(I've always wondered why FreeBSD calls this NID_NOT_FOUND; IDFN stands
for ID Not Found, so what's with the extra "N"?  I've always felt this
is a typo...)

Using the ATA8-ACS specification working draft (2007/05/21), since it's
more recent, we see the following:

  Section 6.2 - Error field
  Section 6.2.4 - ID Not Found (IDNF) bit

  Error bit 4. The IDNF bit shall be set to one if a user-accessible
  address was not found. The IDNF bit shall be set to one if an
  address outside of the range of user-accessible addresses is
  requested when command aborted is not returned (see 4.11.3 and
  6.2.1).

  Section 4.11 - Host Protected Area (HPA) feature set
  Section 4.11.3 - 28-bit and 48-bit HPA commands

  Any read or write command to an address above the maximum address
  specified by the SET MAX ADDRESS or SET MAX ADDRESS EXT command shall
  cause command completion with the IDNF bit set to one and ERR set to
  one, or command aborted.

There's no definition of what "address" means in 6.2.4, but the most
logical (pun intended) guess is an LBA.  This error is returned by the
disk (e.g. not a controller-induced error).  I've mentioned this problem
in the past:

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

I've always read IDNF to mean "OS requested access (read or write) to an
LBA which is out of bounds", where "out of bounds" means "not between 0
and <last LBA>".  How exactly is that possible?  Alexander, do you have
any familiarity with this error code per ATA spec?

Matthew, can you provide output from "atacontrol cap ad0"?  Thanks.

Now regarding the LBA tests -- "smartctl -t select,start-end" will do
the trick.  start should be a starting LBA, end should be an ending LBA.
The OS claims that LBA 395032335 is what was requested to be accessed
when the failure happened, so I would recommend picking start/end ranges
around that area.  Remember that a single sector encapsulates a very
large number of blocks (especially given sizes of disks today), so it's
wise to pick a very large range of LBAs.  I would recommend this in your
case:

smartctl -t select,390000000,410000000 /dev/ad0

I would highly recommend doing this with the disk not doing any I/O,
though it won't hurt it (it'll just delay the scan).  "smartctl -a" will
show the state of things in the "SMART Selective self-test log" at the
bottom, or somewhere else within the output (depends on the drive).

This should, in my opinion, rule out whether or not there's a bad block
or something along those lines within said range.  Given what I believe
IDNF represents, I would say your scan will probably come back clean.
Also remember that the scan performed here is a *disk-level scan*; the
disk firmware itself is doing it (the OS isn't involved).  This helps
rule out any sort of "weird" issues that the OS may be reporting ("hey
man, LBA 8943943983492893428932489324 is bad!"  "Yeah sure it is").

> The two devices in the array are on channels 0 and 1. There is indeed a
> second drive on channel 0 (160G). As I said above, I use that as an
> additional back up device but it's not part of the array.

Okay, so executing "atacontrol detach ata0" will cause you to lose both
ad0 and ad1.  If you can live with that, then cool.

> > What motherboard is this?  Can you change the setting to either
> > "Native", "Enhanced", or (even better) "AHCI"?  I've seen some systems
> > where the Serial ATA option in the BIOS has an "Auto" option, which does
> > totally bizarre things at times.
> 
> I think this has been covered in subsequent postings. I could try it but
> as you say below, I'd like to resolve the disk issue first.
> ...
> > The atacontrol man page covers your situation:
> > ...
> I don't think this is the case for me since ad0 and ad2 are on seperate
> ata channels.
> ...
> Indeed but my hw doesn't have hot-swap capability (at the moment!).

That's the problem -- we're not sure if this really is a disk issue.
It's been reported before, others have reported solving it by increasing
ATA timeout values, etc...  But the fact of the matter is, that error
code is being returned by the device.

Speaking generally about disk replacements on your system -- when I say
generally, I do mean generally and *not* in regards to the specific
situation reported:

Since there's no AHCI in use, we should just assume that a power-down of
the system is the safest way to go about a disk replacement.  Follow
that procedure in the future and you should be fine.  If you ever get a
hot-swap backplane, you absolutely should use AHCI; hot-swap, especially
on an Intel controller (FreeBSD is tested pretty thoroughly on Intel
ICHxx and ESBx controllers), will work fine in that case.

If you do go the AHCI route, and eventually upgrade to RELENG_8 down the
line, I highly recommend you load kernel module ahci.ko (instead of the
default/historic ataahci.ko).  This will get you NCQ support amongst
other things.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |