RAID 1 / disk error / Offline uncorrectable sectors

Mon Jun 16 17:40:46 UTC 2008

Bill Moran wrote:
 > Zbigniew Szalbot wrote:
 > > [...]
 > > Jun 14 01:13:38 relay kernel: ad12: FAILURE - READ_DMA48 
 > > status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=374468863
 > [...]
 > 
 > Replace the hard drive.  Every modern hard drive keeps extra space available
 > to "remap" bad sectors.  This happens magically behind the scenes without
 > you ever knowing about it.  Once you've hit "uncorrectable" errors, it means
 > your re-mappable sectors are used up, and that means the drive is on its
 > last legs.

That's not completely true.

When a disk drive encounters a bad sector during a read
operation, it will remember the bad sector address, but
it is unable to transparently remap the sector because it
doesn't know that correct contents of the sector.  So it
has to report the unrecoverable error to the OS, even if
there's still plenty of space for remapping sectors.

Upon the next write operation to a sector marked as bad,
the drive will finally remap it and write the data to a
spare location.

Therefore, getting "uncorrectable errors" does *not* mean
that the drive has used up its spare sectors.  You only
need to overwrite the bad sectors (e.g. with dd(1))so the
drive gets a chance to remap them.

Of course, it might still be a good idea to replace the
drive anyway.  It depends on the cause of the bad sectors
(mechanical or electrical).

If you had a head crash (caused by mechanical impact or
a media manufacturing error or whatever), it is possible
that it caused debris within the drive which will cause
further bad blocks.  This can lead to a snowball effect
that can really exhaust all spare sectors quickly.

On the other hand, if the bad sectors where caused by
a voltage spike, a power failure or similar, chances are
that the drive is fine and you can continue to use it
after making sure that the bad sectors are remapped
(by overwriting them, see above).

Finally, there is also the possibility that the problem
is caused by a bug in the drive's firmware.  If that's
the case, I would be inclined to replace the drive with
a different brand.  However, I guess all drives have
bugs ...  the question is whether they affect you.
Another question is whether it's possible at all to
find out what caused the problem in the first place.

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"What is this talk of 'release'?  We do not make software 'releases'.
Our software 'escapes', leaving a bloody trail of designers and quality
assurance people in its wake."