ATA failure with 4.6.2 & 250GB drive?

Tue Oct 14 09:02:19 PDT 2003

> Date: Tue, 14 Oct 2003 09:55:54 +0100
> From: Scott Mitchell <scott+freebsd at fishballoon.org>
> Sender: owner-freebsd-stable at freebsd.org
> 
> On Mon, Oct 13, 2003 at 10:09:10AM +0100, Scott Mitchell wrote:
> > Hi all,
> > 
> > Just installed a Maxtor 250GB PATA drive in one of our servers, to be used
> > as a backup staging area.  This was actually a replacement for an identical
> > drive that appeared to have died after a month of service.
> > 
> > Anyway, 2 days after this drive was installed I start seeing this in the
> > daily logs:
> > 
> > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) trying PIO mode
> > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
> > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
> > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
> > ...
> 
> OK, swapped out the cable (from an 80- to 40-wire one, as it happened,
> although that should make no difference on a UDMA33 controller).  Same
> errors appeared again while the backups were running.
> 
> Some more information on how this drive is being used - we're dumping two
> vinum RAID5 volumes onto it, one local and one remote, writing to the
> backup disk over NFS.  Both dumps kick off at 0300, with the remote one
> finishing at 0305 last night.  The first ATA error appeared in the logs at
> 0325, while the local backup was still running.  The last error was logged
> at 0355, but the backup itself didn't finish until nearly 0500.
> 
> Anyone have any more ideas on how to diagnose this?  It does occur to me
> that the daily periodic run also kicks off at 0301 but that is usually all
> done before 0330.

It's a real drive problem, but possibly not a terminal one. (I had the
same issue on one of my drives a few months ago and it's fine now.)

The problem is that the system is getting an error trying to read this
area of the disk. It's an unmapped bunch of bad blocks. The system
gets an unrecoverable error trying to read these blocks and that is
what you see reported. Since it can't read "good" data, it does not
relocate the bad data, but just leaves it there and reports errors
every time it tries to read the data.

First, any files containing data stored in these blocks are probably
toast. Or, at least garbled. Sorry.

The fix/workaround is to move the file(s) involved so that the damaged
blocks are marked free and relocated to spar space on the drive. You
can try to figure out just which file(s) use those blocks. There
might even be a reasonable way to do this...I just don't know what it
is.

Another "fix"is to simply copy the drive onto another and then copy it
back. dd(1) will do the trick as will dump/restore. (I'd suggest the
dump/restore to copy the data out and dd to copy it back if the disks
have identical geometries.) Once the data is restored to the original
disk, the bad blocks will have been re-directed by the drive and will
no longer trouble you.

Modern disks are pretty smart at error recovery, but some failures are
too sudden for the drive to be able to deal with them without losing
data. 
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman at es.net			Phone: +1 510 486-8634