fsdb&smartctl&/var/log/messages

Tue Jul 13 03:10:24 UTC 2010

(Re-adding the mailing list to the CC list)

On Tue, Jul 13, 2010 at 05:15:32AM +0400, Dmitry Lunts wrote:
> OK. See below. The output is too long, so General SMART values are skipped.
>
> [...] 
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always   -       0
> 187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always   -       1297
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always   -       3
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline  -       3
> [...] 
> ATA Error Count: 1287 (device log contains only the most recent five errors)
> [...] 

And here lies your problem.

You have 3 LBAs on your drive which experienced errors during their
lifetime and couldn't be automatically corrected.  They're labelled as
"pending" until some write operations to those LBAs are attempted (and
there's no guarantee that will work either (more on that later).

Attribute 187 is one I haven't seen before (I don't use Seagate drives),
but it indicates the number of read or write transactions to the disk
itself which *could not* be auto-corrected with hardware ECC.  It's a
counter, so it's very possible continuous access to the bad LBAs could
be responsible for the counter being so high.

Now what's interesting is that your SMART self-test log indicates you
actually have 4 bad LBAs: 4007996, 102121619, 110518042, and 195230321:

> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed: read failure       90%      7391         4007996
> # 2  Extended offline    Completed: read failure       90%      7376         195230321
> # 3  Extended offline    Completed: read failure       90%      7369         4007996
> # 4  Extended offline    Completed: read failure       90%      7346         4007996
> # 6  Extended offline    Completed: read failure       90%      7329         110518042
> # 7  Selective offline   Completed: read failure       90%      7302         102121619
> # 8  Extended offline    Completed: read failure       90%      7301         102121619
> # 9  Extended offline    Completed: read failure       90%      7297         102121619
> #10  Selective offline   Completed: read failure       90%      6817         195230321
> #11  Selective offline   Completed: read failure       90%      6817         195230321
> #12  Extended offline    Completed: read failure       50%      6817         195230321
> #15  Extended offline    Completed: read failure       50%      5035         195230321

First thing first: I hope you have backups.  I realise you're trying to
work out what files got damaged, but the easiest way to do that is to
attempt to read the files -- try using rsync or cpdup on all the
filesystems (write the data to /dev/null) and look for I/O errors.

At this point my recommendation to you is simple: replace/RMA the disk.
Really.  You have I/O errors across three completely non-sequential
areas of the disk (maybe dust?).  If you don't replace the drive, you're
going to end up dealing with this again in the future.  I hope you've
been doing backups.  :-)

You can (and should) also run Seagate's SeaTools for DOS utility on the
drive -- do an extended/long/thorough test (which will test all the
sectors).  This is a vendor-specific test which often does things at a
much lower level than even SMART.  I'm willing to bet the test fails, or
at least will give you indication of what you already know.  It may also
let you remap the LBAs (I know WDs utility can do this).

That said, here be dragons.  I'm not responsible for what happens after
you try this, and I haven't done this in a very VERY long time.

Have you tried writing zeros over the LBA where the bad blocks are
located?  This often will get the drive to attempt a remap.  E.g.:

dd if=/dev/zero of=/dev/ad6 bs=512 count=1 seek={whatever}
sync

Be sure to note the of= parameter there refers to the entire drive and
not a slice.

If it does work, both Attribute 197 and 198 should change to 0.  Be sure
to run "smartctl -t offline /dev/ad6" too, since some Offline attributes
don't always get updated.

Also, your calculation formula earlier contains "-63" which I believe is
due to the offset of the slices.  Except in your bsdlabel output, the
"c" slice actually starts at 0, not 63.  Are you sure this formula is
correct?

Let me know what becomes of all this, I'm highly interested.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |