bad sector in gmirror HDD

Sat Aug 20 17:34:45 UTC 2011

On Aug 19, 2011, at 11:24 PM, Jeremy Chadwick wrote:

> On Fri, Aug 19, 2011 at 09:39:17PM -0400, Dan Langille wrote:
>> 
>> On Aug 19, 2011, at 7:21 PM, Jeremy Chadwick wrote:
>> 
>>> On Fri, Aug 19, 2011 at 04:50:01PM -0400, Dan Langille wrote:
>>>> System in question: FreeBSD 8.2-STABLE #3: Thu Mar  3 04:52:04 GMT 2011
>>>> 
>>>> After a recent power failure, I'm seeing this in my logs:
>>>> 
>>>> Aug 19 20:36:34 bast smartd[1575]: Device: /dev/ad2, 2 Currently unreadable (pending) sectors
>>> 
>>> I doubt this is related to a power failure.
>>> 
>>>> Searching on that error message, I was led to believe that identifying the bad sector and
>>>> running dd to read it would cause the HDD to reallocate that bad block.
>>>> 
>>>> http://smartmontools.sourceforge.net/badblockhowto.html
>>> 
>>> This is incorrect (meaning you've misunderstood what's written there).
>>> 
>>> Unreadable LBAs can be a result of the LBA being actually bad (as in
>>> uncorrectable), or the LBA being marked "suspect".  In either case the
>>> LBA will return an I/O error when read.
>>> 
>>> If the LBAs are marked "suspect", the drive will perform re-analysis of
>>> the LBA (to determine if the LBA can be read and the data re-mapped, or
>>> if it cannot then the LBA is marked uncorrectable) when you **write** to
>>> the LBA.
>>> 
>>> The above smartd output doesn't tell me much.  Providing actual SMART
>>> attribute data (smartctl -a) for the drive would help.  The brand of the
>>> drive, the firmware version, and the model all matter -- every drive
>>> behaves a little differently.
>> 
>> Information such as this?  http://beta.freebsddiary.org/smart-fixing-bad-sector.php
> 
> Yes, perfect.  Thank you.  First thing first: upgrade smartmontools to
> 5.41.  Your attributes will be the same after you do this (the drive is
> already in smartmontools' internal drive DB), but I often have to remind
> people that they really need to keep smartmontools updated as often as
> possible.  The changes between versions are vast; this is especially
> important for people with SSDs (I'm responsible for submitting some
> recent improvements for Intel 320 and 510 SSDs).

Done.

> Anyway, the drive (albeit an old PATA Maxtor) appears to have three
> anomalies:
> 
> 1) One confirmed reallocated LBA (SMART attribute 5)
> 
> 2) One "suspect" LBA (SMART attribute 197)
> 
> 3) A very high temperature of 51C (SMART attribute 194).  If this drive
> is in an enclosure or in a system with no fans this would be
> understandable, otherwise this is a bit high.  My home workstation which
> has only one case fan has a drive with more platters than your Maxtor,
> and it idles at ~38C.  Possibly this drive has been undergoing constant
> I/O recently (which does greatly increase drive temperature)?  Not sure.
> I'm not going to focus too much on this one.

This is an older system.  I suspect insufficient ventilation.  I'll look at getting
a new case fan, if not some HDD fans.

> The SMART error log also indicates an LBA failure at the 26000 hour mark
> (which is 16 hours prior to when you did smartctl -a /dev/ad2).  Whether
> that LBA is the remapped one or the suspect one is unknown.  The LBA was
> 5566440.
> 
> The SMART tests you did didn't really amount to anything; no surprise.
> short and long tests usually do not test the surface of the disk.  There
> are some drives which do it on a long test, but as I said before,
> everything varies from drive to drive.
> 
> Furthermore, on this model of drive, you cannot do a surface scans via
> SMART.  Bummer.  That's indicated in the "Offline data collection
> capabilities" section at the top, where it reads:
> 
> 	No Selective Self-test supported.
> 
> So you'll have to use the dd method.  This takes longer than if surface
> scanning was supported by the drive, but is acceptable.  I'll get to how
> to go about that in a moment.

FWIW, I've done a dd read of the entire suspect disk already.  Just two errors.
From the URL mentioned above:

[root at bast:~] # dd of=/dev/null if=/dev/ad2 bs=1m conv=noerror
dd: /dev/ad2: Input/output error
2717+0 records in
2717+0 records out
2848980992 bytes transferred in 127.128503 secs (22410246 bytes/sec)
dd: /dev/ad2: Input/output error
38170+1 records in
38170+1 records out
40025063424 bytes transferred in 1544.671423 secs (25911701 bytes/sec)
[root at bast:~] # 

That seems to indicate two problems.  Are those the values I should be using 
with dd?

I did some more precise testing:

# time dd of=/dev/null if=/dev/ad2 bs=512 iseek=5566440
dd: /dev/ad2: Input/output error
9+0 records in
9+0 records out
4608 bytes transferred in 5.368668 secs (858 bytes/sec)

real	0m5.429s
user	0m0.000s
sys	0m0.010s

NOTE: that's 9 blocks later than mentioned in smarctl

The above generated this in /var/log/messages:

Aug 20 17:29:25 bast kernel: ad2: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=5566449

> [stuff snipped]

> That said:
> 
> http://jdc.parodius.com/freebsd/bad_block_scan
> 
> If you run this on your ad2 drive, I'm hoping what you'll find are two
> LBAs which can't be read -- one will be the remapped LBA and one will be
> the "suspect" LBA.  If you only get one LBA error then that's fine too,
> and will be the "suspect" LBA.

> Once you have the LBA(s), you can submit writes to them to get the drive
> to re-analyse them (assuming they're "suspect"):
> 
> dd if=/dev/zero of=/dev/XXX bs=512 count=1 seek=NNNNN
> 
> Where XXX is the device and NNNNN is the LBA number.
> 
> If this works properly, the dd command should sit there for a little bit
> (as the drive does its re-analysis magic) and then should complete.

ad2 is part of a gmirror with ad0.   Does this change things?

I haven't tried the dd yet.

> 
> You'll want to check SMART stats after that; you should see
> Current_Pending_Sector drop to 0.  If Offline_Uncorrectable incremented
> then the LBA could not be re-read/remapped.

It did increment:

197 Current_Pending_Sector  0x0032   100   100   020    Old_age   Always       -       2

[was 1]

>  If Reallocated_Sector_Ct
> incremented then you now have a total of 2 LBAs which are remapped.

It did increment:

$ diff smarctl.1 smarctl.3 | grep Reallocated_Sector_Ct
<   5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always       -       1
>   5 Reallocated_Sector_Ct   0x0033   100   100   020    Pre-fail  Always       -       2

Full output of smartctl has been appended to http://beta.freebsddiary.org/smart-fixing-bad-sector.php

> In
> the case of remapping, you get to deal with the UFS/FFS thing above.
> To get the stats to update in this situation you *might* (but probably
> not) have to run "smartctl -t offline /dev/XXX".

I didn't try that...

> 
> You might also be wondering "that dd command writes 512 bytes of zero to
> that LBA; what about the old data that was there, in the case that the
> drive remaps the LBA?"  This is a great question, and one I've never
> actually taken the time to answer because at this present time I have
> absolutely *no* bad disks in my possession.  I'm under the impression
> that the write does in fact write zeros if the LBA is remapped, but that
> might not be true at all.  I've been waiting to test this for quite some
> time and document it/write about it.
> 
> I still suggest you replace the drive, although given its age I doubt
> you'll be able to find a suitable replacement.  I tend to keep disks
> like this around for testing/experimental purposes and not for actual
> use.

I have several unused 80GB HDD I can place into this system.  I think that's
what I'll wind up doing.  But I'd like to follow this process through and get it documented
for future reference.

-- 
Dan Langille - http://langille.org