SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error
O. Hartmann
ohartmann at walstatt.org
Tue Dec 12 22:19:21 UTC 2017
Am Tue, 12 Dec 2017 10:52:27 -0800 (PST)
"Rodney W. Grimes" <freebsd-rwg at pdx.rh.CN85.dnsmgr.net> schrieb:
Thank you for answering that fast!
> > Hello,
> >
> > running CURRENT (recent r326769), I realised that smartmond sends out some console
> > messages when booting the box:
> >
> > [...]
> > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently unreadable
> > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1
> > Offline uncorrectable sectors
> > [...]
> >
> > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives), I
> > gather these informations:
> >
> > [... smartctl -x /dev/ada6 ...]
> > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15 hours)
> > When the command that caused the error occurred, the device was active or idle.
> >
> > After command completion occurred, registers were:
> > ER -- ST COUNT LBA_48 LH LM LL DV DC
> > -- -- -- == -- == == == -- -- -- -- --
> > 40 -- 51 00 00 00 00 c2 7a 72 98 40 00 Error: UNC at LBA = 0xc27a7298 = 3262804632
> >
> > Commands leading to the command that caused the error were:
> > CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> > -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> > 60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195 READ FPDMA QUEUED
> > 60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195 READ FPDMA QUEUED
> > 2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195 READ LOG EXT
> > 60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343 READ FPDMA QUEUED
> > 60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343 READ FPDMA QUEUED
> > [...]
> >
> > and
> >
> > [...]
> > SMART Attributes Data Structure revision number: 16
> > Vendor Specific SMART Attributes with Thresholds:
> > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 64
> > 3 Spin_Up_Time POS--K 178 170 021 - 6075
> > 4 Start_Stop_Count -O--CK 098 098 000 - 2406
> > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
> > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0
> > 9 Power_On_Hours -O--CK 066 066 000 - 25339
> > 10 Spin_Retry_Count -O--CK 100 100 000 - 0
> > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0
> > 12 Power_Cycle_Count -O--CK 098 098 000 - 2404
> > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 154
> > 193 Load_Cycle_Count -O--CK 001 001 000 - 2055746
> > 194 Temperature_Celsius -O---K 122 109 000 - 28
> > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0
> > 197 Current_Pending_Sector -O--CK 200 200 000 - 1
> > 198 Offline_Uncorrectable ----CK 200 200 000 - 1
> > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
> > 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 5
> > ||||||_ K auto-keep
> > |||||__ C event count
> > ||||___ R error rate
> > |||____ S speed/performance
> > ||_____ O updated online
> > |______ P prefailure warning
> >
> > [...]
>
> The data up to this point informs us that you have 1 bad sector
> on a 3TB drive, that is actually an expected event given the data
> error rate on this stuff is such that your gona have these now
> and again.
>
> Given you have 1 single event I would not suspect that this drive
> is dying, but it would be prudent to prepare for that possibility.
Hello.
Well, I copied simply "one single event" that has been logged so far.
As you (and I) can see, it is error #42. After I posted here, a reboot has taken place
because the "repair" process on the Pool suddenly increased time and now I'm with error
#47, but interestingly, it is a new block that is damaged, but the SMART attribute fields
show this for now:
[...]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 69
3 Spin_Up_Time POS--K 178 170 021 - 6075
4 Start_Stop_Count -O--CK 098 098 000 - 2406
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 066 066 000 - 25343
10 Spin_Retry_Count -O--CK 100 100 000 - 0
11 Calibration_Retry_Count -O--CK 100 100 000 - 0
12 Power_Cycle_Count -O--CK 098 098 000 - 2404
192 Power-Off_Retract_Count -O--CK 200 200 000 - 154
193 Load_Cycle_Count -O--CK 001 001 000 - 2055746
194 Temperature_Celsius -O---K 122 109 000 - 28
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 200 200 000 - 1
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 5
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
[...]
197 Current_Pending_Sector decreased to zero so far, but with every reboot, the error
count seems to increase:
[...]
Error 47 [22] occurred at disk power-on lifetime: 25343 hours (1055 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 00 00 00 c2 19 d9 88 40 00 Error: UNC at LBA = 0xc219d988 = 3256473992
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 b0 00 d0 00 00 c2 19 da 28 40 08 1d+07:12:34.336 READ FPDMA QUEUED
60 00 b0 00 c8 00 00 c2 19 d9 78 40 08 1d+07:12:34.336 READ FPDMA QUEUED
2f 00 00 00 01 00 00 00 00 00 10 40 08 1d+07:12:34.336 READ LOG EXT
60 00 b0 00 b8 00 00 c2 19 da 28 40 08 1d+07:12:31.484 READ FPDMA QUEUED
60 00 b0 00 b0 00 00 c2 19 d9 78 40 08 1d+07:12:31.483 READ FPDMA QUEUED
I think this is watching a HDD dying, isn't it?
I'd say, a broken cabling would produce different errors, wouldn't it?
The Western Digital Green series HDD is a useful fellow when the HDD is used as a single
drive. I think there might be an issue with paring 4 HDDs, 3 of them "GREEN", in a RAIDZ
and physically sitting next to each other. Maybe it is time to replace them one by one ...
>
>
> >
> > The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3 TB HDD. The
> > failure occured is on one of the WD Green 3 TB HDD.
> Ok, so the data is redundantly protected. This helps a lot.
>
> > The pool is marked as "resilvered" - I do scrubbing on a regular basis and the
> > "resilvering" message has now aapeared the second time in row. Searching the net
> > recommend on SMART attribute 197 errors, in my case it is one, and in combination with
> > the problems occured that I should replace the disk.
>
> It is probably putting the RAIDZ in that state as the scrub is finding a block
> it can not read.
>
> >
> > Well, here comes the problem. The box is comprised from "electronical waste" made by
> > ASRock - it is a Socket 1150/IvyBridge board, which has its last Firmware/BIOS update
> > got in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible (just to
> > indicate that I'm aware of having issues with crap, but that is some other issue
> > right now). The board's SATA connectors are all populated.
> >
> > So: Due to the lack of adequate backup space I can only selectively backup portions,
> > most of the space is occupied by scientific modelling data, which I had worked on. So
> > backup exists! In one way or the other. My concern is how to replace the faulty HDD!
> > Most HowTo's indicate a replacement disk being prepared and then "replaced" via ZFS's
> > replace command. This isn't applicable here.
> >
> > Question: is it possible to simply pull the faulty disk (implies I know exactly which
> > one to pull!) and then prepare and add the replacement HDD and let the system do its
> > job resilvering the pool?
>
> That may work, but I think I have a simpler solution.
>
> >
> > Next question is: I'm about to replace the 3 TB HDD with a more recent and modern 4 TB
> > HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the other disks
> > are 3 TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it?
>
> Someone else?
>
> >
> > This is the first time I have issues with ZFS and a faulty drive, so if some of my
> > questions sound naive, please forgive me.
>
> One thing to try is to see if we can get the drive to fix itself, first order
> of business is can you take this server out of service? If so I would
> simply try to do a
> repeat 100 dd if=/dev/whicheverhdisbad of=/dev/null conv=noerror, sync iseek=3262804632
>
> That is trying to read that block 100 times, if it successful even 1 time
> smart should remap the block and you are all done.
Given the fact, that this errorneous block is like a moving target, it this solution
still the favorite one? I'll try, but I already have the replacement 4 TB HDD at hand.
>
> If that fails we can try to zero the block, there is a risk here, but raidz should just
> handle this as a data corruption of a block. This could possibly lead to data loss,
> so USE AT YOUR OWN RISK ASSESMENT.
> dd if=/dev/zero of=/dev/whateverdrivehasissues bs=512 count=1 oseek=3262804632
I would then be oseek=3256473992, too.
>
> That should forceable overwrite the bad block with 0's, the smart firmware
> well see this in the pending list, write the data, read it back, if successful
> remove it from the pending list, if failed reallocate the block and write
> the 0's to the reallocation and add 1 to the remapped block count.
>
> You might google for "how to fix a pending reallocation"
>
> > Thanks in advance,
> > Oliver
> > --
> > O. Hartmann
>
Kind regards,
Oliver
--
O. Hartmann
Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 313 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20171212/56e51bcd/attachment.sig>
More information about the freebsd-current
mailing list