smartctl
Ken Moffat
zarniwhoop at ntlworld.com
Sat Mar 28 06:15:09 UTC 2015
On Fri, Mar 27, 2015 at 09:05:29PM -0800, CK wrote:
> Regarding the unexpected loss of files from the filesystem under various
> loads, is the appended 'smartctl' data sufficient to make the determination
> that the loss of files while the operating system is in use could be due to
> the condition of the drive?
>
Drives fail. Sometimes smartctl reports problems _if_ you run the
tests, other times they fail suddenly. The drive is old (only
40GB), so although the hours are only 12540 (500 days) I suspect it
might have been "round the clock". Apparently it is a 5400rpm PATA
drive - I used to use a pair of 5400rpm drives for RAID1 on a
previous server, but I think I bought those 6 or more years ago, and
even then they were 320GB. So old age seems a possible answer.
> I didn't think so at first, because:
>
> 1) I would expect a FreeBSD error to the effect of "unable to read/write
> /dev/ada0" or "block checksum does not match block data".
>
> 2) I would expect that all data read/written to from a drive is verfied to be
> correct by FreeBSD with checksums, and that it is guaranteed to be correct
> if there are no serious and fatal errors reported by the operating system.
I cannot comment on that (except in VMs I'm a linux user), but if the
drive's write cache is enabled then technically all bets are off - most
modern drives will do that to improve throughput.
You can also get filesystem errors, and unfortunate use of 'rm -rf'.
>
> But I may be wrong in these assumptions. Anybody know for sure? I have never
> seen FreeBSD report any filesystem r/w errors. My past experience has only
> taught me that when a drive begins to make very bad noises, this generally
> accompanies obvious and serious problems; and that a drive fails when the
> mechanical parts fail, but not due to wear on heads/platters or other things
> that may cause failures that are not detected/reported by the operating
> system.
>
My experience is limited (starting with two or three machines,
mostly with one drive each, through to the current day where I have
4 desktop machines with one drive each, and machine used as a server
with 3 drives). But recently I seem to have to replace at least one
drive every year (although the last one was "just in case" because
the SMART checks were often reporting unreadable sectors - not
permanent errors, and it was in RAID-1 so ok while the other one
still worked - and I've discarded others because they became too
slow or too antiquated (IDE, SATAv1).
But I would seriously suggest that if you have installed
smartmontools then you ought to run some of the tests - on a server
I tend to run long tests daily, at a time when I hope it is quiet,
but on desktops less frequently. For a laptop I probably only run
them when I think about it and know it will be on mains power.
> I can't see how the loss of files could occur without FreeBSD noticing it and
> reporting on it. Does FreeBSD just trust drives to do everything correctly
> at all times?
>
> --
>
> smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Caviar WDxxxAB
> Device Model: WDC WD400AB-22CDB0
> Serial Number: WD-WMA9T1222658
> Firmware Version: 22.04A22
> User Capacity: 40,020,664,320 bytes [40.0 GB]
> Sector Size: 512 bytes logical/physical
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA/ATAPI-5 (minor revision not indicated)
> Local Time is: Fri Mar 27 20:35:32 2015 AKDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 2376) seconds.
> Offline data collection
> capabilities: (0x3b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> No Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> No General Purpose Logging support.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 42) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
> 3 Spin_Up_Time 0x0007 102 099 021 Pre-fail Always - 3975
> 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 58
> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 1
I've had recent drives which started to give problems (particularly,
unreadable sectors) around the time the Reallocated Sector Count
became non-zero.
> 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
> 9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 12540
> 10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
> 11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 57
> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1
> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
>
I would try running some self-tests.
>
> Selective Self-tests/Logging not supported
>
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
ĸen
--
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.
More information about the freebsd-questions
mailing list