smartctl

Sat Mar 28 06:15:09 UTC 2015

On Fri, Mar 27, 2015 at 09:05:29PM -0800, CK wrote:
> Regarding the unexpected loss of files from the filesystem under various
> loads, is the appended 'smartctl' data sufficient to make the determination
> that the loss of files while the operating system is in use could be due to
> the condition of the drive?
> 
Drives fail.  Sometimes smartctl reports problems _if_ you run the
tests, other times they fail suddenly.  The drive is old (only
40GB), so although the hours are only 12540 (500 days) I suspect it
might have been "round the clock".  Apparently it is a 5400rpm PATA
drive - I used to use a pair of 5400rpm drives for RAID1 on a
previous server, but I think I bought those 6 or more years ago, and
even then they were 320GB.  So old age seems a possible answer.

> I didn't think so at first, because:
> 
> 1)  I would expect a FreeBSD error to the effect of "unable to read/write
>     /dev/ada0" or "block checksum does not match block data".
> 
> 2)  I would expect that all data read/written to from a drive is verfied to be
>     correct by FreeBSD with checksums, and that it is guaranteed to be correct
>     if there are no serious and fatal errors reported by the operating system.
I cannot comment on that (except in VMs I'm a linux user), but if the
drive's write cache is enabled then technically all bets are off - most
modern drives will do that to improve throughput.

You can also get filesystem errors, and unfortunate use of 'rm -rf'.
> 
> But I may be wrong in these assumptions.  Anybody know for sure? I have never
> seen FreeBSD report any filesystem r/w errors. My past experience has only
> taught me that when a drive begins to make very bad noises, this generally
> accompanies obvious and serious problems; and that a drive fails when the
> mechanical parts fail, but not due to wear on heads/platters or other things
> that may cause failures that are not detected/reported by the operating
> system.
> 

My experience is limited (starting with two or three machines,
mostly with one drive each, through to the current day where I have
4 desktop machines with one drive each, and machine used as a server
with 3 drives).  But recently I seem to have to replace at least one
drive every year (although the last one was "just in case" because
the SMART checks were often reporting unreadable sectors - not
permanent errors, and it was in RAID-1 so ok while the other one
still worked - and I've discarded others because they became too
slow or too antiquated (IDE, SATAv1).

But I would seriously suggest that if you have installed
smartmontools then you ought to run some of the tests - on a server
I tend to run long tests daily, at a time when I hope it is quiet,
but on desktops less frequently.  For a laptop I probably only run
them when I think about it and know it will be on mains power.

> I can't see how the loss of files could occur without FreeBSD noticing it and
> reporting on it.  Does FreeBSD just trust drives to do everything correctly
> at all times?
> 
> --
> 
> smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Caviar WDxxxAB
> Device Model:     WDC WD400AB-22CDB0
> Serial Number:    WD-WMA9T1222658
> Firmware Version: 22.04A22
> User Capacity:    40,020,664,320 bytes [40.0 GB]
> Sector Size:      512 bytes logical/physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA/ATAPI-5 (minor revision not indicated)
> Local Time is:    Fri Mar 27 20:35:32 2015 AKDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x84)	Offline data collection activity
> 					was suspended by an interrupting command from host.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)	The previous self-test routine completed
> 					without error or no self-test has ever
> 					been run.
> Total time to complete Offline
> data collection: 		( 2376) seconds.
> Offline data collection
> capabilities: 			 (0x3b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					No Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					No General Purpose Logging support.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 (  42) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   5) minutes.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
>   3 Spin_Up_Time            0x0007   102   099   021    Pre-fail  Always       -       3975
>   4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       58
>   5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       1
I've had recent drives which started to give problems (particularly,
unreadable sectors) around the time the Reallocated Sector Count
became non-zero.
>   7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
>   9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       12540
>  10 Spin_Retry_Count        0x0013   100   253   051    Pre-fail  Always       -       0
>  11 Calibration_Retry_Count 0x0013   100   253   051    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
> 196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> 
I would try running some self-tests.
> 
> Selective Self-tests/Logging not supported
> 
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
ĸen
-- 
Nanny Ogg usually went to bed early. After all, she was an old lady.
Sometimes she went to bed as early as 6 a.m.