zfs process hang on pool access

Andriy Gapon avg at FreeBSD.org
Wed Jul 27 13:45:42 UTC 2011


on 27/07/2011 15:06 Steven Hartland said the following:
> I've checked the raw disk and all seems fine there, so does look like its
> some sort of zfs livelock.
> 
> I'm trying to keep the machine available in case someone needs more information,
> but its a production machine so I'm going to have to reboot it in the next
> few hours.
> 
> Disk tests:-
> 
> dd if=/dev/da1 of=/dev/null bs=10m 5724+1 records in
> 5724+1 records out
> 60022480896 bytes transferred in 430.479894 secs (139431555 bytes/sec)
> 
> 
> smartctl -a /dev/da1

Is this the only disk associated with the troubled pool?

> smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-RELEASE amd64] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family:     SandForce Driven SSDs
> Device Model:     Corsair CSSD-F60GB2
> Serial Number:    10446509320009990024
> Firmware Version: 1.1
> User Capacity:    60,022,480,896 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 6
> Local Time is:    Wed Jul 27 11:27:30 2011 UTC
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
>                                        was never started.
>                                        Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                        without error or no self-test has ever
>                                        been run.
> Total time to complete Offline data collection:                 (   0) seconds.
> Offline data collection
> capabilities:                    (0x7f) SMART execute Offline immediate.
>                                        Auto Offline data collection on/off support.
>                                        Abort Offline collection upon new
>                                        command.
>                                        Offline surface scan supported.
>                                        Self-test supported.
>                                        Conveyance Self-test supported.
>                                        Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                        power-saving mode.
>                                        Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                        General Purpose Logging supported.
> Short self-test routine recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        (  48) minutes.
> Conveyance self-test routine
> recommended polling time:        (   2) minutes.
> SCT capabilities:              (0x003d) SCT Status supported.
>                                        SCT Error Recovery Control supported.
>                                        SCT Feature Control supported.
>                                        SCT Data Table supported.
> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
> WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   119   100   050    Pre-fail  Always      
> -       0/238293224
>  5 Retired_Block_Count     0x0033   097   097   003    Pre-fail  Always      
> -       256
>  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always      
> -       5513h+00m+39.450s
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      
> -       2
> 171 Program_Fail_Count      0x0000   000   000   000    Old_age   Offline     
> -       0
> 172 Erase_Fail_Count        0x0000   000   000   000    Old_age   Offline     
> -       0
> 174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline     
> -       0
> 177 Wear_Range_Delta        0x0000   000   000   ---    Old_age   Offline     
> -       1
> 181 Program_Fail_Count      0x0000   000   000   000    Old_age   Offline     
> -       0
> 182 Erase_Fail_Count        0x0000   000   000   000    Old_age   Offline     
> -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always      
> -       0
> 194 Temperature_Celsius     0x0022   022   026   000    Old_age   Always      
> -       22 (Min/Max 0/26)
> 195 ECC_Uncorr_Error_Count  0x001c   119   100   000    Old_age   Offline     
> -       0/238293224
> 196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always      
> -       0
> 231 SSD_Life_Left           0x0013   057   057   010    Pre-fail  Always      
> -       0
> 233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline     
> -       152704
> 234 SandForce_Internal      0x0000   000   000   000    Old_age   Offline     
> -       90688
> 241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always      
> -       90688
> 242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always      
> -       3584
> 
> Error SMART Error Log Read failed: Input/output error
> Smartctl: SMART Error Log Read Failed
> Error SMART Error Self-Test Log Read failed: Input/output error
> Smartctl: SMART Self Test Log Read Failed
> SMART Selective self-test log data structure revision number 1
> SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.

-- 
Andriy Gapon


More information about the freebsd-fs mailing list