out of HDD space - zfs degraded

Sun Oct 3 02:04:35 UTC 2010

On 10/2/2010 7:50 PM, Jeremy Chadwick wrote:
> On Sat, Oct 02, 2010 at 07:23:16PM -0400, Dan Langille wrote:
>> On 10/2/2010 6:36 PM, Jeremy Chadwick wrote:
>>> On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote:
>>>> On 10/2/2010 10:19 AM, Jeremy Chadwick wrote:
>>>>> On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote:
>>>>>> Overnight I was running a zfs send | zfs receive (both within the
>>>>>> same system / zpool).  The system ran out of space, a drive went off
>>>>>> line, and the system is degraded.
>>>>>>
>>>>>> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18
>>>>>> 23:43:48 EDT 2010.
>>>>>>
>>>>>> The following logs are also available at
>>>>>> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping
>>>>>>
>>>>>> This is what was running:
>>>>>>
>>>>>> # time zfs send storage/bacula at transfer | mbuffer | zfs receive
>>>>>> storage/compressed/bacula-mbuffer
>>>>>> in @  0.0 kB/s, out @  0.0 kB/s, 3670 GB total, buffer 100%
>>>>>> fullcannot receive new filesystem stream: out of space
>>>>>> mbuffer: error: outputThread: error writing to<stdout>    at offset
>>>>>> 0x395917c4000: Broken pipe
>>>>>>
>>>>>> summary: 3670 GByte in 10 h 40 min 97.8 MB/s
>>>>>> mbuffer: warning: error during output to<stdout>: Broken pipe
>>>>>> warning: cannot send 'storage/bacula at transfer': Broken pipe
>>>>>>
>>>>>> real    640m48.423s
>>>>>> user    8m52.660s
>>>>>> sys     211m40.862s
>>>>>>
>>>>>>
>>>>>> Looking in the logs, I see this:
>>>>>>
>>>>>> Oct  2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
>>>>>> Oct  2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
>>>>>> Oct  2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>>>> Oct  2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
>>>>>> Oct  2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
>>>>>> Oct  2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>>>> Oct  2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
>>>>>> Oct  2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
>>>>>> Oct  2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>>>> Oct  2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
>>>>>> Oct  2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
>>>>>> Oct  2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>>>> Oct  2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
>>>>>> Oct  2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
>>>>>> Oct  2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>>>> Oct  2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
>>>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>>>> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6
>>>>>>
>>>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize
>>>>>> cache failed
>>>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry
>>>>>>
>>>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>>>> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
>>>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>>>> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6
>>>>>>
>>>>>> $ zpool status
>>>>>>    pool: storage
>>>>>>   state: DEGRADED
>>>>>>   scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
>>>>>> config:
>>>>>>
>>>>>>          NAME                 STATE     READ WRITE CKSUM
>>>>>>          storage              DEGRADED     0     0     0
>>>>>>            raidz2             DEGRADED     0     0     0
>>>>>>              gpt/disk01-live  ONLINE       0     0     0
>>>>>>              gpt/disk02-live  ONLINE       0     0     0
>>>>>>              gpt/disk03-live  ONLINE       0     0     0
>>>>>>              gpt/disk04-live  ONLINE       0     0     0
>>>>>>              gpt/disk05-live  ONLINE       0     0     0
>>>>>>              gpt/disk06-live  REMOVED      0     0     0
>>>>>>              gpt/disk07-live  ONLINE       0     0     0
>>>>>>
>>>>>> $ zfs list
>>>>>> NAME                        USED  AVAIL  REFER  MOUNTPOINT
>>>>>> storage                    6.97T  1.91T  1.75G  /storage
>>>>>> storage/bacula             4.72T  1.91T  4.29T  /storage/bacula
>>>>>> storage/compressed         2.25T  1.91T  46.9K  /storage/compressed
>>>>>> storage/compressed/bacula  2.25T  1.91T  42.7K  /storage/compressed/bacula
>>>>>> storage/pgsql              5.50G  1.91T  5.50G  /storage/pgsql
>>>>>>
>>>>>> $ sudo camcontrol devlist
>>>>>> Password:
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus2 target 0 lun 0 (pass1,ada1)
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus3 target 0 lun 0 (pass2,ada2)
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus4 target 0 lun 0 (pass3,ada3)
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus5 target 0 lun 0 (pass4,ada4)
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus6 target 0 lun 0 (pass5,ada5)
>>>>>> <Hitachi HDS722020ALA330 JKAOA28A>     at scbus7 target 0 lun 0 (pass6,ada6)
>>>>>> <ST380815AS 4.AAB>                    at scbus8 target 0 lun 0 (pass7,ada7)
>>>>>> <TSSTcorp CDDVDW SH-S223C SB01>       at scbus9 target 0 lun 0 (cd0,pass8)
>>>>>> <WDC WD1600AAJS-75M0A0 02.03E02>      at scbus10 target 0 lun 0 (pass9,ada8)
>>>>>>
>>>>>> I'm not yet sure if the drive is fully dead or not.  This is not a
>>>>>> hot-swap box.
>>>>>
>>>>> It looks to me like the disk labelled gpt/disk06-live literally stopped
>>>>> responding to commands.  The errors you see are coming from the OS and
>>>>> the siis(4) controller, and both indicate the actual hard disk isn't
>>>>> responding to the ATA command READ LOG EXT.  error=6 means Device not
>>>>> configured.
>>>>>
>>>>> I can't see how/why running out of space would cause this.  It looks
>>>>> more like that you had a hardware issue of some sort happen during the
>>>>> course of the operations you were running.  It may not have happened
>>>>> until now because you hadn't utilised writes to that area of the disk
>>>>> (could have bad sectors there, or physical media/platter problems).
>>>>>
>>>>> Please provide smartctl -a output for the drive that's gpt/disk06-live,
>>>>> which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't
>>>>> it?  Sigh...).  Please put the results up on the web somewhere, not
>>>>> copy-pasted, otherwise I have to do a bunch of manual work with regarsd
>>>>> to line wrapping/etc...  I'll provide an analysis of SMART stats for
>>>>> you, to see if anything crazy happened to the disk itself.
>>>>
>>>> It is ada0, I'm sure, based on the 'lost device' mentioned in
>>>> /var/log/messages above.
>>>>
>>>> I'm getting nowhere.  /dev/ada0 does not exist so there is nothing
>>>> for smartctl to work on.
>>>>
>>>> [...]
>>>>
>>>> $ ls -l /dev/ada0*
>>>> ls: /dev/ada0*: No such file or directory
>>>
>>> Okay, so gpt/disk06-live is /dev/ada0.  (I won't ask why the label is
>>> called "disk06", but whatever.  :-) )
>>>
>>>> I am tempted to reboot or do a camontrol scan.
>>>
>>> DO NOT REBOOT.You can try the following -- I'm not sure whether to
>>> use scbus0 or scbus1 as the argument however, since I don't know what
>>> scbusX number ada0 was attached to previously.  "dmesg" from when the
>>> machine booted would show this.
>>>
>>> camcontrol reset scbusX
>>> camcontrol rescan scbusX
>>
>> I see this in /var/run/dmesg.boot:
>>
>> ada0 at siisch0 bus 0 scbus0 target 0 lun 0
>> ada0:<Hitachi HDS722020ALA330 JKAOA28A>  ATA-8 SATA 2.x device
>> ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
>> ada0: Command Queueing enabled
>>
>> $ sudo camcontrol reset scbus0
>> Password:
>> Reset of bus 0 was successful
>>
>> $ sudo camcontrol rescan scbus0
>> Re-scan of bus 0 was successful
>>
>>> If the disk comes back, please smartctl -a it.
>>
>> I didn't come back:
>>
>> $ ls /dev/ada*
>> /dev/ada1   /dev/ada2p1 /dev/ada4   /dev/ada5p1 /dev/ada7
>> /dev/ada1p1 /dev/ada3   /dev/ada4p1 /dev/ada6   /dev/ada8
>> /dev/ada2   /dev/ada3p1 /dev/ada5   /dev/ada6p1
>>
>> FYI, there's nothing new in /var/log/messages as a results of those
>> commands.
>
> Then I would recommend power-cycling (not rebooting or pressing of the
> reset button) the machine.  There's a good chance the ada0 disk has
> fallen off the bus and needs a full power-cycle, since a LUN scan didn't
> result in its reappearance.
>
> I see this kind of problem on a weekly basis at my workplace, in 3
> different datacenters, with Fujitsu SCSI-3 disks.  A system reboot
> doesn't make the disk reappear on on the bus, nor does a reset (pressing
> of the reset button).  Only a full power-cycle works.  And when I say
> weekly, I'm not exaggerating.
>
> I realise your disks are Hitachi not Fujitsu, and are SATA not SCSI, but
> it really doesn't matter -- there are cases where the drive firmware is
> wedged so hard that a physical power-cycle is required.
>
> If a power-cycle works, smartctl -a /dev/ada0 the disk and save the
> SMART stats somewhere.  If the same disk fails in this way again, I
> strongly recommend advance RMA'ing it (to ensure you get a completely
> different disk).
>
> Good luck!

thanks.

After a 'shutdown -p now', it was about 20 minutes before I went and 
powered it up (I was on minecraft).  The box came back with the missing HDD:

$ zpool status storage
   pool: storage
  state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
         attempt was made to correct the error.  Applications are 
unaffected.
action: Determine if the device needs to be replaced, and clear the errors
         using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://www.sun.com/msg/ZFS-8000-9P
  scrub: none requested
config:

         NAME                 STATE     READ WRITE CKSUM
         storage              ONLINE       0     0     0
           raidz2             ONLINE       0     0     0
             gpt/disk01-live  ONLINE       0     0     0
             gpt/disk02-live  ONLINE       0     0     0
             gpt/disk03-live  ONLINE       0     0     0
             gpt/disk04-live  ONLINE       0     0     0
             gpt/disk05-live  ONLINE       0     0     0
             gpt/disk06-live  ONLINE       0     0    12
             gpt/disk07-live  ONLINE       0     0     0

smartctl output attached and and http://www.langille.org/tmp/ada0.txt

-- 
Dan Langille - http://langille.org/
-------------- next part --------------
$ sudo smartctl -a /dev/ada0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 7K2000
Device Model:     Hitachi HDS722020ALA330
Serial Number:    JK1131YAHLJWLV
Firmware Version: JKAOA28A
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Oct  2 21:59:03 2010 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (22771) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   180   180   024    Pre-fail  Always       -       441 (Average 361)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   114   114   020    Pre-fail  Offline      -       38
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1671
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       28
194 Temperature_Celsius     0x0002   181   181   000    Old_age   Always       -       33 (Lifetime Min/Max 25/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.