out of HDD space - zfs degraded

Sat Oct 2 23:23:21 UTC 2010

On 10/2/2010 6:36 PM, Jeremy Chadwick wrote:
> On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote:
>> On 10/2/2010 10:19 AM, Jeremy Chadwick wrote:
>>> On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote:
>>>> Overnight I was running a zfs send | zfs receive (both within the
>>>> same system / zpool).  The system ran out of space, a drive went off
>>>> line, and the system is degraded.
>>>>
>>>> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18
>>>> 23:43:48 EDT 2010.
>>>>
>>>> The following logs are also available at
>>>> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping
>>>>
>>>> This is what was running:
>>>>
>>>> # time zfs send storage/bacula at transfer | mbuffer | zfs receive
>>>> storage/compressed/bacula-mbuffer
>>>> in @  0.0 kB/s, out @  0.0 kB/s, 3670 GB total, buffer 100%
>>>> fullcannot receive new filesystem stream: out of space
>>>> mbuffer: error: outputThread: error writing to<stdout>   at offset
>>>> 0x395917c4000: Broken pipe
>>>>
>>>> summary: 3670 GByte in 10 h 40 min 97.8 MB/s
>>>> mbuffer: warning: error during output to<stdout>: Broken pipe
>>>> warning: cannot send 'storage/bacula at transfer': Broken pipe
>>>>
>>>> real    640m48.423s
>>>> user    8m52.660s
>>>> sys     211m40.862s
>>>>
>>>>
>>>> Looking in the logs, I see this:
>>>>
>>>> Oct  2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
>>>> Oct  2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
>>>> Oct  2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss
>>>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>>>> Oct  2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6
>>>>
>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize
>>>> cache failed
>>>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry
>>>>
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
>>>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>>>> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6
>>>>
>>>> $ zpool status
>>>>    pool: storage
>>>>   state: DEGRADED
>>>>   scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
>>>> config:
>>>>
>>>>          NAME                 STATE     READ WRITE CKSUM
>>>>          storage              DEGRADED     0     0     0
>>>>            raidz2             DEGRADED     0     0     0
>>>>              gpt/disk01-live  ONLINE       0     0     0
>>>>              gpt/disk02-live  ONLINE       0     0     0
>>>>              gpt/disk03-live  ONLINE       0     0     0
>>>>              gpt/disk04-live  ONLINE       0     0     0
>>>>              gpt/disk05-live  ONLINE       0     0     0
>>>>              gpt/disk06-live  REMOVED      0     0     0
>>>>              gpt/disk07-live  ONLINE       0     0     0
>>>>
>>>> $ zfs list
>>>> NAME                        USED  AVAIL  REFER  MOUNTPOINT
>>>> storage                    6.97T  1.91T  1.75G  /storage
>>>> storage/bacula             4.72T  1.91T  4.29T  /storage/bacula
>>>> storage/compressed         2.25T  1.91T  46.9K  /storage/compressed
>>>> storage/compressed/bacula  2.25T  1.91T  42.7K  /storage/compressed/bacula
>>>> storage/pgsql              5.50G  1.91T  5.50G  /storage/pgsql
>>>>
>>>> $ sudo camcontrol devlist
>>>> Password:
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus2 target 0 lun 0 (pass1,ada1)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus3 target 0 lun 0 (pass2,ada2)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus4 target 0 lun 0 (pass3,ada3)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus5 target 0 lun 0 (pass4,ada4)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus6 target 0 lun 0 (pass5,ada5)
>>>> <Hitachi HDS722020ALA330 JKAOA28A>    at scbus7 target 0 lun 0 (pass6,ada6)
>>>> <ST380815AS 4.AAB>                   at scbus8 target 0 lun 0 (pass7,ada7)
>>>> <TSSTcorp CDDVDW SH-S223C SB01>      at scbus9 target 0 lun 0 (cd0,pass8)
>>>> <WDC WD1600AAJS-75M0A0 02.03E02>     at scbus10 target 0 lun 0 (pass9,ada8)
>>>>
>>>> I'm not yet sure if the drive is fully dead or not.  This is not a
>>>> hot-swap box.
>>>
>>> It looks to me like the disk labelled gpt/disk06-live literally stopped
>>> responding to commands.  The errors you see are coming from the OS and
>>> the siis(4) controller, and both indicate the actual hard disk isn't
>>> responding to the ATA command READ LOG EXT.  error=6 means Device not
>>> configured.
>>>
>>> I can't see how/why running out of space would cause this.  It looks
>>> more like that you had a hardware issue of some sort happen during the
>>> course of the operations you were running.  It may not have happened
>>> until now because you hadn't utilised writes to that area of the disk
>>> (could have bad sectors there, or physical media/platter problems).
>>>
>>> Please provide smartctl -a output for the drive that's gpt/disk06-live,
>>> which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't
>>> it?  Sigh...).  Please put the results up on the web somewhere, not
>>> copy-pasted, otherwise I have to do a bunch of manual work with regarsd
>>> to line wrapping/etc...  I'll provide an analysis of SMART stats for
>>> you, to see if anything crazy happened to the disk itself.
>>
>> It is ada0, I'm sure, based on the 'lost device' mentioned in
>> /var/log/messages above.
>>
>> I'm getting nowhere.  /dev/ada0 does not exist so there is nothing
>> for smartctl to work on.
>>
>> [...]
>>
>> $ ls -l /dev/ada0*
>> ls: /dev/ada0*: No such file or directory
>
> Okay, so gpt/disk06-live is /dev/ada0.  (I won't ask why the label is
> called "disk06", but whatever.  :-) )
>
>> I am tempted to reboot or do a camontrol scan.
>
> DO NOT REBOOT.You can try the following -- I'm not sure whether to
> use scbus0 or scbus1 as the argument however, since I don't know what
> scbusX number ada0 was attached to previously.  "dmesg" from when the
> machine booted would show this.
>
> camcontrol reset scbusX
> camcontrol rescan scbusX

I see this in /var/run/dmesg.boot:

ada0 at siisch0 bus 0 scbus0 target 0 lun 0
ada0: <Hitachi HDS722020ALA330 JKAOA28A> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled

$ sudo camcontrol reset scbus0
Password:
Reset of bus 0 was successful

$ sudo camcontrol rescan scbus0
Re-scan of bus 0 was successful

> If the disk comes back, please smartctl -a it.

I didn't come back:

$ ls /dev/ada*
/dev/ada1   /dev/ada2p1 /dev/ada4   /dev/ada5p1 /dev/ada7
/dev/ada1p1 /dev/ada3   /dev/ada4p1 /dev/ada6   /dev/ada8
/dev/ada2   /dev/ada3p1 /dev/ada5   /dev/ada6p1

FYI, there's nothing new in /var/log/messages as a results of those 
commands.

-- 
Dan Langille - http://langille.org/