out of HDD space - zfs degraded
Jeremy Chadwick
freebsd at jdc.parodius.com
Sat Oct 2 23:50:26 UTC 2010
On Sat, Oct 02, 2010 at 07:23:16PM -0400, Dan Langille wrote:
> On 10/2/2010 6:36 PM, Jeremy Chadwick wrote:
> >On Sat, Oct 02, 2010 at 06:09:25PM -0400, Dan Langille wrote:
> >>On 10/2/2010 10:19 AM, Jeremy Chadwick wrote:
> >>>On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote:
> >>>>Overnight I was running a zfs send | zfs receive (both within the
> >>>>same system / zpool). The system ran out of space, a drive went off
> >>>>line, and the system is degraded.
> >>>>
> >>>>This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18
> >>>>23:43:48 EDT 2010.
> >>>>
> >>>>The following logs are also available at
> >>>>http://www.langille.org/tmp/zfs-space.txt<- no line wrapping
> >>>>
> >>>>This is what was running:
> >>>>
> >>>># time zfs send storage/bacula at transfer | mbuffer | zfs receive
> >>>>storage/compressed/bacula-mbuffer
> >>>>in @ 0.0 kB/s, out @ 0.0 kB/s, 3670 GB total, buffer 100%
> >>>>fullcannot receive new filesystem stream: out of space
> >>>>mbuffer: error: outputThread: error writing to<stdout> at offset
> >>>>0x395917c4000: Broken pipe
> >>>>
> >>>>summary: 3670 GByte in 10 h 40 min 97.8 MB/s
> >>>>mbuffer: warning: error during output to<stdout>: Broken pipe
> >>>>warning: cannot send 'storage/bacula at transfer': Broken pipe
> >>>>
> >>>>real 640m48.423s
> >>>>user 8m52.660s
> >>>>sys 211m40.862s
> >>>>
> >>>>
> >>>>Looking in the logs, I see this:
> >>>>
> >>>>Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
> >>>>Oct 2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
> >>>>Oct 2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss
> >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
> >>>>Oct 2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
> >>>>Oct 2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
> >>>>Oct 2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss
> >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
> >>>>Oct 2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
> >>>>Oct 2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
> >>>>Oct 2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss
> >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
> >>>>Oct 2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
> >>>>Oct 2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
> >>>>Oct 2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss
> >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
> >>>>Oct 2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
> >>>>Oct 2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
> >>>>Oct 2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss
> >>>>40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
> >>>>Oct 2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
> >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
> >>>>path=/dev/gpt/disk06-live offset=270336 size=8192 error=6
> >>>>
> >>>>Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize
> >>>>cache failed
> >>>>Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry
> >>>>
> >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
> >>>>path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
> >>>>Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
> >>>>path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6
> >>>>
> >>>>$ zpool status
> >>>> pool: storage
> >>>> state: DEGRADED
> >>>> scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
> >>>>config:
> >>>>
> >>>> NAME STATE READ WRITE CKSUM
> >>>> storage DEGRADED 0 0 0
> >>>> raidz2 DEGRADED 0 0 0
> >>>> gpt/disk01-live ONLINE 0 0 0
> >>>> gpt/disk02-live ONLINE 0 0 0
> >>>> gpt/disk03-live ONLINE 0 0 0
> >>>> gpt/disk04-live ONLINE 0 0 0
> >>>> gpt/disk05-live ONLINE 0 0 0
> >>>> gpt/disk06-live REMOVED 0 0 0
> >>>> gpt/disk07-live ONLINE 0 0 0
> >>>>
> >>>>$ zfs list
> >>>>NAME USED AVAIL REFER MOUNTPOINT
> >>>>storage 6.97T 1.91T 1.75G /storage
> >>>>storage/bacula 4.72T 1.91T 4.29T /storage/bacula
> >>>>storage/compressed 2.25T 1.91T 46.9K /storage/compressed
> >>>>storage/compressed/bacula 2.25T 1.91T 42.7K /storage/compressed/bacula
> >>>>storage/pgsql 5.50G 1.91T 5.50G /storage/pgsql
> >>>>
> >>>>$ sudo camcontrol devlist
> >>>>Password:
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus2 target 0 lun 0 (pass1,ada1)
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus3 target 0 lun 0 (pass2,ada2)
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus4 target 0 lun 0 (pass3,ada3)
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus5 target 0 lun 0 (pass4,ada4)
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus6 target 0 lun 0 (pass5,ada5)
> >>>><Hitachi HDS722020ALA330 JKAOA28A> at scbus7 target 0 lun 0 (pass6,ada6)
> >>>><ST380815AS 4.AAB> at scbus8 target 0 lun 0 (pass7,ada7)
> >>>><TSSTcorp CDDVDW SH-S223C SB01> at scbus9 target 0 lun 0 (cd0,pass8)
> >>>><WDC WD1600AAJS-75M0A0 02.03E02> at scbus10 target 0 lun 0 (pass9,ada8)
> >>>>
> >>>>I'm not yet sure if the drive is fully dead or not. This is not a
> >>>>hot-swap box.
> >>>
> >>>It looks to me like the disk labelled gpt/disk06-live literally stopped
> >>>responding to commands. The errors you see are coming from the OS and
> >>>the siis(4) controller, and both indicate the actual hard disk isn't
> >>>responding to the ATA command READ LOG EXT. error=6 means Device not
> >>>configured.
> >>>
> >>>I can't see how/why running out of space would cause this. It looks
> >>>more like that you had a hardware issue of some sort happen during the
> >>>course of the operations you were running. It may not have happened
> >>>until now because you hadn't utilised writes to that area of the disk
> >>>(could have bad sectors there, or physical media/platter problems).
> >>>
> >>>Please provide smartctl -a output for the drive that's gpt/disk06-live,
> >>>which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't
> >>>it? Sigh...). Please put the results up on the web somewhere, not
> >>>copy-pasted, otherwise I have to do a bunch of manual work with regarsd
> >>>to line wrapping/etc... I'll provide an analysis of SMART stats for
> >>>you, to see if anything crazy happened to the disk itself.
> >>
> >>It is ada0, I'm sure, based on the 'lost device' mentioned in
> >>/var/log/messages above.
> >>
> >>I'm getting nowhere. /dev/ada0 does not exist so there is nothing
> >>for smartctl to work on.
> >>
> >>[...]
> >>
> >>$ ls -l /dev/ada0*
> >>ls: /dev/ada0*: No such file or directory
> >
> >Okay, so gpt/disk06-live is /dev/ada0. (I won't ask why the label is
> >called "disk06", but whatever. :-) )
> >
> >>I am tempted to reboot or do a camontrol scan.
> >
> >DO NOT REBOOT.You can try the following -- I'm not sure whether to
> >use scbus0 or scbus1 as the argument however, since I don't know what
> >scbusX number ada0 was attached to previously. "dmesg" from when the
> >machine booted would show this.
> >
> >camcontrol reset scbusX
> >camcontrol rescan scbusX
>
> I see this in /var/run/dmesg.boot:
>
> ada0 at siisch0 bus 0 scbus0 target 0 lun 0
> ada0: <Hitachi HDS722020ALA330 JKAOA28A> ATA-8 SATA 2.x device
> ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> ada0: Command Queueing enabled
>
> $ sudo camcontrol reset scbus0
> Password:
> Reset of bus 0 was successful
>
> $ sudo camcontrol rescan scbus0
> Re-scan of bus 0 was successful
>
> >If the disk comes back, please smartctl -a it.
>
> I didn't come back:
>
> $ ls /dev/ada*
> /dev/ada1 /dev/ada2p1 /dev/ada4 /dev/ada5p1 /dev/ada7
> /dev/ada1p1 /dev/ada3 /dev/ada4p1 /dev/ada6 /dev/ada8
> /dev/ada2 /dev/ada3p1 /dev/ada5 /dev/ada6p1
>
> FYI, there's nothing new in /var/log/messages as a results of those
> commands.
Then I would recommend power-cycling (not rebooting or pressing of the
reset button) the machine. There's a good chance the ada0 disk has
fallen off the bus and needs a full power-cycle, since a LUN scan didn't
result in its reappearance.
I see this kind of problem on a weekly basis at my workplace, in 3
different datacenters, with Fujitsu SCSI-3 disks. A system reboot
doesn't make the disk reappear on on the bus, nor does a reset (pressing
of the reset button). Only a full power-cycle works. And when I say
weekly, I'm not exaggerating.
I realise your disks are Hitachi not Fujitsu, and are SATA not SCSI, but
it really doesn't matter -- there are cases where the drive firmware is
wedged so hard that a physical power-cycle is required.
If a power-cycle works, smartctl -a /dev/ada0 the disk and save the
SMART stats somewhere. If the same disk fails in this way again, I
strongly recommend advance RMA'ing it (to ensure you get a completely
different disk).
Good luck!
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-stable
mailing list