Unplugging disk under ZFS yield panic

Wed Jan 11 20:40:43 UTC 2012

On Wed, Jan 11, 2012 at 09:07:08PM +0100, Fabian Keil wrote:
> Gergely CZUCZY <phoemix at harmless.hu> wrote:
> 
> > I'd like to ask, whether it is normal behaviour when we're unplugging a
> > disk under a ZFS system then on the first write a kernel panic happened.
> 
> Sounds familiar. I currently have two PRs open for
> reproducible kernel panics after a vdev gets lost:
> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/162010
> http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/162036
> 
> Note that the pool layouts are different, though.

Is this problem truly ZFS-specific?  I'd been tracking this problem for
years, and was told it was fixed:

http://wiki.freebsd.org/BugBusting/Commonly_reported_issues

* Panic occurs when a mounted device (USB, SATA, local image file,
  etc.) is removed

  Workaround: Be sure to umount all filesystems before removing the
  physical device
  Partial fix: Committed to CURRENT (8.0) on/prior to 2008/02/21

  There is ongoing work to fully fix this problem, ETA 2009/02 

OP, please provide a kernel backtrace.

Otherwise, if needed, I can go yank one of the two mirrored disks out of
my FreeBSD box at home to try and reproduce the problem.

  pool: data
 state: ONLINE
 scan: scrub repaired 0 in 1h17m with 0 errors on Thu Dec 29 12:05:05 2011
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada1    ONLINE       0     0     0
            ada3    ONLINE       0     0     0
        cache
          ada4      ONLINE       0     0     0

ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <WDC WD1002FAEX-00Z3A0 05.01D05> ATA-8 SATA 3.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <WDC WD1002FAEX-00Z3A0 05.01D05> ATA-8 SATA 3.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)

ahci0: <Intel ICH9 AHCI SATA controller> port 0x1c50-0x1c57,0x1c44-0x1c47,0x1c48-0x1c4f,0x1c40-0x1c43,0x18e0-0x18ff mem 0xdc000800-0xdc000fff irq 17 at device 31.2 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 6 3Gbps ports, Port Multiplier supported
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich3: [ITHREAD]

> > The hardware is a supermicro X8DTH-i/6/iF/6F board with 2x LSI 2008
> > fusion MPT SAS-2 controllers, over the mps(4) driver. The disks are
> > accessed over gmultipath, and the multipath'd devices are added to a
> > ZFS mirror:
> > DB
> >  mirror-0
> >   multipath/DB01
> >   multipath/DB02
> >  mirror-1
> >   multipath/DB03
> >   multipath/DB04
> >  logs
> >   mirror/host1p5
> >  cache
> >   multipath/SSD03p1
> >  spares
> >   multipath/DB05
> > 
> > System is 9.0-RELEASE
> > 
> > I've unplugged DB03 and on the first write we got a kernel panic.
> > Should this be normal behaviour or we're missing something here?
> 
> Without a back trace or at least the panic reason one can only
> speculate what's going on, but I think it's rather unlikely
> that the panic is the intended behaviour and not a bug.
> 
> Maybe you can gather some additional information and file a PR?
> 
> > On a device removal we're expecting it to moving to the spare disk, or
> > using the available redundant disks.
> 
> I agree that this behaviour would be preferable to a panic.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |