ZFS: how to replace a dead disk?

Fri May 28 17:11:30 UTC 2010

On Fri, May 28, 2010 at 11:34:23AM -0500, James R. Van Artsdalen wrote:
> Jeremy Chadwick wrote:
> > On Fri, May 28, 2010 at 08:36:38AM -0500, James R. Van Artsdalen wrote:
> >> What's the right way to replace a dead disk under ZFS?
> >>
> >>         replacing   DEGRADED     0     0     0
> >>           ada1/old  UNAVAIL      0  256K     0  cannot open
> >>           ada1      ONLINE       0     0     0  1.47T resilvered
> >> ---
> >>
> >> It says "replacing" and that the device, vdev and pool are degraded, yet
> >> the "resilver" finished hours ago.  I cannot detach the ada1/old entry.
> >>
> >> Is there some other command I should have used to remove the dead ada1
> >> device?
> >
> > What version of FreeBSD?  Please provide uname -a output and not "8.0"
> > or something equally as terse.
> >
> > Some clarification: you didn't remove the device, you simply told ZFS to
> > assuming that the device had been replaced.
> >
> > What did you do (both physically and software/command-line-wise) *prior*
> > to issuing "zpool replace jwrc ada1"?
> >
> Sorry: my original note contained version information but that isn't in
> your reply?
> 
> FreeBSD cyclone 9.0-CURRENT FreeBSD 9.0-CURRENT #2 r206111: Fri Apr  2
> 13:47:20 CDT 2010    
> root at cyclone.housenet.jrv:/usr/obj/usr/src/sys/GENERIC  amd64

The only line sent to the list was: "This kernel 206111, roughly April
1, on amd64".  Here's verification:

http://lists.freebsd.org/pipermail/freebsd-fs/2010-May/008592.html

So now we know you're running HEAD.

> The original disk is no longer usable by FreeBSD in any way: it returned
> a stream of errors & noise on its port in a way that left the system
> unable to boot.  I physically replaced that disk with a new disk before
> attempting the "zpool replace"
> 
> No actions were taken prior to replacing the disk.  I went to the site
> to see why the server was unresponsive, saw that one drive was
> problematic by watching the activity LEDs, physically replaced that
> disk, booted, and logically replaced that disk with "zpool replace jwrc
> ada1"

I think the procedure you executed might be the problem.  The steps I've
used in the past, 100% reliably, with ata(4) and AHCI on an ICH7 and
ICH9 are:

 1. zpool offline adX
 2. atacontrol list (to find the ataX device number)
 3. atacontrol detach ataX
 4. dmesg (verify the detach worked)
 5. Physically remove the disk (must be in a hot-swap enclosure)
 6. Physically insert the new disk
 7. atacontrol attach ataX
 8. dmesg (to determine what the adX drive number is; on my systems
    the adX drive number remains static/does not change)
 9. zpool online pool adX
10. zpool replace pool adX
11. zpool status  (watch until finished)

This is adherent to the Solaris ZFS Administrator's guide, except that
atacontrol(8) is being used instead of cfgadm(1M).  See Example 11-1:

http://docs.sun.com/app/docs/doc/819-5461/gbbzy?l=en&a=view

The same procedure should ideally be followed using ahci.ko + CAM, using
camcontrol devlist/eject, camcontrol rescan (may not be needed but use
devlist to verify the kernel noticing removal/additions), and camcontrol
load.

If you'd like me to verify and demonstrate this on FreeBSD (RELENG_8
only, however -- I don't run CURRENT) I can do so.  I can also do the
same thing with ahci.ko + CAM.  Just let me know.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |