Shooting yourself in the foot with ZFS: is quite easy

Mon Sep 24 18:23:50 PDT 2007

Hi!

I'm just playing with ZFS in qemu, and I think I've found some bug in
the logic which can lead to shoot-yourself-in-the-foot condition, which
can be avoided.

First of all, I've constructed raidz array:

---
# mdconfig -a -tswap -s64m
md0
# mdconfig -a -tswap -s64m
md1
# mdconfig -a -tswap -s64m
md2
# zpool create pool raidz md{0,1,2}
---

Next, I've brought one of the devices offline and rewrote part of it.
Let's imagine that I've needed some space for emergency situation.

---
# zpool offline pool md0
Bringing device md0 offline
# zpool status
...
        NAME        STATE     READ WRITE CKSUM
        pool        DEGRADED     0     0     0
          raidz1    DEGRADED     0     0     0
            md0     OFFLINE      0     0     0
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0
...
# dd if=/dev/zero of=/dev/md0 bs=1m count=1
1+0 records in
1+0 records out
1048576 bytes transferred in 0.084011 secs (12481402 bytes/sec)
---

Now, how do I put md0 back to the pool?
`zpool online pool md0' seems reasonable, and the pool will recover
itself on scrub, but I'm paranoid and I want to recreate data on md0
completely. But:

---
# zpool replace pool md0
cannot replace md0 with md0: md0 is busy
# zpool replace -f pool md0
cannot replace md0 with md0: md0 is busy
---

Seems like it's looking onto ondisk data (remains of ZFS) and thinks
that it's still used in the pool, because if I erase the whole device
with dd, it thinks of md0 as a new disk and replaces it without problems:

---
# dd if=/dev/zero of=/dev/md0 bs=1m
dd: /dev/md0: end of device
65+0 records in
64+0 records out
67108864 bytes transferred in 10.154127 secs (6609023 bytes/sec)
# zpool replace pool md0
# zpool status
...
        NAME           STATE     READ WRITE CKSUM
        pool           DEGRADED     0     0     0
          raidz1       DEGRADED     0     0     0
            replacing  DEGRADED     0     0     0
              md0/old  OFFLINE      0     0     0
              md0      ONLINE       0     0     0
            md1        ONLINE       0     0     0
            md2        ONLINE       0     0     0
...
# zpool status
...
        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     ONLINE       0     0     0
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0
...
---

This behaviour is, I think, undesired and one shoule be able to replace
offline device by itself any time.

Which is worse:

---
# zpool offline pool md0
Bringing device md0 offline
# dd if=/dev/zero of=/dev/md0 bs=1m
dd: /dev/md0: end of device
65+0 records in
64+0 records out
67108864 bytes transferred in 8.076568 secs (8309082 bytes/sec)
# zpool online pool md0
Bringing device md0 online
# zpool status
  pool: pool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Mon Sep 24 23:21:49 2007
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     UNAVAIL      0     0     0  corrupted data
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0

errors: No known data errors
# zpool replace pool md0
invalid vdev specification
use '-f' to override the following errors:
md0 is in use (r1w1e1)
# zpool replace -f pool md0
invalid vdev specification
the following errors must be manually repaired:
md0 is in use (r1w1e1)
# zpool scrub pool
# zpool status
  pool: pool
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Mon Sep 24 23:22:22 2007
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            md0     UNAVAIL      0     0     0  corrupted data
            md1     ONLINE       0     0     0
            md2     ONLINE       0     0     0

errors: No known data errors
# zpool offline md0
missing device name
usage:
        offline [-t] <pool> <device> ...
# zpool offline pool md0
cannot offline md0: no valid replicas
# mdconfig -du0
mdconfig: ioctl(/dev/mdctl): Device busy
---

This is very confusing: md0 is UNAVAIL, but the table says pool is
ONLINE (not DEGRADED!), though status says it's degraded. Still I
neither can bring the device offline nor replace it with itself
(though replacing it with equal md3 worked).

My opinion is that such situation should be avoided. First of all,
zpool behaviour with one of disks in UNAVAIL state seems to be clear
bug (array shown as ONLINE, unavility of brind unavail device offline
etc.). Also, ZFS should not trust any ondisk contents after bringing a
disk online. The best solution is completely recreating ZFS data
structures on disk in such case. This should solve both cases:

1) `zfs replace <pool> <disk currently offline>` won't say that the
offline disk is busy
2) One won't need to clear disk with dd to recreate it
3) `zfs online <pool> <disk currently offline with erased contents>`
won't lead to UNAVAIL state.
4) I think there could be more potential problems with current behavour:
for example, what happens if I replace a disk in raidz with another
disk, that was used in another raidz before?

As I understand, currently `zfs offline`/`zfs online` on a disk leads to
it's resilvering anyway?

-- 
Best regards,
  Dmitry Marakasov               mailto:amdmi3 at amdmi3.ru