ZFS raidz device replacement problem
Barry Pederson
bp at barryp.org
Sun Apr 15 15:39:09 UTC 2007
Pawel Jakub Dawidek wrote:
> On Fri, Apr 13, 2007 at 11:06:16PM -0500, Barry Pederson wrote:
>> I've been playing with ZFS (awesome stuff, thanks PJD) and noticed something funny when replacing a device under a raidz pool. It seems that even though ZFS says
>> resilvering is complete, you still need to manually do a "zpool scrub" to really get the pool into a good state.
>
> How do you tell it's not in a good state?
I took this last output showing cksum errors on the replaced device to
mean that the device wasn't really synced up with the rest of the raidz
members - it didn't have the data on it that ZFS expected it to.
>> # zpool status mypool
>> pool: mypool
>> state: ONLINE
>> status: One or more devices has experienced an unrecoverable error. An
>> attempt was made to correct the error. Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>> using 'zpool clear' or replace the device with 'zpool replace'.
>> see: http://www.sun.com/msg/ZFS-8000-9P
>> scrub: scrub completed with 0 errors on Fri Apr 13 22:43:46 2007
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> mypool ONLINE 0 0 0
>> raidz1 ONLINE 0 0 0
>> md0 ONLINE 0 0 0
>> md1 ONLINE 0 0 0
>> md3 ONLINE 0 0 5
>
> If you are referring to this CKSUM count not beeing 0, this was a bug in
> ZFS itself, and was fixes in OpenSolaris already and fix was merged to
> FreeBSD.
OK, I'll update and try again.
But the thing that got me going on this was that if you *don't* do the
scrub after the replace but instead do some other destructive things to
one of the non-replaced devices, like:
dd if=/dev/random bs=1m count=64 oseek=1 conv=notrunc of=/tmp/foo1
and *then* do a scrub and then "zpool status -v mypool", you get this
really alarming message mentioning data corruption.
---------------------
zpool status -v mypool
pool: mypool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub completed with 184 errors on Sun Apr 15 09:55:43 2007
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 761
raidz1 ONLINE 0 0 761
md0 ONLINE 0 0 0
md1 ONLINE 0 0 521
md3 ONLINE 0 0 25
errors: Permanent errors have been detected in the following files:
mypool:<0x100>
mypool:<0x104>
mypool:<0x105>
.
.
---------------------
I suppose you have to actually have some files in /mypool for this to
show up.
(are there supposed to be real filenames in the message instead of
things like "<0x100>" ?)
Now that I look at this closer and run "diff -r" between files I copied
into the pool and the originals, I don't see any differences - so ZFS's
claim about detecting actual corrupted files seems wrong there too.
("zpool clear mypool" doesn't make that "data corruption" warning go
away, just the cksum counts).
I'll keep poking at it.
Barry
More information about the freebsd-fs
mailing list