immense delayed write to file system (ZFS and UFS2), performance issues

Tue Jan 26 13:57:24 UTC 2010

On Tue, 19 Jan 2010 03:24:49 -0800 Jeremy Chadwick
<freebsd at jdc.parodius.com> wrote about Re: immense delayed write to file
system (ZFS and UFS2), performance issues:

JC> So which drive models above are experiencing a continual increase in
JC> SMART attribute 193 (Load Cycle Count)?  My guess is that some of the
JC> WD Caviar Green models, and possibly all of the RE2-GP and RE4-GP
JC> models are experiencing this problem.

Just to add some more info:
I contacted WD support about the problem with RE4 drives and received a
firmware update by email today which is supposed to fix the problem. Did
not try it yet, though.

I am still busy replacing RE2-disks with updated drives. I came across a
very strange thing with zfs. Actually I had the following pool layout:

mclane# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0
            ad12    ONLINE       0     0     0
        spares
          ad14      AVAIL   

errors: No known data errors

All disks still have the firmware bug, so I want to replace them with
disks that I already fixed. I put in a updated drive as ad18 and
wanted to replace ad12 to get the drive with the broken firmware out:

mclane# zpool replace tank /dev/ad12 /dev/ad18 
mclane# zpool status
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.01% done, 52h51m to go
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          raidz1       ONLINE       0     0     0
            ad8        ONLINE       0     0     0  7.21M resilvered
            ad10       ONLINE       0     0     0  7.22M resilvered
            replacing  ONLINE       0     0     0
              ad12     ONLINE       0     0     0
              ad18     ONLINE       0     0     0  10.7M resilvered
        spares
          ad14         AVAIL   

errors: No known data errors

However, something must have gone wrong during the resilvering process and
it now looks like this:

mclane# zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected. action: Determine if the device needs to be replaced, and
clear the errors using 'zpool clear' or replace the device with 'zpool
replace'. see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 2h39m with 0 errors on Tue Jan 26
14:00:00 2010 config:

        NAME           STATE     READ WRITE CKSUM
        tank           DEGRADED     0     0     0
          raidz1       DEGRADED     0     0     0
            ad8        ONLINE       0     0     0  975M resilvered
            ad10       ONLINE       0     0   142  974M resilvered
            replacing  DEGRADED     0 7.25M     0
              ad12     ONLINE       0     0     0
              ad18     REMOVED      0     1     0  79.4M resilvered
        spares
          ad14         AVAIL   

errors: No known data errors

What is going on here? ad18 obviously detached during the
process. /var/log/messages just gives me

Jan 26 11:23:33 mclane kernel: ad18: FAILURE - device detached

Additionally ad10 obviously produced chksum errors. What do I do about the
degraded replacing process? Can I terminate it somehow and maybe replace
ad10 first? Any other hints?

cu
  Gerrit