ZFS weird device tasting loop since MFC

Fri Jun 5 11:28:40 UTC 2009

Must be a weird geom interaction. I don't see this with raw disk. I'll
look at it eventually but UMA and performance are further up in the
queue.

-Kip

On Fri, Jun 5, 2009 at 1:44 AM, Ulrich Spörlein<uqs at spoerlein.net> wrote:
> On Tue, 02.06.2009 at 11:24:08 +0200, Ulrich Spörlein wrote:
>> On Tue, 02.06.2009 at 11:16:10 +0200, Ulrich Spörlein wrote:
>> > Hi all,
>> >
>> > so I went ahead and updated my ~7.2 file server to the new ZFS goodness,
>> > and before running any further tests, I already discovered something
>> > weird and annoying.
>> >
>> > I'm using a mirror on GELI, where one disk is usually *not* attached as
>> > a means of poor man's backup. (I had to go that route, as send/recv of
>> > snapshots frequently deadlocked the system, whereas a mirror scrubbing
>> > did not)
>> >
>> > root at coyote:~# zpool status
>> >   pool: tank
>> >  state: DEGRADED
>> > status: The pool is formatted using an older on-disk format.  The pool can
>> >         still be used, but some features are unavailable.
>> > action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
>> >         pool will no longer be accessible on older software versions.
>> >  scrub: none requested
>> > config:
>> >
>> >         NAME                      STATE     READ WRITE CKSUM
>> >         tank                      DEGRADED     0     0     0
>> >           mirror                  DEGRADED     0     0     0
>> >             ad4.eli               ONLINE       0     0     0
>> >             12333765091756463941  REMOVED      0     0     0  was /dev/da0.eli
>> >
>> > errors: No known data errors
>> >
>> > When imported, there is a constant "tasting" of all devices in the system,
>> > which also makes the floppy drive go spinning constantly, which is really
>> > annoying. It did not do this with the old ZFS, are there any remedies?
>> >
>> > gstat(8) is displaying the following every other second, together with a
>> > spinning fd0 drive.
>> >
>> > dT: 1.010s  w: 1.000s  filter: ^...$
>> >  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>> >     0      0      0      0    0.0      0      0    0.0    0.0| fd0
>> >     0      8      8   1014    0.1      0      0    0.0    0.1| md0
>> >     0     32     32   4055    9.2      0      0    0.0   29.2| ad0
>> >     0     77     10   1267    7.1     63   1125    2.3   31.8| ad4
>> >
>> > There is no activity going on, especially md0 is for /tmp, yet it
>> > constantly tries to read stuff from everywhere. I will now insert the
>> > second drive and see if ZFS shuts up then ...
>>
>> It does, but it also did not start resilvering the second disk:
>>
>> root at coyote:~# zpool status
>>   pool: tank
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>>         using 'zpool clear' or replace the device with 'zpool replace'.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: none requested
>> config:
>>
>>         NAME         STATE     READ WRITE CKSUM
>>         tank         ONLINE       0     0     0
>>           mirror     ONLINE       0     0     0
>>             ad4.eli  ONLINE       0     0     0
>>             da0.eli  ONLINE       0     0    16
>>
>> errors: No known data errors
>>
>> Will now run the scrub and report back in 6-9h.
>
> Another datapoint: While the floppy-tasting has stopped, since the mirror sees
> all devices again, there is some other problem here:
>
> root at coyote:/# zpool online tank da0.eli
> root at coyote:/# zpool status
>  pool: tank
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Fri Jun  5 10:21:36 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        tank         ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            ad4.eli  ONLINE       0     0     0  684K resilvered
>            da0.eli  ONLINE       0     0     0  2.20M resilvered
>
> errors: No known data errors
> root at coyote:/# zpool offline tank da0.eli
> root at coyote:/# zpool status
>  pool: tank
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>        Sufficient replicas exist for the pool to continue functioning in a
>        degraded state.
> action: Online the device using 'zpool online' or replace the device with
>        'zpool replace'.
>  scrub: resilver completed after 0h0m with 0 errors on Fri Jun  5 10:21:36 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        tank         DEGRADED     0     0     0
>          mirror     DEGRADED     0     0     0
>            ad4.eli  ONLINE       0     0     0  684K resilvered
>            da0.eli  OFFLINE      0     0     0  2.20M resilvered
>
> errors: No known data errors
> root at coyote:/# zpool status
>  pool: tank
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>        attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>        using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver completed after 0h0m with 0 errors on Fri Jun  5 10:21:36 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        tank         DEGRADED     0     0     0
>          mirror     DEGRADED     0     0     0
>            ad4.eli  ONLINE       0     0     0  684K resilvered
>            da0.eli  OFFLINE      0   339     0  2.20M resilvered
>
> errors: No known data errors
> root at coyote:/# zpool status
>  pool: tank
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
>        Sufficient replicas exist for the pool to continue functioning in a
>        degraded state.
> action: Online the device using 'zpool online' or replace the device with
>        'zpool replace'.
>  scrub: resilver completed after 0h0m with 0 errors on Fri Jun  5 10:21:36 2009
> config:
>
>        NAME         STATE     READ WRITE CKSUM
>        tank         DEGRADED     0     0     0
>          mirror     DEGRADED     0     0     0
>            ad4.eli  ONLINE       0     0     0  684K resilvered
>            da0.eli  OFFLINE      0     0     0  2.20M resilvered
>
> errors: No known data errors
>
>
> So I ran 'zpool status' thrice after the offline, and the second one reports
> write errors on the OFFLINE device (WTF?). Running zpool status in a loop, this
> will constantly show up and then vanish again.
>
> I also get constant write requests to the remaining device, even though no
> applications are accessing it. What the hell is ZFS trying to do here?
>
> root at coyote:/# zpool iostat 1
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> tank         883G  48.4G      8    246  56.8K  1.53M
> tank         883G  48.4G      8    249  55.9K  1.55M
> tank         883G  48.4G      8    250  55.0K  1.54M
> tank         883G  48.4G      8    252  54.1K  1.56M
> tank         883G  48.4G      8    254  53.3K  1.57M
> tank         883G  48.4G      8    253  52.5K  1.56M
> tank         883G  48.4G      7    255  51.7K  1.57M
> ^C
>
> Again, WTF? Can someone please enlighten me here?
>
> Cheers,
> Ulrich Spörlein
> --
> http://www.dubistterrorist.de/
>

-- 
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

    Edmund Burke