Resolving errors with ZVOL-s

Mon Sep 4 17:12:54 UTC 2017

Hi,

I can follow up on my issue - the same problem just happened on the second
ZVOL that I've created:
# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 5h27m with 0 errors on Sat Sep  2 15:30:59 2017
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0    14
          mirror-0         ONLINE       0     0    28
            gpt/tank1.eli  ONLINE       0     0    28
            gpt/tank2.eli  ONLINE       0     0    28

errors: Permanent errors have been detected in the following files:

        tank/docker-big:<0x1>
        <0x5095>:<0x1>

I suspect that these errors might be related to my recent upgrade to 11.1.
Until 19 of August I was running 11.0. I consider rolling back to 11.0
right now.

For reference:
# zfs get all tank/docker-big
NAME             PROPERTY               VALUE                  SOURCE
tank/docker-big  type                   volume                 -
tank/docker-big  creation               Sat Sep  2 10:09 2017  -
tank/docker-big  used                   100G                   -
tank/docker-big  available              747G                   -
tank/docker-big  referenced             10.5G                  -
tank/docker-big  compressratio          4.58x                  -
tank/docker-big  reservation            none                   default
tank/docker-big  volsize                100G                   local
tank/docker-big  volblocksize           128K                   -
tank/docker-big  checksum               skein                  inherited
from tank
tank/docker-big  compression            lz4                    inherited
from tank
tank/docker-big  readonly               off                    default
tank/docker-big  copies                 1                      default
tank/docker-big  refreservation         100G                   local
tank/docker-big  primarycache           all                    default
tank/docker-big  secondarycache         all                    default
tank/docker-big  usedbysnapshots        0                      -
tank/docker-big  usedbydataset          10.5G                  -
tank/docker-big  usedbychildren         0                      -
tank/docker-big  usedbyrefreservation   89.7G                  -
tank/docker-big  logbias                latency                default
tank/docker-big  dedup                  off                    default
tank/docker-big  mlslabel                                      -
tank/docker-big  sync                   standard               default
tank/docker-big  refcompressratio       4.58x                  -
tank/docker-big  written                10.5G                  -
tank/docker-big  logicalused            47.8G                  -
tank/docker-big  logicalreferenced      47.8G                  -
tank/docker-big  volmode                dev                    local
tank/docker-big  snapshot_limit         none                   default
tank/docker-big  snapshot_count         none                   default
tank/docker-big  redundant_metadata     all                    default
tank/docker-big  com.sun:auto-snapshot  false                  local

Any ideas what should I try before rolling back?

Cheers,

Wiktor

2017-09-02 19:17 GMT+02:00 Wiktor Niesiobedzki <bsd at vink.pl>:

> Hi,
>
> I have recently encountered errors on my ZFS Pool on my 11.1-R:
> $ uname -a
> FreeBSD kadlubek 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9
> 11:55:48 UTC 2017     root at amd64-builder.daemonology
> .net:/usr/obj/usr/src/sys/GENERIC  amd64
>
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://illumos.org/msg/ZFS-8000-8A
>   scan: scrub repaired 0 in 5h27m with 0 errors on Sat Sep  2 15:30:59 2017
> config:
>
>         NAME               STATE     READ WRITE CKSUM
>         tank               ONLINE       0     0    98
>           mirror-0         ONLINE       0     0   196
>             gpt/tank1.eli  ONLINE       0     0   196
>             gpt/tank2.eli  ONLINE       0     0   196
>
> errors: Permanent errors have been detected in the following files:
>
>         dkr-test:<0x1>
>
> dkr-test is ZVOL that I use within bhyve and indeed - within bhyve I have
> noticed I/O errors on this volume. This ZVOL did not have any snapshots.
>
> Following the advice mentioned in action I tried to restore the ZVOL:
> # zfs desroy tank/dkr-test
>
> But still errors are mentioned in zpool status:
> errors: Permanent errors have been detected in the following files:
>
>         <0x5095>:<0x1>
>
> I can't find any reference to this dataset in zdb:
>  # zdb -d tank | grep 5095
>  # zdb -d tank | grep 20629
>
>
> I tried also getting statistics about metadata in this pool:
> # zdb -b tank
>
> Traversing all blocks to verify nothing leaked ...
>
> loading space map for vdev 0 of 1, metaslab 159 of 174 ...
>         No leaks (block sum matches space maps exactly)
>
>         bp count:        24426601
>         ganged count:           0
>         bp logical:    1983127334912      avg:  81187
>         bp physical:   1817897247232      avg:  74422     compression:
> 1.09
>         bp allocated:  1820446928896      avg:  74527     compression:
> 1.09
>         bp deduped:             0    ref>1:      0   deduplication:   1.00
>         SPA allocated: 1820446928896     used: 60.90%
>
>         additional, non-pointer bps of type 0:      57981
>         Dittoed blocks on same vdev: 296490
>
> And zdb got stuck using 100% CPU
>
> And now to my questions:
> 1. Do I interpret correctly, that this situation is probably due to error
> during write, and both copies of the block got checksum mismatching their
> data? And if it is a hardware problem, it is probably something other than
> disk? (No, I don't use ECC RAM)
>
> 2. Is there any way to remove offending dataset and clean the pool of the
> errors?
>
> 3. Is my metadata OK? Or should I restore entire pool from backup?
>
> 4. I tried also running zdb -bc tank, but this resulted in kernel panic. I
> might try to get the stack trace once I get physical access to machine next
> week. Also - checksum verification slows down process from 1000MB/s to less
> than 1MB/s. Is this expected?
>
> 5. When I work with zdb (as as above) should I try to limit writes to the
> pool (e.g. by unmounting the datasets)?
>
> Cheers,
>
> Wiktor Niesiobedzki
>
> PS. Sorry for previous partial message.
>
>