[Bug 223085] ZFS Resilver not completing - stuck at 99%

Wed Oct 18 10:48:08 UTC 2017

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223085

            Bug ID: 223085
           Summary: ZFS Resilver not completing - stuck at 99%
           Product: Base System
           Version: 10.2-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: paul at vsl-net.com

I have a number of FreeBSD system with large (30TB) ZFS pools.

I have had several disks fail over time and have seen problems with resilvers
either not completing or getting to 99% within a week but then taking a further
month to complete.

I have been seeking advice in the forums.

https://forums.freebsd.org/threads/61643/#post-355088

A system that has a disk replaced some time ago is in this state

  pool: s11d34
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Sep 14 15:08:15 2017
        49.4T scanned out of 49.8T at 17.7M/s, 6h13m to go
        4.93T resilvered, 99.24% done
config:

        NAME                             STATE     READ WRITE CKSUM
        s11d34                           DEGRADED     0     0     0
          raidz2-0                       ONLINE       0     0     0
            multipath/J11F18-1EJB8KUJ    ONLINE       0     0     0
            multipath/J11R01-1EJ2XT4F    ONLINE       0     0     0
            multipath/J11R02-1EHZE2GF    ONLINE       0     0     0
            multipath/J11R03-1EJ2XTMF    ONLINE       0     0     0
            multipath/J11R04-1EJ3NK4J    ONLINE       0     0     0
          raidz2-1                       DEGRADED     0     0     0
            multipath/J11R05-1EJ2Z8AF    ONLINE       0     0     0
            multipath/J11R06-1EJ2Z8NF    ONLINE       0     0     0
            replacing-2                  OFFLINE      0     0     0
              7444569586532474759        OFFLINE      0     0     0  was
/dev/multipath/J11R07-1EJ03GXJ
              multipath/J11F23-1EJ3AJBJ  ONLINE       0     0     0 
(resilvering)
            multipath/J11R08-1EJ3A0HJ    ONLINE       0     0     0
            multipath/J11R09-1EJ32UPJ    ONLINE       0     0     0

It got to 99.24% within a week but has stuck there since.

I have stopped ALL access to the pool and ran zpool iostat and there is still
activity (although low e.g. 1.2M read, 1.78M write etc...) so it does appear to
be doing something.

The disks (6TB or 8TB HGST SAS) are attached via an LSI 9207-8e HBA which is
connected to a LSI 6160 SAS Switch that is connected to a Supermicro JBOD.

The HBA's have 2 connectors, each is connected to a different SAS switch.

The system sees the disk twice as expected and I use gmultipath to label the
disks and set in Active/Passive mode, I then use the multipath name during
zpool create e.g.

root at freebsd04:~ # gmultipath status
Name Status Components
multipath/J11R00-1EJ2XR5F OPTIMAL da0 (ACTIVE)
da11 (PASSIVE)
multipath/J11R01-1EJ2XT4F OPTIMAL da1 (ACTIVE)
da12 (PASSIVE)
multipath/J11R02-1EHZE2GF OPTIMAL da2 (ACTIVE)
da13 (PASSIVE)

zpool create -f store43 raidz2 multipath/J11R00-1EJ2XR5F
multipath/J11R01-1EJ2XT4F etc.......

Any advice if this is a bug or something wrong with my setup?

Thanks

Paul

-- 
You are receiving this mail because:
You are the assignee for the bug.