kern/169319: zfs resilver can't complete
dirk.meyer at dinoex.sub.org
dirk.meyer at dinoex.sub.org
Fri Jun 22 07:50:03 UTC 2012
>Number: 169319
>Category: kern
>Synopsis: zfs resilver can't complete
>Confidential: no
>Severity: critical
>Priority: medium
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Fri Jun 22 07:50:02 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator: Dirk Meyer
>Release: FreeBSD 9.0-RELEASE
>Organization:
privat
>Environment:
FreeBSD 9.0-RELEASE-p1 GENERIC and GENERIC + DTRACE
ZFS raidz
>Description:
tank was created with 3 disks.
one disk was replaced and resilved completed.
tank was extended wuth 3 new disks.
one disk was replaced and resilved does not complete.
resilver keeps restarting roughly every 3 minutes.
No entry in dmesg or syslog.
smartctl -a on all disks reprts no problems.
This Problem was reported mutiple times:
http://forums.freebsd.org/showthread.php?t=27878
http://lists.freebsd.org/pipermail/freebsd-fs/2008-December/005532.html
http://kerneltrap.org/mailarchive/freebsd-fs/2008/12/27/4498564
http://forums.freenas.org/showthread.php?5008-ZFS-raidz1-resilver-restarts-after-1-min
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/42740
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/49144
zpool status
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jun 17 22:17:19 2012
52,6M scanned out of 3,38T at 1,64M/s, 599h21m to go
7,27M resilvered, 0,00% done
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 ONLINE 0 0 0
ada5p1 ONLINE 0 0 0
ada4p1 ONLINE 0 0 0
ada3p1 ONLINE 0 0 0
raidz1-1 DEGRADED 0 0 0
ada1p1 ONLINE 0 0 0
ada0p1 ONLINE 0 0 0
replacing-2 UNAVAIL 0 0 0
16718347834481227056 UNAVAIL 0 0 0 was /dev/ada6p1/old
ada6p1 ONLINE 0 0 0 (resilvering)
errors: No known data errors
>How-To-Repeat:
$ zpool history -i
2011-10-30.06:33:44 zpool create tank raidz1 ada1p1 ada2p1 ada3p1
(disk fail)
2011-11-29.09:38:28 zpool replace tank ada2p1 ada2p1
(reording of devices here)
2012-01-27.11:21:50 zpool add tank raidz1 ada0p1 ada1p1 ada3p1
(disk fail)
2012-06-16.11:44:50 zpool scrub tank
2012-06-17.22:17:32 zpool replace tank ada6p1 ada6p1
2012-06-17.22:55:26 [internal pool scrub done txg:3885584] complete=0
2012-06-17.22:55:26 [internal pool scrub txg:3885584] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:07:09 [internal pool scrub done txg:3885689] complete=0
2012-06-17.23:07:09 [internal pool scrub txg:3885689] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:11:29 [internal pool scrub done txg:3885725] complete=0
2012-06-17.23:11:29 [internal pool scrub txg:3885725] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:14:16 [internal pool scrub done txg:3885748] complete=0
2012-06-17.23:14:16 [internal pool scrub txg:3885748] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:56:03 [internal pool scrub done txg:3886098] complete=0
[...]
>Fix:
No fix or workaround known.
1) detach the faulty disk did not help:
[...]
2012-06-19.20:58:11 [internal pool scrub done txg:3910984] complete=0
2012-06-19.20:58:11 [internal pool scrub txg:3910984] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:11:03 [internal pool scrub done txg:3911106] complete=0
2012-06-19.21:11:03 [internal pool scrub txg:3911106] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:13:26 [internal pool scrub done txg:3911128] complete=0
2012-06-19.21:13:26 [internal pool scrub txg:3911128] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:13:48 zpool offline tank 16718347834481227056
2012-06-19.21:16:15 [internal vdev detach txg:3911155] vdev=/dev/ada6p1/old
2012-06-19.21:16:15 zpool detach tank 16718347834481227056
2012-06-19.21:19:17 [internal pool scrub done txg:3911183] complete=0
2012-06-19.21:19:17 [internal pool scrub txg:3911183] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:20:18 [internal pool scrub done txg:3911192] complete=0
2012-06-19.21:20:18 [internal pool scrub txg:3911192] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:30:33 [internal pool scrub done txg:3911291] complete=0
2012-06-19.21:30:33 [internal pool scrub txg:3911291] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:34:32 [internal pool scrub done txg:3911328] complete=0
2012-06-19.21:34:32 [internal pool scrub txg:3911328] func=2 mintxg=3 maxtxg=3885254
[...]
2) removing the disk and scrub the drive did not help.
The scrub did complete without error, see "complete=1"
[...]
2012-06-20.06:46:38 [internal pool scrub done txg:3916369] complete=0
2012-06-20.06:46:38 [internal pool scrub txg:3916369] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:52:29 [internal pool scrub done txg:3916425] complete=0
2012-06-20.06:52:29 [internal pool scrub txg:3916425] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:57:04 zpool offline tank ada6p1
2012-06-20.07:11:35 zpool online tank 4320346087595370573
2012-06-20.09:03:19 [internal pool scrub done txg:3917680] complete=1
2012-06-20.17:36:49 zpool online tank ada6p1
2012-06-20.18:57:32 [internal pool scrub txg:3924817] func=2 mintxg=3 maxtxg=3924817
2012-06-20.18:57:43 [internal vdev attach txg:3924819] replace vdev=/dev/ada6p1 for vdev=/dev/ada6p1/old
2012-06-20.18:57:43 zpool replace tank ada6p1 ada6p1
2012-06-20.18:58:50 [internal pool scrub done txg:3924829] complete=0
2012-06-20.18:58:50 [internal pool scrub txg:3924829] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:00:12 [internal pool scrub done txg:3924843] complete=0
2012-06-20.19:00:13 [internal pool scrub txg:3924843] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:01:38 zpool offline tank ada6p1
2012-06-20.21:13:30 [internal pool scrub done txg:3926107] complete=1
2012-06-20.21:20:44 [internal pool scrub txg:3926198] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:20:57 zpool online tank ada6p1
2012-06-20.21:24:49 [internal pool scrub done txg:3926239] complete=0
2012-06-20.21:24:50 [internal pool scrub txg:3926239] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:27:49 [internal pool scrub done txg:3926266] complete=0
2012-06-20.21:27:49 [internal pool scrub txg:3926266] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:29:59 [internal pool scrub done txg:3926285] complete=0
2012-06-20.21:29:59 [internal pool scrub txg:3926285] func=2 mintxg=1616210 maxtxg=3926195
[...]
yy3) cleaning drive and replace again does not help.
[...]
2012-06-20.06:46:38 [internal pool scrub done txg:3916369] complete=0
2012-06-20.06:46:38 [internal pool scrub txg:3916369] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:52:29 [internal pool scrub done txg:3916425] complete=0
2012-06-20.06:52:29 [internal pool scrub txg:3916425] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:57:04 zpool offline tank ada6p1
2012-06-20.07:11:35 zpool online tank 4320346087595370573
2012-06-20.09:03:19 [internal pool scrub done txg:3917680] complete=1
2012-06-20.17:36:49 zpool online tank ada6p1
2012-06-20.18:57:32 [internal pool scrub txg:3924817] func=2 mintxg=3 maxtxg=3924817
2012-06-20.18:57:43 [internal vdev attach txg:3924819] replace vdev=/dev/ada6p1 for vdev=/dev/ada6p1/old
2012-06-20.18:57:43 zpool replace tank ada6p1 ada6p1
2012-06-20.18:58:50 [internal pool scrub done txg:3924829] complete=0
2012-06-20.18:58:50 [internal pool scrub txg:3924829] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:00:12 [internal pool scrub done txg:3924843] complete=0
2012-06-20.19:00:13 [internal pool scrub txg:3924843] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:01:38 zpool offline tank ada6p1
2012-06-20.21:13:30 [internal pool scrub done txg:3926107] complete=1
2012-06-20.21:20:44 [internal pool scrub txg:3926198] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:20:57 zpool online tank ada6p1
2012-06-20.21:24:49 [internal pool scrub done txg:3926239] complete=0
2012-06-20.21:24:50 [internal pool scrub txg:3926239] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:27:49 [internal pool scrub done txg:3926266] complete=0
[...]
4) debugging with a DTRACE enable kernel:
dtrace -q -n '::zfs_dbgmsg:entry{printf("%s;%d\n", stringof(arg0), arg1)}'
[...]
spa=%s async request task=%u;-2198978129920
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931639
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;368
visited %llu blocks in %llums;9349
restarting resilver txg=%llu;3931640
internal %s pool:%s txg:%llu %s;-2123208961
spa=%s async request task=%u;-2198978129920
restarting scan func=%u txg=%llu;2
internal %s pool:%s txg:%llu %s;-2123209268
doing scan sync txg %llu; ddt bm=%llu/%llu/%llu/%llx;3931640
scanned %llu ddt entries with class_max = %u; pausing=%u;0
visited %llu blocks in %llums;2386
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931641
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931642
visited %llu blocks in %llums;1548
visited %llu blocks in %llums;1223
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931643
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;18
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;21
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;234
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;558593
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931644
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;355513
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931645
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;233292
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931646
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931647
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;183353
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;208309
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931648
spa=%s async request task=%u;-2198978129920
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;47458
spa=%s async request task=%u;-2198978129920
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931649
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;368
visited %llu blocks in %llums;21488
restarting resilver txg=%llu;3931650
internal %s pool:%s txg:%llu %s;-2123208961
spa=%s async request task=%u;-2198978129920
restarting scan func=%u txg=%llu;2
[...]
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list