[Bug 235419] zpool scrub progress does not change for hours, heavy disk activity still present

Sat Feb 2 06:37:20 UTC 2019

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235419

            Bug ID: 235419
           Summary: zpool scrub progress does not change for hours, heavy
                    disk activity still present
           Product: Base System
           Version: 11.2-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: bobf at mrp3.com

Frequently, on one of my computers running 11-STABLE, a 'zpool scrub' will
continue for hours while progress does not increase.  The scrub is still
'active' and there is a LOT of disk activity, causing stuttering of application
response as you would expect.  This does not always happen, but happens more
often than not.  The previous scrub completed without any such 'hangs' 2 weeks
ago, with no changes to the configuration since.

This system uses a 'zfs everywhere' configuration, i.e. all partitions are zfs.

A second computer that has UFS+J partitions for userland and kernel does not
appear to exhibit this particular problem.

uname output:

FreeBSD hack.SFT.local 11.2-STABLE FreeBSD 11.2-STABLE #1 r339273: Tue Oct  9
21:10:39 PDT 2018     root at hack.SFT.local:/usr/obj/usr/src/sys/GENERIC  amd64

This system had been running for 80+ days.

At first, I discovered that the scrub had 'hung' at around 74% complete.  After
pausing the scrub for a while, and also terminating firefox and thunderbird,
the scrub re-started and continued. I re-started firefox and thunderbird, and
allowed everything to continue.  The scrub then 'hung' again at about 84%, and
terminating applications (including Xorg) did not seem to help.

With the scrub paused I performed a reboot, and the scrub restarted on boot
[causing the boot process to be excrutiatingly slow].  I have restarted most of
the applications that were running before, while the scrub was continuing to
run .  Now the zpool status shows that the scrub has completed with no errors.

here are some additional pieces of information that might help:

> mount
zroot/ROOT/default on / (zfs, NFS exported, local, noatime, nfsv4acls)
devfs on /dev (devfs, local, multilabel)
zroot/d-drive on /d-drive (zfs, NFS exported, local, noatime, nfsv4acls)
zroot/e-drive on /e-drive (zfs, NFS exported, local, noatime, nfsv4acls)
zroot/tmp on /tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot/usr/home on /usr/home (zfs, NFS exported, local, noatime, nfsv4acls)
zroot/usr/ports on /usr/ports (zfs, NFS exported, local, noatime, nosuid,
nfsv4acls)
zroot/usr/src on /usr/src (zfs, NFS exported, local, noatime, nfsv4acls)
zroot/var/audit on /var/audit (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/crash on /var/crash (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/log on /var/log (zfs, local, noatime, noexec, nosuid, nfsv4acls)
zroot/var/mail on /var/mail (zfs, local, nfsv4acls)
zroot/var/tmp on /var/tmp (zfs, local, noatime, nosuid, nfsv4acls)
zroot on /zroot (zfs, local, noatime, nfsv4acls)

> kldstat
Id Refs Address            Size     Name
 1   44 0xffffffff80200000 206b5d0  kernel
 2    1 0xffffffff8226d000 393200   zfs.ko
 3    2 0xffffffff82601000 a380     opensolaris.ko
 4    1 0xffffffff82821000 4090     cuse.ko
 5    1 0xffffffff82826000 6e40     uftdi.ko
 6    1 0xffffffff8282d000 3c58     ucom.ko
 7    3 0xffffffff82831000 50c70    vboxdrv.ko
 8    2 0xffffffff82882000 2ad0     vboxnetflt.ko
 9    2 0xffffffff82885000 9a20     netgraph.ko
10    1 0xffffffff8288f000 14b8     ng_ether.ko
11    1 0xffffffff82891000 3f70     vboxnetadp.ko
12    2 0xffffffff82895000 37528    linux.ko
13    2 0xffffffff828cd000 2d28     linux_common.ko
14    1 0xffffffff828d0000 31e80    linux64.ko
15    1 0xffffffff82902000 c60      coretemp.ko
16    1 0xffffffff82903000 965128   nvidia.ko

there were no messages regarding zpool scrub that I could find.

port versions for things with kernel modules:

nvidia-driver-340-340.106
virtualbox-ose-5.1.18
virtualbox-ose-kmod-5.1.22
linux-c7-7.3.1611_1

This problem has happened since mid last year, around the time when the -STABLE
source went to 11.2 and I updated kernel+world on this computer.  The zpool has
also been upgraded.  It is worth noting that this computer ran 11.0 for a long
time without incident.  The problem may have been present in 11.1 .

Related:  there is an apparent (random crash) bug in the NVidia module that I
have been trying to track down.  It causes occasional page fault crashes. 
Sometimes I will see swap space in use when there does not seem to be any
reason for it, and I believe this NVidia bug is a part of that (the crash
happening from randomly accessing 'after free' or random memory addresses, and
swap space is allocated as a consequence?).  Whether this NVidia driver bug is
responsible for the zfs problem, I do not know, but this driver is only on this
particular computer, and so it's worth mentioning, as only this computer seems
to exhibit the problem.

-- 
You are receiving this mail because:
You are the assignee for the bug.