[Bug 225337] z_teardown_inactive_lock held inordinately long
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Sat Jan 20 07:03:57 UTC 2018
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225337
Bug ID: 225337
Summary: z_teardown_inactive_lock held inordinately long
Product: Base System
Version: 11.1-RELEASE
Hardware: amd64
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: freebsd-bugs at FreeBSD.org
Reporter: wollman at FreeBSD.org
On one of our large NFS servers, it seems that some process holds
zfsvfs->z_teardown_inactive_lock far too long -- on the order of ten minutes or
more -- causing all filesystem activity to hang. The exact same configuration
and activity patterns did not have such a hang under 10.3 I believe from web
searches that this lock is implicated in zfs dataset rollback and consequently
zfs recv -F, but the hang only seems to take place when we have both pull
replication (zfs recv) *and* active (through-the-filesystem) backups running at
the same time, which usually only happens late at night. There are no console
messages or other indications of faults in the underlying storage system. The
system as a whole becomes completely unusable, our monitoring system raises
alarms, but it doesn't actually crash, and whatever it was eventually does
complete without visible errors.
I'm temporarily disabling the replication job to see if that truly is the
smoking gun. Or rather, I'm going to do that once I get access to the
filesystem again.
Example, taken from my ssh session over the past hour (these are all waiting
for the same shell script to *begin executing*):
load: 0.82 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 7.42r 0.00u
0.00s 0% 3624k
load: 0.71 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 23.00r 0.00u
0.00s 0% 3624k
load: 0.59 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 38.85r 0.00u
0.00s 0% 3624k
load: 1.02 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 88.32r 0.00u
0.00s 0% 3624k
load: 0.81 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 149.97r 0.00u
0.00s 0% 3624k
load: 0.76 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 181.17r 0.00u
0.00s 0% 3624k
load: 1.51 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 243.76r 0.00u
0.00s 0% 3624k
load: 0.96 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 282.39r 0.00u
0.00s 0% 3624k
load: 1.50 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 333.94r 0.00u
0.00s 0% 3624k
load: 0.93 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 392.77r 0.00u
0.00s 0% 3624k
load: 0.84 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 457.04r 0.00u
0.00s 0% 3624k
load: 0.85 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 526.06r 0.00u
0.00s 0% 3624k
load: 0.40 cmd: bash 56646 [zfsvfs->z_teardown_inactive_lock] 588.82r 0.00u
0.00s 0% 3624k
My suspicion is that the primary vector is zfs recv on a dataset that is
currently being backed up, but why this causes all other filesystem activity to
become blocked is a bit unclear to me. (Race to the root? I think the backup
software uses openat(2) and shouldn't cause that sort of problem, but maybe
random NFS clients can.)
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list