[Bug 243814] ZFS deadlock when adding cache partition
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Sun Feb 2 18:05:53 UTC 2020
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243814
Bug ID: 243814
Summary: ZFS deadlock when adding cache partition
Product: Base System
Version: 12.1-RELEASE
Hardware: amd64
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: bugs at FreeBSD.org
Reporter: jfc at mit.edu
My system hung in ZFS, probably a deadlock. I doubt this is reproducible but
there have been scattered other reports so I'll add more data.
The machine is an HPE ProLiant with AMD CPU, 96 GB, UFS root on NVME, a ZFS
mirrored pool on two spinning disks, and a ZFS raidz2 pool on five spinning
disks encrypted with full-disk geli. Both pools have failmode=continue. The
mirrored pool was idle. The raidz2 pool caused the hang.
zpool add ... cache ... was hung waiting on tx->tx_sync_done_cv. Meanwhile, a
zpool iostat was hung waiting on spa_namespace_lock. shutdown -r failed to
reboot, probably because of a deadlock. After the 90 second timeout in
rc.shutdown init attempted to go into single user mode but never started a
shell. I had to power cycle the system.
Now back in time to the setup.
I was moving a lot of data into the pool and from filesystem to filesystem
within the pool. I noticed the transfer tended to stop for minutes at a time,
and ZFS administrative actions and the sync command also took several minutes.
Probably waiting for many gigabytes of dirty buffers to write. I thought
adding a cache drive might help, so I created a partition on the NVME drive and
ran
# zpool add private cache nvd0p4 ("private" is the name of my pool)
And that's where everything went wrong. Instead of a 1-10 minute wait for
buffers to drain I/O to the pool was totally hung.
Control-T reported
load: 3.47 cmd: zpool 79137 [scl->scl_cv] 1.09r 0.00u 0.00s 0% 4284k
load: 3.47 cmd: zpool 79137 [scl->scl_cv] 2.32r 0.00u 0.00s 0% 4284k
load: 3.19 cmd: zpool 79137 [scl->scl_cv] 6.86r 0.00u 0.00s 0% 4284k
load: 1.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 500.21r 0.00u 0.00s 0% 4284k
load: 0.96 cmd: zpool 79137 [tx->tx_sync_done_cv] 566.71r 0.00u 0.00s 0% 4284k
load: 0.75 cmd: zpool 79137 [tx->tx_sync_done_cv] 577.43r 0.00u 0.00s 0% 4284k
load: 0.61 cmd: zpool 79137 [tx->tx_sync_done_cv] 611.66r 0.00u 0.00s 0% 4284k
load: 0.14 cmd: zpool 79137 [tx->tx_sync_done_cv] 736.87r 0.00u 0.00s 0% 4284k
load: 0.05 cmd: zpool 79137 [tx->tx_sync_done_cv] 792.23r 0.00u 0.00s 0% 4284k
load: 0.18 cmd: zpool 79137 [tx->tx_sync_done_cv] 997.17r 0.00u 0.00s 0% 4284k
load: 0.37 cmd: zpool 79137 [tx->tx_sync_done_cv] 1198.80r 0.00u 0.00s 0%
4284k
load: 0.49 cmd: zpool 79137 [tx->tx_sync_done_cv] 1339.23r 0.00u 0.00s 0%
4284k
Meanwhile in another shell, ^T to a zpool iostat reported
load: 0.24 cmd: zpool 50732 [spa_namespace_lock] 179170.71r 0.01u 0.08s 0%
1068k
/var/log/messages has these lines that seem relevant. I created a partition
intended for cache and then resized it before using it. That seems to create
the error 6 condition; I've seen it elsewhere but without any side effects.
kernel: g_access(958): provider gptid/af44cd24-36ef-11ea-a744-48df37a69238 has
error 6 set
syslogd: last message repeated 2 times
kernel: g_dev_taste: make_dev_p() failed
(gp->name=gptid/af44cd24-36ef-11ea-a744-48df37a69238, error=17)
ZFS[79140]: vdev state changed, pool_guid=$706653905921838876
vdev_guid=$2416291949121178716
ZFS[79141]: vdev state changed, pool_guid=$706653905921838876
vdev_guid=$2416291949121178716
The partition UUID is for the cache partition and the pool UUID is for the pool
I was using. (There is also a second pool that was mounted but idle.) But I
have failmode=continue so an I/O error should not hang the system.
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list