[Bug 280216] UFS deadly hangs while removing snapshot
Date: Wed, 10 Jul 2024 11:31:48 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280216
Bug ID: 280216
Summary: UFS deadly hangs while removing snapshot
Product: Base System
Version: Unspecified
Hardware: amd64
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: bugs@FreeBSD.org
Reporter: ant_mail@inbox.ru
I have a very sad situation with a production server which force me to break my
weekends.
Server hangs on some friday nights and have to be bringed to life by phisically
power off/on. This begun at autumn '23.
It appeared as filesystem hanging: server respond to ping but every I/O
operation hangs.
I'm running 12-STABLE and may be there is a some relation with commits made
during July-October '23.
It was hard to explore because of production server and total number incidents
is about 7-8. So what I've founded.
I'm using 'snapshot' (package freebsd-snapshot) utility to make periodic
snapshot. It contain the following lines of code:
logger -p daemon.notice \
"snapshot: removing $fs_dir/.snap/$fs_tag.$"
system rm -f $fs_dir/.snap/$fs_tag.$i
Last messages that was logged in system are:
Jun 28 22:10:06 serv root[52374]: snapshot: rotating snapshots
Jun 28 22:10:06 serv root[52375]: snapshot: rm /data/office/.snap/weekly.3
Jun 29 09:47:28 serv syslogd: kernel boot file is /boot/kernel/kernel
Jun 29 09:47:28 serv kernel: ---<<BOOT>>---
There is no evidence that system has any successfull UFS reads or writes after
'rm' was engaged.
After power off/on fsck found errors on some partitions but the problematic
partition (/data/office) has no error. And there is no problem to remove
snapshot (doing rm /data/office/.snap/weekly.3)
There are other UFS partitions on this server which doing UFS snapshot same way
but it never hangs.
UFS parameters of data/office:
tunefs: POSIX.1e ACLs: (-a) enabled
tunefs: NFSv4 ACLs: (-N) disabled
tunefs: MAC multilabel: (-l) disabled
tunefs: soft updates: (-n) disabled
tunefs: soft update journaling: (-j) disabled
tunefs: gjournal: (-J) enabled
tunefs: trim: (-t) disabled
tunefs: maximum blocks per file in a cylinder group: (-e) 4096
tunefs: average file size: (-f) 512000
tunefs: average number of files in a directory: (-s) 64
tunefs: minimum percentage of free space: (-m) 12%
tunefs: space to hold for metadata blocks: (-k) 6408
tunefs: optimization preference: (-o) time
What was tried:
creating new enlarged partition, making newfs on it, dumping and restoring data
to the new partition. After couple of month the server hangs again.
I suppose that problem arise when the size of snapshot getting large. This
explain why it hangs on some fridays only: removing oldest snapshot is a
removing largest snapshot and when it size is more than some thresholds it
hangs.
Currently I have those size of snapshot:
/data/office/ ufs 464GB 40.0% 44GB 3.8% weekly.2
2024-06-07T22:11
/data/office/ ufs 464GB 40.0% 22GB 1.9% weekly.1
2024-06-14T22:10
/data/office/ ufs 464GB 40.0% 18GB 1.5% weekly.0
2024-06-21T22:11
/data/office/ ufs 464GB 40.0% 9GB 0.8% daily.2
2024-07-08T00:03
/data/office/ ufs 464GB 40.0% 741MB 0.1% daily.1
2024-07-09T00:03
/data/office/ ufs 464GB 40.0% 784MB 0.1% hourly.1
2024-07-09T16:01
/data/office/ ufs 464GB 40.0% 594MB 0.0% daily.0
2024-07-10T00:03
/data/office/ ufs 464GB 40.0% 590MB 0.0% hourly.0
2024-07-10T12:01
Any help is greatly appreciated.
--
You are receiving this mail because:
You are the assignee for the bug.