System deadlock when using mksnap_ffs
Jeremy Chadwick
koitsu at FreeBSD.org
Thu Nov 13 01:15:52 PST 2008
On Wed, Nov 12, 2008 at 10:05:21PM -0800, Jeremy Chadwick wrote:
> On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote:
> > On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
> > > ...
> > > > > On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote:
> > > > > > I've been playing around with snapshots lately but I've got a problem on
> > > > > > one of my servers running 7-STABLE amd64:
> > > > > >
> > > > > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb at paladin:/usr/obj/usr/src/sys/PALADIN amd64
> > > > > >
> > > > > > I run the mksnap_ffs command to take the snapshot and some time later
> > > > > > the system completely freezes up:
> > > > > >
> > > > > > paladin# cd /u2/.snap/
> > > > > > paladin# mksnap_ffs /u2 test.1
> > > > > >
> > > > > > It only happens on this one filesystem, though, which might be to do
> > > > > > with its size. It's not over the 2TB marker, but it's pretty close. It's
> > > > > > also backed by a hardware RAID system, although a smaller filesystem on
> > > > > > the same RAID has no issues.
> > > ...
> > > Then in my book, the patch didn't fix anything. :-) The system is
> > > still "deadlocking"; snapshot generation **should not** wedge the system
> > > hard like this.
> > >
> > > Also, during my own testing, I am always able to use Ctrl-T to get
> > > SIGINFO from the running process (mksnap_ffs). That behaviour does not
> > > change for me.
> > >
> > > The rest of the below information is good -- but I'm confused about
> > > something: is there anyone out there who can use mksnap_ffs on a
> > > filesystem (/usr is a good test source) and NOT experience this
> > > deadlocking problem?
> >
> > I hadn't ever tried until I saw your message. Granted, I'm using a
> > smaller file system (I doubt that I have a toital of as much as 2 TB in
> > all my machines combined), and I'm running i386, vs. amd64. But it ran
> > just fine. I wasn't able to test SIGINFO; it finished before I had a
> > chance. (I ran it under time(1); wall clock time was 0.91 sec.)
> >
> > > Literally *every* FreeBSD box I have root access
> > > to suffers from this problem, so I'm a little baffled why we end-users
> > > need to keep providing debugging output when it should be easy as pie
> > > for a developer to do "dump -0 -L -a -f /path/fs.dump /usr" and watch
> > > their system wedge.
> >
> > Well, I routinely use dump/restore pipelines to copy file systems
> > around; never had a problem with it.
> >
> > > ...
> >
> > For reference:
> >
> > freebeast(7.1-P)[9] uname -a
> > FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008 root at freebeast.catwhisker.org:/common/S3/obj/usr/src/sys/FREEBEAST i386
> > freebeast(7.1-P)[10] ls -la
> > total 4
> > drwxrwxr-x 2 root operator 512 Nov 12 20:53 .
> > drwxr-xr-x 14 root wheel 512 Jan 22 2008 ..
> > freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1
> > 0.91 real 0.00 user 0.05 sys
> > 976 maximum resident set size
> > 3 average shared memory size
> > 627 average unshared data size
> > 109 average unshared stack size
> > 104 page reclaims
> > 0 page faults
> > 0 swaps
> > 1 block input operations
> > 230 block output operations
> > 0 messages sent
> > 0 messages received
> > 0 signals received
> > 101 voluntary context switches
> > 34 involuntary context switches
> > freebeast(7.1-P)[12] ls -la
> > total 1460
> > drwxrwxr-x 2 root operator 512 Nov 12 20:54 .
> > drwxr-xr-x 14 root wheel 512 Jan 22 2008 ..
> > -r--r----- 1 root operator 2410791056 Nov 12 20:54 test.1
> > freebeast(7.1-P)[13]
>
> David, thanks for chiming in. This is exactly what I was
> fearing/worried about.
>
> It would be greatly beneficial if we could figure out what triggers the
> slowdown for a lot of us, since for others (proof above) mksnap_ffs
> behaves as expected.
>
> Since I'm able to reproduce this pretty much everywhere, here's
> information:
>
> # df -ki /usr
> Filesystem 1024-blocks Used Avail Capacity iused ifree %iused Mounted on
> /dev/ad4s1f 163815904 3835274 146875358 3% 254864 20941934 1% /usr
>
> # cd /usr/.snap
> # /usr/bin/time -l mksnap_ffs /usr test.1
>
> <after about 20 seconds, hitting Ctrl-T>
>
> load: 1.90 cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k
> 23.25 real 0.00 user 0.00 sys
>
> 135.98 real 0.00 user 0.62 sys
> 1092 maximum resident set size
> 4 average shared memory size
> 1081 average unshared data size
> 135 average unshared stack size
> 101 page reclaims
> 0 page faults
> 0 swaps
> 895 block input operations
> 13444 block output operations
> 0 messages sent
> 0 messages received
> 0 signals received
> 6433 voluntary context switches
> 197 involuntary context switches
> # ls -l test.1
> -r--r----- 1 root operator 173203463240 Nov 12 21:42 test.1
>
> David's filesystem is 2GBs, while mine is 16GB. His snap takes under 1
> second, yet mine takes over 2 minutes.
>
> Possibly the large deviation is explained by the amount of space used on
> the filesystem or the number of inodes in use?
I also want to add that snapshot removal (e.g. rm test.1) is equally as
slow (rm process is also in wdrain); takes about 20 seconds for the
above test.1 snapshot. Maybe long durations during deletion are
justified though, I don't know.
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, USA |
| Making life hard for others since 1977. PGP: 4BD6C0CB |
More information about the freebsd-stable
mailing list