System deadlock when using mksnap_ffs

Jeremy Chadwick koitsu at FreeBSD.org
Wed Nov 12 22:05:24 PST 2008


On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote:
> On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote:
> > ...
> > > > On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote:
> > > > > I've been playing around with snapshots lately but I've got a problem on
> > > > > one of my servers running 7-STABLE amd64:
> > > > > 
> > > > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb at paladin:/usr/obj/usr/src/sys/PALADIN  amd64
> > > > > 
> > > > > I run the mksnap_ffs command to take the snapshot and some time later
> > > > > the system completely freezes up:
> > > > > 
> > > > > paladin# cd /u2/.snap/
> > > > > paladin# mksnap_ffs /u2 test.1
> > > > > 
> > > > > It only happens on this one filesystem, though, which might be to do
> > > > > with its size. It's not over the 2TB marker, but it's pretty close. It's
> > > > > also backed by a hardware RAID system, although a smaller filesystem on
> > > > > the same RAID has no issues.
> > ...
> > Then in my book, the patch didn't fix anything.  :-)  The system is
> > still "deadlocking"; snapshot generation **should not** wedge the system
> > hard like this.
> > 
> > Also, during my own testing, I am always able to use Ctrl-T to get
> > SIGINFO from the running process (mksnap_ffs).  That behaviour does not
> > change for me.
> > 
> > The rest of the below information is good -- but I'm confused about
> > something: is there anyone out there who can use mksnap_ffs on a
> > filesystem (/usr is a good test source) and NOT experience this
> > deadlocking problem?
> 
> I hadn't ever tried until I saw your message.  Granted, I'm using a
> smaller file system (I doubt that I have a toital of as much as 2 TB in
> all my machines combined), and I'm running i386, vs. amd64.  But it ran
> just fine.  I wasn't able to test SIGINFO; it finished before I had a
> chance.  (I ran it under time(1); wall clock time was 0.91 sec.)
> 
> > Literally *every* FreeBSD box I have root access
> > to suffers from this problem, so I'm a little baffled why we end-users
> > need to keep providing debugging output when it should be easy as pie
> > for a developer to do "dump -0 -L -a -f /path/fs.dump /usr" and watch
> > their system wedge.
> 
> Well, I routinely use dump/restore pipelines to copy file systems
> around; never had a problem with it.
> 
> > ...
> 
> For reference:
> 
> freebeast(7.1-P)[9] uname -a
> FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008     root at freebeast.catwhisker.org:/common/S3/obj/usr/src/sys/FREEBEAST  i386
> freebeast(7.1-P)[10] ls -la
> total 4
> drwxrwxr-x   2 root  operator  512 Nov 12 20:53 .
> drwxr-xr-x  14 root  wheel     512 Jan 22  2008 ..
> freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1
>         0.91 real         0.00 user         0.05 sys
>        976  maximum resident set size
>          3  average shared memory size
>        627  average unshared data size
>        109  average unshared stack size
>        104  page reclaims
>          0  page faults
>          0  swaps
>          1  block input operations
>        230  block output operations
>          0  messages sent
>          0  messages received
>          0  signals received
>        101  voluntary context switches
>         34  involuntary context switches
> freebeast(7.1-P)[12] ls -la
> total 1460
> drwxrwxr-x   2 root  operator         512 Nov 12 20:54 .
> drwxr-xr-x  14 root  wheel            512 Jan 22  2008 ..
> -r--r-----   1 root  operator  2410791056 Nov 12 20:54 test.1
> freebeast(7.1-P)[13] 

David, thanks for chiming in.  This is exactly what I was
fearing/worried about.

It would be greatly beneficial if we could figure out what triggers the
slowdown for a lot of us, since for others (proof above) mksnap_ffs
behaves as expected.

Since I'm able to reproduce this pretty much everywhere, here's
information:

# df -ki /usr
Filesystem  1024-blocks    Used     Avail Capacity iused    ifree %iused  Mounted on
/dev/ad4s1f   163815904 3835274 146875358     3%  254864 20941934    1%   /usr

# cd /usr/.snap
# /usr/bin/time -l mksnap_ffs /usr test.1

<after about 20 seconds, hitting Ctrl-T>

load: 1.90  cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k
       23.25 real         0.00 user         0.00 sys

      135.98 real         0.00 user         0.62 sys
      1092  maximum resident set size
         4  average shared memory size
      1081  average unshared data size
       135  average unshared stack size
       101  page reclaims
         0  page faults
         0  swaps
       895  block input operations
     13444  block output operations
         0  messages sent
         0  messages received
         0  signals received
      6433  voluntary context switches
       197  involuntary context switches
# ls -l test.1
-r--r-----  1 root  operator  173203463240 Nov 12 21:42 test.1

David's filesystem is 2GBs, while mine is 16GB.  His snap takes under 1
second, yet mine takes over 2 minutes.

Possibly the large deviation is explained by the amount of space used on
the filesystem or the number of inodes in use?

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |



More information about the freebsd-stable mailing list