Is some combination of gmirror, md file systems, snapshots and,
maybe, quotas considered harmful?
Scott Lambert
lambert at lambertfam.org
Sun Mar 22 02:31:57 PDT 2009
On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote:
> I have a previously stable machine, other than a one time panic in
> soft-updates which I could never reproduce, running RELENG_7 from July
> 23, 2008.
>
> Starting update: Wed Jul 23 01:29:47 CDT 2008
> Finished update: Wed Jul 23 01:31:13 CDT 2008
>
> I had the userquota option in the fstab for /home, but I did not yet
> have anything in /etc/rc.conf to enable them. I have been running an
> unmodified GENERIC kernel config.
>
> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates)
>
> It runs a few jails, using ezjails. Two of them were image based jails,
> 1GB and 2GB. There is also one non-image file jail. The jails live in
> /home/ezjails.
>
> I added another image based jail, 3GB image, on March 12th.
>
> I added this machine to our AMANDA setup on March 13, 2009.
>
> Things seemed to be okay until the 19th. On the 19th, during the dump
> of /home, things gradually started to hang. Nagios paged me about
> services not responding.
>
> I did not find any explanation for it. The disks were idle according to
> systat -vm. I was able to grep the log files on /var for a while, and
> then I could no longer do anything with it.
>
> I eventually had to go to the office and power cycle it. I tried C-A-D
> first, but shutdown timed out after 30 seconds.
>
> Just to make sure it wasn't something that had since been fixed, I
> updated to RELENG_7 as of Mar 19th.
>
> Starting update: Thu Mar 19 03:40:41 CDT 2009
> Finished update: Thu Mar 19 03:48:45 CDT 2009
>
> I rebooted to the new kernel and installed the world just after midnight
> on the 20th. I started getting paged by Nagios again at 3:40am.
>
> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
> as things began to circle the drain. That was about 30 minutes after
> the dump attempt had been started by AMANDA. There were many processes
> waiting in state D. This time I did a reboot -n -q and the box rebooted
> but was still fscking when I got to the office.
>
> # ls -l /home/.snap
> -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot
>
> # df /home
> Filesystem Size Used Avail Capacity Mounted on
> /dev/mirror/gm0s1g 106G 11G 86G 11% /home
>
> I removed userquota from the fstab entry for /home and rebooted, just
> to be sure. The last danger combination I remember for snapshots was
> in combination with quotas. Am I even in the danger zone for quotas
> without having them compiled into the kernel?
>
> It looks like removing the .snap directory should be enough to prevent
> any future snapshots during the backup process. Does that sound like a
> reasonable workaround? It would at least remove one variable from the
> trouble shooting process.
>
> Any other suggestions?
>
> Thank you for any help you may be able to provide,
Did it to me again tonight. I was unable to get in to look at anything.
Just pushed the power button. It did give me the same "shutdown timed
out after 30 seconds."
So, I tuned the /home fs to disable softupdates. I also removed the
.snap directory.
I would appreciate any suggestions...
--
Scott Lambert KC5MLE Unix SysAdmin
lambert at lambertfam.org
More information about the freebsd-stable
mailing list