Is some combination of gmirror, md file systems, snapshots and,
maybe, quotas considered harmful?
Kris Kennaway
kris at FreeBSD.org
Sun Mar 22 11:03:55 PDT 2009
Scott Lambert wrote:
> On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote:
>> I have a previously stable machine, other than a one time panic in
>> soft-updates which I could never reproduce, running RELENG_7 from July
>> 23, 2008.
>>
>> Starting update: Wed Jul 23 01:29:47 CDT 2008
>> Finished update: Wed Jul 23 01:31:13 CDT 2008
>>
>> I had the userquota option in the fstab for /home, but I did not yet
>> have anything in /etc/rc.conf to enable them. I have been running an
>> unmodified GENERIC kernel config.
>>
>> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates)
>>
>> It runs a few jails, using ezjails. Two of them were image based jails,
>> 1GB and 2GB. There is also one non-image file jail. The jails live in
>> /home/ezjails.
>>
>> I added another image based jail, 3GB image, on March 12th.
>>
>> I added this machine to our AMANDA setup on March 13, 2009.
>>
>> Things seemed to be okay until the 19th. On the 19th, during the dump
>> of /home, things gradually started to hang. Nagios paged me about
>> services not responding.
>>
>> I did not find any explanation for it. The disks were idle according to
>> systat -vm. I was able to grep the log files on /var for a while, and
>> then I could no longer do anything with it.
>>
>> I eventually had to go to the office and power cycle it. I tried C-A-D
>> first, but shutdown timed out after 30 seconds.
>>
>> Just to make sure it wasn't something that had since been fixed, I
>> updated to RELENG_7 as of Mar 19th.
>>
>> Starting update: Thu Mar 19 03:40:41 CDT 2009
>> Finished update: Thu Mar 19 03:48:45 CDT 2009
>>
>> I rebooted to the new kernel and installed the world just after midnight
>> on the 20th. I started getting paged by Nagios again at 3:40am.
>>
>> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
>> as things began to circle the drain. That was about 30 minutes after
>> the dump attempt had been started by AMANDA. There were many processes
>> waiting in state D. This time I did a reboot -n -q and the box rebooted
>> but was still fscking when I got to the office.
>>
>> # ls -l /home/.snap
>> -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot
>>
>> # df /home
>> Filesystem Size Used Avail Capacity Mounted on
>> /dev/mirror/gm0s1g 106G 11G 86G 11% /home
>>
>> I removed userquota from the fstab entry for /home and rebooted, just
>> to be sure. The last danger combination I remember for snapshots was
>> in combination with quotas. Am I even in the danger zone for quotas
>> without having them compiled into the kernel?
>>
>> It looks like removing the .snap directory should be enough to prevent
>> any future snapshots during the backup process. Does that sound like a
>> reasonable workaround? It would at least remove one variable from the
>> trouble shooting process.
>>
>> Any other suggestions?
>>
>> Thank you for any help you may be able to provide,
>
> Did it to me again tonight. I was unable to get in to look at anything.
> Just pushed the power button. It did give me the same "shutdown timed
> out after 30 seconds."
>
> So, I tuned the /home fs to disable softupdates. I also removed the
> .snap directory.
>
> I would appreciate any suggestions...
>
http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html
Kris
More information about the freebsd-stable
mailing list