Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful?

Sun Mar 22 11:03:55 PDT 2009

Scott Lambert wrote:
> On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote:
>> I have a previously stable machine, other than a one time panic in
>> soft-updates which I could never reproduce, running RELENG_7 from July
>> 23, 2008.
>>
>> Starting update: Wed Jul 23 01:29:47 CDT 2008
>> Finished update: Wed Jul 23 01:31:13 CDT 2008
>>
>> I had the userquota option in the fstab for /home, but I did not yet
>> have anything in /etc/rc.conf to enable them.  I have been running an
>> unmodified GENERIC kernel config.
>>
>> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates)
>>
>> It runs a few jails, using ezjails.  Two of them were image based jails,
>> 1GB and 2GB.  There is also one non-image file jail.  The jails live in
>> /home/ezjails.
>>
>> I added another image based jail, 3GB image, on March 12th.
>>
>> I added this machine to our AMANDA setup on March 13, 2009.  
>>
>> Things seemed to be okay until the 19th.  On the 19th, during the dump
>> of /home, things gradually started to hang.  Nagios paged me about
>> services not responding.  
>>
>> I did not find any explanation for it.  The disks were idle according to
>> systat -vm.  I was able to grep the log files on /var for a while, and
>> then I could no longer do anything with it.
>>
>> I eventually had to go to the office and power cycle it.  I tried C-A-D
>> first, but shutdown timed out after 30 seconds.
>>
>> Just to make sure it wasn't something that had since been fixed, I
>> updated to RELENG_7 as of Mar 19th.
>>
>> Starting update: Thu Mar 19 03:40:41 CDT 2009
>> Finished update: Thu Mar 19 03:48:45 CDT 2009
>>
>> I rebooted to the new kernel and installed the world just after midnight
>> on the 20th.  I started getting paged by Nagios again at 3:40am.  
>>
>> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77,
>> as things began to circle the drain.  That was about 30 minutes after
>> the dump attempt had been started by AMANDA.  There were many processes
>> waiting in state D.  This time I did a reboot -n -q and the box rebooted
>> but was still fscking when I got to the office.
>>
>> # ls -l /home/.snap
>> -r--------   1 root  operator  117285093376 Mar 20 03:18 dump_snapshot
>>
>> # df /home
>> Filesystem            Size    Used   Avail Capacity  Mounted on
>> /dev/mirror/gm0s1g    106G     11G     86G    11%    /home
>>
>> I removed userquota from the fstab entry for /home and rebooted, just
>> to be sure.  The last danger combination I remember for snapshots was
>> in combination with quotas.  Am I even in the danger zone for quotas
>> without having them compiled into the kernel?
>>
>> It looks like removing the .snap directory should be enough to prevent
>> any future snapshots during the backup process.  Does that sound like a
>> reasonable workaround?  It would at least remove one variable from the
>> trouble shooting process.
>>
>> Any other suggestions?
>>
>> Thank you for any help you may be able to provide,
> 
> Did it to me again tonight.  I was unable to get in to look at anything.
> Just pushed the power button.  It did give me the same "shutdown timed
> out after 30 seconds."
> 
> So, I tuned the /home fs to disable softupdates.  I also removed the
> .snap directory.
> 
> I would appreciate any suggestions...
>  

http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html

Kris