UFS Crash and directories now missing
Alejandro Imass
ait at p2ee.org
Thu May 3 17:14:53 UTC 2012
On Thu, May 3, 2012 at 9:35 AM, Robert Bonomi <bonomi at mail.r-bonomi.com> wrote:
>
> Alejandro Imass <ait at p2ee.org> wrote:
>
> [ megasnip ]
>
>> > Things to investigate :
>> > - When was the last time this box was rebooted normally ? Did it went fine ?
>>
>> After I moved the jails to the right place I archived the jails with
>> ezjail-admin and rebooted the server several times, and everything
>> worked as expected.
>
> Rephrasing -- when was the last time _before_the_problem_was_discovered_
> that the machine was re-booted?
>
The jails moved Friday 27th so the last reboot before that was Apr 4
and before Feb 29
Feb 29 10:18:46 nune reboot: rebooted by aimass
Apr 4 19:45:03 nune reboot: rebooted by aimass
Apr 27 19:47:06 nune reboot: rebooted by aimass
Apr 28 02:03:57 nune reboot: rebooted by aimass
>> > Were the jails created at this time ?
>>
>> No. Most of these jails have been operational for over a year on this
>> server without any incidents.
>
> Clarifying the question -- were the jails created at the time of the last
> _prior_ reboot? i.e., had the machine been re-booted successfully _after_
> the jails were installed, or was this the _first_ such reboot?
>
No not at all. Most of these jails were created last year, but here is
the detail. cmm_php52_1 is the problematic jail with the MySQL, you
will see a recent date in the config file because I recently added
some cpuset as a band-aid to limit the jail's ability to bring down
the whole system, leaving at least a couple of CPUs free to be able to
ssh and shut it down. There is however a new jail corcaribe_php53 and
was the reason we rebboted the server on Apr 4th, just to make sure
that eveything would boot OK after reboot.
-rw-r--r-- 1 root wheel 917 Feb 16 2011 cat58base
-rw-r--r-- 1 root wheel 917 Apr 29 2011 cm_idvida
-rw-r--r-- 1 root wheel 937 Apr 3 2011 cm_website
-rw-r--r-- 1 root wheel 960 May 2 09:48 cmm_php52_1
-rw-r--r-- 1 root wheel 1037 Apr 4 20:00 corcaribe_php53
-rw-r--r-- 1 root wheel 950 Feb 16 2011 http_proxy
-rw-r--r-- 1 root wheel 917 Aug 3 2011 mcs_cat58
-rw-r--r-- 1 root wheel 917 Feb 10 2011 php52base
-rw-r--r-- 1 root wheel 917 Feb 12 2011 php53base
-rw-r--r-- 1 root wheel 877 Dec 27 20:33 pyugmao
-rw-r--r-- 1 root wheel 877 Mar 21 22:03 testbed
-rw-r--r-- 1 root wheel 1017 May 13 2011 yabarana_cat58
-rw-r--r-- 1 root wheel 1017 Feb 13 2011 yabarana_php52
-rw-r--r-- 1 root wheel 1017 Feb 13 2011 yabarana_php53
> It appears you misunderstood the 'at this time' reference -- it did ot
> mean 'at the time of the incident', but 'at the time of the last prior
> reboot'. If English is not your primary language, it is an understandable
> misread.
>
>> As I told you earlier, this server has been running for over a year
>> and we have rebooted many times.
>
> I don't believe you ever mentioed that particular point (multiple
> successful reboots after istallation) before. Repeating a prior
> question, _how_long_ before the problem showed up was the most recent
> re-boot? (Doesn't have to be exact -- an 'order of magnitude' estimate
> [a day, a week, a month, several months] is sufficient.)
>
Apr 4th
>> If there are such problems they exist
>> by using the EzJail commands and I find this unlikely.
>
> What you 'find unlikely' is irrelevant. The entire situation is 'unlikely',
> yet it happened. So one -has- to look at unlikely things. <wry grin>
>
funny
>> here is the mount output is that's of any help:
>
> [ first disk, and 'fdescfs', and 'procfs' references removed, for clarity ]
>
>> /dev/ad6s1.journal on /usr/jails (ufs, asynchronous, local, gjournal)
>> /usr/jails/basejail on /usr/jails/yabarana-php53/basejail (nullfs,
[...]
>
> Yes, that is a good start at useful detail. It is, presumably, _after_
> the problem, and _after_ you had restored things to their proper places.
>
Yes.
> Is it safe to assume that you do -not- have such a 'mount' output from
> some time 'before' the problem? ( There's no rational reason why you
> -would- have such, but _if_ it existed, and there were any differences
> between 'then' and 'now', it could be very informative.)
>
No, but from what I remember it's mostly very similar. I can pull off
similar mount statement from other server(s) where we run similar
set-ups and that have never failed if needed.
> Aother critical piece of information is what diretories -- by full path
> name -- disappeared from 'where they were', and where -- by full path name,
> again -- did you find them, and _with_what_names_? If everything was
> moved from the same source point to the same destination, it's not necessary
> to itemize each one, but the details of _one_ 'typicaal' migration is needed.
> It is also significant if there was 'anything else' in the 'where they
> belonged' directory that was -not- moved. *OR* if there was anything else
> (something other than the '/' of a jail) there, that was _also_ moved.
>
I took a screen shot because I somehow suspected no one would believe
me, I don't know if I can attach it here but I can send it to you
privately if not.
> "Narrative" descriptions, as previously provided, and while clear to someone
> familiar with the machcine in question, are not sufficiently precise to allow
> an 'outsider' to follow the events without 'logically' replicating the setup,
> and then guessing at the meaning of any shorthands employed.
>
OK. I can provide mostly any information required.
>
>
> One comment: for 'defensive' purposes it would be useful to break ad6 up
> into two slices, putting 'basejail' in it's own slice. Then, for production
> use, that slice can be mounted RO, and with the 'system immutable' flag
> set on everything in that filesystem.
>
Yes. From one of your posts that became somewhat clear to me: Having
all the jails on a single 150GB slice seems like a bad idea.
Thanks! Let me know if I can provide anything else to help determine
the root cause.
--
Alejandro Imass
More information about the freebsd-questions
mailing list