5.3 in diskless cluster: irregular reboots at 14:09 hr. ?!?!
Colin J. Raven
colin at kenmore.kozy-kabin.nl
Wed Dec 29 07:47:43 PST 2004
On Dec 29, Rob launched this into the bitstream:
> Colin J. Raven wrote:
>> On Dec 29, Rob launched this into the bitstream:
>>> I'm running 5.3-Stable on all PC's.
>>> I have a master/router with 7 diskless slaves. One of the
>>> slaves shows irregular reboots, without a trace, not even
>>> a shutdown message in the logs.
>>> Until now I have the following sudden reboots of one particular
>>> slave happen:
>>> Nov. 16 14:09:41
>>> Nov. 30 14:09:23
>>> Dec. 28 14:09:34
>>> Each is exactly at the same time; this is rather peculiar, isn't it?
>>> Any idea what's going on here, or how to trace this problem?
>> What *else* is happening at (or immediately before) 14:09 on this machine??
>> For example is something rather intense occurring immediately beforehand?
>> I'm thinking power supply failure when it get's loaded beyond a certain
>> point...so, pursuant to that is there maybe a big log grep happening
>> beforehand, or some other event that stresses components, thus consuming
>> more power?
> Thank you Colin.
> What would be a good command to run, to find out how stressful the
> PC is right before the reboot? Is 'top' good enough? Or is there
> something better? 'ps auxw' for example?
That's a good question. I suspect there may be a wide spectrum of
opinions on that one.
My own instinct would be to pipe the output of ps
-whatever-switches-you-like to a file, *then* squirt top output into the
same file - appended naturally - waurgh, also just to be obsessive about
it, also tail -[some number] /var/log/messages into the same file and
have cron send it to you at some external address. One day prolly
wouldn't show you anything, but an accumulation of data *might* help you
get to grips with conditions that immediately precede the witching hour
> Since I don't know on what date it happens a next time, I will start
> a cron job each day at 14:08 to check how stressful the PC is. It will
> output the result of the job to disk.
Yes for sure, a daily cron job is clearly required here...but.. Opinions
vary, but FWIW, I wouldn't read the job output on the local disc. If
this is serious enough you may wanna read it outside of the cluster
environment as said above.
>> It has that funny; "I'll bet the PSU is on the way out" feeling to it,
>> but actually proving that can be tedious.
> I may also swap UPS between two slaves and see if the reboots are
> related to a shaky UPS. I don't want to replace the PSU yet :(.
Can't hurt, but think for a quick moment; if the box PSU is going down
and the UPS is also shaky, then you potentially have two problems and
not one. I'd (personally) take the step-by-step methodical approach.
First examine the box environment for some time until you can see what
immediately precedes the apparently spontaneous reboot, then focus on
external issues like the UPS. Eliminate one factor at a time, even if
you have innumerable items on your own inner list of possible suspects.
Keep us posted please. there have been a couple of instances of this
behaviour posted to the list recently, it would be interesting (as well
as instructive) to understand the proportionate number of cases in which
the PSU is ultimately proven to be the cause. I'd doubt the OS itself in
almost all cases. I mean, ffs it's FreeBSD.
Regards & HTH,
More information about the freebsd-questions