5.3 in diskless cluster: irregular reboots at 14:09 hr. ?!?!

Tue Jan 18 03:31:52 PST 2005

Colin J. Raven wrote:
> On Dec 29, Rob launched this into the bitstream:
> 
>> Colin J. Raven wrote:
>>> On Dec 29, Rob launched this into the bitstream:
>>> 
>>>> 
>>>> I'm running 5.3-Stable on all PC's.
>>>> 
>>>> I have a master/router with 7 diskless slaves. One of the
>>>> slaves shows irregular reboots, without a trace, not even
>>>> a shutdown message in the logs.
>>>> 
>>>> Until now I have the following sudden reboots of one particular
>>>> slave happen:
>>>>        Nov. 16 14:09:41
>>>>        Nov. 30 14:09:23
>>>>        Dec. 28 14:09:34
>>>> 
>>>> Each is exactly at the same time; this is rather peculiar, isn't it?
>>>> 
>>>> Any idea what's going on here, or how to trace this problem?
>>> 
>>> 
>>> What *else* is happening at (or immediately before) 14:09 on this machine?? 
>>> For example is something rather intense occurring immediately beforehand? 
>>> I'm thinking power supply failure when it get's loaded beyond a certain 
>>> point...so, pursuant to that is there maybe  a big log grep happening 
>>> beforehand, or some other event that stresses components, thus consuming 
>>> more power?
>>
>> Thank you Colin.
>>
>> What would be a good command to run, to find out how stressful the
>> PC is right before the reboot? Is 'top' good enough? Or is there
>> something better? 'ps auxw' for example?
> 
> That's a good question. I suspect there may be a wide spectrum of 
> opinions on that one.
> 
> My own instinct would be to pipe the output of ps 
> -whatever-switches-you-like to a file, *then* squirt top output into the 
> same file - appended naturally - waurgh, also just to be obsessive about 
> it, also tail -[some number] /var/log/messages into the same file and 
> have cron send it to you at some external address. One day prolly 
> wouldn't show you anything, but an accumulation of data *might* help you 
> get to grips with conditions that immediately precede the witching hour 
> of 14:09.
> 
>> Since I don't know on what date it happens a next time, I will start
>> a cron job each day at 14:08 to check how stressful the PC is. It will
>> output the result of the job to disk.
> 
> Yes for sure, a daily cron job is clearly required here...but.. Opinions 
> vary, but FWIW, I wouldn't read the job output on the local disc. If 
> this is serious enough you may wanna read it outside of the cluster 
> environment as said above.
> 
>>> It has that funny; "I'll bet the PSU is on the way out" feeling to it,
>>> but actually proving that can be tedious.
>>
>> I may also swap UPS between two slaves and see if the reboots are
>> related to a shaky UPS. I don't want to replace the PSU yet :(.
> 
> Can't hurt, but think for a quick moment; if the box PSU is going down 
> and the UPS is also shaky, then you potentially have two problems and 
> not one. I'd (personally) take the step-by-step methodical approach. 
> First examine the box environment for some time until you can see what 
> immediately precedes the apparently spontaneous reboot, then focus on 
> external issues like the UPS. Eliminate one factor at a time, even if 
> you have innumerable items on your own inner list of possible suspects.
> 
> Keep us posted please. there have been a couple of instances of this 
> behaviour posted to the list recently, it would be interesting (as well 
> as instructive) to understand the proportionate number of cases in which 
> the PSU is ultimately proven to be the cause. I'd doubt the OS itself in 
> almost all cases. I mean, ffs it's FreeBSD.

The troublesome PC is part of a cluster of 8 PCs, which are protected by
four UPS, with two PCs powered by one UPS. Only one single PC showed
unexpected reboots, without leaving a trace of shutdown.

Last December, I have removed its UPS, and connected the two PCs directly
to the main power (taking for granted the risk of no protection against
sudden power cuts).

Since that action, no unexpected reboots have happened.

If no more followups will come, this is the conclusion of my problem:
A badly functioning UPS.

Don't know why the time of 14:09 was so critical. Maybe some big switch
somewhere in the chemistry building, that causes a ripple on the
stability of the power in the building. Just a wild guess.

Best regards,
Rob.