Problem restarts

Bill Moran wmoran at potentialtech.com
Mon Jun 28 07:06:42 PDT 2004


Mark Terribile <materribile at yahoo.com> wrote:
> Hi,
> 
> I'm having a problem with spontaneous restarts.  This isn't a new problem,
> but I've done the obvious things and the problem hasn't gone away.  I
> was thinking of asking on -hackers, but I'm trying here first.
> 
> The system is a 4.8 with a mix of patches and port upgrades of various
> ages.  I'm planning to rebuild the whole thing, bringing it up to date,
> but I'm hoping to be able to wait for a 5.x in STABLE; I don't want to do
> this twice, since I expect I'll have to dump and restore everything.
> 
> The hardware is a 2.6 GHz P4 with 2 GByte of GEIL dual-channel memory.
> (The problem existed on the previous, somewhat slower, memory as well.)
> The box contains the processor and motherboard (Gigabyte GA-SINXP1394),
> two floppy drives, CD and CD/W drives, an HP DAT, three IBM/Hitachi
> 36G/10K SCSI drives, and one 120G IDE.  The SCSI card is by Adaptec; the
> video card is a low-end NVidia, and I'm running their video driver.  The
> PS is an Antec True380, which should be enough for the box, with something
> to spare.  There are several extra, large fans, of which more later.
> 
> The system, monitor, printer, and cable modem are all powered through an
> APC BACK-UPS 450, about 18 months old.  It's shown in the last week that
> it can keep things up for more than an hour.
> 
> The symptom is a restart that leaves no indication of how it happened.
> 
>   Recently, the system shut down (completely, and at the power supply)
>   instead of restarting.  In that case, the last deliberate shutdown
>   was a `shutdown -h now'; it appears that in every other case, the last
>   deliberate shutdown was a `-r now'.  (Question: does the machine
>   architecture have settings for reset-resume .vs. reset-halt, settings
>   that might be remembered when a later action occurs?)  It has
>   subsequently shut down with an immediate restart.
> 
> There are no failure indications in the /var/log/messages, nor reported
> by dmesg.  (The console scrolls by very quickly.)  The message sequence
> over the restart typically looks like this:
> 
> =======================================================================
> Jun  7 18:39:09 moleend /kernel: arp: 24.228.64.1 moved from 00:05:00:e7:17:44
> t
> o 00:05:00:e7:17:57 on em0
> Jun  7 18:39:09 moleend /kernel: arp: 24.228.64.1 moved from 00:05:00:e7:17:57
> t
> o 00:05:00:e7:17:44 on em0
> Jun  7 18:59:06 moleend dhclient: New Network Number: 24.228.64.0
> Jun  7 18:59:06 moleend dhclient: New Broadcast Address: 255.255.255.255
> Jun  7 22:47:33 moleend /kernel: Copyright (c) 1992-2003 The FreeBSD Project.
> Jun  7 22:47:33 moleend /kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988,
> 198
> 9, 1991, 1992, 1993, 1994
> ========================================================================
> 
> The restart most often occurs AFTER X has been shut down (and often
> restarted) but sometimes when X has not been run.  It most often occurs
> when the system is under heavy CPU load, but sometimes when the load
> has been light.
> 
> I thought at one time it might be a thermal problem and undertook to
> fix that.  (I am still working to get more cooling air over the disks.)
> Right now, I have 120 mm fans rated at 130-135 CFM (Panaflow and JMC)
> pushing air in and out of the box, and pressurizing a duct feeding the
> CPU cooler, which is now cool to the touch.  The memory modules are cool
> to the touch.  While the disks need a proper plenum to route more air
> over them, I no longer believe that there is a thermal problem.  The
> vid card's fan-blown heatsink is warm (not hot) to the touch; the
> northbridge's fan-blown heatsink is warm (not hot) to the touch.
> 
> (Some people commute to white-collar jobs in heavy pickups; I drive a
> small server as my PC.  No chrome pipes.)
> 
> So: what should I do next?  Should I set the system up to go to the
> kernel debugger on panic, or even start it via the kernel debugger?
> (Where is the full documentation?)  Should I shell out for an even
> bigger power supply?  Is there another log that I should examine?
> A restart wire that I should check?  A power bus I should scope?
> (I'll have to borrow a scope somewhere.)  Is it time for an exorcist?

I would look at the hardware, but not arbitrarily.  Try running programs
like memtest86 and cpuburn for extended periods of time to see if they
trigger the reboot.  The randomness of the problem seems to suggest a
hardware problem.

You may want to hire the exorcist ... hardware problems can be a PITA
to track down.

-- 
Bill Moran
Potential Technologies
http://www.potentialtech.com


More information about the freebsd-questions mailing list