FreeBSD Machines dieing, we've tried so much....

fenix at ramb.com.ua fenix at ramb.com.ua
Wed Jun 22 16:11:06 GMT 2005


Hello, Matt.

>>The vast majority of panics are hardware-related.  It is rare nowadays
>>for a usermode program to make the system panic.  In particular you said
>>the problem happens more under load.  That really points even more to a
>>hardware problem - bad CPU cache ram, bad ram, scsi termination, that
>>sort of thing.
>>
>>Ted
>>  
>>

> This is kind of going to be a blanket post to all the recent suggestions
> to me.  I appreciate suggestions :)   Ted, sorry, my other posts had
> dmesg and hardware specs, etc. I just couldn't remember the subject line
> of that thread. I'll be more descriptive here.

> We have two different servers crashing.  Both are SMP, but on different
> hardware.  We have five freeBSD servers in total, and only two are 
> affected.  That is why I do not believe this is a hardware problem.

> In any case, the machines are in a cold room where the temperature is
> constantly maintained.  20 other servers in there are perfectly stable,
> with no probs.

> This particular machine that crashed last night while running portsdb
> -uU is a Super Micro machine, with hyperthreading disabled in the bios,
> dual CPU 3.06 ghz, with 4 gigs memory.  We ran mem test on orion (the
> machine that crashed last night) a week or so ago, and it found 70,000
> ECC errors.  Those were fixed and that machine has been stable until
> last night.  I've now disabled SMP support, we'll see if that keeps it
> stable or not. Portsdb -uU ran without problems after I disabled SMP.

> As far as uranus, the other box (we keep a planet scheme for a certain
> set of servers), we ran memtest86 and found no errors at all.  That box
> crashed about two days ago but has been stable since.  It has not lasted
> more than a week without doing a kernel trap and freezing.

> It seems that both these servers have this problem.  Out of the five
> FreeBSD servers we have, these two are the ones with the highest load.
> Maybe a higher load on the other three servers would cause the same 
> problem.  I agree with you that this is a hardware problem, but on more
> than one server with two different architectures and our highest load
> makes me re-consider.

> If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is something
> that has been fixed in -stable?  I will compile a debug kernel today and
> try to provide a trace to the problem.  I'll do it on which ever server
> crashes next.

I had same situation with to different high loaded servers (both SMP, with 8Gb of
ram, and HT enabled,), with 5.4 Release, after disabeling HT and cvsup
OS to 5.4-stable all working fine without any problems, last reboot was 28
days ago.


> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to
> "freebsd-questions-unsubscribe at freebsd.org"



-- 
Best regards,
Sergey S. Ropchan



More information about the freebsd-questions mailing list