FreeBSD Machines dieing, we've tried so much....
Chad Leigh -- Shire.Net LLC
chad at shire.net
Wed Jun 22 16:12:09 GMT 2005
On Jun 22, 2005, at 9:59 AM, Matt Juszczak wrote:
>
>
>> The vast majority of panics are hardware-related. It is rare
>> nowadays
>> for a usermode program to make the system panic. In particular
>> you said
>> the problem happens more under load. That really points even more
>> to a
>> hardware problem - bad CPU cache ram, bad ram, scsi termination, that
>> sort of thing.
>>
>> Ted
>>
>>
>
> This is kind of going to be a blanket post to all the recent
> suggestions to me. I appreciate suggestions :) Ted, sorry, my
> other posts had dmesg and hardware specs, etc. I just couldn't
> remember the subject line of that thread. I'll be more descriptive
> here.
>
> We have two different servers crashing. Both are SMP, but on
> different hardware. We have five freeBSD servers in total, and
> only two are affected. That is why I do not believe this is a
> hardware problem.
>
> In any case, the machines are in a cold room where the temperature
> is constantly maintained. 20 other servers in there are perfectly
> stable, with no probs.
>
> This particular machine that crashed last night while running
> portsdb -uU is a Super Micro machine, with hyperthreading disabled
> in the bios, dual CPU 3.06 ghz, with 4 gigs memory. We ran mem
> test on orion (the machine that crashed last night) a week or so
> ago, and it found 70,000 ECC errors. Those were fixed and that
> machine has been stable until last night. I've now disabled SMP
> support, we'll see if that keeps it stable or not. Portsdb -uU ran
> without problems after I disabled SMP.
>
> As far as uranus, the other box (we keep a planet scheme for a
> certain set of servers), we ran memtest86 and found no errors at
> all. That box crashed about two days ago but has been stable
> since. It has not lasted more than a week without doing a kernel
> trap and freezing.
>
> It seems that both these servers have this problem. Out of the
> five FreeBSD servers we have, these two are the ones with the
> highest load. Maybe a higher load on the other three servers would
> cause the same problem. I agree with you that this is a hardware
> problem, but on more than one server with two different
> architectures and our highest load makes me re-consider.
>
> If this is truly a bug in FreeBSD 5.4-RELEASE, maybe this is
> something that has been fixed in -stable? I will compile a debug
> kernel today and try to provide a trace to the problem. I'll do it
> on which ever server crashes next.
What do they have in common? Disk controller? Network controller?
Chad
---
Chad Leigh -- Shire.Net LLC
Your Web App and Email hosting provider
chad at shire.net
More information about the freebsd-questions
mailing list