random hangs/reboots with Dell servers

Chuck Swiger cswiger at mac.com
Thu Apr 19 17:43:19 UTC 2007

On Apr 19, 2007, at 3:54 AM, Dimitris Zilaskos wrote:
> Over the last 3 year we have installed freebsd 5.x and 6.x, with  
> currently deployed version being 6.1, to a variety of of Dell rack  
> mounted systems.
> The Dell systems used so far are Poweredge 1750, 2950 (both scsi),  
> and sc1425 (sata). All of them are dual CPU Xeon systems.

I've got a large number of Dell PowerEdge 1750, 1850, 2900, 2950  
deployed in various production environments, whereas some other  
clients are using HP ProLiant 360/370 boxen.  Both seem to be rock  
solid under either 5.4/5.5, or 6.1/6.2.  I've even got a pair of  
firewall boxes running nothing but NAT and SSHd, which are at 600+  
days of uptime:

FreeBSD 5.4-STABLE (FW) #0: Tue Jul 12 11:10:14 EDT 2005

Welcome to FreeBSD!
12:24PM  up 636 days, 19:26, 3 users, load averages: 0.25, 0.14, 0.04

(Machines running more services get OS or service related updates  
more frequently-- typically every month to every 3 months-- but I  
don't like to make changes to a running machine unless I expect the  
change to make an improvement which justifies the disruption.  For a  
non-SMP firewall which would involve loss of external network  
connectivity to update, nothing in 6.x is worth the cost to update to  
as yet, IMHO.)

> All these systems serve as mail/web servers, with 2 to 15 jails.
> Installation has always proceeded normally without problems.  
> However, after a few months of operation, all of these systems,  
> purchased at different moments during the last 3 years, will begin  
> rebooting randomly or freezing completely.
> These reboots/freezes will at first occur once per 6 months, then  
> gradually will move to to once per month, to normally stabilize  
> around once per week, but in the case of the 1750 system once it  
> even happened twice a day.
> Load does not seem to matter, since even after shutting down all  
> services in the servers, still random reboots occured.

Sounds to be something hardware-related like a power-supply problem,  
if the failure rate is gradually getting shorter and is not  
correlated with load at all.

> So far we tried various tricks digged from the archives, like  
> disabling ACPI, HT, but nothing changed.
> We have migrated some systems that had these issues to RHEL  
> compatible OS, and they run rock solid under heavy load.

Hmm.  Well, you might have to wait for a few weeks or months to be  
able to get reasonable comparison of longer-term stability, but this  
at least implies that something like cooling or a failed fan aren't  
likely causes.

> Right now I have enabled kernel crash dumps and I am waiting for  
> the next crash. But I understad a lot of people use FreeBSD with  
> Dell servers, and I would like to listen on how to tackle this  
> situation we are facing.

Try to get a crash dump.  Also, you might find reviewing the BIOS  
options and disabling everything which is not needed, hopefully  
including USB, will help.


