More Server Crash Saga
Grant Peel
gpeel at thenetnow.com
Fri Mar 17 18:59:07 UTC 2006
Hi Derek,
I got this data using ipmitool from the servers BMC just after (about 3 minutes after robbot) a crash this afternoon.
I will be heading to th NOC this afternoone to copy the harddrive to another machine I have been using for about a year and a half.
Anyways, here is the sensor data ....
Temp | 38 degrees C | ok
Temp | 50 degrees C | ok
Ambient Temp | 30 degrees C | ok
Planar Temp | 35 degrees C | ok
Riser Temp | 34 degrees C | ok
Temp | 40 degrees C | ok
Temp | 40 degrees C | ok
CMOS Battery | 3.15 Volts | ok
ROMB Battery | Not Readable | ns
VCORE | 0x01 | ok
VCORE | Not Readable | ns
PROC VTT | 0x01 | ok
1.5V PG | 0x01 | ok
1.8V PG | 0x01 | ok
3.3V PG | 0x01 | ok
5V PG | 0x01 | ok
5V Riser PG | 0x01 | ok
Riser PG | 0x01 | ok
PFault Fail Safe | Not Readable | ns
Presence | 0x01 | ok
Presence | 0x02 | ok
Presence | 0x01 | ok
Presence | 0x02 | ok
ROMB Presence | 0x02 | ok
FAN 1A RPM | 9600 RPM | ok
FAN 1B RPM | 6900 RPM | ok
FAN 2A RPM | 9900 RPM | ok
FAN 2B RPM | 6825 RPM | ok
FAN 3A RPM | 9825 RPM | ok
FAN 3B RPM | 6825 RPM | ok
FAN 4A RPM | 10200 RPM | ok
FAN 4B RPM | 6675 RPM | ok
Status | 0x80 | ok
Status | Not Readable | ns
Status | 0x01 | ok
Status | Not Readable | ns
VRM | 0x01 | ok
VRM | 0x01 | ok
OS Watchdog | 0x00 | ok
SEL | Not Readable | ns
Intrusion | 0x00 | ok
PS Redundancy | Not Readable | ns
Fan Redundancy | 0x01 | ok
SCSI Connector A | Not Readable | ns
Drive | 0xc0 | ok
ECC Corr Err | 0xc0 | ok
ECC Uncorr Err | Not Readable | ns
I/O Channel Chk | 0xc0 | ok
PCI Parity Err | 0xc0 | ok
PCI System Err | 0xc0 | ok
SBE Log Disabled | Not Readable | ns
Logging Disabled | Not Readable | ns
Unknown | Not Readable | ns
PROC Protocol | Not Readable | ns
PROC Bus PERR | Not Readable | ns
PROC Init Err | Not Readable | ns
PROC Machine Chk | Not Readable | ns
Memory Spared | Not Readable | ns
Memory Mirrored | 0x01 | ok
Memory RAID | Not Readable | ns
Memory Added | 0x01 | ok
Memory Removed | 0x01 | ok
PCIE Fatal Err | 0x01 | ok
Chipset Err | 0x01 | ok
Err Reg Pointer | 0x01 | ok
root on s1#
----- Original Message -----
From: Derek Ragona
To: Grant Peel ; freebsd-questions at freebsd.org
Sent: Thursday, March 16, 2006 5:45 PM
Subject: Re: More Server Crash Saga
Grant,
That is a one unit rack mount server, which makes it prone to have heat problems, particularly under any load. You might want to check the ambient heat and the internal heat sensors as well.
That server uses an intel chipset (and probably an intel motherboard) which should allow "out-of-band" monitoring. You should see what you can use to monitor the system and see what the system is reporting prior to a lockup.
It may be time to just call dell and have them send a replacement MB or entire unit.
-Derek
At 03:47 PM 3/16/2006, Grant Peel wrote:
Hi all,
Still getting crashing today ... FreeBSD 6.0 PE 1850
Does the output of vmstat -i for fove seconds show a problem? Interupt storm?
I have been searching, trying to find out what the 'rate' means and what should it be?
interrupt total rate
irq0: clk 3277223 999
irq5: em1 8877 2
irq6: ehci0 atapci0 85 0
irq7: mpt0 uhci2 56401 17
irq8: rtc 419429 127
irq11: em0 uhci0 85684 26
irq13: npx0 1 0
irq14: ata0 48 0
Total 3847748 1173
root on s1# vmstat -i
interrupt total rate
irq0: clk 3278793 999
irq5: em1 8883 2
irq6: ehci0 atapci0 85 0
irq7: mpt0 uhci2 56408 17
irq8: rtc 419630 127
irq11: em0 uhci0 85752 26
irq13: npx0 1 0
irq14: ata0 48 0
Total 3849600 1174
root on s1# vmstat -i
interrupt total rate
irq0: clk 3280691 999
irq5: em1 8889 2
irq6: ehci0 atapci0 85 0
irq7: mpt0 uhci2 56408 17
irq8: rtc 419873 127
irq11: em0 uhci0 85843 26
irq13: npx0 1 0
irq14: ata0 48 0
Total 3851838 1173
root on s1# vmstat -i
interrupt total rate
irq0: clk 3282850 999
irq5: em1 8891 2
irq6: ehci0 atapci0 85 0
irq7: mpt0 uhci2 56408 17
irq8: rtc 420149 127
irq11: em0 uhci0 86153 26
irq13: npx0 1 0
irq14: ata0 48 0
Total 3854585 1174
_______________________________________________
freebsd-questions at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"
More information about the freebsd-questions
mailing list