More Server Crash Saga

Grant Peel gpeel at thenetnow.com
Fri Mar 17 18:59:07 UTC 2006


Hi Derek,

I got this data using ipmitool from the servers BMC just after (about 3 minutes after robbot) a crash this afternoon.

I will be heading to th NOC this afternoone to copy the harddrive to another machine I have been using for about a year and a half.

Anyways, here is the sensor data ....

Temp             | 38 degrees C      | ok
Temp             | 50 degrees C      | ok
Ambient Temp     | 30 degrees C      | ok
Planar Temp      | 35 degrees C      | ok
Riser Temp       | 34 degrees C      | ok
Temp             | 40 degrees C      | ok
Temp             | 40 degrees C      | ok
CMOS Battery     | 3.15 Volts        | ok
ROMB Battery     | Not Readable      | ns
VCORE            | 0x01              | ok
VCORE            | Not Readable      | ns
PROC VTT         | 0x01              | ok
1.5V PG          | 0x01              | ok
1.8V PG          | 0x01              | ok
3.3V PG          | 0x01              | ok
5V PG            | 0x01              | ok
5V Riser PG      | 0x01              | ok
Riser PG         | 0x01              | ok
PFault Fail Safe | Not Readable      | ns
Presence         | 0x01              | ok
Presence         | 0x02              | ok
Presence         | 0x01              | ok
Presence         | 0x02              | ok
ROMB Presence    | 0x02              | ok
FAN 1A RPM       | 9600 RPM          | ok
FAN 1B RPM       | 6900 RPM          | ok
FAN 2A RPM       | 9900 RPM          | ok
FAN 2B RPM       | 6825 RPM          | ok
FAN 3A RPM       | 9825 RPM          | ok
FAN 3B RPM       | 6825 RPM          | ok
FAN 4A RPM       | 10200 RPM         | ok
FAN 4B RPM       | 6675 RPM          | ok
Status           | 0x80              | ok
Status           | Not Readable      | ns
Status           | 0x01              | ok
Status           | Not Readable      | ns
VRM              | 0x01              | ok
VRM              | 0x01              | ok
OS Watchdog      | 0x00              | ok
SEL              | Not Readable      | ns
Intrusion        | 0x00              | ok
PS Redundancy    | Not Readable      | ns
Fan Redundancy   | 0x01              | ok
SCSI Connector A | Not Readable      | ns
Drive            | 0xc0              | ok
ECC Corr Err     | 0xc0              | ok
ECC Uncorr Err   | Not Readable      | ns
I/O Channel Chk  | 0xc0              | ok
PCI Parity Err   | 0xc0              | ok
PCI System Err   | 0xc0              | ok
SBE Log Disabled | Not Readable      | ns
Logging Disabled | Not Readable      | ns
Unknown          | Not Readable      | ns
PROC Protocol    | Not Readable      | ns
PROC Bus PERR    | Not Readable      | ns
PROC Init Err    | Not Readable      | ns
PROC Machine Chk | Not Readable      | ns
Memory Spared    | Not Readable      | ns
Memory Mirrored  | 0x01              | ok
Memory RAID      | Not Readable      | ns
Memory Added     | 0x01              | ok
Memory Removed   | 0x01              | ok
PCIE Fatal Err   | 0x01              | ok
Chipset Err      | 0x01              | ok
Err Reg Pointer  | 0x01              | ok
root on s1#
  ----- Original Message ----- 
  From: Derek Ragona 
  To: Grant Peel ; freebsd-questions at freebsd.org 
  Sent: Thursday, March 16, 2006 5:45 PM
  Subject: Re: More Server Crash Saga


  Grant,

  That is a one unit rack mount server, which makes it prone to have heat problems, particularly under any load.  You might want to check the ambient heat and the internal heat sensors as well.

  That server uses an intel chipset (and probably an intel motherboard) which should allow "out-of-band" monitoring.  You should see what you can use to monitor the system and see what the system is reporting prior to a lockup.

  It may be time to just call dell and have them send a replacement MB or entire unit.

          -Derek


  At 03:47 PM 3/16/2006, Grant Peel wrote:

    Hi all,

    Still getting crashing today ... FreeBSD 6.0 PE 1850

    Does the output of vmstat -i for fove seconds show a problem? Interupt storm?

    I have been searching, trying to find out what the 'rate' means and what should it be?

    interrupt                          total       rate
    irq0: clk                        3277223        999
    irq5: em1                           8877          2
    irq6: ehci0 atapci0                   85          0
    irq7: mpt0 uhci2                   56401         17
    irq8: rtc                         419429        127
    irq11: em0 uhci0                   85684         26
    irq13: npx0                            1          0
    irq14: ata0                           48          0
    Total                            3847748       1173
    root on s1# vmstat -i
    interrupt                          total       rate
    irq0: clk                        3278793        999
    irq5: em1                           8883          2
    irq6: ehci0 atapci0                   85          0
    irq7: mpt0 uhci2                   56408         17
    irq8: rtc                         419630        127
    irq11: em0 uhci0                   85752         26
    irq13: npx0                            1          0
    irq14: ata0                           48          0
    Total                            3849600       1174
    root on s1# vmstat -i
    interrupt                          total       rate
    irq0: clk                        3280691        999
    irq5: em1                           8889          2
    irq6: ehci0 atapci0                   85          0
    irq7: mpt0 uhci2                   56408         17
    irq8: rtc                         419873        127
    irq11: em0 uhci0                   85843         26
    irq13: npx0                            1          0
    irq14: ata0                           48          0
    Total                            3851838       1173
    root on s1# vmstat -i
    interrupt                          total       rate
    irq0: clk                        3282850        999
    irq5: em1                           8891          2
    irq6: ehci0 atapci0                   85          0
    irq7: mpt0 uhci2                   56408         17
    irq8: rtc                         420149        127
    irq11: em0 uhci0                   86153         26
    irq13: npx0                            1          0
    irq14: ata0                           48          0
    Total                            3854585       1174 

    _______________________________________________
    freebsd-questions at freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe at freebsd.org"


More information about the freebsd-questions mailing list