Dell PE4600 RAID5 server failing

Derek Ragona derek at computinginnovations.com
Wed Nov 14 13:38:06 PST 2007


At 12:12 PM 11/14/2007, Tamouh H. wrote:
> >
> > Derek Ragona wrote:
> > > At 09:00 AM 11/14/2007, Barnaby Scott wrote:
> > >> I suspect I already know the answer to this, which is that the
> > >> trouble I am having is nothing to do with the OS at all,
> > but I have
> > >> to ask, because I am otherwise up against a total brick wall!
> > >>
> > >> I bought a second-hand Dell Poweredge 4600 and installed
> > FreeBSD 6.2
> > >> earlier this year. I had it set up with RAID5 using its PERC3/DC
> > >> controller, with 7 x 73GB disks (+ 1 hot spare). So far so
> > good, and
> > >> it worked faultlessly as a Samba server for several months.
> > >>
> > >> At the beginning of October, it went down, reporting a mismatch
> > >> between the configuration on the NVRAM and the disks. With
> > help from
> > >> Dell support, I managed to recreate the RAID array and it worked
> > >> again for a month.
> > >>
> > >> In early November it happened again, and has kept
> > happening since. At
> > >> one point it appeared that the backplane was faulty, so I replaced
> > >> that, but I cannot keep the server up for more than a day or so
> > >> without this 'mismatch' poblem.
> > >>
> > >> What about diagnostics on the hardware you may ask? I have run all
> > >> the diagnostic tools that Dell can supply - several times
> > - and the
> > >> server declares itself to be totally fault-free.
> > >>
> > >> My specific questions therefore:
> > >>
> > >> Is there any way at all that FreeBSD could be invloved with this
> > >> problem? (I did notice for example that the Dell PERC3/DC
> > controller
> > >> was not in the list of supported hardware - but then
> > again, why did
> > >> it work for several months?)
> > >>
> > >> Can I use FreeBSD to tell me anything about the fault that Dell's
> > >> diagnostic tools haven't found?
> > >>
> > >> (I do hope someone might be able to help - Dell are trying
> > to get me
> > >> to switch to a 'supported' OS!)
> > >>
> > >>
> > >> Thanks
> > >>
> > >> Barnaby Scott
> > >
> > > It doesn't sound like any OS issue as you set up the RAID
> > outside the
> > > OS.  It may be a bad drive or drive(s).  Most RAID drives have RAID
> > > information written to the drives, and if this becomes
> > unreadable you
> > > will have RAID faults.
> > >
> > > Another likely culprit is heat.  Overheating drives often
> > fail.  Are
> > > you sure the temperatures in the drive enclosure is OK?
> > >
> > > If you can, run diagnostics on the drives, this usually requires
> > > running these with the drives taken out of the RAID array though.
> > >
> > >         -Derek
> > >
> >
> > Thanks for replying - as I said, this is a long shot trying
> > to see if there is any OS involvement.
> >
> > The drives are fine - I have used two different tools to
> > analyse them while the computer is booted from a live CD and
> > the RAID configuration cleared on the controller. Besides,
> > you would expect one drive to fail at a time, and if this
> > happened, the hot spare would surely be pressed into service.
> > Nothing like this has happened though - the controller is
> > reporting several drives (not always the same ones) failed
> > simultaneously, but when the array is re-created from the
> > disks, everything works fine. Problem is, it goes down again
> > a day or so later.
> >
> > As for heat, there is nothing being reported there and the
> > fans that cool that area are working.
> >
> > Any other ideas gratefully received!
> >
> > Barnaby Scott
>
>This is very unlikely to be OS related. But here are few pointers:
>
>1) Check the make/model of the drives. Certain types of make/model SCSI 
>drives had a glitch in them a while ago with a certain firmware that 
>they'd disconnect from a RAID. I had a personal experience with these ones 
>(Seagate U320).
>
>2) What did happen in October? Anything hardware, software, power wise has 
>occurred ?
>
>3) NVRAM and Disk mismatch, I'd say check the controller, backup battery 
>present but weak ?
>
>4) Unlikely to be the source, but run a test on your physical RAM using 
>MEMTEST86+ and check the power supply is sufficient and working properly.
>
>

I've had some raid drives disconnect and go missing, which all cleared and 
was rebuilt on a full power-off reboot.  I belive this is due to some power 
issues in my area.  Specifically my line power from the utility was running 
high, over 127 volts, making over-voltage spikes prevalent.  On a couple 
spikes I saw the drives disconnect.

So it could be power related.

On temperature, I would put in a temperature probe and check it from the 
external probe.  Some remote KVM solutions now include temperature probes.

         -Derek

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
MailScanner thanks transtec Computers for their support.



More information about the freebsd-questions mailing list