dealing with a failing drive

Sun Nov 25 21:07:18 PST 2007

Are we looking at the same output?

Here's the output of idacontrol show off one of my DL360 servers:

mail# idacontrol show
cmd_show_all()
[Compaq Integrated Array controller]
          Controller uptime: 301 hours 54 minutes 22 seconds
           Firmware Version: 1.50 (running) 1.50 (ROM)
        Revision -
                   Hardware: 2
                  Marketing: A
             SCSI bus count: 2
         Max drives per bus: 16
            Maximum request: 65535 blocks

Logical drive 0: 17359MB (35553120 sectors), blocksize=512
         Status: Logical drive ok
           Mode: Mirroring (RAID1)
       Drive ID: 00000000
    Drive Label:
bus 1 target 0 lun 0:
        enclosure 0, bay 0, connector 2J
        <COMPAQ  BB01813467       3BM0G606000071011MHF 3B07> direct-access
        17361MB (35556888 512 byte sectors, 1088 reserved)
        Sync, Ultra2, Wide - Configured in a logical volume.
bus 1 target 1 lun 0:
        enclosure 0, bay 1, connector 2J
        <COMPAQ  BF01864663       3EV0J0V3000072363NRD 3B0B> direct-access
        17361MB (35556888 512 byte sectors, 1088 reserved)
        Sync, Ultra2, Wide - Configured in a logical volume.
bus 1 target 7 lun 0:
        enclosure 0, bay 7, connector 2J
        <COMPAQ  PROLIANT 4L2I     JB21> non-disk
        Async
mail#

There are two physical disks in the server.  bus 1 target 0 and
bus 1 target 1.  Those ARE the physical disks.  If one of them
has failed instead of:

 Sync, Ultra2, Wide - Configured in a logical volume.

you will see something like:

 Sync, Ultra2, Wide - Unconfigured

or nothing at all.

It is normal for idacontrol to generate soft write errors.  The
developer knows about this.  There's really no easy way to make
it not happen.  It doesen't hurt anything, however.

If the RAID card itself is flakey you can't really tell it from
software.  Even the Windows RAID utilities that HP/Compaq supplies
won't tell you this.

The "by the book" way of troubleshooting these servers is if you get
a disk failure, you immediately swap the disk.  Then if the failure
happens again and your pretty sure it's not the disk, you down the
server, and boot it into Compaq Diagnostics and let it run for a day or so.

It is not uncommon to end up with several additional hard drives
that you don't need in the process of identifying a bad RAID card
in a server.  We have all done it, it is part of the territory.  If
you cannot afford it, stay away from these servers.  Remember these
servers are designed for a medium to large corporation that has
a lot of resources.

To give you a typical scenario, a couple weeks ago one of our mailservers
running on a Proliant 1600R started freezing up.  I had the admin
pull the entire disk array and put the disks into our backup server,
that went online in place of the original server, and the original
server was pulled and put on a test bench.  About a week later the
admin finally discovered the processor board had worked it's way
almost out of the socket, after much hair-pulling, running of
diagnostics, and so on.

Ted

> -----Original Message-----
> From: owner-freebsd-questions at freebsd.org
> [mailto:owner-freebsd-questions at freebsd.org]On Behalf Of David Newman
> Sent: Sunday, November 25, 2007 2:58 PM
> To: Ted Mittelstaedt
> Cc: freebsd-questions at freebsd.org
> Subject: Re: dealing with a failing drive
>
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/24/07 12:39 PM, Ted Mittelstaedt wrote:
> > The output of idacontrol show will show if one of the
> > hard disks in the SmartArray has failed.  Your choice with
> > a hardware array is to either run it with redundancy or not.
> > (ie: raid5 or mirroring or striping)  You have to choose
> > which is more important for you.
> >
> > IMHO it is very foolish to stripe an array that you have
> > critical data on and assume that you can predict a failure
> > of a disk using smart or other monitoring, and replace it
> > in advance of a failure.  If your concern is redundancy, then
> > add more disks to the array and create a raid 5 or a mirror.
> > Then ignore all the predictive junk and let the array card
> > concern itself with detecting if a drive has failed.  Run
> > idacontrol periodically out of a script that checks for a
> > failure of a disk and e-mails you if there is one.
>
> Thanks, this is good advice, but it doesn't answer the specific
> questions I had:
>
> 1. How to diagnose the health of a *physical* disk that's part of a RAID
> array (RAID1, in this case) in an old Compaq Proliant server?
>
> 2. Is it normal for idacontrol to generate soft write errors?
>
> Backstory here is that Proliant server #1 generated beaucoup hard and
> soft read and write errors and eventually locked up. I thought it was
> one of the disks but replacing one at a time didn't help. So I took both
> disks and put them in identical Proliant server #2. Ergo, I would
> conclude server #1's RAID controller flaked out.
>
> idacontrol is useful for telling the health of the logical disk. What it
> doesn't tell me (or maybe I just don't see it) is whether the physical
> disks are ok, and those "soft write errors" concern me. I had a failure
> situation, and need to figure out whether just the controller was bad or
> whether I need to replace at least one disk too.
>
> Thanks again!
>
> dn
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.3 (Darwin)
>
> iD8DBQFHSf39yPxGVjntI4IRAp1yAJ4vMV9FkeaBsHRr/Z5WpCL27wJ3tACfS+pT
> 3UVlscnQUZhe8ulHksKDWsY=
> =Om7/
> -----END PGP SIGNATURE-----
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to
> "freebsd-questions-unsubscribe at freebsd.org"
>