gvinum losing state about failed drives

Sun Mar 12 11:19:08 UTC 2006

Hi,

My hardware:

    Intel L440GX+ serverboard, 2x 700MHz P3, 1GB ECC RAM
    2x Seagate SCSI 73GB off mainboard SCSI controller
    2x add-in Promise ATA133 controller
    4x Hitachi 500GB ATA133 disks off the Promise controllers
    add-in Intel gigabit ethernet controller

My gvinum config:

    12 volumes mirrored across da0 and da1
    1 volume 500GB mirrored across ad4 and ad8
    1 volume 500GB mirrored across ad6 and ad10

After my 4-STABLE to 6-STABLE upgrade of the first server I had two
occasions where two ATA disks became unavailable because the controller
stopped responding.  The first time I lost ad8 and ad10 containing
vol12.p1 and vol13.p1, the second time (after everything was manually
repaired) I lost vol12.p0 and vol13.p0.

When the ATA controller stops, two gvinum drives go down, the plexes
and the subdisks on them go down as well.  After a reboot, however,
all drives, plexes and subdisks are up again.  By comparing the
plexes by hand (using optimized cmp which still takes 5.5 hours for
500GB) I see that they are not equal, understandably because some
data was updated while one plex was down.

Seems that the failure of a drive and its subdisks is not recorded in
the metadata of the other drives.

I'm now contemplating a rollback of the upgrade as this server has been
down too long already but I'll try to get me a similar setup here to
do more testing.

Regards,

Paul Schenkeveld