Panic after trying to recover from drive failure with geom_vinum

Paul Mather paul at gromit.dlib.vt.edu
Mon Nov 15 10:04:30 PST 2004


I have a 5.3-STABLE system upgraded from a 5.2.1 system that used a
root-on-vinum mirrored setup.  Both under 5.2.1 and 5.3, the system
periodically gets those "TIMEOUT - WRITE_DMA retrying" errors you
sometimes hear people mention.  Usually, it is nothing, but it seems the
one that happened last night caused geom_vinum to mark the drive as down
and flag all its plexes and subdisks down, too:

Nov 15 04:34:14 handle kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=1581375
Nov 15 04:34:15 handle kernel: ad0: FAILURE - WRITE_DMA timed out
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk swap.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex swap.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk root.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex root.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk var.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex var.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk usr.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex usr.p0 is down


Of course, the drive wasn't actually down, but how to tell geom_vinum
that?  I tried "gvinum start laurel" (laurel is the name for the ad0
drive), but geom_vinum said it couldn't.  So, I thought I'd try and
start the plexes individually.  Unfortunately, "gvinum start root.p0"
caused the machine to reboot.  (I was logged in via SSH so I couldn't
see what happened on the console; I'm presuming there was a panic
followed by a reboot.)

Luckily, when the system came back, "laurel" was now flagged as "up" and
so a "gvinum start" of each plex synchronised them and brought them all
back up.

My question is this: what would be a better way to recover from this in
the future, i.e., how to let geom_vinum know the drive was in fact "up"?
With classic vinum, "setstate" could have been used as a last resort.  I
thought in retrospect that perhaps an "atacontrol detach" followed by an
"atacontrol attach" might have brought the drive's real state to
geom_vinum's attention.  Does this sound likely?

I'm just trying to avoid another unnecessary panic+reboot in the future,
here. :-)

Cheers,

Paul.
-- 
e-mail: paul at gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
 deadlines or dates by which bills must be paid."
        --- Frank Vincent Zappa


More information about the freebsd-geom mailing list