Panic after trying to recover from drive failure with geom_vinum
Paul Mather
paul at gromit.dlib.vt.edu
Mon Nov 15 10:04:30 PST 2004
I have a 5.3-STABLE system upgraded from a 5.2.1 system that used a
root-on-vinum mirrored setup. Both under 5.2.1 and 5.3, the system
periodically gets those "TIMEOUT - WRITE_DMA retrying" errors you
sometimes hear people mention. Usually, it is nothing, but it seems the
one that happened last night caused geom_vinum to mark the drive as down
and flag all its plexes and subdisks down, too:
Nov 15 04:34:14 handle kernel: ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=1581375
Nov 15 04:34:15 handle kernel: ad0: FAILURE - WRITE_DMA timed out
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk swap.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex swap.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk root.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex root.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk var.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex var.p0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: subdisk usr.p0.s0 is down
Nov 15 04:34:15 handle kernel: GEOM_VINUM: plex usr.p0 is down
Of course, the drive wasn't actually down, but how to tell geom_vinum
that? I tried "gvinum start laurel" (laurel is the name for the ad0
drive), but geom_vinum said it couldn't. So, I thought I'd try and
start the plexes individually. Unfortunately, "gvinum start root.p0"
caused the machine to reboot. (I was logged in via SSH so I couldn't
see what happened on the console; I'm presuming there was a panic
followed by a reboot.)
Luckily, when the system came back, "laurel" was now flagged as "up" and
so a "gvinum start" of each plex synchronised them and brought them all
back up.
My question is this: what would be a better way to recover from this in
the future, i.e., how to let geom_vinum know the drive was in fact "up"?
With classic vinum, "setstate" could have been used as a last resort. I
thought in retrospect that perhaps an "atacontrol detach" followed by an
"atacontrol attach" might have brought the drive's real state to
geom_vinum's attention. Does this sound likely?
I'm just trying to avoid another unnecessary panic+reboot in the future,
here. :-)
Cheers,
Paul.
--
e-mail: paul at gromit.dlib.vt.edu
"Without music to decorate it, time is just a bunch of boring production
deadlines or dates by which bills must be paid."
--- Frank Vincent Zappa
More information about the freebsd-geom
mailing list