kern/79035: gvinum unable to create a striped set of mirrored sets/plexes

Sven Willenberger sven at dmv.com
Fri Apr 8 15:27:51 PDT 2005


On Sun, 2005-03-20 at 15:51 +1030, Greg 'groggy' Lehey wrote:
> On Saturday, 19 March 2005 at 23:43:00 -0500, Sven Willenberger wrote:
> > Greg 'groggy' Lehey presumably uttered the following on 03/19/05 22:11:
> >> On Sunday, 20 March 2005 at  2:04:34 +0000, Sven Willenberger wrote:
> >>
> >>> Under the current implementation of gvinum it is possible to create
> >>> a mirrored set of striped plexes but not a striped set of mirrored
> >>> plexes. For purposes of resiliency the latter configuration is
> >>> preferred as illustrated by the following example:
> >>>
> >>> Use 6 disks to create one of 2 different scenarios.
> >>>
> >>> 1) Using the current abilities of gvinum create 2 striped sets using
> >>> 3 disks each: A1 A2 A3 and B1 B2 B3 then create a mirror of those 2
> >>> sets such that A(123) mirrors B(123). In this situation if any drive
> >>> in Set A fails, one still has a working set with Set B. If any drive
> >>> now fails in Set B, the system is shot.
> >>
> >> No, this is not correct.  The plex ("set") only fails when all drives
> >> in it fail.
> >
> > I hope the following diagrams better illustrate what I was trying to
> > point out. Data striped across all the A's and that is mirrored to the B
> > Stripes:
> >
> > ...
> >
> > If A1 fails, then the A Stripe set cannot function (much like in Raid 0,
> > one disk fails the set) meaning that B now is the array:
> 
> No, this is not correct.
> 
> >>> Thus the striping of mirrors (rather than a mirror of striped sets)
> >>> is a more resilient and fault-tolerant setup of a multi-disk array.
> >>
> >> No, you're misunderstanding the current implementation.
> >
> > Perhaps I am ... but unless gvinum somehow reconstructs a 3 disk stripe
> > into a 2 disk stripe in the event one disk fails, I am now sure how.
> 
> Well, you have the source code.  It's not quite the way you look at
> it.  It doesn't have stripes: it has plexes.  And they can be
> incomplete.  If a read to a plex hits a "hole", it automatically
> retries via (possibly all) the other plexes.  Only when all plexes
> have a hole in the same place does the transfer fail.
> 
> You might like to (re)read http://www.vinumvm.org/vinum/intro.html.
> 

I was really hoping that the "holes in the plex" functioning was going
to work but my tests have shown otherwise. I created a gvinum array
consisting of (A striped B) mirror (C striped D) which is the only such
mirror/stripe combination allowed by gvinum for four drives. We have:

_________
| A   B |__
|_______|  |
           |Mirror
_________  |
| C   D |--|
|_______|

Based on what the "plex hole" theory states, Drive A and Drive D could
both fail and the system would read through the holes and pick up data
from B and C (or the converse if B and C failed), functionally
equivalent to a stripe of mirrors. To fail a drive I rebooted
single-user, dd dev/zero to the beginning of the disk and then fdisk.

drive d device /dev/da4s1h
drive c device /dev/da3s1h
drive b device /dev/da2s1h
drive a device /dev/da1s1h
volume home
plex name home.p1 org striped 960s vol home
plex name home.p0 org striped 960s vol home
sd name home.p1.s1 drive d len 71681280s driveoffset 265s plex home.p1
plexoffset 960s
sd name home.p1.s0 drive c len 71681280s driveoffset 265s plex home.p1
plexoffset 0s
sd name home.p0.s1 drive b len 71681280s driveoffset 265s plex home.p0
plexoffset 960s
sd name home.p0.s0 drive a len 71681280s driveoffset 265s plex home.p0
plexoffset 0s

In my case:        Fail B     Fail B and C
A = /dev/da1s1h      up          up
B = /dev/da2s1h      down        down
C = /dev/da3s1h      up          down
D = /dev/da4s1h      up          up

1 Volume
V home2              up          down (!)

2 Plexes
P home.p0 (A and B)  down        down
P home.p1 (C and D)  up          down

4 Subdisks
S home.p0.s0 (A)     up          up
S home.p0.s1 (B)     down        down
S home.p1.s0 (C)     up          down
S home p1.s1 (D)     up          up

Based on this failing the one drive did in fact fail the plex (home.p0).
Although at that point I realized that failing either drive on the other
plex would also fail that plex and also the volume, I went ahead and
failed drive C also. The result was a failed volume.

With the failed B drive, once I bsdlabeled the disk to include the vinum
slice, then I got the message that the the plex was now stale (instead
of down). A simple gvinum start home2 changed the state to degraded the
the system rebuilt the array. When both drives failed I had to work a
bit of a kludge in. I gvinum setstate -f up home.p1.s0, then gvinum
start home.p0. At that point the system rebuilt itself and it would
appear the data is intact .. I have not completely tested or verified
that last statement however.

In essence although my feature request to have the ability to create a
striped set of mirrors was going to be hopefully supplanted by the
functional equivalent via the "plex hole" system, it did not come to
fruition. So please note this as either a re-request for that feature or
a bug report in that the pass-through feature of gvinum plexes is
broken.

Sven



More information about the freebsd-stable mailing list