Gvinum RAID5 performance

Mon Nov 1 05:36:22 PST 2004

> -----Original Message-----
> From: Brad Knowles [mailto:brad at stop.mail-abuse.org] 
> Sent: Monday, 1 November 2004 9:48 PM
> To: Alastair D'Silva
> Cc: current at freebsd.org
> Subject: Re: Gvinum RAID5 performance
> 
> 	Keep in mind that if you've got a five disk RAID-5 array, then 
> for any given block, four of those disks are data and would have to 
> be accessed on every read operation anyway, and only one disk would 
> be parity.  The more disks you have in your RAID array, the lower the 
> parity to data ratio, and the less benefit you would get from 
> checking parity in background.
> 

Not quite true. The general expectation I have is that a RAID5 setup
would be used in a situation where the array will encounter a very high
ratio of reads to writes, so optimising for the common case by avoiding
reading every disk in the stripe makes sense. By avoiding reading the
whole stripe every time a read request is issued (at least, without
caching the results), the expected throughput of the array would be a
little less than (N - 1) * (drive througput), whereas the current
implementation gives us an expected throughput of less than (drive
throughput).

A brief look at geom_vinum_raid5.c indicates that this was the original
intention, with data for undegraded reads coming from a single subdisk.
I'm guessing that there is some serialisation deeper down - I don't have
time to look into it tonight but maybe tomorrow. If someone could point
me to what processes the queue added to by GV_ENQUEUE, it would save me
some time :)

If my guess about serialisation is correct, it would explain why my
drives are flickering instead of being locked on solid, since they
should be read in parallel (the block size specified in dd was
significantly larger than the stripe size, so the read requests *should*
have been issued in parallel).

> 	Most disks do now have track caches, and they do read and write 
> entire tracks at once.  However, given the multititudes of 
> permutations that go on with data addressing (including bad sector 
> mapping, etc...), what the disk thinks of as a "track" may have 
> absolutely no relationship whatsoever to what the OS or driver sees 
> as related or contiguous data.
> 
> 	Therefore, the track cache may not contribute in any meaningful 
> way to what the RAID-5 implementation needs in terms of a stripe 
> cache.  Moreover, the RAID-5 implementation already knows that it 
> needs to do a full read/write of the entire stripe every time it 
> accesses or writes data to that stripe, and this could easily have 
> destructive interference with the on-disk track cache.

Ok, this makes sense - my understanding of how the on-disk cache
operates is somewhat lacking.

> going on as the data is being accessed.  Fundamentally, RAID-5 is not 
> going to be as fast as directly reading the underlying disk.

Well, my point is that RAID5 should have a greater throughput than a
single drive in reading an undegraded volume, since consecutive (or
random non-conflicting) data can be pulled from different drives, the
same way it can in a conventional stripe or mirror. Verifying the parity
on every request is pointless, as not only does it hinder performance,
but a simple XOR parity check does not tell you where the error was,
only that there was an error.

Hmm, now theres an interesting idea - implement an ECC style alg for the
parity calculation to protect against flipped bits - probably not
significantly more computationally intensive than the simple parity
(maybe twice as much, on the assumption that the parity for each word is
calculated once for each row and once for each column), and it would
provide the software with enough information to regenerate the faulty
data, and provide the user with advance notice of a failing drive.

> >  I think both approaches have the ability to increase overall 
> > reliability  as well as improve performance since the 
> drives will not 
> > be worked as hard.
> 
> 	A "lazy read parity" RAID-5 implementation might have slightly 
> increased performance over normal RAID-5 on the same

I would say an (N-2) increase in throughput is significant, rather than
slight, and from a quick glance at the code, this is the way the author
intended it to operate. Of course, I would really love it if Lukas could
share his knowledge on this, since he wrote the code :)

BTW, Lukas, I don't buy into the offset calculation justification for
poor performance - the overhead is minimal compared to drive access
times.

On a side note, I was thinking of the following for implementing
growable RAID5:

First, have a few bytes in the vinum header for that
subdisk/plex/volume/whatever (there is a header somewhere describing the
plexes right?) which stores how much of the volume has been converted to
the new (larger) volume.

Now, for every new stripe, read the appropriate data from the old
stripe, write it to disk and update the header. If the power fails at
any point, the header won't be updated, and the original stripe will
still be intact, so we can resume it as needed. The only problem that I
can see is that if the power fails (or other disaster occurs) during the
first few stripes processed, there is uncertainty in the data as what is
on disk may be from either layout.  To combat this, maybe the first few
stripes should be moved on a block by block basis, rather than a whole
stripe at a time.

-- 
Alastair D'Silva           mob: 0413 485 733
Networking Consultant      fax: 0413 181 661
New Millennium Networking  web: http://www.newmillennium.net.au