2 bonnies can stop disk activity permanently

Eric Anderson anderson at centtech.com
Wed Oct 11 12:41:19 PDT 2006

On 10/11/06 11:55, Scott Long wrote:
> Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
>> wrote:
>>> Mike Tancsa wrote:
>>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>> wrote:
>>>>> this is only a crude hack.  I get around this right now by not using a
>>>>> disklabel or fdisk table on arrays where I value speed.  For those, I
>>>>> just put a filesystem directly on the array, and boot off of a small
>>>>> system disk.
>>>> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
>>> Yup.
>> Hi,
>> 	Is this going to work in most/all cases ?  In other words, how
>> to I make sure the file system I lay down is indeed properly /
>> optimally aligned with the underlying structure ?
>> 	---Mike
> UFS1 skips the first 8k of its space to allow for
> bootstrapping/partitioning data.  UFS2 skips the first 64k.
> Blocks are then aligned to that skip.  64K is a good alignment
> for most RAID cases.  But understanding exactly how RAID-5 works
> will help you make appropriate choices.
> (Note that in the follow write-up I'm actually describing RAID-4.
> The only difference between RAID-4 and 5 is that the parity data
> is spread out to all of the disks instead of being kept all on a
> single disk.  However, this is just a performance detail, and it's
> easier to describe how things work if you ignore it)
> As you might know, RAID-4/5 takes N disks and writes data to N-1 of
> them while computing and writing a parity calculation to the Nth
> disk.  That parity calculation is a logical XOR of the data disks.
> One of the neat properties of XOR is that it's a reversible algorithm;
> you can take the final answer and re-run the XOR using all but one of
> the opriginal comoponents and get an answer that represents the data of
> the missing component.
> The array is divided into 'stripes', each stripe containing a equal
> subsection of each data disk plus the parity disk.  When we talk about
> 'stripe size', what we are refering to is the size of one of those
> subsections.  A 64K stripe size means that each disk is divided into
> 64K subsections.  The total amount of data in a stripe is then a
> function of the stripe size and the number of disks in the array.  If
> you have 5 disks in your array and have set a stripe size of 64K, each
> stripe will hold a total of 256K of data (4 data disks and 1 parity
> disk).
> Every time you write to an RAID-5 array, parity needs to be updated.
> As everything operates in terms of the stripes, the most straight
> forward way to do this is to read all of the data from the stripe,
> replace the portion that is being written, recompute the parity, and
> then write out the updates.  This is also the slowest way to do it.
> An easy optimization is to buffer the writes and look for situations
> where all of the data in a stripe is being written sequentially.  If
> all of the data in the stripe is being replaced, there is no need to
> read any of the old data.  Just collect all of the writes together,
> compute the parity, and write everything out all at once.
> Another optimization is to recognize when only one member of the stripe
> is being updated.  For that, you read the parity, read the old data, and
> then XOR out the old data and XOR in the new data.  You still have the
> latency of waiting for a read, but on a busy system you reduce head
> movement on all of the disks, which is a big win.
> Both of these optmizations rely on the writes having a certain amount
> of alignment.  If your stripe size is 64k and your writes are 64k, but
> they all start at an 8k offset into the stripe, you loose.  Each 64K
> write will have to touch 56k of one disk and 8k of the next disk.  But,
> an 8k offset can be made to work if you reduce your stripe size to 8k.
> It then becomes an excercise in balancing the parameters of FS block
> size and array stripe size to give you the best peformance for your
> needs.  The 64k offset in UFS2 gives you more room to work here, so
> that's why I say at the beginning that it's a good value.  In any case,
> you want to choose parameters that result in each block write covering
> either a single disk or a whole stripe.
> Where things really go bad for BSD is when a _63_ sector offset gets
> introduced for the MBR.  Now everything is offset to an odd,
> non-power-of-2 value, and there isn't anything that you can tweak in the
> filesystem or array to compensate.  The best you can do is to manually
> calculate a compensating offset in the disklabel for each partition.
> But at the point, it often becomes easier to just ditch all of that and
> put the fielsystem directly on the disk.
> Scott


Just wanted to say thanks for such a well put explanation on this, with 
all the right details.


Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.

More information about the freebsd-fs mailing list