2 bonnies can stop disk activity permanently

Wed Oct 11 12:41:19 PDT 2006

On 10/11/06 11:55, Scott Long wrote:
> Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
>> wrote:
>>
>>
>>> Mike Tancsa wrote:
>>>
>>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> this is only a crude hack.  I get around this right now by not using a
>>>>> disklabel or fdisk table on arrays where I value speed.  For those, I
>>>>> just put a filesystem directly on the array, and boot off of a small
>>>>> system disk.
>>>>
>>>>
>>>> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
>>> Yup.
>>
>> Hi,
>> 	Is this going to work in most/all cases ?  In other words, how
>> to I make sure the file system I lay down is indeed properly /
>> optimally aligned with the underlying structure ?
>>
>> 	---Mike
> 
> UFS1 skips the first 8k of its space to allow for
> bootstrapping/partitioning data.  UFS2 skips the first 64k.
> Blocks are then aligned to that skip.  64K is a good alignment
> for most RAID cases.  But understanding exactly how RAID-5 works
> will help you make appropriate choices.
> 
> (Note that in the follow write-up I'm actually describing RAID-4.
> The only difference between RAID-4 and 5 is that the parity data
> is spread out to all of the disks instead of being kept all on a
> single disk.  However, this is just a performance detail, and it's
> easier to describe how things work if you ignore it)
> 
> As you might know, RAID-4/5 takes N disks and writes data to N-1 of
> them while computing and writing a parity calculation to the Nth
> disk.  That parity calculation is a logical XOR of the data disks.
> One of the neat properties of XOR is that it's a reversible algorithm;
> you can take the final answer and re-run the XOR using all but one of
> the opriginal comoponents and get an answer that represents the data of
> the missing component.
> 
> The array is divided into 'stripes', each stripe containing a equal
> subsection of each data disk plus the parity disk.  When we talk about
> 'stripe size', what we are refering to is the size of one of those
> subsections.  A 64K stripe size means that each disk is divided into
> 64K subsections.  The total amount of data in a stripe is then a
> function of the stripe size and the number of disks in the array.  If
> you have 5 disks in your array and have set a stripe size of 64K, each
> stripe will hold a total of 256K of data (4 data disks and 1 parity
> disk).
> 
> Every time you write to an RAID-5 array, parity needs to be updated.
> As everything operates in terms of the stripes, the most straight
> forward way to do this is to read all of the data from the stripe,
> replace the portion that is being written, recompute the parity, and
> then write out the updates.  This is also the slowest way to do it.
> 
> An easy optimization is to buffer the writes and look for situations
> where all of the data in a stripe is being written sequentially.  If
> all of the data in the stripe is being replaced, there is no need to
> read any of the old data.  Just collect all of the writes together,
> compute the parity, and write everything out all at once.
> 
> Another optimization is to recognize when only one member of the stripe
> is being updated.  For that, you read the parity, read the old data, and
> then XOR out the old data and XOR in the new data.  You still have the
> latency of waiting for a read, but on a busy system you reduce head
> movement on all of the disks, which is a big win.
> 
> Both of these optmizations rely on the writes having a certain amount
> of alignment.  If your stripe size is 64k and your writes are 64k, but
> they all start at an 8k offset into the stripe, you loose.  Each 64K
> write will have to touch 56k of one disk and 8k of the next disk.  But,
> an 8k offset can be made to work if you reduce your stripe size to 8k.
> It then becomes an excercise in balancing the parameters of FS block
> size and array stripe size to give you the best peformance for your
> needs.  The 64k offset in UFS2 gives you more room to work here, so
> that's why I say at the beginning that it's a good value.  In any case,
> you want to choose parameters that result in each block write covering
> either a single disk or a whole stripe.
> 
> Where things really go bad for BSD is when a _63_ sector offset gets
> introduced for the MBR.  Now everything is offset to an odd,
> non-power-of-2 value, and there isn't anything that you can tweak in the
> filesystem or array to compensate.  The best you can do is to manually
> calculate a compensating offset in the disklabel for each partition.
> But at the point, it often becomes easier to just ditch all of that and
> put the fielsystem directly on the disk.
> 
> Scott

Scott,

Just wanted to say thanks for such a well put explanation on this, with 
all the right details.

Eric

-- 
------------------------------------------------------------------------
Eric Anderson        Sr. Systems Administrator        Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------