2 bonnies can stop disk activity permanently
Eric Anderson
anderson at centtech.com
Wed Oct 11 12:41:19 PDT 2006
On 10/11/06 11:55, Scott Long wrote:
> Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 15:53:47 -0600, in sentex.lists.freebsd.fs you
>> wrote:
>>
>>
>>> Mike Tancsa wrote:
>>>
>>>> On Mon, 09 Oct 2006 14:47:25 -0600, in sentex.lists.freebsd.fs you
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> this is only a crude hack. I get around this right now by not using a
>>>>> disklabel or fdisk table on arrays where I value speed. For those, I
>>>>> just put a filesystem directly on the array, and boot off of a small
>>>>> system disk.
>>>>
>>>>
>>>> How is that done ? just newfs -O2 -U /dev/da0 ?
>>> Yup.
>>
>> Hi,
>> Is this going to work in most/all cases ? In other words, how
>> to I make sure the file system I lay down is indeed properly /
>> optimally aligned with the underlying structure ?
>>
>> ---Mike
>
> UFS1 skips the first 8k of its space to allow for
> bootstrapping/partitioning data. UFS2 skips the first 64k.
> Blocks are then aligned to that skip. 64K is a good alignment
> for most RAID cases. But understanding exactly how RAID-5 works
> will help you make appropriate choices.
>
> (Note that in the follow write-up I'm actually describing RAID-4.
> The only difference between RAID-4 and 5 is that the parity data
> is spread out to all of the disks instead of being kept all on a
> single disk. However, this is just a performance detail, and it's
> easier to describe how things work if you ignore it)
>
> As you might know, RAID-4/5 takes N disks and writes data to N-1 of
> them while computing and writing a parity calculation to the Nth
> disk. That parity calculation is a logical XOR of the data disks.
> One of the neat properties of XOR is that it's a reversible algorithm;
> you can take the final answer and re-run the XOR using all but one of
> the opriginal comoponents and get an answer that represents the data of
> the missing component.
>
> The array is divided into 'stripes', each stripe containing a equal
> subsection of each data disk plus the parity disk. When we talk about
> 'stripe size', what we are refering to is the size of one of those
> subsections. A 64K stripe size means that each disk is divided into
> 64K subsections. The total amount of data in a stripe is then a
> function of the stripe size and the number of disks in the array. If
> you have 5 disks in your array and have set a stripe size of 64K, each
> stripe will hold a total of 256K of data (4 data disks and 1 parity
> disk).
>
> Every time you write to an RAID-5 array, parity needs to be updated.
> As everything operates in terms of the stripes, the most straight
> forward way to do this is to read all of the data from the stripe,
> replace the portion that is being written, recompute the parity, and
> then write out the updates. This is also the slowest way to do it.
>
> An easy optimization is to buffer the writes and look for situations
> where all of the data in a stripe is being written sequentially. If
> all of the data in the stripe is being replaced, there is no need to
> read any of the old data. Just collect all of the writes together,
> compute the parity, and write everything out all at once.
>
> Another optimization is to recognize when only one member of the stripe
> is being updated. For that, you read the parity, read the old data, and
> then XOR out the old data and XOR in the new data. You still have the
> latency of waiting for a read, but on a busy system you reduce head
> movement on all of the disks, which is a big win.
>
> Both of these optmizations rely on the writes having a certain amount
> of alignment. If your stripe size is 64k and your writes are 64k, but
> they all start at an 8k offset into the stripe, you loose. Each 64K
> write will have to touch 56k of one disk and 8k of the next disk. But,
> an 8k offset can be made to work if you reduce your stripe size to 8k.
> It then becomes an excercise in balancing the parameters of FS block
> size and array stripe size to give you the best peformance for your
> needs. The 64k offset in UFS2 gives you more room to work here, so
> that's why I say at the beginning that it's a good value. In any case,
> you want to choose parameters that result in each block write covering
> either a single disk or a whole stripe.
>
> Where things really go bad for BSD is when a _63_ sector offset gets
> introduced for the MBR. Now everything is offset to an odd,
> non-power-of-2 value, and there isn't anything that you can tweak in the
> filesystem or array to compensate. The best you can do is to manually
> calculate a compensating offset in the disklabel for each partition.
> But at the point, it often becomes easier to just ditch all of that and
> put the fielsystem directly on the disk.
>
> Scott
Scott,
Just wanted to say thanks for such a well put explanation on this, with
all the right details.
Eric
--
------------------------------------------------------------------------
Eric Anderson Sr. Systems Administrator Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------
More information about the freebsd-fs
mailing list