2 bonnies can stop disk activity permanently

Bruce Evans bde at zeta.org.au
Mon Oct 9 16:13:55 PDT 2006

On Mon, 9 Oct 2006, Scott Long wrote:

> Bruce Evans wrote:
>> ...
>> I suspect the problems are that the 64K-block i/o is usually perfectly
>> misaligned unless the fs itself has 64K-blocks and the fs's partition
>> starts on a 64K-block boundary, and that some hardware or firmware
>> (mainly RAIDs) want the blocks to be aligned.  I'm not very familiar
>> ...
> Yes, it's a well-known problem that the combination of fdisk+disklabel+ufs 
> means that all FS blocks are mis-aligned in the worst way possible (blocks 
> start on odd sector numbers).  This
> _horribly_ pessimizes RAID-5 on most controllers.

Apparently the internal fs block alignment/size problem is not so well
known.  I knew about the external one but didn't connect it with fs
block sizes at first.  How horribly do aligned 16K-blocks pessimize
RAID-5?  Does it help much to have misaligned 64K-blocks instead of
misaligned 16K-blocks?

> Solving it reliably
> and automatically is hard, though.  The filesystem ultimately needs to
> know the physical sector that it starts on, and compensate accordingly.
> You could cheat by having the disklabel tools always align partitions,
> but the tool would still need to know the physical address of where it
> starts in the slice.  Either way, something high up needs to get the
> logical to physical translation of the sectors.

The filesystem shouldn't need to know more than that its starting sector
is not physically misaligned.  The clustering code could use knowledge
of physical offsets and alignment requirements to fix up some cases.
My version of newfs_msdosfs(8) uses the (unimplemented) ioctl
DIOCMEDIAOFFSET to ask the system for the physical offset.  Using
this is much easier than parsing XML.

> Suggestions have been made to just put blind offsets into the disklabel
> tool that assumes the common case (mbr is present and is a known length,
> and that the disklabel is in the first slice of the MBR).  Obviously,
> this is only a crude hack.  I get around this right now by not using a
> disklabel or fdisk table on arrays where I value speed.  For those, I
> just put a filesystem directly on the array, and boot off of a small
> system disk.

I normally align FreeBSD slices and partitions manually to a "cylinder"
boundary, and this sometimes gives alignment to a large power of 2
accidentally.  I normally use a fake cylinder size of 16065 (255 fake
heads and 63 sectors per fake track).  This is just as bad for cylinder
alignment as 63 is for track alignment, but new systems only need it
for compatibility with other systems.


More information about the freebsd-fs mailing list