ATA 4K sector issues

Wed Mar 17 20:23:25 UTC 2010

    We experimented a bit with aligning fdisk (dos slices) by changing
    the sector offset to 2 but I came to the conclusion that it was better
    to do the alignment in disklabel / gpt / whatever higher-level
    partitioner floats your boat and not mess with anything the BIOS
    uses to boot the machine

    My recommendation is to use a 1MB physical base alignment.  That's what
    I adjusted DragonFly's disklabel64 to do.  It's definitely best to
    have the partitioner deal with it instead of having to mess around
    manually because the partitioner can calculate the actual physical
    alignment by querying the kernel's disk subsystem regardless of the
    topology.

    There are several reasons for using a large alignment:

    * A variety of media already uses much larger physical block sizes.
      MLC flash uses 128K and SLC uses 64K blocks.  See the note below
      on why this matters even though SSDs do write combining.

    * A larger alignment is more likely to work well as a default in
      RAID configurations and doesn't hurt non-RAID.

    * The kernel cluster I/O subsystem wants to collect stuff into 64K-256K
      clusters for reading and writing (writing being the most important).
      A larger alignment plus some minor tweeks in the cluster code will
      cause the cluster writes to also be well aligned.

    * Even though UFS does not take advantage of cluster alignment
      (because BMAP tends to align only to the UFS block size which
      is a fairly small <= 32K usually), filesystems such as ZFS (with
      128K blocks I believe) and HAMMER (with 64K blocks and 8MB super
      blocks) will.  And fixing up UFS isn't difficult.  One might need
      to mess with the cylinder group alignment and make some minor tweeks
      to the bmap allocator but that's about it.

    * A large alignment hurts nothing.  Who cares about ~512K-1MB of wasted
      space at the beginning of the drive?  I don't.

    This is particularly important for SSDs.  Even though SSDs do write
    combining a properly aligned write will theoretically greatly improve
    write endurance by reducing internal fragmentation, reducing write
    amplification effects, and also reducing the amount of internal
    rewriting the drive does to defragment and wear-level.  It is hard
    to test this but I am seeing wear rates condusive with a 100TB write
    endurance on 40G Intel drives vendor-speced for a 35TB write endurance.

    So even though you might not see a major difference in performance
    you could very well see a big difference in write endurance.  It isn't
    possible to benchmark this with a standard benchmark which keeps the
    SSD 100% active so I've been using real work loads and it just takes
    forever to tick-down the SSDs wear-meter.  The SSD also needs idle
    time to implement internal defragmentation and wear leveling efficiently
    (This seems more apparent in the OCZs than in the Intels).

    There are a lot of moving parts in the kernel related to alignment.
    The cluster code and the filesystem block allocation code are the two
    biggest issues and adjustments have to made to take proper advantage
    of it, particularly for SSDs.

    So the answer is:  Aligning things certainly isn't going to hurt
    anything so you might as well kick it hard (use a large alignment)
    so you don't have to revisist the problem again a year from now.

    --

    For hard drives with larger physical sector sizes it shouldn't matter
    for asynchronous writes.  It really shouldn't.  And nearly all of UFS's
    writes are asynchronous.  That said:

    I read Thiago's posting.  I will note something specifics about a ports
    tarball.  Ports has 261,000+ files in it, mostly small.  UFS and the
    cluster code CANNOT COMBINE those writes (because the buffer-cache for
    file data is per-vnode), so UFS will wind up doing a very large number
    of fragment-sized writes.

    These fragment-sized writes (4K in Thiago's aligned test that ran in
    1:25, and 2K in Thiago's aligned test that ran in 10:24) should STILL
    be write-combined in the drive.  That is, UFS STILL has good write
    linearity even with the small writes.

    So I suspect the issue here is that the drive is not properly
    write-combining the writes, possibly coupled with additional issues
    in UFS's bmap and inode allocator that might not be presenting the
    drive with enough write-combinable data that fits in the drive's cache,
    forcing the drive to do a lot of read-before-write.

    In terms of write-combinable data and UFS it could be a cylinder-group
    alignment issue.  Bitmap blocks are a particular problem because they
    use an odd-sized block size (typically 6K if I remember right), though
    I'm not sure how the filesystem fragment size effects it.

    You would have to instrument the write activity to determine how
    good the linearity is verses the size of the drive's ram cache.
    There are definitely several possible explanations for the horrible
    performance when using 2K fragments.

    ZFS (and also HAMMER) would not have this particular problem.  ZFS
    clearly has other issues in those tests but I don't know enough about
    its internals to guess, other than maybe it is a ZIL tuning issue.

						-Matt