ATA 4K sector issues
Matthew Dillon
dillon at apollo.backplane.com
Wed Mar 17 20:23:25 UTC 2010
We experimented a bit with aligning fdisk (dos slices) by changing
the sector offset to 2 but I came to the conclusion that it was better
to do the alignment in disklabel / gpt / whatever higher-level
partitioner floats your boat and not mess with anything the BIOS
uses to boot the machine
My recommendation is to use a 1MB physical base alignment. That's what
I adjusted DragonFly's disklabel64 to do. It's definitely best to
have the partitioner deal with it instead of having to mess around
manually because the partitioner can calculate the actual physical
alignment by querying the kernel's disk subsystem regardless of the
topology.
There are several reasons for using a large alignment:
* A variety of media already uses much larger physical block sizes.
MLC flash uses 128K and SLC uses 64K blocks. See the note below
on why this matters even though SSDs do write combining.
* A larger alignment is more likely to work well as a default in
RAID configurations and doesn't hurt non-RAID.
* The kernel cluster I/O subsystem wants to collect stuff into 64K-256K
clusters for reading and writing (writing being the most important).
A larger alignment plus some minor tweeks in the cluster code will
cause the cluster writes to also be well aligned.
* Even though UFS does not take advantage of cluster alignment
(because BMAP tends to align only to the UFS block size which
is a fairly small <= 32K usually), filesystems such as ZFS (with
128K blocks I believe) and HAMMER (with 64K blocks and 8MB super
blocks) will. And fixing up UFS isn't difficult. One might need
to mess with the cylinder group alignment and make some minor tweeks
to the bmap allocator but that's about it.
* A large alignment hurts nothing. Who cares about ~512K-1MB of wasted
space at the beginning of the drive? I don't.
This is particularly important for SSDs. Even though SSDs do write
combining a properly aligned write will theoretically greatly improve
write endurance by reducing internal fragmentation, reducing write
amplification effects, and also reducing the amount of internal
rewriting the drive does to defragment and wear-level. It is hard
to test this but I am seeing wear rates condusive with a 100TB write
endurance on 40G Intel drives vendor-speced for a 35TB write endurance.
So even though you might not see a major difference in performance
you could very well see a big difference in write endurance. It isn't
possible to benchmark this with a standard benchmark which keeps the
SSD 100% active so I've been using real work loads and it just takes
forever to tick-down the SSDs wear-meter. The SSD also needs idle
time to implement internal defragmentation and wear leveling efficiently
(This seems more apparent in the OCZs than in the Intels).
There are a lot of moving parts in the kernel related to alignment.
The cluster code and the filesystem block allocation code are the two
biggest issues and adjustments have to made to take proper advantage
of it, particularly for SSDs.
So the answer is: Aligning things certainly isn't going to hurt
anything so you might as well kick it hard (use a large alignment)
so you don't have to revisist the problem again a year from now.
--
For hard drives with larger physical sector sizes it shouldn't matter
for asynchronous writes. It really shouldn't. And nearly all of UFS's
writes are asynchronous. That said:
I read Thiago's posting. I will note something specifics about a ports
tarball. Ports has 261,000+ files in it, mostly small. UFS and the
cluster code CANNOT COMBINE those writes (because the buffer-cache for
file data is per-vnode), so UFS will wind up doing a very large number
of fragment-sized writes.
These fragment-sized writes (4K in Thiago's aligned test that ran in
1:25, and 2K in Thiago's aligned test that ran in 10:24) should STILL
be write-combined in the drive. That is, UFS STILL has good write
linearity even with the small writes.
So I suspect the issue here is that the drive is not properly
write-combining the writes, possibly coupled with additional issues
in UFS's bmap and inode allocator that might not be presenting the
drive with enough write-combinable data that fits in the drive's cache,
forcing the drive to do a lot of read-before-write.
In terms of write-combinable data and UFS it could be a cylinder-group
alignment issue. Bitmap blocks are a particular problem because they
use an odd-sized block size (typically 6K if I remember right), though
I'm not sure how the filesystem fragment size effects it.
You would have to instrument the write activity to determine how
good the linearity is verses the size of the drive's ram cache.
There are definitely several possible explanations for the horrible
performance when using 2K fragments.
ZFS (and also HAMMER) would not have this particular problem. ZFS
clearly has other issues in those tests but I don't know enough about
its internals to guess, other than maybe it is a ZIL tuning issue.
-Matt
More information about the freebsd-hackers
mailing list