NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at
dillon at apollo.backplane.com
Fri Dec 4 22:26:20 UTC 2009
The biggest issue we've had (w/ DragonFly) on things like database
benchmarks, and MySQL in particular, is with the large number of
fsync() calls MySQL makes and less with SMP. SMP only really matters
when one is operating out of the cache. Read-heavy from-cache operations
on an open descriptor can run without the BGL on DragonFly with the
flip of a sysctl but it doesn't have nearly the same effect as, say,
disabling fsync has in tests which blow-out the system caches.
Disk I/O is a huge bottleneck so anything disk-bound tends to be
less reliant on cpu parallelism. Anything with NCQ, such as AHCI,
will greatly improve random disk reads and mixed reads & writes,
though it also has a tendancy to give writes priority over reads
due to write I/Os returning nearly instantly (until the disk's own
cache fills up, anyway, which is another issue entirely). e.g. if you
have 32 tags and dedicate 1 tag for writes then load up all 32 tags
(31 parallel reads and 1 parallel write), the write bandwidth will
wind up being far more then 1/32 available disk bandwidth.
fsync() is an area where UFS can operate quite efficiently, at least
insofar as block-replacement write()'s which do not have to extend
the file's size. I dunno about ZFS but for something like HAMMER
an efficient fsync() requires implementing a forward (REDO) log
to remove all seeks and degenerate into only linear writes for
the fsync() operation itself. I've made some progress there but I
still have a ways to go.
SSD vs HD will skew the effect different subsystems have on performance,
of course, though even a SSD would benefit from a forward-log capable
of devolving the entire fsync() to a single device write I/O. SMP
becomes more important as I/O subsystems get faster.
One area where locking seems to matter more than SMP is when
one is mixing read() and write() operations on the same vnode. Here
the issue tends to be either:
(1) Holding an exclusive vnode lock during a write() while blocked on
the buffer cache, thus interfering with read()s.
Moving to an offset-range lock for read/write to ensure read/write
atomicy, and to deal with inode updates, solves this issue.
(I have the offset-range locks in DFly but I haven't turned off
the exclusive vnode lock for write()'s yet). I don't quite recall
but I think linux has given up on guaranteeing read/write atomicy.
Unlocking the vnode while blocked on the buffer cache would also
work, as long as read vs write atomicy mechanics can be maintained
for the duration.
Pre-caching/pre-creating the buffer cache buffers with the vnode
unlocked also helps, but increases cpu overhead as you have to
lookup each buffer twice.
(2) A large number of buffer cache buffers undergoing physical write
I/O at once, and thus in a locked state for a long period of time
causing read()'s of the same buffers to block for similarly long
periods of time.
Limiting the number of buffer cache buffers you queue to the
underlying device at any given moment (via bawrite()) mostly
solves this problem. Note that I am not talking about the disk
device's queue here (NCQ vs not NCQ doesn't matter for disk
writes)... you have to actually NOT issue the bawrite() in the
first place so the buffer remains unlocked until the very last
I implemented this on DFly along with pipelining fixes in the
buffer flusher thread and got very interesting blogbench results
during the pre-cache-blowout phase of the test. Basically
blogbench was able to write() at full speed (with disk I/O
saturated 100% with writes) without any detrimental effect on
read()s (which were being satisfied at full speed from the VM/buffer
cache) during that phase. Before the fixes the two would interfere
with each other quite a bit.
In fact, reducing the amount of time a buffer cache buffer undergoing
write() I/O remains locked is a really difficult problem because
you have to tune the data rate to match the disk drive's actual
write pipeline (which changes depending on the simultanious read
load) so the disk drive's own caches don't get saturated
with dirty data and stall-out the write I/O's that were queued to
it (leading to the related dirty buffer cache buffer remaining
locked longer). I haven't been able to automate it, there's no
way to query the disk, but I have been able to tune things manually.
In anycase, I think the key takeaway here is that there are at least
four (and probably more) different subsystems in the codebase which
must be addressed to get good benchmark results. Many of these
benchmarks are doing simultanious reads and writes which tend to tickle
(and require) that all the bottlenecks be addressed. SMP becomes more
important when system caches are well-utilized. Disk scheduling and
buffer cache management becomes more important as the disk gets more
More information about the freebsd-current