NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at Phoronix)]

Fri Dec 4 22:26:20 UTC 2009

    The biggest issue we've had (w/ DragonFly) on things like database
    benchmarks, and MySQL in particular, is with the large number of
    fsync() calls MySQL makes and less with SMP.  SMP only really matters
    when one is operating out of the cache.  Read-heavy from-cache operations
    on an open descriptor can run without the BGL on DragonFly with the
    flip of a sysctl but it doesn't have nearly the same effect as, say,
    disabling fsync has in tests which blow-out the system caches.

    Disk I/O is a huge bottleneck so anything disk-bound tends to be
    less reliant on cpu parallelism.  Anything with NCQ, such as AHCI,
    will greatly improve random disk reads and mixed reads & writes,
    though it also has a tendancy to give writes priority over reads
    due to write I/Os returning nearly instantly (until the disk's own
    cache fills up, anyway, which is another issue entirely).  e.g. if you
    have 32 tags and dedicate 1 tag for writes then load up all 32 tags
    (31 parallel reads and 1 parallel write), the write bandwidth will
    wind up being far more then 1/32 available disk bandwidth.

    fsync() is an area where UFS can operate quite efficiently, at least
    insofar as block-replacement write()'s which do not have to extend
    the file's size.  I dunno about ZFS but for something like HAMMER
    an efficient fsync() requires implementing a forward (REDO) log
    to remove all seeks and degenerate into only linear writes for
    the fsync() operation itself.  I've made some progress there but I
    still have a ways to go.

    SSD vs HD will skew the effect different subsystems have on performance,
    of course, though even a SSD would benefit from a forward-log capable
    of devolving the entire fsync() to a single device write I/O.  SMP
    becomes more important as I/O subsystems get faster.

    One area where locking seems to matter more than SMP is when
    one is mixing read() and write() operations on the same vnode.  Here
    the issue tends to be either:

    (1) Holding an exclusive vnode lock during a write() while blocked on
	the buffer cache, thus interfering with read()s.

	Moving to an offset-range lock for read/write to ensure read/write
	atomicy, and to deal with inode updates, solves this issue.
	(I have the offset-range locks in DFly but I haven't turned off
	the exclusive vnode lock for write()'s yet).  I don't quite recall
	but I think linux has given up on guaranteeing read/write atomicy.

	Unlocking the vnode while blocked on the buffer cache would also
	work, as long as read vs write atomicy mechanics can be maintained
	for the duration.

	Pre-caching/pre-creating the buffer cache buffers with the vnode
	unlocked also helps, but increases cpu overhead as you have to
	lookup each buffer twice.

    or

    (2) A large number of buffer cache buffers undergoing physical write
	I/O at once, and thus in a locked state for a long period of time
	causing read()'s of the same buffers to block for similarly long
	periods of time.

	Limiting the number of buffer cache buffers you queue to the
	underlying device at any given moment (via bawrite()) mostly
	solves this problem.  Note that I am not talking about the disk
	device's queue here (NCQ vs not NCQ doesn't matter for disk
	writes)...  you have to actually NOT issue the bawrite() in the
	first place so the buffer remains unlocked until the very last
	moment.

	I implemented this on DFly along with pipelining fixes in the
	buffer flusher thread and got very interesting blogbench results
	during the pre-cache-blowout phase of the test.  Basically
	blogbench was able to write() at full speed (with disk I/O
	saturated 100% with writes) without any detrimental effect on
	read()s (which were being satisfied at full speed from the VM/buffer
	cache) during that phase.  Before the fixes the two would interfere
	with each other quite a bit.

	In fact, reducing the amount of time a buffer cache buffer undergoing
	write() I/O remains locked is a really difficult problem because
	you have to tune the data rate to match the disk drive's actual
	write pipeline (which changes depending on the simultanious read
	load) so the disk drive's own caches don't get saturated
	with dirty data and stall-out the write I/O's that were queued to
	it (leading to the related dirty buffer cache buffer remaining
	locked longer).  I haven't been able to automate it, there's no
	way to query the disk, but I have been able to tune things manually.

    In anycase, I think the key takeaway here is that there are at least
    four (and probably more) different subsystems in the codebase which
    must be addressed to get good benchmark results.  Many of these
    benchmarks are doing simultanious reads and writes which tend to tickle
    (and require) that all the bottlenecks be addressed.  SMP becomes more
    important when system caches are well-utilized.  Disk scheduling and
    buffer cache management becomes more important as the disk gets more
    saturated.

						-Matt