FreeBSD 5.3 I/O Performance / Linux 2.6.10 and dragonfly
dillon at apollo.backplane.com
Wed Feb 2 15:04:46 PST 2005
:> I can figure some things out. Clearly the BSD write numbers are dropping
:> at a block size of 2048 due to vfs.write_behind being set to 1.
:Interesting, I didnt know of this. I really should re-read tuning(8). What
:are the dangers of setting it to zero?
There are three issues here. First is how much of the buffer cache you
want to allow a single application to monopolize. Second is our
historically terrible filesystem syncer and buffer cache dirty page
management. Third is the fact that we even *HAVE* a buffer cache for
reads that the system should be extracting directly out of the VM object.
If you turn off write_behind a single application (the benchmark) can
monopolize the buffer cache and greatly reduce the cache performance
of other applications. So e.g. on a large system doing lots of things
you would want to leave this on (in its current incarnation).
The idea behind the write-behind code is to flush out data blocks when
enough data is present to be reasonably efficient to the disk. Right
now that is approximately 64KB of data but 'small writes' do not
trigger the clustering code, hence the 2K transition you are seeing.
The write-behind code also depresses the priority of the underlying
VM pages allowing them to be reused more quickly relative to other
applications running in the system, the idea being that data written
in large blocks is unlikely to be read again any time soon.
The second issue is our historically terrible filesystem syncer. The
write_behind greatly reduces the burden on the buffer cache and makes it
work better. If you turn it off, applications other then the benchmark
trying to use the system will probably get pretty sludgy due to blockages
in the buffer cache created by the benchmark.
In FreeBSD-5 the vnode dirty/clean buffer list is now a splay tree,
which is an improvement over what we had before but the real issue with
the filesystem syncer is the fact that it tries to write out every single
dirty buffer associated with a file all at once. What it really needs to
do (and OpenBSD or NetBSD does this) is only write out up to X (1)
megabytes of data, remember where it left off, and then proceed to the
next dirty file.
The write_behind code really needs to be replaced with something integrated
into a filesystem syncer (as described above). That is, it should detect
the existance of a large amount of sequential dirty data and it should
kick another thread to flush it out synchronously, but it should not
try to do it itself asynchronously. The big problem with trying to buffer
that much data asynchronously is that you wind up blocking on the disk
device when the file is removed because so much I/O is marked
'in progress'. The data set size should be increased from 64KB
to 1MB as well.
If the flushing can be done correctly it should be possible to have a
good implementation of write_behind WITHOUT impacting cache performance.
The third issue is the fact that we even have a buffer cache for things
like read() that would be better served going directly to the VM object.
I suspect that cache performance could be increased by a huge amount by
having the file->read go directly to the VM object instead of recursing
through 8 subroutine levels, instantiating, and garbage collecting the
:> clearly, Linux is not bothering to write out ANY data, and then able to
:> take advantage of the fact that the test file is being destroyed by
:> iozone (so it can throw away the data rather then write it out). This
:> skews the numbers to the point where the benchmark doesn't even come
:> to reflecting reality, though I do believe it points to an issue with
:> the BSDs ... the write_behind heuristic is completely out of date now
:> and needs to be reworked.
:http://www.iozone.org is what I was using to test with. Although right
:now, the box I am trying to put together is a Samba and NFS server for
:mostly static web content.
:In the not too distant future, a file server for IMAP/POP3 front ends. I
:think postmark does a good job at simulating that.
:Are there better benchmarks / methods of testing that would give a more
:fair comparison that you know of? I know all benchmarks have many caveats,
:but I am trying to approach this somewhat methodically. I am just about to
:start another round of testing with nfs using multiple machines pounding
:the one server. I was just going to run postmark on the 3 clients machines
:(starting out at the same time).
Boy, I just don't know. Benchmarks have their uses, but the ones that
simulate more then one processs accessing the disk are almost certainly
more realistic the ones like iozone which just run a single process
and do best when they are allowed to monopolizing the entire system.
Bonnie is probably more accurate then iozone, it at least tries a lot
harder to avoid side effects from prior tests.
:Ultimately I dont give a toss if one is 10% or even 20% better than the
:other. For that money, a few hundred dollars in RAM and CPU would change
:that. We are mostly a BSD shop so I dont want to deploy a LINUX box for
:25% faster disk I/O. But if the differences are far more acute, I need to
:perhaps take a bit more notice.
:> The read tests are less clear. iozone runs its read tests just after
:> it runs its write tests. so filesystem syncing and write flushing is
:> going to have a huge effect on the read numbers. I suspect that this
:> is skewing the results across the spectrum. In particular, I don't
:> see anywhere near the difference in cache-read performance between
:> FreeBSD-5 and DragonFly. But I guess I'll have to load up a few test
:> boxes myself and do my own comparisons to figure out what is going on.
Well, the 4-way explains the cache performance on the read tests at
least. What you are seeing is the BGL removal in FreeBSD-5 verses
DFly. Try it on a UP machine, though, and I'll bet the numbers will be
In anycase, the #1 issue that should be on both our plates is fixing
up the filesystem syncer and modernizing the write_behind code.
<dillon at backplane.com>
More information about the freebsd-performance