FreeBSD 5.3 I/O Performance / Linux 2.6.10 and dragonfly

Wed Feb 2 15:04:46 PST 2005

:>     I can figure some things out.  Clearly the BSD write numbers are dropping
:>     at a block size of 2048 due to vfs.write_behind being set to 1.
:
:Interesting, I didnt know of this. I really should re-read tuning(8). What 
:are the dangers of setting it to zero?

    There are three issues here.  First is how much of the buffer cache you
    want to allow a single application to monopolize.  Second is our 
    historically terrible filesystem syncer and buffer cache dirty page
    management.  Third is the fact that we even *HAVE* a buffer cache for
    reads that the system should be extracting directly out of the VM object.

    If you turn off write_behind a single application (the benchmark) can
    monopolize the buffer cache and greatly reduce the cache performance
    of other applications.  So e.g. on a large system doing lots of things
    you would want to leave this on (in its current incarnation).
    The idea behind the write-behind code is to flush out data blocks when
    enough data is present to be reasonably efficient to the disk.  Right
    now that is approximately 64KB of data but 'small writes' do not
    trigger the clustering code, hence the 2K transition you are seeing.
    The write-behind code also depresses the priority of the underlying
    VM pages allowing them to be reused more quickly relative to other
    applications running in the system, the idea being that data written
    in large blocks is unlikely to be read again any time soon.

    The second issue is our historically terrible filesystem syncer.  The
    write_behind greatly reduces the burden on the buffer cache and makes it
    work better.  If you turn it off, applications other then the benchmark
    trying to use the system will probably get pretty sludgy due to blockages
    in the buffer cache created by the benchmark.

    In FreeBSD-5 the vnode dirty/clean buffer list is now a splay tree,
    which is an improvement over what we had before but the real issue with
    the filesystem syncer is the fact that it tries to write out every single
    dirty buffer associated with a file all at once.  What it really needs to
    do (and OpenBSD or NetBSD does this) is only write out up to X (1)
    megabytes of data, remember where it left off, and then proceed to the
    next dirty file.

    The write_behind code really needs to be replaced with something integrated
    into a filesystem syncer (as described above).  That is, it should detect
    the existance of a large amount of sequential dirty data and it should
    kick another thread to flush it out synchronously, but it should not
    try to do it itself asynchronously.  The big problem with trying to buffer
    that much data asynchronously is that you wind up blocking on the disk
    device when the file is removed because so much I/O is marked 
    'in progress'.   The data set size should be increased from 64KB
    to 1MB as well.

    If the flushing can be done correctly it should be possible to have a
    good implementation of write_behind WITHOUT impacting cache performance.

    The third issue is the fact that we even have a buffer cache for things 
    like read() that would be better served going directly to the VM object.
    I suspect that cache performance could be increased by a huge amount by
    having the file->read go directly to the VM object instead of recursing
    through 8 subroutine levels, instantiating, and garbage collecting the
    buffer cache.

:>     clearly, Linux is not bothering to write out ANY data, and then able to
:>     take advantage of the fact that the test file is being destroyed by
:>     iozone (so it can throw away the data rather then write it out).  This
:>     skews the numbers to the point where the benchmark doesn't even come 
:> close
:>     to reflecting reality, though I do believe it points to an issue with
:>     the BSDs ... the write_behind heuristic is completely out of date now
:>     and needs to be reworked.
:
:http://www.iozone.org is what I was using to test with.  Although right 
:now, the box I am trying to put together is a Samba and NFS server for 
:mostly static web content.
:
:In the not too distant future, a file server for IMAP/POP3 front ends.  I 
:think postmark does a good job at simulating that.
:
:Are there better benchmarks / methods of testing that would give a more 
:fair comparison that you know of? I know all benchmarks have many caveats, 
:but I am trying to approach this somewhat methodically.  I am just about to 
:start another round of testing with nfs using multiple machines pounding 
:the one server.  I was just going to run postmark on the 3 clients machines 
:(starting out at the same time).

    Boy, I just don't know.  Benchmarks have their uses, but the ones that
    simulate more then one processs accessing the disk are almost certainly
    more realistic the ones like iozone which just run a single process 
    and do best when they are allowed to monopolizing the entire system.
    Bonnie is probably more accurate then iozone, it at least tries a lot
    harder to avoid side effects from prior tests.

:Ultimately I dont give a toss if one is 10% or even 20% better than the 
:other.  For that money, a few hundred dollars in RAM and CPU would change 
:that.  We are mostly a BSD shop so I dont want to deploy a LINUX box for 
:25% faster disk I/O.  But if the differences are far more acute, I need to 
:perhaps take a bit more notice.
:
:>     The read tests are less clear.  iozone runs its read tests just after
:>     it runs its write tests. so filesystem syncing and write flushing is
:>     going to have a huge effect on the read numbers.  I suspect that this
:>     is skewing the results across the spectrum.  In particular, I don't
:>     see anywhere near the difference in cache-read performance between
:>     FreeBSD-5 and DragonFly.  But I guess I'll have to load up a few test
:>     boxes myself and do my own comparisons to figure out what is going on.
:>

    Well, the 4-way explains the cache performance on the read tests at
    least.  What you are seeing is the BGL removal in FreeBSD-5 verses
    DFly.  Try it on a UP machine, though, and I'll bet the numbers will be
    reversed.

    In anycase, the #1 issue that should be on both our plates is fixing
    up the filesystem syncer and modernizing the write_behind code.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>