Dump Utility cache efficiency analysis

Wed Jun 24 01:07:53 UTC 2009

:Hello
:
:This is regarding the dump utility cache efficiency analysis post made on
:February '07 by Peter Jeremy [
:http://lists.freebsd.org/pipermail/freebsd-hackers/2007-February/019666.html]
:and if this project is still open. I would be interested to begin exploring
:FreeBSD (and contributing) by starting this project.
:
:I do have some basic understanding of the problem at hand - to determine if
:a unified cache would appeal as a more efficient/elegant solution compared
:to the per-process-cache in the Dump utility implementation. I admit I am
:new to this list and FreeBSD so I wouldn't be able to determine what the
:current implementation is, until I get started.
:...

     I think the cache in the dump utility is still the one I worked up
     a long time ago.  It was a quick and dirty job at the time, and it
     was never really designed for parallel operation which is probably
     why it doesn't work so well in that regard.

     In my opinion, a unified cache would be an excellent improvement.
     Ultimately dump is an I/O bound process so I don't think we would
     really need to worry about the minor increases in cpu overhead
     from the additional locking needed.

     There are a few issues you will have to consider:

     * Dump uses a fork model for its children rather then pthreads.  You
       would either have to use the F_*LK fcntl() operations or use a
       simpler flock() scheme to lock across the children.  Alternatively
       you could change dump over to a pthreads model and use pthreads
       mutexes, but that would entail a lot more work.  Dump was never
       designed to be threaded.

     * The general issue with any caching scheme for dump is how much to
       actually cache per I/O vs the size of the cache.  Caching larger
       amounts of data hits diminishing returns as it also increases seek
       times and waste (cached data never usde).  Caching smaller amounts
       of data hits diminishing returns as it causes the disk to seek more.

    Disk drives generally do have a track cache, but they also only typically
    have 8-16M of cache ram (32M in newer drives, particularly the higher
    capacity ones).  A track is typically about 1-2M (maybe higher now) so
    it doesn't take much seeking for the drive to blow out its internal
    track cache.  Caching that much data in a single read would probably
    be detrimental anyway.

    This also means you do not necessarily want to cache too much
    linearly-read data, as the disk drive is already doing it for you.

    Because of all of this it is going to be tough to find cache parameters
    that work well generally, and the parameters are going to chance
    drastically based on the amount of cache you specify on the command
    line and the size of the partition being dumped.

						-Matt