Flash disks and FFS layout heuristics

Tue Apr 1 13:10:24 PDT 2008

:If you've asked, I've missed the question.
:
:We tend to size ram and embedded NAND the same. The latest numbers I can
:discuss are several years old and were 64mb/64mb. Engineering *always*
:wants more of each, but the BOM rules.
:=20

    64MB is tiny.  None of the problems with any of the approachs we've
    discussed even exist with devices that small in an embedded system.
    You barely need to even implement a filesystem topology, let alone
    anything sophisticated.

    To be clear, because I really don't understand how you can possibly
    argue that the named-block storage layer is bad in a device that small... 
    the only sophistication a named-block storage model needs is when it must
    create a forward lookup on-flash to complement the reverse lookup you
    get from the auxillary storage.  Given that you can trivially cache
    many translations in memory, not to mention do a boot-up scan of a flash
    that small, the only performance impact would be writing out a portion
    of the forward translation topology every N (N > 1000) or so page writes 
    (however many translations can be conveniently cached in system memory).

    In a small flash device the embedded application will determine whether
    you even need such a table... frankly, unless it's a general purpose 
    computing device like an iPhone you probably wouldn't need an on-flash
    forward lookup, you could simply size the blocks to guarantee 99% flash
    utilization verses the number of files you expect to have to maintain
    (so, for example, the named block size could be 512K if you expected
    to have to maintain 100 files on a 64MB device).  This doesn't mean the
    filesystem would have to use a 512K block size, that would only be the
    case if the filesystem were flash-unaware.

    It's seriously a non-issue.  You are making too many assumptions about
    how named blocks would be used, particularly if the filesystem is
    flash-aware.  Named blocks do not have to be 'sectors' or 'filesystem
    blocks' or anything of the sort.  They can be much larger.. they can
    easily be multiples of a flash page though you don't want to make them
    too large because a failed page also fails the whole named-block covering
    that page.  They can be erase units (probably the best fit).

    This leaves the filesystem layer (and here we are talking about a
    flash-aware filesystem), with a hellofalot more implementation
    flexiblity.

			FORWARD LOOKUP ON-FLASH TOPOLOGY

    There are many choices available for the forward lookup topology,
    assuming you need one.  Here again we are describing the need to have
    one (or at least one that would be considered sophisticated) only for
    larger flash devices -- really true solid state storage.

    We aren't talking about having to write out tiny little updates to
    B-Tree elements... that's stupid.  Only an idiot would do that.

    Because you can cache new lookups in system memory and because you do
    NOT have to flush the forward lookup topology to flash for crash
    recovery purposes, the sole limiting factor for the efficiency of the
    forward lookup flush to flash is the amount of system memory you are
    willing to set aside to cache new translations.  Since translations are
    fairly small structures we are probably talking not dozens, not hundreds,
    but probably at least a thousand translations before any sort of flush
    would be needed.

    Lets be clear here.  That's ONE page write every THOUSAND page writes
    worth of overhead.  There are no write performance issues.

    The actual on-flash topology for the forward lookup?  With such a large
    rollup cache available it almost doesn't matter, but lets say we wanted
    to limit forward lookups to 3 levels.

    Lets take a 2G flash device with 8K pages, just to be nice about it.
    That's 262144 named blocks.  Not a very large number, eh?  Why you 
    could almost fit that in system memory (and maybe you can!) and obviate
    the need for an on-flash forward lookup topology at all.

    But lets say you COULDN'T fit that in system memory.  Hmm. 3 levels,
    262144 entries maximum (less in real life).  That would be a 3-layer
    radix tree with a radix of 64.  The top layer would almost certainly
    be cacheable in system memory (and maybe even two layers) so we are
    talking one or two page reads from flash to do the lookup and the
    update mechanic, being a radix tree, would be to sync the bits of the
    radix tree that were modified by the new translations all the way up
    to the root.

    Clearly given the number of 'dirty' translations that would need to be
    synchronized, you could easily fill a flash page and then simply retire
    the synced translations from system memory, and repeat as often as
    necessary to maintain the dirty ratio in the cache in system memory
    at appropriate levels.  You can also clearly accumulate enough dirty
    translations for the sync to be worthwhile... that is, be guaranteed
    to fill a whole page.  You do NOT have to sync for recovery purposes
    so it becomes an issue that is solely related to the system cache
    and nothing else.

    I'll add something else with regards to radix trees using large radii...
    you can usually cache just about the whole damn thing except the leaf
    level in system memory.  Think about that for a moment and in particular
    think about how it greatly reduces the number of actual flash reads
    needed to perform the lookup.

    I'll add something else with regards to on-storage radix trees.  You can
    also adjust the layout so the BOTTOM few levels of the radix tree,
    relative to some leaf, reside in the same page.

    So now we've reduced a random uncached translation lookup to, at worse,
    ONE flash page read operation that ALSO guarantees us locality of
    reference for nearby file blocks (and hence has no performance issues
    for streaming reads either).

    --

    Now, if you want to argue that this model would have serious performance
    penalities please go ahead, I'm all ears.

						-Matt