Flash disks and FFS layout heuristics

Sun Mar 30 13:16:57 PDT 2008

:You should try running your experiment using ZFS. Because it is a
:non-overwriting filesystem, it might work better with flash.
:
:	Kirk McKusick

   I'm assuming ZFS still has to update indices and indirect blocks, though,
   which is the primary source for random updates in all filesystems.

   The right way to deal with flash is *NOT* to require that the filesystem
   be smart about flash storage, but instead to implement an intermediate
   storage layer which linearizes the writes to flash and removes all
   random erases from the critical path.  This also causes erasures to
   be evenly spread out on the flash unit and *GREATLY* extends the life
   of the flash device (to the point where you can just treat it as a disk
   and not have to worry about wearing out cells).

   I wrote precisely that 20 years ago for the flash filesystem I built
   for our telemetry RTUs.  Of course, 20 years ago flash devices were
   much smaller, only 1-4MB per chip.  But the concept is sound and with
   proper design can be implemented for much larger devices.

   Basically the general idea is as follows:  Break the flash into three
   pieces:  Two sector translation tables and one bulk storage area.
   Whenever a modification is made that involves transitioning bits
   from 0->1 (1->0 doesn't need an erase cycle) instead of erasing the
   flash sector all you do is allocate a new flash sector, append
   an entry to the translation table, and write the data out to the 
   new flash sector.  The logical block is now renumbered.  You cache
   (some or all of) the translation table in-memory for fast access.

   * Appends to the translation table only involve 1->0 transitions.  You
     don't even have to zero-out the old translation but can use it for
     crash recovery purposes.  Thus no erasures are needed until the table
     becomes full.

   * Any non-trivial overwrites append a new sector, again involving only
     1->0 transitions and requiring no erasures.

   * When the translation table becomes full you repack it into the second
     translation table (which then becomes the primary table), and erase
     the previous table.  You ping-pong the tables (that's why there are
     two).

   * Bulk space can be allocated linearly until the flash becomes full,
     then erased/repacked (you also switch to the alternate translation
     table when doing the repacking of the bulk space).  This can be a
     little tricky but as long as you leave one erase-sector's worth of
     space available you can always repack the flash without any
     possibility of losing data.

     This latter operation is the most expensive but once some space is freed
     up it is possible to pack simultaniously with running new ops, or to
     repack continuously as a background operation when space is tight, as
     long as you don't get twisted up with a full translation table.

    The only 'hard' bit about this design is you need to come up with a
    translation table topology that works for large flash devices.  My
    flash filesystem of long ago just used a linear array and cached the
    whole thing with a hash table in memory, so it didn't require a
    sophisticated topology on-flash.  But for a large flash device you
    probably need something a bit more sophisticated that still does not
    involve erase cycles in the critical path.  The critical point, however,
    is that the on-flash translation table does NOT need to be optimal
    because you can mirror or cache elements of it in-memory.

    In anycase, that's really the only acceptable way to do a flash
    filesystem and still be able to guarantee proper wear characteristics
    for the flash cells.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>