Flash disks and FFS layout heuristics
Matthew Dillon
dillon at apollo.backplane.com
Sun Mar 30 13:16:57 PDT 2008
:You should try running your experiment using ZFS. Because it is a
:non-overwriting filesystem, it might work better with flash.
:
: Kirk McKusick
I'm assuming ZFS still has to update indices and indirect blocks, though,
which is the primary source for random updates in all filesystems.
The right way to deal with flash is *NOT* to require that the filesystem
be smart about flash storage, but instead to implement an intermediate
storage layer which linearizes the writes to flash and removes all
random erases from the critical path. This also causes erasures to
be evenly spread out on the flash unit and *GREATLY* extends the life
of the flash device (to the point where you can just treat it as a disk
and not have to worry about wearing out cells).
I wrote precisely that 20 years ago for the flash filesystem I built
for our telemetry RTUs. Of course, 20 years ago flash devices were
much smaller, only 1-4MB per chip. But the concept is sound and with
proper design can be implemented for much larger devices.
Basically the general idea is as follows: Break the flash into three
pieces: Two sector translation tables and one bulk storage area.
Whenever a modification is made that involves transitioning bits
from 0->1 (1->0 doesn't need an erase cycle) instead of erasing the
flash sector all you do is allocate a new flash sector, append
an entry to the translation table, and write the data out to the
new flash sector. The logical block is now renumbered. You cache
(some or all of) the translation table in-memory for fast access.
* Appends to the translation table only involve 1->0 transitions. You
don't even have to zero-out the old translation but can use it for
crash recovery purposes. Thus no erasures are needed until the table
becomes full.
* Any non-trivial overwrites append a new sector, again involving only
1->0 transitions and requiring no erasures.
* When the translation table becomes full you repack it into the second
translation table (which then becomes the primary table), and erase
the previous table. You ping-pong the tables (that's why there are
two).
* Bulk space can be allocated linearly until the flash becomes full,
then erased/repacked (you also switch to the alternate translation
table when doing the repacking of the bulk space). This can be a
little tricky but as long as you leave one erase-sector's worth of
space available you can always repack the flash without any
possibility of losing data.
This latter operation is the most expensive but once some space is freed
up it is possible to pack simultaniously with running new ops, or to
repack continuously as a background operation when space is tight, as
long as you don't get twisted up with a full translation table.
The only 'hard' bit about this design is you need to come up with a
translation table topology that works for large flash devices. My
flash filesystem of long ago just used a linear array and cached the
whole thing with a hash table in memory, so it didn't require a
sophisticated topology on-flash. But for a large flash device you
probably need something a bit more sophisticated that still does not
involve erase cycles in the critical path. The critical point, however,
is that the on-flash translation table does NOT need to be optimal
because you can mirror or cache elements of it in-memory.
In anycase, that's really the only acceptable way to do a flash
filesystem and still be able to guarantee proper wear characteristics
for the flash cells.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list