Flash disks and FFS layout heuristics
Matthew Dillon
dillon at apollo.backplane.com
Tue Apr 1 13:10:24 PDT 2008
:If you've asked, I've missed the question.
:
:We tend to size ram and embedded NAND the same. The latest numbers I can
:discuss are several years old and were 64mb/64mb. Engineering *always*
:wants more of each, but the BOM rules.
:=20
64MB is tiny. None of the problems with any of the approachs we've
discussed even exist with devices that small in an embedded system.
You barely need to even implement a filesystem topology, let alone
anything sophisticated.
To be clear, because I really don't understand how you can possibly
argue that the named-block storage layer is bad in a device that small...
the only sophistication a named-block storage model needs is when it must
create a forward lookup on-flash to complement the reverse lookup you
get from the auxillary storage. Given that you can trivially cache
many translations in memory, not to mention do a boot-up scan of a flash
that small, the only performance impact would be writing out a portion
of the forward translation topology every N (N > 1000) or so page writes
(however many translations can be conveniently cached in system memory).
In a small flash device the embedded application will determine whether
you even need such a table... frankly, unless it's a general purpose
computing device like an iPhone you probably wouldn't need an on-flash
forward lookup, you could simply size the blocks to guarantee 99% flash
utilization verses the number of files you expect to have to maintain
(so, for example, the named block size could be 512K if you expected
to have to maintain 100 files on a 64MB device). This doesn't mean the
filesystem would have to use a 512K block size, that would only be the
case if the filesystem were flash-unaware.
It's seriously a non-issue. You are making too many assumptions about
how named blocks would be used, particularly if the filesystem is
flash-aware. Named blocks do not have to be 'sectors' or 'filesystem
blocks' or anything of the sort. They can be much larger.. they can
easily be multiples of a flash page though you don't want to make them
too large because a failed page also fails the whole named-block covering
that page. They can be erase units (probably the best fit).
This leaves the filesystem layer (and here we are talking about a
flash-aware filesystem), with a hellofalot more implementation
flexiblity.
FORWARD LOOKUP ON-FLASH TOPOLOGY
There are many choices available for the forward lookup topology,
assuming you need one. Here again we are describing the need to have
one (or at least one that would be considered sophisticated) only for
larger flash devices -- really true solid state storage.
We aren't talking about having to write out tiny little updates to
B-Tree elements... that's stupid. Only an idiot would do that.
Because you can cache new lookups in system memory and because you do
NOT have to flush the forward lookup topology to flash for crash
recovery purposes, the sole limiting factor for the efficiency of the
forward lookup flush to flash is the amount of system memory you are
willing to set aside to cache new translations. Since translations are
fairly small structures we are probably talking not dozens, not hundreds,
but probably at least a thousand translations before any sort of flush
would be needed.
Lets be clear here. That's ONE page write every THOUSAND page writes
worth of overhead. There are no write performance issues.
The actual on-flash topology for the forward lookup? With such a large
rollup cache available it almost doesn't matter, but lets say we wanted
to limit forward lookups to 3 levels.
Lets take a 2G flash device with 8K pages, just to be nice about it.
That's 262144 named blocks. Not a very large number, eh? Why you
could almost fit that in system memory (and maybe you can!) and obviate
the need for an on-flash forward lookup topology at all.
But lets say you COULDN'T fit that in system memory. Hmm. 3 levels,
262144 entries maximum (less in real life). That would be a 3-layer
radix tree with a radix of 64. The top layer would almost certainly
be cacheable in system memory (and maybe even two layers) so we are
talking one or two page reads from flash to do the lookup and the
update mechanic, being a radix tree, would be to sync the bits of the
radix tree that were modified by the new translations all the way up
to the root.
Clearly given the number of 'dirty' translations that would need to be
synchronized, you could easily fill a flash page and then simply retire
the synced translations from system memory, and repeat as often as
necessary to maintain the dirty ratio in the cache in system memory
at appropriate levels. You can also clearly accumulate enough dirty
translations for the sync to be worthwhile... that is, be guaranteed
to fill a whole page. You do NOT have to sync for recovery purposes
so it becomes an issue that is solely related to the system cache
and nothing else.
I'll add something else with regards to radix trees using large radii...
you can usually cache just about the whole damn thing except the leaf
level in system memory. Think about that for a moment and in particular
think about how it greatly reduces the number of actual flash reads
needed to perform the lookup.
I'll add something else with regards to on-storage radix trees. You can
also adjust the layout so the BOTTOM few levels of the radix tree,
relative to some leaf, reside in the same page.
So now we've reduced a random uncached translation lookup to, at worse,
ONE flash page read operation that ALSO guarantees us locality of
reference for nearby file blocks (and hence has no performance issues
for streaming reads either).
--
Now, if you want to argue that this model would have serious performance
penalities please go ahead, I'm all ears.
-Matt
More information about the freebsd-arch
mailing list