Flash disks and FFS layout heuristics
Matthew Dillon
dillon at apollo.backplane.com
Mon Mar 31 15:20:09 PDT 2008
:True, but when you're working with a part that does ECC in HW, you're
:stuck with the ECC it does.
Well, I suppose if you can't get access to the original data *OR* the
HW ECC code (to undo the broken correction) you would wind up with an
uncorrectable double error (A failed ECC correction always makes things
worse rather then better, particularly a 1 bit correct / 2 bit detect
hamming code). If you DO have access to the original data or the
HW ECC code you can undo the failed correction and then ignore the
hardware ECC and do your own in the auxillary storage.
I can see why people would want the hardware to do the data validation,
it's a performance issue in many respects, but if the hardware is only
doing a simple ECC and doesn't do a separate CRC it will do more harm
then good and simply cannot be depended upon for anything.
:...
: (some stuff I reordered later one below so I can get this out of the way)
:...
:> For example, the simple addition of some front-end non-volatile
:cache,
:> such as a dime-cap-backed static ram, would have a very serious effect
:> on any such filesystem design.
:
:Yes. However since the phone market makes such a change very unlikely,
:because of cost pressures, it's not one we take into consideration.
For flash storage systems competitive with hard drive storage,
that is any flash storage device of significant size (e.g. 64GB-1TB or
more), the incremental cost of adding non-volatile front-end cache ram
is going to be in the $1-$3 range. If that vendor can then advertise
a performance advantage over their competitors they can easily price
the drive to match the incremental cost. Very easily. This is what
happened to the HD market.
The economics that drive front-end cache implementations for flash SSD
are going to be the same economics that drive front-end cache
implementations for hard drives. All hard drives these days have at
least 8MB of sector cache (and many have 32MB or more), so the writing
is on the wall. Any filesystem you design which does not take into
account a front-end cache is going to be obsolete in probably less then
2 years (if not already).
For the phone market? You mean small flash storage devices? Performance
is almost irrelevant there and you certainly do not need a front-end
cache, or much of one. A sector translation model for a small flash
storage device (< 2G) is utterly trivial to implement but so is the
log-append model. There is going to be a huge scaling difference
between the two when you get into large amounts of storage.
:Well, those of us who are shipping devices with flash parts in them have
:a somewhat different view on that, which is why I've worked on three
:NAND specific file systems in the last four years. Two of those are in
:use in shipping devices, and are expected to be in use for five or more
:years.
Three in five years? Is that an illustration of my point with regards
to flash filesystem design? Ok, that was a joke :-)
But I don't think we can count small flash storage systems. Both models
devolve into trivialities when you are managing small amounts of
flash storage.
:...
:reference).
:
:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.
:
:> DragonFly's HAMMER has pretty good write-locality of=20
:> reference but still does random updates for B-Tree indices and things
:like=20
:> the mtime and atime fields. It also uses numerous blockmaps that
:could=20
:> make direct use of a flash sector-mapping translation layer (1). It=20
:> might be adaptable.
:
:You are pretty much describing the data structures that have made jffs2
:such a poor performer.
:...
:Works OK for NOR. Has interesting problems, mainly with maintaining the
:block number map reliabily in storage, when attempted on NAND.
:...
:The devil in the details of your naming scheme turns out to be managing
:the name translation information within the NAND storage itself. This is
:the source of significant performance problems in jffs2, for example,
:and have a huge amount of code complexity in the commercial system I
:work with.
Again, I am not familiar with jffs2 but you are painting a very broad
brush that is more then likely an issue specifically with the jffs2
design and not the concept of using named blocks in general.
I understand where you are coming from. Regardless of the model you
use you have to index the data somehow.
What you are advocating is a filesystem which uses an absolute sector
referencing scheme. Any change made to the filesystem requires a new
page to essentially be appended to the flash storage. In order to
properly index the information and maintain the filesystem topology
you also have to recopy *ALL* pages containing references to the
updated absolute sector in order to repoint them to the new
absolute sector. The root of your filesystem winds up being the last
page appended, in simple terms.
While some modifications to this scheme are possible, it's pretty much
the way you have to do things if you use that model.
I really understand that model, and it has the advantage of simplicity
but it also has some severe disadvantages when used as a general
purpose filesystem (verses an embedded filesystem), not the least of
which is that a single update can result in a chain reaction that
requires considerably more write bandwidth, considerably more garbage
collection, and some extra (but probably minor) wear of the flash.
In contrast, if a filesystem is referencing named blocks and you have
to move a block (either due to an error or a modification of that
block through normal filesystem activity), NO changes need to be made
to those elements of the filesystem that pointed to the block that got
moved. All you have to do is append the new block that is renaming the
old one, which includes the name (aka 64 bit quantity) in its auxillary
data area, and cache the change in the translation in system memory
until you decide to flush out the named block index (which I will
describe a bit later on)... that's non critical information, by the
way, and does not have to be synchronously in order to be crash
recoverable. Write bandwidth is greatly reduced, particularly
because when using a named block you only have to flush the actual
modified page to the flash and nothing else other then a topological
rollup record (which I will describe a bit later on).
This works particularly well with a filesystem designed to use named
blocks because there are *NO* indirect blockmaps to reference data or
inodes in the filesystem. An absolute-sector-based filesystem has
blockmaps, e.g. to locate a block in a file. In a named-block
filesystem the blockmap *IS* the named block. That is, the 'name' of
the named block is effectively the inode number and file block number
combined into one 64 bit (or larger) key.
Let me be clear about this distinction. In a filesystem that references
absolute sectors an append to a file requires (typically) updating a
blockmap of absolute sectors which in turns requires the blockmap block
to be rewritten, along with any reference to it and so on and so forth
up the chain. In a named-block filesystem appending to a file simply
means writing out a new named block. The filesystem itself has NO
concept of a blockmap... the blockmap is built into the sector translation
layer.
In other words, a filesystem using the named-block model is not any
more complex then a filesystem using an absolute sector numbering model,
and a filesystem using the named-block model is far easier from the
perspective of caching changes in system memory without requiring a
sync to flash for crash recovery. That is a huge deal.
Now is there some work involved with making the named block translations
efficient? Yes, there is some... but it is really not much more complex
then the work involved in an absolute-sector-based filesystem which
must index files, directories, and so forth within the filesystem
itself.
In particular, when using a named-block model you still have to
occassionally flush out the translation topology to the flash media
Since this topology references physical block numbers it, in fact, uses
exactly the SAME mechanism that the absolute filesystem model used
to maintain its topology. In other words, no more complex then the
absolute filesystem model.
The big difference is that the translation topology does not have
to be written synchronously and the frequency of the rollup writes
is based ENTIRELY on how much system memory you are willing to
dedicate to caching topological changes. E.G. if you dedicate, say, 100KB
of system memory you can store the topology for, say, 3200 filesystem
updates (using 32 byte structures) before you have to 'flush' it to
the flash. A filesystem based on absolute blocks pretty much has to
cache the related (modified) blocks in memory which are far larger, and
thus must flush them to storage far more frequently. But translations
are tiny little records... 10's of gigabytes worth of updates can be
cached in a small amount of system memory. The translation topology
does NOT have to be synced to disk on fsync() because all the
information can be recovered when the filesystem is mounted after
a reboot. That is critical.
Going back to the absolute filesystem model, such a filesystem does have
the advantage of locality of reference (that is, not so much seek-wise
which is irrelevant for flash but more from the point of view of being
able to chain to the desired information). A filesystem using the
named-block model must lookup the block, typically in a global index,
which means it must maintain a cache of pointers into its translation
tree. This is a little more expensive when looking up inodes but,
actually, I use a very similar scheme in HAMMER (which has a global
B-Tree for everything) and the caching required is so simple that it just
becomes a NOP. Just storing a cached absolute sector number in the
in-memory inode structure for use as a starting point when looking up
elements related to that inode winds up being no less efficient then
embedding a blockmap in an inode as you see in a more typical
filesystem.
In anycase, I hope this clarifies the issues. I really do understand
where you are coming from, the simplicity of chaining the physical
topology cannot be denied, and I like the elegance, but I hope I've
shown that it is not actually simplifying the overall design much
over a named block scheme, and that there are some fairly severe
issues that can crop up that are complete non-issues when using a named
block scheme.
Long winded, I know.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list