Brainstorm: NAND flash

Fri Feb 15 16:33:09 PST 2008

In message <A7D3ED75-BC10-4E15-BABE-7182995FAC61 at mac.com>, Marcel Moolenaar wri
tes:

>> Mediasize is about addressability, not about usability, so this
>> assumption is wrong.
>>
>> A GEOM provider is just an addressable array of sectors, it
>> doesn't guarantee that you can read them all or write them
>> all, as is indeed the case when your disk develops a bad sector.
>>
>> NAND is only special due to the OOB stuff, the main page array
>> is just a pretty spotty disk, for all GEOM cares.
>
>The reason I thought this was good is that disks are
>shipped without bad blocks visible to the "application".
>That is: the norm is no bad blocks. With NAND flash
>the norm is that bad blocks part of the deal. I thought
>that dealing with bad blocks explicitly for NAND would
>level the playing field and make it more consistent...

Well, if you want to take that route, you should not use
GEOM to connect the wear-leveling to the NAND flash in
the first place.

Which option you prefer there is sort of a toss.

Putting it gives you devices in /dev and other benefits, using a
private interface allows you to get it more precisely tailored to
your needs.

I would say put it under GEOM, the bad blocks will not trouble GEOM,
and should somebody get perfect NAND (or care to handle the bad
blocks otherwise), they can stick their filesystem there directly,
if they don't need to write to it too much.

>>> dealt with at this level. NANDs don't have sectors.
>>> Attributes of this class include:
>>> 	blockcount - the raw number of blocks
>>
>> This goes in mediasize (as a byte count)
>>
>>> 	blocksize - the number of bytes or pages in a block
>>
>> This goes in sectorsize.
>
>Can't this cause race conditions?
>
>Suppose there happens to be a MBR in the first page at
>offset 0. The MBR class could end up taking the provider,
>when a wear-leveling geom should really take it.

At the moment the wear-leveling opens the NAND device for writing,
the MBR would get spoiled and disappear.

And the chances of MBR finding its metadata in the right physical
sector is pretty small to begin with if the wear leveling is worth
anything.

Of course if you do simple bad-block substitution, the chance would
be close to certainty, but the MBR would still get spoiled, so that
would still work.

>I'm ignorant of the obviousness of why sector mapping and
>wear-leveling are to be done at the same time...
>
>...and I presume you can't elaborate...

No I can't.

But I can tell you something about filesystems under BSD license
which might interest you.

Imagine you implement a filesystem, that allocates space in
512 byte sectors, even though the underlying device has a
(much) larger sector size.[1]

To reduce the amount of disk-I/O, you would obviously want
to avoid doing
	read 64k block
	modify 512 bytes of those
	write 64k block
	read same 64k block
	modify some other 512 bytes of those
	write 64k block again
In particular if writes were very slow or otherwise expensive.

You would of course do this, by implementing, as UNIX has always
done, a buffer-cache that does the logical/physical translation.

BUT, imagine now as a complication, that your filesystem was
log-structured in somewhat the same hacked up way that Margo Seltzer
did with LFS.

The idea behind LFS is important in this context:  The objective
was to gain write speed by always writing sequentially and basically
treat the disk as a circular buffer, hoping that the RAM cache would
limit the amount of seeks for reading, and that the disk would have
enough free space to reduce the workload of the cleaner process.

The trouble with that of course, is that both assumptions were wrong
until RAM and disk exploded in size just a few years ago.  On a
95% filled filesystem, LFS sunk under the weight of the cleaner,
and RAM was never big enough to cache all you wanted and it doesn't
help until the second access anyway.

The other important aspect of a LFS, is that you need a "cleaner"
process to run ahead of the write pointer, and scavenge space.
If it finds a fully used big block, it leaves it alone, but if
it finds an 64k block with only 512 bytes of data, it copies
those 512 bytes into the write stream so it can mark the 64k
block as free, and recycle it.

Margos LFS was a fiasco, but we can still learn from it:

The source of trouble, as far as I have been able to find out, is
that the filesystem naming layer (in her case UFS) need a logical
block number which must be determined before the physical block
number has been allocated, so the logical block number must be
translated to a physical number through some sort of means or table.

You obviously would _not_ want two copies of the data in the cache,
one under the logical and one under the physical blocknumber, so
you have to pick one or the other.

Margos choice for the easy solution to the logical/physical mapping
problem in LFS, sucked badly when it came to write the "cleaner"
process: A mapping that gives you only a logical->physical translation
cheaply, but requires you to read many blocks of disk to reverse
the mapping, doesn't help you when you read a physical sector and
need to find out if it is used in, and where it belongs in the
logical space.

Which is exactly what the cleaner needs to do.

I belive in the end her choice made it so damn hard that the cleaner
never happened during the time she took an interest in LFS (exactly
until she got her phd I belive ?)  Ousterhout had some very good
and relevant, but harsh words for her about that.

(Sprites LFS, by Ousterhout, is also worth a study, but it was better
designed but also more narrowly tailored to the Sprite OS, and thus
we cannot learn as much from it today.)

This is all from memory, I havn't bothered to look up the LFS source
code or the correspondence on Ousterhouts page, so some details may
be slightly off, for which I apologize.

Poul-Henning

[1] Its interesting that Sun gave up on this and had to get
special firmware to CD-ROM drives, but that's an entirely
different story and not relevant :-)

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.