ZFS: How to enable cache and logs.

Thu May 12 08:59:49 UTC 2011

On Wed, May 11, 2011 at 06:04:33PM -0700 I heard the voice of
Jeremy Chadwick, and lo! it spake thus:
>
> (What confuses me about the "idle GC" method is how it determines
> what it can erase -- if the OS didn't tell it what it's using, how
> does it know it can erase the page?)

I'm no expert either, but the following is my understanding...

Remember that SSD's (like ZFS, a layer higher up) don't overwrite
blocks, they write new data to a new block and update the pointers the
level above them (the disk LBA in this case) to point at the new
location.

So when you overwrite LBA 12345 on the disk with new data, what
actually happens is that the SDD writes that data to currently empy
flash $SOMEWHERE, and updates its internal table so that LBA 12345
request go there.  The bit of flash that was previously considered LBA
12345 still contains the old data, but is now "free" as far as the
drive is concerned (though not immediately writable, as it needs to be
erased first).  Sorta like rm'ing a file doesn't actually delete its
contents, just the name pointing to it.

Where GC comes in is that the size you can write/address is smaller
than the size flash has to be erased in.  To pick numbers that are in
the right ballpark (it will vary per drive), you have 512 byte blocks
that you can read/write (like any other drive), but you can only erase
a page of 8k at a time.  So let's suppose you write 16 kB of data to a
fresh drive.  You've written 32 512-byte blocks, which completely fill
up 2 8k pages.  All nice and compact.

Now let's suppose you overwrite from 4k-8k and 12k-16k.  Now we have
8k of remaining useful data, but it's spread out over 2 8k pages (4k
in each).  We can't write new stuff those two now "empty" 4k sections,
because we have to erase before we can write, and we can only erase
the whole 8k page.  This is where the GC kicks in; it knows (because
those two LBA ranges have been overwritten) that they're no longer
needed, and can notice that all the remaining important data in those
two pages can actually fit in a single page.  So, it can read 0k-4k
and 8k-12k, and write them into a new empty page.  Update its LBA map
to point those logical addresses over to the new in-flash location,
and now the entirety of those two original 8k pages is unused.  So now
it can go ahead and erase them both, and put them on the "ready for
reuse" list.

Now, as for TRIM.  There are two ways that a block (or set of blocks)
can become "no longer needed".  One is that they're overwritten with
new data; the drive knows that and can mark them as unused like above.
The other is that they contain data for a file that's deleted.  But
the drive has no idea what files being deleted means.  All that
happens from the drive's perspective is an overwrite of some LBA's
that, to the OS, contain directory info.  It has no way of knowing
that impacts these other LBA's that held a file.  TRIM allows the OS
to say "OK, these LBA's?  Yeah, you can trash 'em now."  And so they
end up on the dead list, ready for the GC to collapse them away like
above.

So neither TRIM nor GC is a replacement for the other.  GC is about
collapsing away reapable space (and also serves a purpose in
wear-levelling, but that's unimportant in this discussion).  The drive
automatically knows about space that's reapable because it was
rewritten.  TRIM lets it know about space that's reapable because of
deletion.  Without that, you could delete a file (so LBA 54321 no
longer contains useful info, and doesn't need to be preserved), but
since the drive doesn't know that, not only can the GC not compact
away that space, it has to go ahead and re-copy that block as if it
held good data when it shuffles stuff around, so you're creating extra
wear.

GC can't make TRIM "unnecessary", any more than a book can make a
flashlight unnecessary.  TRIM is one of the ways you provide info for
the GC to use.  One thing that CAN make TRIM less important is writing
in a "compact" manner (e.g., always write new data to the lowest
available LBA).  Assuming you oscillate around a steady disk usage (or
slowly increase), that means that you'll tend to overwrite space for
deleted files relatively soon, so the drive gets to know about the
reapable space that way.  With more random or other LBA allocation, or
if you shrink the used space significantly, a deleted block may hang
around unwritten to for much longer, and so have more chance for the
GC to unnecessarily recopy and recopy it.

This leaves entirely to one side annoying implementational issues.
I'm given to understand that due to some combination of "dumb firmware
implementation" and "dumb standardized requirements", TRIM can be an
unbelievable expensive command, so doing it as part of e.g. 'rm' may
damage performance outrageously.  That may point to a better
implementation being "rack up a list of LBA's and flush periodically",
or "scan filesystem weekly and send TRIM's for all empty LBA's" or the
like.  But again, that's implementation.

-- 
Matthew Fuller     (MF4839)   |  fullermd at over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
           On the Internet, nobody can hear you scream.