ZFS vfs.zfs.cache_flush_disable and ZIL reliability

Thu Mar 17 07:46:01 UTC 2011

On Wed, Mar 16, 2011 at 11:16:18PM -0800, Marcus Reid wrote:
> I was just doing some reading about write barriers being used in
> filesystems to ensure that the journal is complete to prevent data
> corruption on unexpected failure.  This is done by flushing the
> disk cache after making a journal entry and before writing to
> the rest of the fs.
> 
> I figured I'd look to see what different filesystems do for this.
> 
> In Linux, ext3 and ext4 have a "barrier" mount option which controls
> this.  It's the subject of much debate and was turned off by default
> until 2.6.28 in ext4 (it's still off by default in ext3) because it
> can significantly reduce performance in some workloads.
> 
> FreeBSD g_journal is not configurable -- it flushes the cache and
> looks to be safe.  My only worry is that it looks like it might even
> flush it too often, but there may be a reason for the extra flush.
> 
> I'm having a hard time finding where the rubber meets the road with
> the ZFS ZIL though (one does not just walk into Mordor.)  I got as
> far as finding the vfs.zfs.cache_flush_disable sysctl which sets
> zfs_nocacheflush which is referenced in zil_add_block() in zil.c
> but haven't found where the actual flushing happens.  Can someone
> who is more familiar with it comment on whether this is happening?

I think what you might be looking for is BIO_FLUSH, which is a kernel
thing.  I could have the name wrong; someone will need to correct me.

Whenever this topic comes up, I always ask people the same 2 questions:

1) What *absolute guarantee* do you have that data *actually gets
written to the platters* when BIO_FLUSH is called?  You can
sync/sync/sync all you want -- there's no guarantee that the hard disk
itself (that is to say, the cache that lives on the hard disk) has fully
written all of its data to its platters.

2) What do you think will happen when the hard disk abruptly loses
power?  Could be the system PSU dying, could be the power circuitry on
the drive failing, could be a "quirk" that causes the drive to
power-cycle itself, etc...

General question to users and/or developers:

Can someone please explain to me why people are so horribly focused (I
would go as far to say OCD) on this topic?

Won't there *always* be some degree of potential loss of data in the
above two circumstances?  Shouldn't the concern be less about "how much
data just got lost" and more about "is the filesystem actually usable
and clean/correct?"  (ZFS implements the latter two assuming you're
using mirror or raidz).

Sorry for the rant, I just keep seeing this topic come up over and over
and over and over and over and it blows my mind.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |