Turn off RAID read and write caching with ZFS? [SB QUAR: Thu May 22 08:33:59 2014]

Thu May 22 14:00:45 UTC 2014

On 5/22/2014 8:33 AM, Bob Friesenhahn wrote:
> On Thu, 22 May 2014, Karl Denninger wrote:
>>
>> Write-caching is very evil in a ZFS world, because ZFS checksums each 
>> block. If the filesystem gets back an "OK" for a block not actually 
>> on the disk ZFS will presume the checksum is ok.  If that assumption 
>> proves to be false down the road you're going to have a very bad day.
>
> I don't agree with the above statement.  Non-volatile write caching is 
> very beneficial for zfs since it allows transactions (particularly 
> synchronous zil writes) to complete much quicker. This is important 
> for NFS servers and for databases.  What is important is that the 
> cache either be non-volatile (e.g. battery-backed RAM) or absolutely 
> observe zfs's cache flush requests.  Volatile caches which don't obey 
> cache flush requests can result in a corrupted pool on power loss, 
> system panic, or controller failure.
>
> Some plug-in RAID cards have poorly performing firmware which causes 
> problems.  Only testing or experience from other users can help 
> identify such cards so that they can be avoided or set to their least 
> harmful configuration.
>
Let's think this one though.

You have said disk on said controller.

It has a battery-backed RAM cache and JBOD drives on it.

Your database says "Write/Commit" and the controller does, to cache, and 
says "ok, done."  The data is now in the battery-backed cache. Let's 
further assume the cache is ECC-corrected and we'll accept the risk of 
an undetected ECC failure (very, very long odds on that one so that 
seems reasonable.)

Some time passes and other I/O takes place without incident.

Now the *DRIVE* returns an unrecoverable data error during the actual 
write to spinning rust when the controller (eventually) flushes its cache.

Note that the controller can't rebuild the drive as it doesn't have a 
second copy; it's JBOD.  When does the operating system find out about 
the fault and what locality of the fault does it learn about?

Be very careful with your assumptions here.  If there is more than one 
filesystem on that drive the I/O that actually returns a fault (because 
of when it is detected) may in fact be to a *different filesystem* than 
the one that actually faulted!

The only safe thing for the adapter to do if it detects a failure on a 
deferred (battery-backed) write is to declare the entire *disk* dead and 
return error for all subsequent I/O attempts to it, effectively forcing 
all data on that pack to be declared "gone" at the OS level.  You better 
hope the adapter does that (are you sure yours does?) or you're going to 
get a surprise of a most-unpleasant sort because there is no way for the 
adapter to go back and declare a formerly-committed-and-confirmed I/O 
invalid.

At a minimum by doing this you have multiplied a single-block failure 
into a failure of *all* blocks on the media as soon as the first one 
fails.  In practice that may not be all that far off the mark (drives 
has a distressing habit of failing far more than one block at a time) 
but to force that behavior is something you should be aware of.

There is a very good argument for what amounts to a battery-backed RAM 
"disk" for ZIL for the reasons you noted.  And I do agree there are 
significant performance improvements to be had from battery-backed RAM 
adapters in a ZFS environment (by the way, set the zfs logbias to 
"throughput" rather than "latency" if you're using a controller cache 
since ZFS is incapable of deterministically predicting latency and that 
can lead to some really odd behavior) but in terms of operational 
integrity you are taking risk by doing this.

Then again we lived with that risk in the world before ZFS and 
hardware-backed RAID in that an *undetected* sector fault was 
potentially ruinous, and since individual blocks were not checksummed it 
did occasionally happen.

All configurations carry risk and you have to evaluate which ones you're 
willing to live with and which ones you simply cannot accept.

-- 
-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140522/fadce106/attachment.bin>