Turn off RAID read and write caching with ZFS?

Karl Denninger karl at denninger.net
Thu May 22 12:52:20 UTC 2014

On 5/22/2014 5:38 AM, Jeff Chan wrote:
> As mentioned before we have a server with the LSI 2208 RAID chip which
> apparently doesn't seem to have HBA firmware available.  (If anyone
> knows of one, please let me know.)  Therefore we are running each drive
> as separate, individual RAID0, and we've turned off the RAID harware
> read and write caching on the claim it performs better with ZFS, such
> as:
> http://forums.freenas.org/index.php?threads/disable-cache-flush.12253/
> " cyberjock, Apr 7, 2013
>      AAh. You have a RAID controller with on-card RAM. Based on my
> testing with 3 different RAID controllers that had RAM and benchmark
> and real world tests, here's my recommended settings for ZFS users:
>      1. Disable your on-card write cache. Believe it or not this
> improves write performance significantly. I was very disappointed with
> this choice, but it seems to be a universal truth. I upgraded one of
> the cards to 4GB of cache a few months before going to ZFS and I'm
> disappointed that I wasted my money. It helped a LOT on the Windows
> server, but in FreeBSD it's a performance killer. :("
>      2. If your RAID controller supports read-ahead cache, you should
> be setting to either "disabled", the most "conservative"(smallest
> read-ahead) or "normal"(medium size read-ahead). I found that
> "conservative" was better for random reads from lots of users and the
> "normal" was better for things where you were constantly reading a
> file in order(such as copying a single very large file). If you choose
> anything else for the read-ahead size the latency of your zpool will
> go way up because any read by the zpool will be multiplied by 100x
> because the RAID card is constantly reading a bunch of sectors before
> and after the one sector or area requested."
> Does anyone have any comments or test results about this?  I have not
> attempted to test it independently.  Should we run with RAID hardware
> caching on or off?
That's mostly-right.

Write-caching is very evil in a ZFS world, because ZFS checksums each 
block.  If the filesystem gets back an "OK" for a block not actually on 
the disk ZFS will presume the checksum is ok.  If that assumption proves 
to be false down the road you're going to have a very bad day.

READ caching is not so simple.  The problem that comes about is that in 
order to obtain the best speed from a spinning piece of rust you must 
read whole tracks.  If you don't you take a latency penalty every time 
you want a sector, because you must wait for the rust to pass under the 
head.  If you read a single sector and then come back to read a second 
one inter-sector gap sync is lost and you get to wait for another rotation.

Therefore what you WANT for spinning rust in virtually all cases is for 
all reads coming off the rust to be one full **TRACK** in size. If you 
wind up only using one sector of that track you still don't get hurt 
materially because you had to wait for the rotational latency anyway as 
soon as you move the head.

Unfortunately this stopped being easy to figure out quite a long time 
ago in the disk drive world with the sort of certainty that you need to 
best-optimize workload.  It used to be that ST506-style drives had 17 
sectors per track and RLL 2,7 ones had 26.  Then areal density became 
the limit and variable geometry showed up, frustrating an operating 
system (or disk controller!) that tried to, at the driver level, issue 
one DMA command per physical track in an attempt to capitalize on the 
fact that all but the first sector read for a given rotation were 
essentially "free".

Modern drives typically try to compensate for their 
variable-geometryness through their own read-ahead cache, but the exact 
details of their algorithm are typically not exposed.

What I would love to find is a "buffered" controller that recognizes all 
of this and works as follows:

1. Writes, when committed, are committed and no return is made until 
storage has written the data and claims it's on the disk.  If the 
sector(s) written are in the buffer memory (from a previous read in 2 
below) then the write physically alters both the disk AND the buffer.

2. Reads are always one full track in size and go into the buffer memory 
on a LRU basis.  A read for a sector already in the buffer memory 
results in no physical I/O taking place.  The controller does not store 
sectors per-se in the buffer, it stores tracks.  This requires that the 
adapter be able to discern the *actual* underlying geometry of the drive 
so it knows where track boundaries are.  Yes, I know drive caches 
themselves try to do this, but how well do they manage?  Evidence 
suggests that it's not particularly effective.

Without this read cache is a crapshoot that gets difficult to tune and 
is very workload-dependent in terms of what delivers best performance.  
All you can do is tune (if you're able with a given controller) and test.

-- Karl
karl at denninger.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140522/5c6e9d63/attachment.bin>

More information about the freebsd-fs mailing list