Turn off RAID read and write caching with ZFS?
Karl Denninger
karl at denninger.net
Thu May 22 12:52:20 UTC 2014
On 5/22/2014 5:38 AM, Jeff Chan wrote:
> As mentioned before we have a server with the LSI 2208 RAID chip which
> apparently doesn't seem to have HBA firmware available. (If anyone
> knows of one, please let me know.) Therefore we are running each drive
> as separate, individual RAID0, and we've turned off the RAID harware
> read and write caching on the claim it performs better with ZFS, such
> as:
>
>
> http://forums.freenas.org/index.php?threads/disable-cache-flush.12253/
>
> " cyberjock, Apr 7, 2013
>
> AAh. You have a RAID controller with on-card RAM. Based on my
> testing with 3 different RAID controllers that had RAM and benchmark
> and real world tests, here's my recommended settings for ZFS users:
>
> 1. Disable your on-card write cache. Believe it or not this
> improves write performance significantly. I was very disappointed with
> this choice, but it seems to be a universal truth. I upgraded one of
> the cards to 4GB of cache a few months before going to ZFS and I'm
> disappointed that I wasted my money. It helped a LOT on the Windows
> server, but in FreeBSD it's a performance killer. :("
>
> 2. If your RAID controller supports read-ahead cache, you should
> be setting to either "disabled", the most "conservative"(smallest
> read-ahead) or "normal"(medium size read-ahead). I found that
> "conservative" was better for random reads from lots of users and the
> "normal" was better for things where you were constantly reading a
> file in order(such as copying a single very large file). If you choose
> anything else for the read-ahead size the latency of your zpool will
> go way up because any read by the zpool will be multiplied by 100x
> because the RAID card is constantly reading a bunch of sectors before
> and after the one sector or area requested."
>
>
>
> Does anyone have any comments or test results about this? I have not
> attempted to test it independently. Should we run with RAID hardware
> caching on or off?
>
That's mostly-right.
Write-caching is very evil in a ZFS world, because ZFS checksums each
block. If the filesystem gets back an "OK" for a block not actually on
the disk ZFS will presume the checksum is ok. If that assumption proves
to be false down the road you're going to have a very bad day.
READ caching is not so simple. The problem that comes about is that in
order to obtain the best speed from a spinning piece of rust you must
read whole tracks. If you don't you take a latency penalty every time
you want a sector, because you must wait for the rust to pass under the
head. If you read a single sector and then come back to read a second
one inter-sector gap sync is lost and you get to wait for another rotation.
Therefore what you WANT for spinning rust in virtually all cases is for
all reads coming off the rust to be one full **TRACK** in size. If you
wind up only using one sector of that track you still don't get hurt
materially because you had to wait for the rotational latency anyway as
soon as you move the head.
Unfortunately this stopped being easy to figure out quite a long time
ago in the disk drive world with the sort of certainty that you need to
best-optimize workload. It used to be that ST506-style drives had 17
sectors per track and RLL 2,7 ones had 26. Then areal density became
the limit and variable geometry showed up, frustrating an operating
system (or disk controller!) that tried to, at the driver level, issue
one DMA command per physical track in an attempt to capitalize on the
fact that all but the first sector read for a given rotation were
essentially "free".
Modern drives typically try to compensate for their
variable-geometryness through their own read-ahead cache, but the exact
details of their algorithm are typically not exposed.
What I would love to find is a "buffered" controller that recognizes all
of this and works as follows:
1. Writes, when committed, are committed and no return is made until
storage has written the data and claims it's on the disk. If the
sector(s) written are in the buffer memory (from a previous read in 2
below) then the write physically alters both the disk AND the buffer.
2. Reads are always one full track in size and go into the buffer memory
on a LRU basis. A read for a sector already in the buffer memory
results in no physical I/O taking place. The controller does not store
sectors per-se in the buffer, it stores tracks. This requires that the
adapter be able to discern the *actual* underlying geometry of the drive
so it knows where track boundaries are. Yes, I know drive caches
themselves try to do this, but how well do they manage? Evidence
suggests that it's not particularly effective.
Without this read cache is a crapshoot that gets difficult to tune and
is very workload-dependent in terms of what delivers best performance.
All you can do is tune (if you're able with a given controller) and test.
--
-- Karl
karl at denninger.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2711 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140522/5c6e9d63/attachment.bin>
More information about the freebsd-fs
mailing list