Poor RAID performance demystified

Thu Nov 25 09:50:16 UTC 2010

Hi all,

This issue has been raised periodically on various lists and forums
and I myself recently ran into it so I feel that I should just post my
findings here.

Every now and then somebody complains about extremely poor RAID
performance.  What is common in those reports is that they usually
mention FreeBSD and HP RAID controllers, and all of them are about
load patterns from Postgresql.  We are just about to see why it's so.

People get surprisingly low disk I/O performance (e.g., 1-2MB/s) in
spite of numerous spindles striped in the array when the benchmark
involves a lot of tiny DB transactions.  On the same array, sequential
read and write rates can be more than satisfactory.

That happens just because Postgresql in its default configuration is
*remarkably* stringent about flushing every transaction out to the
disk before proceeding to the next one.  The PG folks know that well.

But, as it is known from practice, just application flushing data
wouldn't be sufficient for this effect to be so pronounced.  What
_might_ be happening here is that HP RAIDs as driven by FreeBSD do
fully comply with flush requests all the way down the disk stack
whereas other popular RAID / OS combos can effectively ignore them to
a certain extent due to latent write-back caching, e.g., that in the
drives.

Why does striping fail to speed the things up?  Just because the
transactions are tiny and every disk write ends up blocked waiting for
a single spindle to handle it.  No striping can speed up 8K or 16K
synchronous writes because they are seek limited, not bandwidth
limited.  (Likewise, no RAID or cache can speed up highly random reads
just a few blocks each as reads are synchronous by their nature just
because you can't know the data before it has been read in.)

It is easy to check if you are hitting this kind of bottleneck.  While
running your benchmark, watch the output from iostat or systat -vm or
gstat.  The average I/O size will closely match the FS block size (the
default is 16K now on FFS) and the tps (transfers per second) value
will be quite close to your disks RPM rate expressed in revs per
second.  E.g., with 10K RPM disks you are going to get 10000 / 60 =
~170 tps and with 15K RPM disks it'll be around 250 tps.  You are just
hitting very basic laws of nature and logic here.

The final question will be, of course, what to do about this issue.
First of all, make up your mind if 150 or 200 write transactions per
second aren't going to be enough for your task.  Your actual load
pattern can be quite different from that in the benchmark.  If you
still need greater write performance on tiny transactions, consider
getting a battery backup unit (BBU) for your RAID adapter.  Quite
remarkably, HP refer to them as "Write-back Cache Enablers" because
installing one is the only way to get an HP RAID adapter do write-back
caching.  A write-back cache with BBU will let the adapter delay and
coalesce tiny writes without jeopardizing the DB integrity.  However,
you'll need to trust your BBU as your DB integrity will be staked on
it (the PG folks are somehow skeptical about BBUs).  On the other
hand, just fiddling with the PG settings to disable transaction
flushing is a certain recipe for disaster.  Fortunately, there is a
trade-off mode in PG where it does transaction coalescing by itself --
search for synchronous_commit.  The downside of it is that, should the
system crash, a few most recent transactions can be lost after they
were reported as successful to the SQL client. That can be OK or not
OK depending on the task, and synchronous_commit can be toggled on per
session or per transaction basis to finely tune the trade-off.

That's it, folks.

Thanks,
Yar