sata raid & write cache state

Kenneth D. Merry ken at freebsd.org
Mon Oct 11 14:03:05 PDT 2004


On Mon, Oct 11, 2004 at 09:08:01 -0700, Roisin Murphy wrote:
> thanks for your response
> 
> well, i'm not really concerned with read/write performance or with
> loosing little bit of cached data that the controller didn't manage to
> write down in a case of a power outage or anything breaking in the
> box. I don't plan to store anything critical. It will be used to dump
> lots of uncompressed pictures and video, but still the performance
> doesn't matter. The only thing i want to avoid is an inconsistent
> state of the array (=dead raid), if anything breaks in the box, and
> the actual disks&parity are not in sync. From what ken mentioned, it
> seems like ata/sata is a loose specification, and even if some sata
> disks do support tagged queueing, the controller most likely won't use
> it. So my best bet would be to get those floppy manufacturer utilities
> and set the write cache default to off on all the disks.
> 
> > Well, any decent SCSI RAID controller will either just disable the write
> > cache altogether, or it will give the user the option of disabling the
> > write cache.
> > Of course a battery backed cache is useless if write caching is turned on
> > on your drives.  So it will be a useless feature with most ATA or SATA RAID
> > controllers, because it's unlikely that they would want to tank their
> > performance badly by disabling write caching.
> 
> 1. So if scsi controllers most likely switch the disk caches off, and
> optimize the writes with it's own on-board cache, why wouldn't a
> better/smarter ata/sata controller be able to do the same? I mean, how
> does a sata disk with cache off differ from a scsi disk with it's
> cache off?

See my previous mail.  SATA disks differ in two ways:

1.  Many don't support tagged queueing.

2.  If the SATA disk does support tagged queueing, there is still a
fundamental problem with the queueing model in SATA (and probably ATA, not
sure).  According to a coworker of mine (hardware engineer) who is a SATA
expert, the status phase on the bus is the same phase as the data phase.
So you basically have to send all the data to the drive on a write and the
drive has to send the status back before the drive can accept any more data
for another queued write command.  So that limits you, effectively, to
writing data for one command at a time.

My coworker also mentioned that in order to figure this out, you have to go
look at some of the state diagrams in the SATA spec.

The reason you can get good performance with SCSI disks by doing caching in
the RAID controller is that SCSI disks do tagged queueing, and their
queueing model isn't broken.

> > Also, keep in mind that with any RAID controller that does RAID-5 or
> > RAID-1, you should get a battery backed cache.  It may be an option, but
> > you should get it.  This will protect you from the RAID-5/RAID-1 write
> > hole.  That is, when you have a crash, you don't know:
> > 1.  What writes you have outstanding.
> > 2.  Whether all, part, or none of those writes got committed.
> 
> 2. So the backup battery (the intel raid card has that option too), is
> that primarily meant to save the data from the controller's cache?

Yes.

> Without the backup battery, could you still end up with an
> inconsistent/dead array?

You could end up with an inconsistent array, yes.  A "dead" array implies
that you have two failed disks.  Lack of a battery won't cause that.

> > So without a battery backed cache, you will have to scrub your entire
> > array to make sure the parity is consistent, and you still will not know
> > whether some of your data was corrupted.  All you can really do is sync the
> > parity.
> 
> 3. I would assume that this parity syncing will be automatic, or not
> necessary at all. Isn't the controller acting kind of like a
> transactional db, so even if the machine crashes in any way, the array
> still survives in a consistent state, only loosing the last active
> writes? So yes, you could end up with a corrupted file or two, but not
> with an inconsistent parity/dead array? and with battery backed up
> controller, you could avoid it altogether, unless the actual
> controller breaks?

The thing to realize about a RAID controller is that it will generally ack
writes as soon as it DMAs them into its cache.  i.e. most RAID controllers
run in write back caching mode.  So the OS (filesystem, really) thinks that
the data has been committed to media, but it hasn't.

As for the RAID controller, the writes it does for any given I/O are not
atomic.  Without a battery backed cache, or some sort of (very slow) I/O
log on the disk to tell it what writes are outstanding, it will not know
what I/O was active at the time of a crash.  So it will not know whether
all, part, or none of those I/Os made it to disk.  This is called the
RAID-5/RAID-1 write hole.

So if you don't have a battery backed cache, and therefore you don't know
which stripes on your array had I/O outstanding, you have to scrub the
entire array.  (This is assuming RAID-5.)  That means you read all the data
blocks and recompute the parity for every data block.  That takes quite a
while.

With a battery backed cache, you get two pieces of information:

1.  What I/Os were outstanding to the disk.
2.  The actual data for those I/Os.

So since you know what I/Os were outstanding to disk, and you have the
actual data, all you have to do is replay those I/Os and your array is
completely consistent.  So you don't have to do a (very time consuming)
scrub of the entire array, and you won't have any data corruption.

Because of the RAID-5 write hole, you can not only give the user back old
data if a particular piece of data isn't written around the time of a
crash, but you can also give the user back corrupt data in the event of a
disk failure.  (Because your parity isn't consistent any more.  Even
scrubbing will only make the parity consistent, but won't solve the partial
write problem.)

> 4. Is any software raid solution able to deliver a safe raid5 setup?
> cache off on all the disks is a must of course. Anybody has a
> successful story with software solutions, proven by simulated crashes?
> :) All i keep hearing about vinum is 'not for production use'/'horror
> stories'

Well, with (S)ATA disks, I don't think you'll get good performance with the
write cache off, with software or hardware RAID.

With a software RAID solution, you won't be able to cover the RAID-5/RAID-1
write hole because you don't have a battery backed cache for the RAID
controller.  (The RAID controller is now the host OS.)

If you want the cheapest solution, and something somewhat reliable, you
could go with software RAID on SATA disks, and put the machine on a UPS.
With a UPS, you could run with write caching turned on on the drives, and
without a battery backed RAID controller cache and not worry about it too
much.  As long as your UPS has enough power to shut down the machine,
you'll be fine.

In any case, software RAID will actually be faster than most hardware RAID
controllers, because your Pentium 4/Opteron/etc. is much faster than the
processor on hardware RAID cards.

The only problem with pure software RAID is that you generally can't boot
off of anything other than a RAID-1.  (Even then you may have issues.)  So
your boot disk won't have any protection.  That's why various vendors
(Promise, Broadcom, Adaptec, maybe Intel, etc.) have come out with
controllers that are basically standard SATA controllers with a special
BIOS that lets you boot from a RAID-0, RAID-5, etc.  Software RAID takes
over once the OS boots.

Ken
-- 
Kenneth Merry
ken at FreeBSD.ORG


More information about the freebsd-hardware mailing list