Does UFS2 send BIO_FLUSH to GEOM when update metadata (with
softupdates)?
Kostik Belousov
kostikbel at gmail.com
Sat Nov 26 08:41:56 UTC 2011
On Sat, Nov 26, 2011 at 12:13:54PM +0400, Lev Serebryakov wrote:
> Hello, Kostik.
> You wrote 26 ноября 2011 г., 12:03:51:
>
> >> You are entirely correct when you say that the requirement for
> >> SU and SU+J is that it requires that notification of a disk-write
> >> complete mean that the data is on the disk (stable). The problem
> >> that arises is that (apparently) some tag-queue implementations
> >> report back that tags have been written when in fact they have
> >> not been written.
> > Right, and my belief that real hardware is not much affected,
> You have wrong idea about modern hardware, sorry.
>
> Again: don't forget multi-megabyte caches, and absence of any
> guarantees, in which order these caches will be flashed. Many
> controllers and drives itself group writes. And if companion for data
> block in cache is found earlier than companion for metadata block (as
> drive doesn't distinguish them) or waiting timeout, data block will
> be written first. The same applicable to two metadata blocks, of
> course. And it is not question of BROKEN QUEUEING.
>
> Again, I'm speaking not about cheap ATA drivers here, but about
> expensive high-performance RAID controllers ands server drives with
> huge caches.
>
> > except probably some ultra-cheap and old ATA disks. Another issue
> > is broken-by-design 'drivers' which authors do not understand the
> > environment they programming for.
> And, again, or you have synchronous from top to bottoms storage
> stack and performance, which will be miserable, compared to other
> OSes, or you need to give some freedom to driver authors and provide
> hints about semantic of personal operations to them. Every drive and
> controller, which does write caching and reordering (except old,
> cheap broken ATA ones) HAVE flags and knobs to send this individual
> block to plactes as soon as possible. But now drivers doesn't have
> any idea when they should use these flags. And they don't use them.
Sigh. In-kernel i/o subsystem is already asynchronous by design.
You have to make special arrangements to get synchronous writes of the
blocks (like using bwrite()). And even if you do, buf layer emulates the
sync operation by blocking the thread until async even happen. Whole
buffer i/o (including reads) operates using bio_done callbacks called
on the operation end. In fact, there is inherited uglyness due to async
nature, namely, the kernel-owned buffer locks. Getting rid of them would
be much more useful then breaking UFS.
The non-broken driver must not return the 'completed' bio into the up
queue until write is sent to hardware and hardware reported the completion.
Raid controllers which aggressively cache the writes use nvram or
battery backups, and do not allow to turn on write cache if battery is
non-functional. I had not seen SU inconsistencies on RAID 6 on mfi(4),
despite one our machine has unfortunate habit of dropping boot disk over
SATA channel each week, for 2 years.
>
> > I do not see how this proposal change much, except limiting potential
> > havoc to the last 100ms of system operation. In fact, reordering,
> > besides causing fs consistency problems, may cause the security issues
> > as well [*]. If user data is written into the reused blocks, but
> > metadata update was ordered after data write, we can end with the
> > arbitrary override of the sensititive authorization or accounting
> > information.
> It is why metadata requests should be marked as non-reordable,
> non-queuable. Personal requests, not some global barrier every 100ms.
>
You again missed the point - if metadata is not reordable, but user
data is, you get security issues. They are similar (but inverse) to what
I described in the previous paragraph.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20111126/619de3ae/attachment.pgp
More information about the freebsd-fs
mailing list