Re: bio re-ordering

From: Mehmet Erol Sanliturk <>
Date: Fri, 18 Feb 2022 17:47:09 UTC
On Fri, Feb 18, 2022 at 7:31 PM Warner Losh <> wrote:

> So I spent some time looking at what BIO_ORDERED means in today's kernel
> and flavored it with my indoctrination of the ordering guarantees with BIO
> requests
> from when I wrote the CAM I/O scheduler. it's kinda long, but spells out
> what
> BIO_ORDERED means, where it can come from and who depends on it for what.
> On Fri, Feb 18, 2022 at 1:36 AM Peter Jeremy <> wrote:
>> On 2022-Feb-17 17:48:14 -0800, John-Mark Gurney <> wrote:
>> >Peter Jeremy wrote this message on Sat, Feb 05, 2022 at 20:50 +1100:
>> >> I've raised
>> to
>> >> make geom_gate support BIO_ORDERED.  Exposing the BIO_ORDERED flag to
>> >> userland is quite easy (once a decision is made as to how to do that).
>> >> Enhancing the geom_gate clients to correctly implement BIO_ORDERED is
>> >> somewhat harder.
>> >
>> >The clients are single threaded wrt IOs, so I don't think updating them
>> >are required.
>> ggatec(8) and ggated(8) will not reorder I/Os.  I'm not sure about hast.
>> >I do have patches to improve things by making ggated multithreaded to
>> >improve IOPs, and so making this improvement would allow those patches
>> >to be useful.
>> Likewise, I found ggatec and ggated to be too slow for my purposes and
>> so I've implemented my own variant (not network API compatible) that
>> can/does reorder requests.  That was when I noticed that BIO_ORDERED
>> wasn't implemented.
>> >I do have a question though, what is the exact semantics of _ORDERED?
>> I can't authoritatively answer this, sorry.
> This is under documented. Clients, in general, are expected to cope with
> I/O that completes in an arbitrary order. They are expected to not schedule
> new I/O that depends on old I/O completing for whatever reason (usually
> on-media consistency). BIO_ORDERED is used to create a full barrier
> in the stream of I/Os. The comments in the code say vaguely:
> /*
>  * This bio must be executed after all previous bios in the queue have been
>  * executed, and before any successive bios can be executed.
>  */
> Drivers implement this as a partitioning of requests. All requests before
> it are completed, then the BIO_ORDERED operation is done, then requests
> after it are scheduled with the device.
> BIO_FLUSH I think is the only remaining operation that's done as
> directly. xen.../blkback.c, geom_io.c and ffs_softdep.c are the only ones
> that set it
> and all on BIO_FLUSH operations. bio/buf clients depend on this to ensure
> metadata
> on the drive is in a consistent state after it's been updated.
> xen/.../blkback.c also sets it for all BLKIF_OP_WRITE_BARRIER operations
> (so
> write barriers).
> In the upper layers, we have struct buf instead of struct bio to describe
> future I/Os
> that the buffer cache may need to do. There's a flag B_BARRIER that gets
> turned
> into BIO_ORDERED in geom_vfs. B_BARRIER is set in only two places (and
> copied
> in one other) in vfs_bio.c. babarrierwrite and bbarrierwrite for async vs
> sync writes
> respectively.
> CAM will set BIO_ORDERED for all BIO_ZONE commands for reasons that are
> at best unclear to me, but which won't matter for this discussion.
> ffs_alloc.c (so UFS again) is the only place that uses babarrierwrite. It
> is used
> to ensure that all inode initializations are completed before the cylinder
> group
> bitmap is written out. This is done with newfs, when new cylinder groups
> are
> created with growfs, and apparently in a few other cases where additional
> inodes
> are created in newly-created UFS2 filesystems. This can be disabled with
> vfs.ffs.doasyncinodeinit=0 when barrier writes aren't working as
> advertised,
> but there's a big performance hit from doing so until all the inodes for
> the
> filesystem have been lazily populated.
> No place uses bbarrierwrite that I can find.
> Based on all of that, the CAM's dynamic I/O scheduler will reorder reads
> around a BIO_ORDERED operation, but not writes, trims or flushes. Since,
> in general, operations happen in an arbitrary order, scheduling both a read
> and a write at the same time for the same block will result in undefined
> results.
> Different drivers handle this differently. CAM will honor the BIO_ORDERED
> tag by scheduling the I/O with an ordering tag so that the SCSI hardware
> will
> properly order the result. The simpler ATA version will use a non NCQ
> request
> to force the proper ordering (since to send a non-NCQ request, you have to
> drain the queue, do that one command, and then start up again). nvd will
> just throw
> the I/O at the device, until it encounters a BIO_ORDERED request. Then it
> will queue
> everything until all the current requests complete, then do the ordered
> request, then
> do the rest of the queued I/O as if it had just showed up.
> Most drivers use bioq_disksort(), which will queue the request to the end
> of the bioq
> and mark things so all I/Os after that are in their new 'elevator car' for
> its elevator sort
> algorithm. This means that CAM's normal ways of dequeuing the request will
> preserve
> ordering through the periph driver's start routine (where the dynamic
> schedule will honor
> it for writes, but not reads, but the default scheduler will honor it for
> both).
>> >And right now, the ggate protocol (from what I remember) doesn't have
>> >a way to know when the remote kernel has received notification that an
>> >IO is complete.
>> A G_GATE_CMD_START write request will be sent to the remote system and
>> issued as a pwrite(2) then an acknowledgement packet will be returned
>> and passed back to the local kernel via G_GATE_CMD_DONE.  There's no
>> support for BIO_FLUSH or BIO_ORDERED so there's no way for the local
>> kernel to know when the write has been written to non-volatile store.
> That's unfortunate. UFS can work around the BIO_ORDERED problem with
> a simple setting, but not the BIO_FLUSH problem.
>> >> I've done some experiments and OpenZFS doesn't generate BIO_ORDERED
>> >> operations so I've also raised
>> >> I haven't looked into how difficult that would be to fix.
>> Unrelated to the above but for completeness:  OpenZFS avoids the need
>> for BIO_ORDERED by not issuing additional I/Os until previous I/Os have
>> been retired when ordering is important.  (It does rely on BIO_FLUSH).
> To be clear: OpenZFS won't schedule new I/Os until the BIO_FLUSH it sends
> down w/o the BIO_ORDERED flag completes, right? The parenthetical confuses
> me on how to parse it: BIO_FLUSH is needed and ZFS depends on it completing
> with all blocks flushed to stable media, or ZFS depends on BIO_FLUSH being
> strongly ordered relative to other commands. I think you mean the former,
> but want
> to make sure.
> The root of this problem, I think, is the following:
>      % man 9 bio
>      No manual entry for bio


> I think I'll have to massage this email into an appropriate man page.
At the very least, I should turn some/all of the above into a blog post :)
> Warner


The above sentence is WONDERFUL ...

In my some messages , I am saying that :

- Make Handbook parts , man pages a  "blog" system ,
- Attach the related messages to these parts ,
- Relay comments / questions specified for these pages to mailing lists ,
- After a while or at suitable times , move "knowledge" ( meaning "what to
do" ) in these
  messages into related parts .

My opinion is that my ideas are not very effective .

If the above sentence can  "converge" to such a structure , it may be
really WONDERFUL ...

With my best wishes ,

Mehmet Erol Sanliturk