Trying to understand how aio(4), mount(8) async (and gjournal) relate

Wed Mar 25 05:50:40 PDT 2009

On Wed, 25 Mar 2009, Jan Mikkelsen wrote:

> [Jumping into a conversation on aio, async mounts, etc.]
>
> I have had a few questions for a while that I haven't asked yet; these seems 
> like an appropriate time to ask them!
>
> Is it reasonable to open a file with O_FSYNC and then use aio_write() to 
> issue multiple writes, and then assume that the data is on disk when the aio 
> completes?

I know very little about aio, but looking at the sources seems to show that
O_FSYNC (or mounting with the sync option) just defeats the asyncness of
aio.  aio seems to use only fo_write() for writing, so at lower (file
system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs
the i/o at the end of the call in the usual case where fo_write = vn_write.

> Can I get I/O parallelism using this approach?

Apparently not.

> I recall reading (some time 
> ago) that FreeBSD doesn't do I/O parallelism on a single file descriptor.  Is 
> that true?  Do I need to open the file multiple times in order to get I/O 
> parallelism?

The fs part of vn_write() is serialized, now using the exclusive vnode lock.
The code is essentially:

 	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
 	error = VOP_WRITE(...);		/* this soon reaches foofs_write() */
 	VOP_UNLOCK(vp);

In the usual case without O_FSYNC, foofs should try to only schedule
the i/o (by writing it to the buffer cache and not waiting), so that
the actual i/o is done in parallel later.  However, foofs might need
to do some physical input in order to tell where to write (e.g., reading
indirect block(s)) or some physical output of metadata needed for
consistency (e.g., writing indirect blocks), and any such i/o is
serialized.  (I think most file systems avoid writing to the inode on
every foofs_write(), though not doing requires tricks to maintain
consistency.  No tricks seem to be available for indirect blocks, so
ffs without soft updates always writes them synchronously (except in
my version where the async mount option actually works for indurect
blocks).)

O_FSYNC should cause almost all writes related to the file to be
synced at the end of foofs_write().  Thus it forces all i/o to be
serialized.
Some excepions to "all":
- at least in ffs, bitmap blocks are not synced.  This is safe since
   fsck can always recover bitmap blocks.
- at least in ffs, directories above the file are not synced by fsync()
   for the file.  This is normally harmless because critical directory
   operations are normally synchronous (or ordered relative to everything
   including related file operations in the case of soft updates), and
   fsync() is not specified to do this (?), but perhaps careful
   applications should fsync() all the directories too, and with the
   async mount option, even the most critical directory operation
   (creation of the file's directory entry) is asynchronous (except
   bugs make it partly synchronus).
- at least in ffs, with the async mount option, fsync() is more broken than
   it should be broken -- it syncs everything except for the most critical
   metadata (the inode) and directories above the file.

> You can see where I'm going with this:  What I'd really like to do is open a 
> file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio operations on 
> it using chunks that a multiple of the page size with buffers that are 
> aligned on page boundaries.  I'd like to know when aio writes are "really" 
> complete to maintain various kinds of on-disk structures (eg. b-trees).  I'd 
> also like to avoid call fsync(2).

Calling fsync() or aio_waitcomplete() seems to be necessary.  More
global options like the sync mount flag and O_FSYNC don't provide
enough control.  I can't find any aio interfaces to select or poll for
completion.  It seems to have only aio_return() to test for completion,
with the possibly unwanted side effect of doing the completion if
possible.  I don't trust aio_return() to test that _all_ the things
that would be done by the file system for fsync(2) have been done.
aio_waitcomplete ensures doing these things by calling the file system
(VOP_FSYNC()), but aio_return() doesn't seem to go near the file system.

BTW, I just remembered that there is no mount option or file flag to
give fully sync metadata.  At least in ffs, all inode-change operations
(chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of
mount options and O_FSYNC.  It takes a syscall calling VOP_FSYNC() or
an unrelated inode update to sync the metadata for these operations.

Bruce