Trying to understand how aio(4), mount(8) async (and gjournal) relate

Wed Mar 25 10:22:13 PDT 2009

Bruce Evans wrote:
> On Wed, 25 Mar 2009, Jan Mikkelsen wrote:
> 
>> [Jumping into a conversation on aio, async mounts, etc.]
>>
>> I have had a few questions for a while that I haven't asked yet; these 
>> seems like an appropriate time to ask them!
>>
>> Is it reasonable to open a file with O_FSYNC and then use aio_write() 
>> to issue multiple writes, and then assume that the data is on disk 
>> when the aio completes?
> 
> I know very little about aio, but looking at the sources seems to show that
> O_FSYNC (or mounting with the sync option) just defeats the asyncness of
> aio.  aio seems to use only fo_write() for writing, so at lower (file
> system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs
> the i/o at the end of the call in the usual case where fo_write = vn_write.
> 
>> Can I get I/O parallelism using this approach?
> 
> Apparently not.
> 
>> I recall reading (some time ago) that FreeBSD doesn't do I/O 
>> parallelism on a single file descriptor.  Is that true?  Do I need to 
>> open the file multiple times in order to get I/O parallelism?
> 
> The fs part of vn_write() is serialized, now using the exclusive vnode 
> lock.
> The code is essentially:
> 
>     vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>     error = VOP_WRITE(...);        /* this soon reaches foofs_write() */
>     VOP_UNLOCK(vp);
> 
> In the usual case without O_FSYNC, foofs should try to only schedule
> the i/o (by writing it to the buffer cache and not waiting), so that
> the actual i/o is done in parallel later.  However, foofs might need
> to do some physical input in order to tell where to write (e.g., reading
> indirect block(s)) or some physical output of metadata needed for
> consistency (e.g., writing indirect blocks), and any such i/o is
> serialized.  (I think most file systems avoid writing to the inode on
> every foofs_write(), though not doing requires tricks to maintain
> consistency.  No tricks seem to be available for indirect blocks, so
> ffs without soft updates always writes them synchronously (except in
> my version where the async mount option actually works for indurect
> blocks).)
> 
> O_FSYNC should cause almost all writes related to the file to be
> synced at the end of foofs_write().  Thus it forces all i/o to be
> serialized.
> Some excepions to "all":
> - at least in ffs, bitmap blocks are not synced.  This is safe since
>   fsck can always recover bitmap blocks.
> - at least in ffs, directories above the file are not synced by fsync()
>   for the file.  This is normally harmless because critical directory
>   operations are normally synchronous (or ordered relative to everything
>   including related file operations in the case of soft updates), and
>   fsync() is not specified to do this (?), but perhaps careful
>   applications should fsync() all the directories too, and with the
>   async mount option, even the most critical directory operation
>   (creation of the file's directory entry) is asynchronous (except
>   bugs make it partly synchronus).
> - at least in ffs, with the async mount option, fsync() is more broken than
>   it should be broken -- it syncs everything except for the most critical
>   metadata (the inode) and directories above the file.
> 
>> You can see where I'm going with this:  What I'd really like to do is 
>> open a file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio 
>> operations on it using chunks that a multiple of the page size with 
>> buffers that are aligned on page boundaries.  I'd like to know when 
>> aio writes are "really" complete to maintain various kinds of on-disk 
>> structures (eg. b-trees).  I'd also like to avoid call fsync(2).
> 
> Calling fsync() or aio_waitcomplete() seems to be necessary.  More
> global options like the sync mount flag and O_FSYNC don't provide
> enough control.  I can't find any aio interfaces to select or poll for
> completion.

it does have a comprehensive interface with kqueue.

>  It seems to have only aio_return() to test for completion,
> with the possibly unwanted side effect of doing the completion if
> possible.  I don't trust aio_return() to test that _all_ the things
> that would be done by the file system for fsync(2) have been done.
> aio_waitcomplete ensures doing these things by calling the file system
> (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system.
> 
> BTW, I just remembered that there is no mount option or file flag to
> give fully sync metadata.  At least in ffs, all inode-change operations
> (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of
> mount options and O_FSYNC.  It takes a syscall calling VOP_FSYNC() or
> an unrelated inode update to sync the metadata for these operations.
> 
> Bruce