Trying to understand how aio(4), mount(8) async (and gjournal)
relate
Julian Elischer
julian at elischer.org
Wed Mar 25 10:22:13 PDT 2009
Bruce Evans wrote:
> On Wed, 25 Mar 2009, Jan Mikkelsen wrote:
>
>> [Jumping into a conversation on aio, async mounts, etc.]
>>
>> I have had a few questions for a while that I haven't asked yet; these
>> seems like an appropriate time to ask them!
>>
>> Is it reasonable to open a file with O_FSYNC and then use aio_write()
>> to issue multiple writes, and then assume that the data is on disk
>> when the aio completes?
>
> I know very little about aio, but looking at the sources seems to show that
> O_FSYNC (or mounting with the sync option) just defeats the asyncness of
> aio. aio seems to use only fo_write() for writing, so at lower (file
> system) levels, O_FSYNC has the same behaviour as for write(2) -- it syncs
> the i/o at the end of the call in the usual case where fo_write = vn_write.
>
>> Can I get I/O parallelism using this approach?
>
> Apparently not.
>
>> I recall reading (some time ago) that FreeBSD doesn't do I/O
>> parallelism on a single file descriptor. Is that true? Do I need to
>> open the file multiple times in order to get I/O parallelism?
>
> The fs part of vn_write() is serialized, now using the exclusive vnode
> lock.
> The code is essentially:
>
> vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
> error = VOP_WRITE(...); /* this soon reaches foofs_write() */
> VOP_UNLOCK(vp);
>
> In the usual case without O_FSYNC, foofs should try to only schedule
> the i/o (by writing it to the buffer cache and not waiting), so that
> the actual i/o is done in parallel later. However, foofs might need
> to do some physical input in order to tell where to write (e.g., reading
> indirect block(s)) or some physical output of metadata needed for
> consistency (e.g., writing indirect blocks), and any such i/o is
> serialized. (I think most file systems avoid writing to the inode on
> every foofs_write(), though not doing requires tricks to maintain
> consistency. No tricks seem to be available for indirect blocks, so
> ffs without soft updates always writes them synchronously (except in
> my version where the async mount option actually works for indurect
> blocks).)
>
> O_FSYNC should cause almost all writes related to the file to be
> synced at the end of foofs_write(). Thus it forces all i/o to be
> serialized.
> Some excepions to "all":
> - at least in ffs, bitmap blocks are not synced. This is safe since
> fsck can always recover bitmap blocks.
> - at least in ffs, directories above the file are not synced by fsync()
> for the file. This is normally harmless because critical directory
> operations are normally synchronous (or ordered relative to everything
> including related file operations in the case of soft updates), and
> fsync() is not specified to do this (?), but perhaps careful
> applications should fsync() all the directories too, and with the
> async mount option, even the most critical directory operation
> (creation of the file's directory entry) is asynchronous (except
> bugs make it partly synchronus).
> - at least in ffs, with the async mount option, fsync() is more broken than
> it should be broken -- it syncs everything except for the most critical
> metadata (the inode) and directories above the file.
>
>> You can see where I'm going with this: What I'd really like to do is
>> open a file with O_FSYNC | O_DIRECT | O_EXCL, and then do lots of aio
>> operations on it using chunks that a multiple of the page size with
>> buffers that are aligned on page boundaries. I'd like to know when
>> aio writes are "really" complete to maintain various kinds of on-disk
>> structures (eg. b-trees). I'd also like to avoid call fsync(2).
>
> Calling fsync() or aio_waitcomplete() seems to be necessary. More
> global options like the sync mount flag and O_FSYNC don't provide
> enough control. I can't find any aio interfaces to select or poll for
> completion.
it does have a comprehensive interface with kqueue.
> It seems to have only aio_return() to test for completion,
> with the possibly unwanted side effect of doing the completion if
> possible. I don't trust aio_return() to test that _all_ the things
> that would be done by the file system for fsync(2) have been done.
> aio_waitcomplete ensures doing these things by calling the file system
> (VOP_FSYNC()), but aio_return() doesn't seem to go near the file system.
>
> BTW, I just remembered that there is no mount option or file flag to
> give fully sync metadata. At least in ffs, all inode-change operations
> (chmod(), chown(), fchmod(), fchown(), etc.) are async, irrespective of
> mount options and O_FSYNC. It takes a syscall calling VOP_FSYNC() or
> an unrelated inode update to sync the metadata for these operations.
>
> Bruce
More information about the freebsd-fs
mailing list