Strange IO performance with UFS

Wed Jul 9 12:23:52 UTC 2014

On Tue, 8 Jul 2014, Don Lewis wrote:

> On  5 Jul, Konstantin Belousov wrote:
>> On Sat, Jul 05, 2014 at 06:18:07PM +0200, Roger Pau Monn? wrote:
>
>>> As can be seen from the log above, at first the workload runs fine,
>>> and the disk is only performing writes, but at some point (in this
>>> case around 40% of completion) it starts performing this
>>> read-before-write dance that completely screws up performance.
>>
>> I reproduced this locally.  I think my patch is useless for the fio/4k write
>> situation.
>>
>> What happens is indeed related to the amount of the available memory.
>> When the size of the file written by fio is larger than the memory,
>> system has to recycle the cached pages.  So after some moment, doing
>> a write has to do read-before-write, and this occurs not at the EOF
>> (since fio pre-allocated the job file).
>
> I reproduced this locally with dd if=/dev/zero bs=4k conv=notrunc ...
> For the small file case, if I flush the file from cache by unmounting
> the filesystem where it resides and then remounting the filesystem, then
> I see lots of reads right from the start.

This seems to be related to kern/178997: Heavy disk I/O may hang system.
Test programs doing more complicated versions of conv=notrunc caused
even worse problems when run in parallel.  I lost track of what happened
with that.  I think kib committed a partial fix that doesn't apply to
the old version of FreeBSD that I use.

>> In fact, I used 10G file on 8G machine, but I interrupted the fio
>> before it finish the job.  The longer the previous job runs, the longer
>> is time for which new job does not issue reads.  If I allow the job to
>> completely fill the cache, then the reads starts immediately on the next
>> job run.
>>
>> I do not see how could anything be changed there, if we want to keep
>> user file content on partial block writes, and we do.
>
> About the only thing I can think of that might help is to trigger
> readahead when we detect sequential small writes.  We'll still have to
> do the reads, but hopefully they will be larger and occupy less time in
> the critical path.

ffs_balloc*() already uses cluster_write() so sequentuial small writes
already normally do at least 128K of readahead and you should rarely
see the the 4K-reads (except with O_DIRECT?).

msdosfs is missing this readahead.  I never got around to sending
my patches for this to kib in the PR 178997 discussion.

Here I see full clustering with 64K-clusters on the old version of
FreeBSD, but my drive doesn't like going back and forth, so the writes
go 8 times as slow as without the reads instead of only 2 times as
slow.  (It's an old ATA drive with a ~1MB buffer, but apparently has
dumb firmware so seeking back just 64K is too much for it to cache.)
Just remembered I have a newer SATA drive with a ~32MB buffer.  It
only goes 3 times as slow.  The second drive is also on a not quite
so old version of FreeBSD that certainly doesn't have any workarounds
for PR 178977.  All file systems were mounted async, which shouldn't
affect this much.

> Writing a multiple of the filesystem blocksize is still the most
> efficient strategy.

Except when the filesystem block size is too large to be efficient.
The FreeBSD ffs default block size of 32K is slow for small files.
Fragments reduce its space wastage but interact badly with the
buffer cache.  Linux avoids some of these problems by using smaller
filesystem block sizes and not using fragments (at least in old
filesystems).

Bruce