Strange IO performance with UFS
Julian Elischer
julian at freebsd.org
Sat Jul 5 14:24:45 UTC 2014
On 7/5/14, 7:24 PM, Konstantin Belousov wrote:
> On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote:
>> On 05/07/14 11:58, Konstantin Belousov wrote:
>>> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote:
>>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3
>>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3
>>>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a
>>>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46
>>>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2
>>>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166
>>>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22
>>>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173
>>>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65
>>>> kernel`sys_write+0x63
>>>>
>>>> This can also be seen by running iostat in parallel with the fio
>>>> workload:
>>>>
>>>> device r/s w/s kr/s kw/s qlen svc_t %b ada0
>>>> 243.3 233.7 31053.3 29919.1 31 57.4 100
>>>>
>>>> This clearly shows that even when I was doing a sequential write
>>>> (the fio workload shown above), the disk was actually reading
>>>> more data than writing it, which makes no sense, and all the
>>>> reads come from the path trace shown above.
>>> The backtrace above means that the BA_CLRBUF was specified for
>>> UFS_BALLOC(). In turns, this occurs when the write size is less
>>> than the UFS block size. UFS has to read the block to ensure that
>>> partial write does not corrupt the rest of the buffer.
>> Thanks for the clarification, that makes sense. I'm not opening the
>> file with O_DIRECT, so shouldn't the write be cached in memory and
>> flushed to disk when we have the full block? It's a sequential write,
>> so the whole block is going to be rewritten very soon.
>>
>>> You can get the block size for file with stat(2), st_blksize field
>>> of the struct stat, or using statfs(2), field f_iosize of struct
>>> statfs, or just looking at the dumpfs output for your filesystem,
>>> the bsize value. For modern UFS typical value is 32KB.
>> Yes, block size is 32KB, checked with dumpfs. I've changed the block
>> size in fio to 32k and then I get the expected results in iostat and fio:
>>
>> extended device statistics
>> device r/s w/s kr/s kw/s qlen svc_t %b
>> ada0 1.0 658.2 31.1 84245.1 58 108.4 101
>> extended device statistics
>> device r/s w/s kr/s kw/s qlen svc_t %b
>> ada0 0.0 689.8 0.0 88291.4 54 112.1 99
>> extended device statistics
>> device r/s w/s kr/s kw/s qlen svc_t %b
>> ada0 1.0 593.3 30.6 75936.9 80 111.7 97
>>
>> write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec
> The current code in ffs_write() only avoids read before write when
> write covers complete block. I think we can somewhat loose the test
> to also avoid read when we are at EOF and write covers completely
> the valid portion of the last block.
>
> This leaves the unwritten portion of the block with the garbage. I
> believe that it is not harmful, since the only way for usermode to
> access that garbage is through the mmap(2). The vnode_generic_getpages()
> zeroes out parts of the page which are after EOF.
I have vague memories of this being in a security bulletin once along
the lines of "random data disclosure" by making tons of 1 frag size
files and then mmapping them.
>
> Try this, almost completely untested:
>
> commit 30375741f5b15609e51cac5b242ecfe7d614e902
> Author: Konstantin Belousov <kib at freebsd.org>
> Date: Sat Jul 5 14:19:39 2014 +0300
>
> Do not do read-before-write if the written area completely covers
> the valid portion of the block at EOF.
>
> diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
> index 423d811..b725932 100644
> --- a/sys/ufs/ffs/ffs_vnops.c
> +++ b/sys/ufs/ffs/ffs_vnops.c
> @@ -729,10 +729,12 @@ ffs_write(ap)
> vnode_pager_setsize(vp, uio->uio_offset + xfersize);
>
> /*
> - * We must perform a read-before-write if the transfer size
> - * does not cover the entire buffer.
> + * We must perform a read-before-write if the transfer
> + * size does not cover the entire buffer or the valid
> + * part of the last buffer for the file.
> */
> - if (fs->fs_bsize > xfersize)
> + if (fs->fs_bsize > xfersize && (blkoffset != 0 ||
> + uio->uio_offset + xfersize < ip->i_size))
> flags |= BA_CLRBUF;
> else
> flags &= ~BA_CLRBUF;
More information about the freebsd-fs
mailing list