Strange IO performance with UFS

Sat Jul 5 14:24:45 UTC 2014

On 7/5/14, 7:24 PM, Konstantin Belousov wrote:
> On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote:
>> On 05/07/14 11:58, Konstantin Belousov wrote:
>>> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote:
>>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3
>>>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3
>>>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a
>>>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46
>>>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2
>>>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166
>>>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22
>>>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173
>>>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65
>>>> kernel`sys_write+0x63
>>>>
>>>> This can also be seen by running iostat in parallel with the fio
>>>> workload:
>>>>
>>>> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b ada0
>>>> 243.3 233.7 31053.3 29919.1   31  57.4 100
>>>>
>>>> This clearly shows that even when I was doing a sequential write
>>>> (the fio workload shown above), the disk was actually reading
>>>> more data than writing it, which makes no sense, and all the
>>>> reads come from the path trace shown above.
>>> The backtrace above means that the BA_CLRBUF was specified for
>>> UFS_BALLOC(). In turns, this occurs when the write size is less
>>> than the UFS block size. UFS has to read the block to ensure that
>>> partial write does not corrupt the rest of the buffer.
>> Thanks for the clarification, that makes sense. I'm not opening the
>> file with O_DIRECT, so shouldn't the write be cached in memory and
>> flushed to disk when we have the full block? It's a sequential write,
>> so the whole block is going to be rewritten very soon.
>>
>>> You can get the block size for file with stat(2), st_blksize field
>>> of the struct stat, or using statfs(2), field f_iosize of struct
>>> statfs, or just looking at the dumpfs output for your filesystem,
>>> the bsize value.  For modern UFS typical value is 32KB.
>> Yes, block size is 32KB, checked with dumpfs. I've changed the block
>> size in fio to 32k and then I get the expected results in iostat and fio:
>>
>>                          extended device statistics
>> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
>> ada0       1.0 658.2    31.1 84245.1   58 108.4 101
>>                          extended device statistics
>> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
>> ada0       0.0 689.8     0.0 88291.4   54 112.1  99
>>                          extended device statistics
>> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
>> ada0       1.0 593.3    30.6 75936.9   80 111.7  97
>>
>> write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec
> The current code in ffs_write() only avoids read before write when
> write covers complete block.  I think we can somewhat loose the test
> to also avoid read when we are at EOF and write covers completely
> the valid portion of the last block.
>
> This leaves the unwritten portion of the block with the garbage. I
> believe that it is not harmful, since the only way for usermode to
> access that garbage is through the mmap(2). The vnode_generic_getpages()
> zeroes out parts of the page which are after EOF.

I have vague memories of this being in a security bulletin once along 
the lines of "random data disclosure"  by making tons of 1 frag size 
files and then mmapping them.
>
> Try this, almost completely untested:
>
> commit 30375741f5b15609e51cac5b242ecfe7d614e902
> Author: Konstantin Belousov <kib at freebsd.org>
> Date:   Sat Jul 5 14:19:39 2014 +0300
>
>      Do not do read-before-write if the written area completely covers
>      the valid portion of the block at EOF.
>
> diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
> index 423d811..b725932 100644
> --- a/sys/ufs/ffs/ffs_vnops.c
> +++ b/sys/ufs/ffs/ffs_vnops.c
> @@ -729,10 +729,12 @@ ffs_write(ap)
>   			vnode_pager_setsize(vp, uio->uio_offset + xfersize);
>   
>   		/*
> -		 * We must perform a read-before-write if the transfer size
> -		 * does not cover the entire buffer.
> +		 * We must perform a read-before-write if the transfer
> +		 * size does not cover the entire buffer or the valid
> +		 * part of the last buffer for the file.
>   		 */
> -		if (fs->fs_bsize > xfersize)
> +		if (fs->fs_bsize > xfersize && (blkoffset != 0 ||
> +		    uio->uio_offset + xfersize < ip->i_size))
>   			flags |= BA_CLRBUF;
>   		else
>   			flags &= ~BA_CLRBUF;