Strange IO performance with UFS

Konstantin Belousov kostikbel at gmail.com
Sat Jul 5 11:24:57 UTC 2014


On Sat, Jul 05, 2014 at 12:35:11PM +0200, Roger Pau Monn? wrote:
> On 05/07/14 11:58, Konstantin Belousov wrote:
> > On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote:
> >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 
> >> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 
> >> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a 
> >> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46 
> >> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2 
> >> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166 
> >> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22 
> >> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173 
> >> kernel`dofilewrite+0x85 kernel`kern_writev+0x65 
> >> kernel`sys_write+0x63
> >> 
> >> This can also be seen by running iostat in parallel with the fio
> >> workload:
> >> 
> >> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b ada0
> >> 243.3 233.7 31053.3 29919.1   31  57.4 100
> >> 
> >> This clearly shows that even when I was doing a sequential write
> >> (the fio workload shown above), the disk was actually reading
> >> more data than writing it, which makes no sense, and all the
> >> reads come from the path trace shown above.
> > 
> > The backtrace above means that the BA_CLRBUF was specified for
> > UFS_BALLOC(). In turns, this occurs when the write size is less
> > than the UFS block size. UFS has to read the block to ensure that
> > partial write does not corrupt the rest of the buffer.
> 
> Thanks for the clarification, that makes sense. I'm not opening the
> file with O_DIRECT, so shouldn't the write be cached in memory and
> flushed to disk when we have the full block? It's a sequential write,
> so the whole block is going to be rewritten very soon.
> 
> > 
> > You can get the block size for file with stat(2), st_blksize field
> > of the struct stat, or using statfs(2), field f_iosize of struct
> > statfs, or just looking at the dumpfs output for your filesystem,
> > the bsize value.  For modern UFS typical value is 32KB.
> 
> Yes, block size is 32KB, checked with dumpfs. I've changed the block
> size in fio to 32k and then I get the expected results in iostat and fio:
> 
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       1.0 658.2    31.1 84245.1   58 108.4 101
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       0.0 689.8     0.0 88291.4   54 112.1  99
>                         extended device statistics
> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
> ada0       1.0 593.3    30.6 75936.9   80 111.7  97
> 
> write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec

The current code in ffs_write() only avoids read before write when
write covers complete block.  I think we can somewhat loose the test
to also avoid read when we are at EOF and write covers completely
the valid portion of the last block.

This leaves the unwritten portion of the block with the garbage. I
believe that it is not harmful, since the only way for usermode to
access that garbage is through the mmap(2). The vnode_generic_getpages()
zeroes out parts of the page which are after EOF.

Try this, almost completely untested:

commit 30375741f5b15609e51cac5b242ecfe7d614e902
Author: Konstantin Belousov <kib at freebsd.org>
Date:   Sat Jul 5 14:19:39 2014 +0300

    Do not do read-before-write if the written area completely covers
    the valid portion of the block at EOF.

diff --git a/sys/ufs/ffs/ffs_vnops.c b/sys/ufs/ffs/ffs_vnops.c
index 423d811..b725932 100644
--- a/sys/ufs/ffs/ffs_vnops.c
+++ b/sys/ufs/ffs/ffs_vnops.c
@@ -729,10 +729,12 @@ ffs_write(ap)
 			vnode_pager_setsize(vp, uio->uio_offset + xfersize);
 
 		/*
-		 * We must perform a read-before-write if the transfer size
-		 * does not cover the entire buffer.
+		 * We must perform a read-before-write if the transfer
+		 * size does not cover the entire buffer or the valid
+		 * part of the last buffer for the file.
 		 */
-		if (fs->fs_bsize > xfersize)
+		if (fs->fs_bsize > xfersize && (blkoffset != 0 ||
+		    uio->uio_offset + xfersize < ip->i_size))
 			flags |= BA_CLRBUF;
 		else
 			flags &= ~BA_CLRBUF;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140705/363827db/attachment.sig>


More information about the freebsd-fs mailing list