Strange IO performance with UFS

Sat Jul 5 10:35:18 UTC 2014

On 05/07/14 11:58, Konstantin Belousov wrote:
> On Sat, Jul 05, 2014 at 11:32:06AM +0200, Roger Pau Monn? wrote:
>> On 04/07/14 23:19, Stefan Parvu wrote:
>>> Hi,
>>> 
>>>>> I'm doing some tests on IO performance using fio, and I've
>>>>> found something weird when using UFS and large files. I
>>>>> have the following very simple sequential fio workload:
>>> 
>>> System: FreeBSD ox 10.0-RELEASE-p6 FreeBSD 10.0-RELEASE-p6 #0:
>>> Tue Jun 24 07:47:37 UTC 2014 
>>> root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
>>> amd64
>>> 
>>> 
>>> 1. Seq Write to 1 file, 10GB size, single writer, block 4k,
>>> UFS2:
>>> 
>>> I tried to write seq using a single writer using an IOSIZE
>>> similar to your example, 10 GB to a 14TB Hdw RAID 10 LSI device
>>> using fio 2.1.9 under FreeBSD 10.0.
>>> 
>>> Result: Run status group 0 (all jobs): WRITE: io=10240MB,
>>> aggrb=460993KB/s, minb=460993KB/s, maxb=460993KB/s, 
>>> mint=22746msec, maxt=22746msec
>> 
>> This looks much better than what I've saw in my benchmarks, how
>> much memory does the system have?
>> 
>> In my case I've seen the reads issue when trying to write to
>> files that where greater than the memory the system has. My box
>> has 6GB of RAM and I was using a 10GB file.
>> 
>>> 
>>> 
>>> 2. Seq Write to 2500 files, each file 5MB size, multiple
>>> writers, UFS2:
>>> 
>>> Result: Run status group 0 (all jobs): WRITE: io=12500MB,
>>> aggrb=167429KB/s, minb=334KB/s, maxb=9968KB/s, mint=2568msec,
>>> maxt=76450msec
>>> 
>>> Questions:
>>> 
>>> - where are you writing, what storage: hdw / sfw RAID ?
>> 
>> The storage is a simple SATA disk, no RAID:
>> 
>> pass0 at ahcich0 bus 0 scbus0 target 0 lun 0 pass0:
>> <ST500DM002-1BD142 KC45> ATA-8 SATA 3.x device pass0: Serial
>> Number Z3T3FJXL pass0: 300.000MB/s transfers (SATA 2.x, UDMA6,
>> PIO 8192bytes) pass0: Command Queueing enabled
>> 
>>> - are you using time based fio tests ?
>> 
>> I'm using the following fio workload, as stated in the first
>> email:
>> 
>> [global] rw=write size=4g bs=4k
>> 
>> [job1]
>> 
>> The problem doesn't seem to be related to the hardware (I've also
>> seen this when running inside of a VM), but to UFS itself that at
>> some point (or maybe under certain conditions) starts making a
>> lot of reads when doing a simple write:
>> 
>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 
>> kernel`g_io_request+0x384 kernel`g_part_start+0x2c3 
>> kernel`g_io_request+0x384 kernel`ufs_strategy+0x8a 
>> kernel`VOP_STRATEGY_APV+0xf5 kernel`bufstrategy+0x46 
>> kernel`cluster_read+0x5e6 kernel`ffs_balloc_ufs2+0x1be2 
>> kernel`ffs_write+0x310 kernel`VOP_WRITE_APV+0x166 
>> kernel`vn_write+0x2eb kernel`vn_io_fault_doio+0x22 
>> kernel`vn_io_fault1+0x78 kernel`vn_io_fault+0x173 
>> kernel`dofilewrite+0x85 kernel`kern_writev+0x65 
>> kernel`sys_write+0x63
>> 
>> This can also be seen by running iostat in parallel with the fio
>> workload:
>> 
>> device     r/s   w/s    kr/s    kw/s qlen svc_t  %b ada0
>> 243.3 233.7 31053.3 29919.1   31  57.4 100
>> 
>> This clearly shows that even when I was doing a sequential write
>> (the fio workload shown above), the disk was actually reading
>> more data than writing it, which makes no sense, and all the
>> reads come from the path trace shown above.
> 
> The backtrace above means that the BA_CLRBUF was specified for
> UFS_BALLOC(). In turns, this occurs when the write size is less
> than the UFS block size. UFS has to read the block to ensure that
> partial write does not corrupt the rest of the buffer.

Thanks for the clarification, that makes sense. I'm not opening the
file with O_DIRECT, so shouldn't the write be cached in memory and
flushed to disk when we have the full block? It's a sequential write,
so the whole block is going to be rewritten very soon.

> 
> You can get the block size for file with stat(2), st_blksize field
> of the struct stat, or using statfs(2), field f_iosize of struct
> statfs, or just looking at the dumpfs output for your filesystem,
> the bsize value.  For modern UFS typical value is 32KB.

Yes, block size is 32KB, checked with dumpfs. I've changed the block
size in fio to 32k and then I get the expected results in iostat and fio:

                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       1.0 658.2    31.1 84245.1   58 108.4 101
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       0.0 689.8     0.0 88291.4   54 112.1  99
                        extended device statistics
device     r/s   w/s    kr/s    kw/s qlen svc_t  %b
ada0       1.0 593.3    30.6 75936.9   80 111.7  97

write: io=10240MB, bw=81704KB/s, iops=2553, runt=128339msec

Roger.