ZFS slow reads for unallocated blocks

Matthew Ahrens mahrens at delphix.com
Fri Apr 19 18:22:42 UTC 2013


Sorry I'm late to the game here, just saw this email now.

Yes, this is also a problem on illumos, though much less so on my system,
only about 2x.  It looks like the difference is due to the fact that the
zeroed dbufs are not cached, so we have to zero the entire dbuf (e.g. 128k)
for every read syscall (e.g. 8k).  Increasing the size of the reads to
match the recordsize results in performance parity between reading cached
data and sparse zeros.

You can see this behavior in the following dtrace, which shows that we are
initializing the dbuf in dbuf_read_impl() as many times as we do syscalls:

sudo dtrace -n 'dbuf_read_impl:entry/pid==$target/{@[probefunc] = count()}'
-c 'dd if=t100m of=/dev/null bs=8k'
dtrace: description 'dbuf_read_impl:entry' matched 1 probe
*12800*+0 records in
12800+0 records out
dtrace: pid 29419 has exited

  dbuf_read_impl                                                *12800*

--matt


On Sat, Apr 13, 2013 at 12:24 PM, Adam Nowacki <nowakpl at platinum.linux.pl>wrote:

> Including zfs at illumos on this. To recap:
>
> Reads from sparse files are slow with speed proportional to ratio of read
> size to filesystem recordsize ratio. There is no physical disk I/O.
>
> # zfs create -o atime=off -o recordsize=128k -o compression=off -o
> sync=disabled -o mountpoint=/home/testfs home/testfs
> # dd if=/dev/random of=/home/testfs/random10m bs=1024k count=10
> # truncate -s 10m /home/testfs/trunc10m
> # dd if=/home/testfs/random10m of=/dev/null bs=512
> 10485760 bytes transferred in 0.078637 secs (133344041 bytes/sec)
> # dd if=/home/testfs/trunc10m of=/dev/null bs=512
> 10485760 bytes transferred in 1.011500 secs (10366544 bytes/sec)
>
> # zfs create -o atime=off -o recordsize=8M -o compression=off -o
> sync=disabled -o mountpoint=/home/testfs home/testfs
> # dd if=/home/testfs/random10m of=/dev/null bs=512
> 10485760 bytes transferred in 0.080430 secs (130371205 bytes/sec)
> # dd if=/home/testfs/trunc10m of=/dev/null bs=512
> 10485760 bytes transferred in 72.465486 secs (144700 bytes/sec)
>
> This is from FreeBSD 9.1 and possible solution at
> http://tepeserwery.pl/nowak/**freebsd/zfs_sparse_**
> optimization_v2.patch.txt<http://tepeserwery.pl/nowak/freebsd/zfs_sparse_optimization_v2.patch.txt>- untested yet, system will be busy building packages for a few more days.
>
>
> On 2013-04-13 19:11, Will Andrews wrote:
>
>> Hi,
>>
>> I think the idea of using a pre-zeroed region as the 'source' is a good
>> one, but probably it would be better to set a special flag on a hole
>> dbuf than to require caller flags.  That way, ZFS can lazily evaluate
>> the hole dbuf (i.e. avoid zeroing db_data until it has to).  However,
>> that could be complicated by the fact that there are many potential
>> users of hole dbufs that would want to write to the dbuf.
>>
>> This sort of optimization should be brought to the illumos zfs list.  As
>> it stands, your patch is also FreeBSD-specific, since 'zero_region' only
>> exists in vm/vm_kern.c.  Given the frequency of zero-copying, however,
>> it's quite possible there are other versions of this region elsewhere.
>>
>> --Will.
>>
>>
>> On Sat, Apr 13, 2013 at 6:04 AM, Adam Nowacki <nowakpl at platinum.linux.pl
>> <mailto:nowakpl at platinum.**linux.pl <nowakpl at platinum.linux.pl>>> wrote:
>>
>>     Temporary dbufs are created for each missing (unallocated on disk)
>>     record, including indirects if the hole is large enough. Those dbufs
>>     never find way to ARC and are freed at the end of dmu_read_uio.
>>
>>     A small read (from a hole) would in the best case bzero 128KiB
>>     (recordsize, more if missing indirects) ... and I'm running modified
>>     ZFS with record sizes up to 8MiB.
>>
>>     # zfs create -o atime=off -o recordsize=8M -o compression=off -o
>>     mountpoint=/home/testfs home/testfs
>>     # truncate -s 8m /home/testfs/trunc8m
>>     # dd if=/dev/zero of=/home/testfs/zero8m bs=8m count=1
>>     1+0 records in
>>     1+0 records out
>>     8388608 bytes transferred in 0.010193 secs (822987745 bytes/sec)
>>
>>     # time cat /home/testfs/trunc8m > /dev/null
>>     0.000u 6.111s 0:06.11 100.0%    15+2753k 0+0io 0pf+0w
>>
>>     # time cat /home/testfs/zero8m > /dev/null
>>     0.000u 0.010s 0:00.01 100.0%    12+2168k 0+0io 0pf+0w
>>
>>     600x increase in system time and close to 1MB/s - insanity.
>>
>>     The fix - a lot of the code to efficiently handle this was already
>>     there.
>>
>>     dbuf_hold_impl has int fail_sparse argument to return ENOENT for
>>     holes. Just had to get there and somehow back to dmu_read_uio where
>>     zeroing can happen at byte granularity.
>>
>>     ... didn't have time to actually test it yet.
>>
>>
>>     On 2013-04-13 12:24, Andriy Gapon wrote:
>>
>>         on 13/04/2013 02:35 Adam Nowacki said the following:
>>
>>             http://tepeserwery.pl/nowak/__**freebsd/zfs_sparse___**
>> optimization.patch.txt<http://tepeserwery.pl/nowak/__freebsd/zfs_sparse___optimization.patch.txt>
>>
>>             <http://tepeserwery.pl/nowak/**freebsd/zfs_sparse_**
>> optimization.patch.txt<http://tepeserwery.pl/nowak/freebsd/zfs_sparse_optimization.patch.txt>
>> >
>>
>>             Does it look sane?
>>
>>
>>         It's hard to tell from a quick look since they change is not
>> small.
>>         What is your idea of the problem and the fix?
>>
>>             On 2013-04-12 09:03, Andriy Gapon wrote:
>>
>>
>>                 ENOTIME to really investigate, but here is a basic
>>                 profile result for those
>>                 interested:
>>                                  kernel`bzero+0xa
>>                                  kernel`dmu_buf_hold_array_by__**
>> _dnode+0x1cf
>>
>>                                  kernel`dmu_read_uio+0x66
>>                                  kernel`zfs_freebsd_read+0x3c0
>>                                  kernel`VOP_READ_APV+0x92
>>                                  kernel`vn_read+0x1a3
>>                                  kernel`vn_io_fault+0x23a
>>                                  kernel`dofileread+0x7b
>>                                  kernel`sys_read+0x9e
>>                                  kernel`amd64_syscall+0x238
>>                                  kernel`0xffffffff80747e4b
>>
>>                 That's where > 99% of time is spent.
>>
>>
>>
>>
>>
>>     ______________________________**___________________
>>     freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org**> mailing list
>>     http://lists.freebsd.org/__**mailman/listinfo/freebsd-fs<http://lists.freebsd.org/__mailman/listinfo/freebsd-fs>
>>
>>     <http://lists.freebsd.org/**mailman/listinfo/freebsd-fs<http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
>> >
>>     To unsubscribe, send any mail to
>>     "freebsd-fs-unsubscribe at __free**bsd.org <http://freebsd.org>
>>     <mailto:freebsd-fs-**unsubscribe at freebsd.org<freebsd-fs-unsubscribe at freebsd.org>
>> >"
>>
>>
>>
> ______________________________**_________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/**mailman/listinfo/freebsd-fs<http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@**freebsd.org<freebsd-fs-unsubscribe at freebsd.org>
> "
>


More information about the freebsd-fs mailing list