ZFS slow reads for unallocated blocks

Adam Nowacki nowakpl at platinum.linux.pl
Sat Apr 13 19:24:33 UTC 2013


Including zfs at illumos on this. To recap:

Reads from sparse files are slow with speed proportional to ratio of 
read size to filesystem recordsize ratio. There is no physical disk I/O.

# zfs create -o atime=off -o recordsize=128k -o compression=off -o 
sync=disabled -o mountpoint=/home/testfs home/testfs
# dd if=/dev/random of=/home/testfs/random10m bs=1024k count=10
# truncate -s 10m /home/testfs/trunc10m
# dd if=/home/testfs/random10m of=/dev/null bs=512
10485760 bytes transferred in 0.078637 secs (133344041 bytes/sec)
# dd if=/home/testfs/trunc10m of=/dev/null bs=512
10485760 bytes transferred in 1.011500 secs (10366544 bytes/sec)

# zfs create -o atime=off -o recordsize=8M -o compression=off -o 
sync=disabled -o mountpoint=/home/testfs home/testfs
# dd if=/home/testfs/random10m of=/dev/null bs=512
10485760 bytes transferred in 0.080430 secs (130371205 bytes/sec)
# dd if=/home/testfs/trunc10m of=/dev/null bs=512
10485760 bytes transferred in 72.465486 secs (144700 bytes/sec)

This is from FreeBSD 9.1 and possible solution at 
http://tepeserwery.pl/nowak/freebsd/zfs_sparse_optimization_v2.patch.txt 
- untested yet, system will be busy building packages for a few more days.

On 2013-04-13 19:11, Will Andrews wrote:
> Hi,
>
> I think the idea of using a pre-zeroed region as the 'source' is a good
> one, but probably it would be better to set a special flag on a hole
> dbuf than to require caller flags.  That way, ZFS can lazily evaluate
> the hole dbuf (i.e. avoid zeroing db_data until it has to).  However,
> that could be complicated by the fact that there are many potential
> users of hole dbufs that would want to write to the dbuf.
>
> This sort of optimization should be brought to the illumos zfs list.  As
> it stands, your patch is also FreeBSD-specific, since 'zero_region' only
> exists in vm/vm_kern.c.  Given the frequency of zero-copying, however,
> it's quite possible there are other versions of this region elsewhere.
>
> --Will.
>
>
> On Sat, Apr 13, 2013 at 6:04 AM, Adam Nowacki <nowakpl at platinum.linux.pl
> <mailto:nowakpl at platinum.linux.pl>> wrote:
>
>     Temporary dbufs are created for each missing (unallocated on disk)
>     record, including indirects if the hole is large enough. Those dbufs
>     never find way to ARC and are freed at the end of dmu_read_uio.
>
>     A small read (from a hole) would in the best case bzero 128KiB
>     (recordsize, more if missing indirects) ... and I'm running modified
>     ZFS with record sizes up to 8MiB.
>
>     # zfs create -o atime=off -o recordsize=8M -o compression=off -o
>     mountpoint=/home/testfs home/testfs
>     # truncate -s 8m /home/testfs/trunc8m
>     # dd if=/dev/zero of=/home/testfs/zero8m bs=8m count=1
>     1+0 records in
>     1+0 records out
>     8388608 bytes transferred in 0.010193 secs (822987745 bytes/sec)
>
>     # time cat /home/testfs/trunc8m > /dev/null
>     0.000u 6.111s 0:06.11 100.0%    15+2753k 0+0io 0pf+0w
>
>     # time cat /home/testfs/zero8m > /dev/null
>     0.000u 0.010s 0:00.01 100.0%    12+2168k 0+0io 0pf+0w
>
>     600x increase in system time and close to 1MB/s - insanity.
>
>     The fix - a lot of the code to efficiently handle this was already
>     there.
>
>     dbuf_hold_impl has int fail_sparse argument to return ENOENT for
>     holes. Just had to get there and somehow back to dmu_read_uio where
>     zeroing can happen at byte granularity.
>
>     ... didn't have time to actually test it yet.
>
>
>     On 2013-04-13 12:24, Andriy Gapon wrote:
>
>         on 13/04/2013 02:35 Adam Nowacki said the following:
>
>             http://tepeserwery.pl/nowak/__freebsd/zfs_sparse___optimization.patch.txt
>             <http://tepeserwery.pl/nowak/freebsd/zfs_sparse_optimization.patch.txt>
>
>             Does it look sane?
>
>
>         It's hard to tell from a quick look since they change is not small.
>         What is your idea of the problem and the fix?
>
>             On 2013-04-12 09:03, Andriy Gapon wrote:
>
>
>                 ENOTIME to really investigate, but here is a basic
>                 profile result for those
>                 interested:
>                                  kernel`bzero+0xa
>                                  kernel`dmu_buf_hold_array_by___dnode+0x1cf
>                                  kernel`dmu_read_uio+0x66
>                                  kernel`zfs_freebsd_read+0x3c0
>                                  kernel`VOP_READ_APV+0x92
>                                  kernel`vn_read+0x1a3
>                                  kernel`vn_io_fault+0x23a
>                                  kernel`dofileread+0x7b
>                                  kernel`sys_read+0x9e
>                                  kernel`amd64_syscall+0x238
>                                  kernel`0xffffffff80747e4b
>
>                 That's where > 99% of time is spent.
>
>
>
>
>
>     _________________________________________________
>     freebsd-fs at freebsd.org <mailto:freebsd-fs at freebsd.org> mailing list
>     http://lists.freebsd.org/__mailman/listinfo/freebsd-fs
>     <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>
>     To unsubscribe, send any mail to
>     "freebsd-fs-unsubscribe at __freebsd.org
>     <mailto:freebsd-fs-unsubscribe at freebsd.org>"
>
>



More information about the freebsd-fs mailing list