ZFS slow reads for unallocated blocks

Adam Nowacki nowakpl at platinum.linux.pl
Thu Apr 11 22:12:48 UTC 2013


On 2013-04-11 20:23, Jeremy Chadwick wrote:
> On Thu, Apr 11, 2013 at 10:14:28AM -0700, Jeremy Chadwick wrote:
>> On Thu, Apr 11, 2013 at 06:52:19PM +0200, Adam Nowacki wrote:
>>> This one is quite weird - reads from files that were created and
>>> resized with ftruncate (so no actual data on disk) are considerably
>>> slower and use more CPU time than files with data. If compression is
>>> enabled this will also affect files with long runs of zeroes as ZFS
>>> won't write any data to disk in this case. There is no I/O on the
>>> pool during the read tests - all fits into 10GB ARC.
>>>
>>> FreeBSD storage 9.1-RELEASE FreeBSD 9.1-RELEASE #0: Sat Feb 23
>>> 15:51:26 UTC 2013     root at storage:/usr/obj/usr/src/sys/GENERIC
>>> amd64
>>>
>>> Mem: 264M Active, 82M Inact, 12G Wired, 100M Cache, 13M Buf, 3279M Free
>>> Swap: 5120M Total, 5120M Free
>>>
>>> # zfs create -o atime=off -o recordsize=128k -o compression=off -o
>>> mountpoint=/home/testfs home/testfs
>>>
>>> --- truncated file:
>>>
>>> # time truncate -s 1G /home/testfs/trunc1g
>>> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
>>>
>>> # time dd if=/home/testfs/trunc1g of=/dev/null bs=1024k
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 0.434817 secs (2469410435 bytes/sec)
>>> 0.000u 0.435s 0:00.43 100.0%    26+2813k 0+0io 0pf+0w
>>>
>>> # time dd if=/home/testfs/trunc1g of=/dev/null bs=16k
>>> 65536+0 records in
>>> 65536+0 records out
>>> 1073741824 bytes transferred in 3.809560 secs (281854564 bytes/sec)
>>> 0.000u 3.779s 0:03.81 98.9%     25+2755k 0+0io 0pf+0w
>>>
>>> # time cat /home/testfs/trunc1g > /dev/null
>>> 0.070u 14.031s 0:14.19 99.3%    15+2755k 0+0io 0pf+0w
>>>         ^^^^^^^ 14 seconds compared to 1 second for random data
>>>
>>> --- file filled with zeroes:
>>>
>>> # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 2.375426 secs (452020732 bytes/sec)
>>> 0.000u 0.525s 0:02.37 21.9%     23+2533k 1+1io 0pf+0w
>>>
>>> # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 0.199078 secs (5393571244 bytes/sec)
>>> 0.000u 0.200s 0:00.20 100.0%    26+2808k 0+0io 0pf+0w
>>>
>>> # time dd if=/home/testfs/zero1g of=/dev/null bs=16k
>>> 65536+0 records in
>>> 65536+0 records out
>>> 1073741824 bytes transferred in 0.436472 secs (2460046434 bytes/sec)
>>> 0.015u 0.421s 0:00.43 100.0%    26+2813k 0+0io 0pf+0w
>>>
>>> # time cat /home/testfs/zero1g > /dev/null
>>> 0.023u 1.156s 0:01.18 99.1%     15+2779k 0+0io 0pf+0w
>>>
>>> --- file filled with random bytes:
>>>
>>> # time dd if=/dev/random of=/home/testfs/random1g bs=1024k count=1024
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 16.116569 secs (66623474 bytes/sec)
>>> 0.000u 13.214s 0:16.11 81.9%    25+2750k 0+1io 0pf+0w
>>>
>>> # time dd if=/home/testfs/random1g of=/dev/null bs=1024k
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 0.207115 secs (5184280044 bytes/sec)
>>> 0.000u 0.208s 0:00.20 100.0%    26+2808k 0+0io 0pf+0w
>>>
>>> # time dd if=/home/testfs/random1g of=/dev/null bs=16k
>>> 65536+0 records in
>>> 65536+0 records out
>>> 1073741824 bytes transferred in 0.432518 secs (2482536705 bytes/sec)
>>> 0.023u 0.409s 0:00.43 97.6%     26+2828k 0+0io 0pf+0w
>>>
>>> # time cat /home/testfs/random1g > /dev/null
>>> 0.031u 1.053s 0:01.08 100.0%    15+2770k 0+0io 0pf+0w
>>>
>>> --- compression on:
>>>
>>> # zfs create -o atime=off -o recordsize=128k -o compression=lzjb -o
>>> mountpoint=/home/testfs home/testfs
>>>
>>> --- file filled with zeroes:
>>>
>>> # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 1.007765 secs (1065468404 bytes/sec)
>>> 0.000u 0.458s 0:01.01 44.5%     26+2880k 1+1io 0pf+0w
>>>
>>> # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k
>>> 1024+0 records in
>>> 1024+0 records out
>>> 1073741824 bytes transferred in 0.630737 secs (1702360431 bytes/sec)
>>> 0.000u 0.630s 0:00.63 100.0%    25+2742k 0+0io 0pf+0w
>>>
>>> # time dd if=/home/testfs/zero1g of=/dev/null bs=16k
>>> 65536+0 records in
>>> 65536+0 records out
>>> 1073741824 bytes transferred in 4.089175 secs (262581530 bytes/sec)
>>> 0.015u 4.036s 0:04.09 98.7%     25+2758k 0+0io 0pf+0w
>>>
>>> # time cat /home/testfs/zero1g > /dev/null
>>> 0.031u 15.863s 0:15.95 99.6%    15+2754k 0+0io 0pf+0w
>>>         ^^^^^^^
>>>
>>> --- it appears recordsize has a huge effect on this (recordsize=32k):
>>>
>>> # zfs create -o atime=off -o recordsize=32k -o compression=off -o
>>> mountpoint=/home/testfs home/testfs
>>>
>>> # time truncate -s 1G testfs/trunc1g
>>> 0.000u 0.000s 0:00.01 0.0%      0+0k 1+0io 0pf+0w
>>>
>>> # time cat /home/testfs/trunc1g > /dev/null
>>> 0.047u 5.842s 0:05.93 99.1%     15+2761k 0+0io 0pf+0w
>>>         ^^^^^^
>>>
>>> --- recordsize=4k:
>>>
>>> # zfs create -o atime=off -o recordsize=4k -o compression=off -o
>>> mountpoint=/home/testfs home/testfs
>>>
>>> # time truncate -s 1G testfs/trunc1g
>>> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
>>>
>>> # time cat /home/testfs/trunc1g > /dev/null
>>> 0.047u 1.441s 0:01.52 97.3%     15+2768k 0+0io 0pf+0w
>>>         ^^^^^^
>>
>> Take a look at src/bin/cat/cat.c, function raw_cat().
>>
>> Therein lies the answer.
>>
>> TL;DR -- cat != dd.
>
> I wanted to follow up on this, because I received an off-list private
> flame basically telling me to shut the fuck up.

Indeed. I wasn't expecting an answer like that (TL;DR) from freebsd-fs. 
And it angered me quite a lot since answers like that will just scare 
away those with actual knowledge of the internals.

> Compression has nothing to do with this.  recordsize plays a role for
> what should become obvious reasons -- keep reading.

It has to do quite a lot if you understand how ZFS handles compression 
and all zero records. Without compression file data is stored as is - if 
the file was written with all zeroes then those zeroes end written to 
the disks. With compression there is additional step (separate from 
compression itself) where if recordsized block of all zeroes is written 
this block remains unallocated with no data stored on the disk - there 
is no L0 record for that block. This produces the same behavior as if 
file was extended with ftruncate.

> Again: cat is not dd.  cat has its own set of logic for how it
> calculates the "optimal block size" to use when calling read(2).  dd,
> operates differently, and lets you set the blocksize using bs.

I've already answered you off the list - cat was used to show "real 
life" example of really how bad the slowdown is. Most software in the 
wild won't be using large buffer sizes. I've given you another example 
(again sent in private but worth repeating on the list below) where this 
issue is clearly visible:

# time md5 testfs/zero1g
MD5 (testfs/zero1g) = cd573cfaace07e7949bc0c46028904ff
2.495u 3.384s 0:05.88 99.8%     15+167k 0+0io 0pf+0w

# time md5 testfs/trunc1g
MD5 (testfs/trunc1g) = cd573cfaace07e7949bc0c46028904ff
2.711u 65.452s 1:08.55 99.4%    15+167k 0+0io 0pf+0w

The difference is 3 seconds versus 65 seconds. Clearly there is room for 
improvement - and most of the extra time is spent in kernel.

> My conclusion is that ZFS handles sparse/truncated files very
> differently than UFS.  Those familiar with the internals of ZFS can
> probably explain this dilemma.

And finally we are back to the root of the problem. Is this FreeBSD only 
issue or should this be brought upstream to illumos - all my systems are 
FreeBSD.



More information about the freebsd-fs mailing list