ZFS slow reads for unallocated blocks

Jeremy Chadwick jdc at koitsu.org
Thu Apr 11 18:23:23 UTC 2013


On Thu, Apr 11, 2013 at 10:14:28AM -0700, Jeremy Chadwick wrote:
> On Thu, Apr 11, 2013 at 06:52:19PM +0200, Adam Nowacki wrote:
> > This one is quite weird - reads from files that were created and
> > resized with ftruncate (so no actual data on disk) are considerably
> > slower and use more CPU time than files with data. If compression is
> > enabled this will also affect files with long runs of zeroes as ZFS
> > won't write any data to disk in this case. There is no I/O on the
> > pool during the read tests - all fits into 10GB ARC.
> > 
> > FreeBSD storage 9.1-RELEASE FreeBSD 9.1-RELEASE #0: Sat Feb 23
> > 15:51:26 UTC 2013     root at storage:/usr/obj/usr/src/sys/GENERIC
> > amd64
> > 
> > Mem: 264M Active, 82M Inact, 12G Wired, 100M Cache, 13M Buf, 3279M Free
> > Swap: 5120M Total, 5120M Free
> > 
> > # zfs create -o atime=off -o recordsize=128k -o compression=off -o
> > mountpoint=/home/testfs home/testfs
> > 
> > --- truncated file:
> > 
> > # time truncate -s 1G /home/testfs/trunc1g
> > 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> > 
> > # time dd if=/home/testfs/trunc1g of=/dev/null bs=1024k
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 0.434817 secs (2469410435 bytes/sec)
> > 0.000u 0.435s 0:00.43 100.0%    26+2813k 0+0io 0pf+0w
> > 
> > # time dd if=/home/testfs/trunc1g of=/dev/null bs=16k
> > 65536+0 records in
> > 65536+0 records out
> > 1073741824 bytes transferred in 3.809560 secs (281854564 bytes/sec)
> > 0.000u 3.779s 0:03.81 98.9%     25+2755k 0+0io 0pf+0w
> > 
> > # time cat /home/testfs/trunc1g > /dev/null
> > 0.070u 14.031s 0:14.19 99.3%    15+2755k 0+0io 0pf+0w
> >        ^^^^^^^ 14 seconds compared to 1 second for random data
> > 
> > --- file filled with zeroes:
> > 
> > # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 2.375426 secs (452020732 bytes/sec)
> > 0.000u 0.525s 0:02.37 21.9%     23+2533k 1+1io 0pf+0w
> > 
> > # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 0.199078 secs (5393571244 bytes/sec)
> > 0.000u 0.200s 0:00.20 100.0%    26+2808k 0+0io 0pf+0w
> > 
> > # time dd if=/home/testfs/zero1g of=/dev/null bs=16k
> > 65536+0 records in
> > 65536+0 records out
> > 1073741824 bytes transferred in 0.436472 secs (2460046434 bytes/sec)
> > 0.015u 0.421s 0:00.43 100.0%    26+2813k 0+0io 0pf+0w
> > 
> > # time cat /home/testfs/zero1g > /dev/null
> > 0.023u 1.156s 0:01.18 99.1%     15+2779k 0+0io 0pf+0w
> > 
> > --- file filled with random bytes:
> > 
> > # time dd if=/dev/random of=/home/testfs/random1g bs=1024k count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 16.116569 secs (66623474 bytes/sec)
> > 0.000u 13.214s 0:16.11 81.9%    25+2750k 0+1io 0pf+0w
> > 
> > # time dd if=/home/testfs/random1g of=/dev/null bs=1024k
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 0.207115 secs (5184280044 bytes/sec)
> > 0.000u 0.208s 0:00.20 100.0%    26+2808k 0+0io 0pf+0w
> > 
> > # time dd if=/home/testfs/random1g of=/dev/null bs=16k
> > 65536+0 records in
> > 65536+0 records out
> > 1073741824 bytes transferred in 0.432518 secs (2482536705 bytes/sec)
> > 0.023u 0.409s 0:00.43 97.6%     26+2828k 0+0io 0pf+0w
> > 
> > # time cat /home/testfs/random1g > /dev/null
> > 0.031u 1.053s 0:01.08 100.0%    15+2770k 0+0io 0pf+0w
> > 
> > --- compression on:
> > 
> > # zfs create -o atime=off -o recordsize=128k -o compression=lzjb -o
> > mountpoint=/home/testfs home/testfs
> > 
> > --- file filled with zeroes:
> > 
> > # time dd if=/dev/zero of=/home/testfs/zero1g bs=1024k count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 1.007765 secs (1065468404 bytes/sec)
> > 0.000u 0.458s 0:01.01 44.5%     26+2880k 1+1io 0pf+0w
> > 
> > # time dd if=/home/testfs/zero1g of=/dev/null bs=1024k
> > 1024+0 records in
> > 1024+0 records out
> > 1073741824 bytes transferred in 0.630737 secs (1702360431 bytes/sec)
> > 0.000u 0.630s 0:00.63 100.0%    25+2742k 0+0io 0pf+0w
> > 
> > # time dd if=/home/testfs/zero1g of=/dev/null bs=16k
> > 65536+0 records in
> > 65536+0 records out
> > 1073741824 bytes transferred in 4.089175 secs (262581530 bytes/sec)
> > 0.015u 4.036s 0:04.09 98.7%     25+2758k 0+0io 0pf+0w
> > 
> > # time cat /home/testfs/zero1g > /dev/null
> > 0.031u 15.863s 0:15.95 99.6%    15+2754k 0+0io 0pf+0w
> >        ^^^^^^^
> > 
> > --- it appears recordsize has a huge effect on this (recordsize=32k):
> > 
> > # zfs create -o atime=off -o recordsize=32k -o compression=off -o
> > mountpoint=/home/testfs home/testfs
> > 
> > # time truncate -s 1G testfs/trunc1g
> > 0.000u 0.000s 0:00.01 0.0%      0+0k 1+0io 0pf+0w
> > 
> > # time cat /home/testfs/trunc1g > /dev/null
> > 0.047u 5.842s 0:05.93 99.1%     15+2761k 0+0io 0pf+0w
> >        ^^^^^^
> > 
> > --- recordsize=4k:
> > 
> > # zfs create -o atime=off -o recordsize=4k -o compression=off -o
> > mountpoint=/home/testfs home/testfs
> > 
> > # time truncate -s 1G testfs/trunc1g
> > 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> > 
> > # time cat /home/testfs/trunc1g > /dev/null
> > 0.047u 1.441s 0:01.52 97.3%     15+2768k 0+0io 0pf+0w
> >        ^^^^^^
> 
> Take a look at src/bin/cat/cat.c, function raw_cat().
> 
> Therein lies the answer.
> 
> TL;DR -- cat != dd.

I wanted to follow up on this, because I received an off-list private
flame basically telling me to shut the fuck up.

Compression has nothing to do with this.  recordsize plays a role for
what should become obvious reasons -- keep reading.

Again: cat is not dd.  cat has its own set of logic for how it
calculates the "optimal block size" to use when calling read(2).  dd,
operates differently, and lets you set the blocksize using bs.

A modified cat binary (work/src/bin/cat/cat) which prints out the
relevant "calculation" bits within raw_cat().  And yes, $cwd is a ZFS
filesystem, with a default recordsize (128KB):

$ truncate -s 1g testfile
$ time work/src/bin/cat/cat testfile > /dev/null
raw_cat()
PHYSPAGES_THRESHOLD: 32768
BUFSIZE_MAX:         2097152
BUFSIZE_SMALL:       131072
_SC_PHYS_PAGES:      2088077
bsize              = 4096

real    0m13.067s
user    0m0.000s
sys     0m13.070s

cat, in my case, issues read(fd, 4096).  That's a 4KByte block size.
Now let's use dd with the same size (bs=4096):

$ time dd if=testfile of=/dev/null bs=4k
262144+0 records in
262144+0 records out
1073741824 bytes transferred in 13.031543 secs (82395601 bytes/sec)

real    0m13.033s
user    0m0.023s
sys     0m13.000s

Increase the block size to dd (from 4k up to 8k) and the speed will
double:

$ time dd if=testfile of=/dev/null bs=8k
131072+0 records in
131072+0 records out
1073741824 bytes transferred in 6.545117 secs (164052352 bytes/sec)

real    0m6.546s
user    0m0.026s
sys     0m6.519s

Now let's move to a UFS2+SU filesystem, where the behaviour is
different:

$ truncate -s 1g /tmp/testfile
$ time work/src/bin/cat/cat /tmp/testfile > /dev/null
raw_cat()
PHYSPAGES_THRESHOLD: 32768
BUFSIZE_MAX:         2097152
BUFSIZE_SMALL:       131072
_SC_PHYS_PAGES:      2088077
bsize              = 4096

real    0m1.191s
user    0m0.031s
sys     0m1.159s

$ time dd if=/tmp/testfile of=/dev/null bs=4k
262144+0 records in
262144+0 records out
1073741824 bytes transferred in 0.898440 secs (1195117846 bytes/sec)

real    0m0.900s
user    0m0.039s
sys     0m0.860s

$ time dd if=/tmp/testfile of=/dev/null bs=8k
131072+0 records in
131072+0 records out
1073741824 bytes transferred in 0.768045 secs (1398019512 bytes/sec)

real    0m0.769s
user    0m0.024s
sys     0m0.745s

My conclusion is that ZFS handles sparse/truncated files very
differently than UFS.  Those familiar with the internals of ZFS can
probably explain this dilemma.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |


More information about the freebsd-fs mailing list