ZFS read performance anomaly / performance regression
Garrett Wollman
wollman at freebsd.org
Sat Jan 25 06:07:59 UTC 2014
Consider three 1-TiB files, equal in dignity, all on the same ZFS
dataset:
-rw-r--r-- 1 root wheel 1099511627776 Jan 7 19:03 test
-rw-r--r-- 1 root wheel 1099511627776 Jan 24 22:38 test-truncate
-rw-r--r-- 1 root wheel 1099511628800 Jan 24 21:32 test2
File "test" was created by writing zeros; "test2" was created by
seeking to 1 TiB and writing a block of zeros, and "test-truncate" was
created with "truncate -s 1T". Compression is disabled on this
dataset, and "uname -a" reports:
FreeBSD nfs-prod-3.csail.mit.edu 9.2-RELEASE-p2 FreeBSD 9.2-RELEASE-p2 #2 r260414M: Tue Jan 7 17:32:43 EST 2014 wollman at xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL amd64
A naive observer would expect that reading "test" should be
significantly slower than reading either of the other files, since it
requires reading data from the disk, and that there should be no
significant difference between the other two, which consist entirely
of cacheable metadata and have no, or very little, data stored on
disk.
But no:
x noholes
+ oseek
* truncate
+------------------------------------------------------------------------------+
|x x x x * *+ * * * + + + + |
| |________AM______| |________A__M___|__|_______A__M______||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 71.300931 82.89801 77.044237 76.452097 4.3961747
+ 5 97.931676 111.36284 108.52787 106.5687 5.5595081
Difference at 95.0% confidence
30.1166 +/- 7.30929
39.3928% +/- 9.56062%
(Student's t, pooled s = 5.01171)
* 5 89.187745 101.74893 98.825988 97.497923 4.9188155
Difference at 95.0% confidence
21.0458 +/- 6.80337
27.5281% +/- 8.89887%
(Student's t, pooled s = 4.66482)
(Values are the time to read 128 GiB of each file, which is more than
the available memory for caching on this server. Measurements were
interleaved abcabcabcabcabc, but server was not rebooted nor
completely idle. Lower numbers are better.)
Now consider the same test on an older, slower (but also completely
idle) server, running an older FreeBSD release, but with an identical
zpool configuration (88 spindles plus log and cache SSDs):
FreeBSD nfs-backup-1.csail.mit.edu 9.1-RELEASE-p7 FreeBSD 9.1-RELEASE-p7 #34 r255669M: Fri Oct 11 13:04:40 EDT 2013 wollman at xyz.csail.mit.edu:/usr/obj/usr/src/sys/CSAIL amd64
Files were created in the same manner in a fresh, unsnapshotted
dataset on this server, and the same shell script was used to time
reads (even though the older machine has significantly less memory):
x noholes2 (9.1)
+ oseek2 (9.1)
* truncate2 (9.1)
+------------------------------------------------------------------------------+
|** + x |
|** * + + ++ xxx|
||A_| |_MA__| |A||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 82.813107 83.970546 83.527358 83.427247 0.43568711
+ 5 47.283031 51.450488 49.063725 49.493986 1.6446236
Difference at 95.0% confidence
-33.9333 +/- 1.75456
-40.6741% +/- 2.10311%
(Student's t, pooled s = 1.20304)
* 5 44.460431 46.148431 44.636276 44.937538 0.70509763
Difference at 95.0% confidence
-38.4897 +/- 0.854768
-46.1357% +/- 1.02457%
(Student's t, pooled s = 0.586083)
This is more like what I was expecting! But there's still this
anomaly:
x oseek2 (9.1)
+ truncate2 (9.1)
+------------------------------------------------------------------------------+
| +++ + + x x x x x|
||___M___A______| |____________M___A_________________| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 47.283031 51.450488 49.063725 49.493986 1.6446236
+ 5 44.460431 46.148431 44.636276 44.937538 0.70509763
Difference at 95.0% confidence
-4.55645 +/- 1.84536
-9.20606% +/- 3.72846%
(Student's t, pooled s = 1.2653)
Why is it 9% faster to read the beginning of a 1 TiB file that was
crated with truncate(1) than to read the beginning of a 1 TiB file
that was created with dd oseek=?
Even more disturbing: the newer, faster server is in fact faster when
reading from disk:
x noholes (9.2)
+ noholes2 (9.1)
+------------------------------------------------------------------------------+
|x x x x * + ++ +|
| |_________________________A___M______________________| |__A_| |
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 71.300931 82.89801 77.044237 76.452097 4.3961747
+ 5 82.813107 83.970546 83.527358 83.427247 0.43568711
Difference at 95.0% confidence
6.97515 +/- 4.55587
9.12356% +/- 5.95912%
(Student's t, pooled s = 3.12379)
But the faster (9.2) server is much, much slower when reading from
holes:
x truncate (9.2)
+ truncate2 (9.1)
+------------------------------------------------------------------------------+
|++ |
|++ + x x x x x |
||A| |_____A_M____||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 89.187745 101.74893 98.825988 97.497923 4.9188155
+ 5 44.460431 46.148431 44.636276 44.937538 0.70509763
Difference at 95.0% confidence
-52.5604 +/- 5.1245
-53.9092% +/- 5.25601%
(Student's t, pooled s = 3.51368)
Repeating the same experiments in a fresh dataset with compression
turned on gives the expected results: ZFS automatically punches a big
hole in the "no holes" test file, and the read performance is
indistinguishable among all three creation methods. Unfortunately,
this means that the newer, faster 9.2 server slows all accesses down
to the speed of reading holes, whereas the older, slower 9.1 server
speeds all accesses up:
x noholes+lzjb (9.2)
+ noholes2+lzjb (9.1)
+------------------------------------------------------------------------------+
| + x |
|++++ x x xx|
||A_| |_AM||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 100.23982 103.88802 102.84441 102.38784 1.4890753
+ 5 43.747693 45.784714 44.914809 44.902244 0.82465096
Difference at 95.0% confidence
-57.4856 +/- 1.75541
-56.1449% +/- 1.71447%
(Student's t, pooled s = 1.20362)
The 9.2 machine is unquestionably faster, at least when shoving data
around without touching the filesystem ("dd if=/dev/zero of=/dev/null"
with other parameters as for the filesystem-based tests):
x overhead (9.2)
+ overhead2 (9.1)
+------------------------------------------------------------------------------+
| x + + |
| x x x + + +|
||M__A___| |____________A_______M____||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 5.971316 6.357959 5.980432 6.0935198 0.17238797
+ 5 8.015839 9.168025 8.948119 8.6249242 0.55062308
Difference at 95.0% confidence
2.5314 +/- 0.595023
41.5426% +/- 9.76484%
(Student's t, pooled s = 0.407985)
What's going on here?
-GAWollman
More information about the freebsd-fs
mailing list