ZFS vdev I/O questions
Daniel Kalchev
daniel at digsys.bg
Tue Jul 16 11:41:44 UTC 2013
I am observing some "strange" behaviour with I/O spread on ZFS vdevs and
thought I might ask if someone has observed it too.
The system hardware is an Supermicro X8DTH-6F board with integrated
LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB
modules).
Runs 9-stable r252690.
It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs,
plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists
of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool
layout is
pool: storage
state: ONLINE
scan: scrub canceled on Thu Jul 11 17:14:50 2013
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gpt/disk00 ONLINE 0 0 0
gpt/disk01 ONLINE 0 0 0
gpt/disk02 ONLINE 0 0 0
gpt/disk03 ONLINE 0 0 0
gpt/disk04 ONLINE 0 0 0
gpt/disk05 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
gpt/disk06 ONLINE 0 0 0
gpt/disk07 ONLINE 0 0 0
gpt/disk08 ONLINE 0 0 0
gpt/disk09 ONLINE 0 0 0
gpt/disk10 ONLINE 0 0 0
gpt/disk11 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
gpt/disk12 ONLINE 0 0 0
gpt/disk13 ONLINE 0 0 0
gpt/disk14 ONLINE 0 0 0
gpt/disk15 ONLINE 0 0 0
gpt/disk16 ONLINE 0 0 0
gpt/disk17 ONLINE 0 0 0
logs
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
cache
da20p2 ONLINE 0 0 0
zdb output
storage:
version: 5000
name: 'storage'
state: 0
txg: 5258772
pool_guid: 17094379857311239400
hostid: 3505628652
hostname: 'a1.register.bg'
vdev_children: 5
vdev_tree:
type: 'root'
id: 0
guid: 17094379857311239400
children[0]:
type: 'raidz'
id: 0
guid: 2748500753748741494
nparity: 2
metaslab_array: 33
metaslab_shift: 37
ashift: 12
asize: 18003521961984
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 5074824874132816460
path: '/dev/gpt/disk00'
phys_path: '/dev/gpt/disk00'
whole_disk: 1
DTL: 378
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 14410366944090513563
path: '/dev/gpt/disk01'
phys_path: '/dev/gpt/disk01'
whole_disk: 1
DTL: 53
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 3526681390841761237
path: '/dev/gpt/disk02'
phys_path: '/dev/gpt/disk02'
whole_disk: 1
DTL: 52
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 3773850995072323004
path: '/dev/gpt/disk03'
phys_path: '/dev/gpt/disk03'
whole_disk: 1
DTL: 51
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 16528489666301728411
path: '/dev/gpt/disk04'
phys_path: '/dev/gpt/disk04'
whole_disk: 1
DTL: 50
create_txg: 4
children[5]:
type: 'disk'
id: 5
guid: 11222774817699257051
path: '/dev/gpt/disk05'
phys_path: '/dev/gpt/disk05'
whole_disk: 1
DTL: 44147
create_txg: 4
children[1]:
type: 'raidz'
id: 1
guid: 614220834244218709
nparity: 2
metaslab_array: 39
metaslab_shift: 37
ashift: 12
asize: 18003521961984
is_log: 0
create_txg: 40
children[0]:
type: 'disk'
id: 0
guid: 8076478524731550200
path: '/dev/gpt/disk06'
phys_path: '/dev/gpt/disk06'
whole_disk: 1
DTL: 2914
create_txg: 40
children[1]:
type: 'disk'
id: 1
guid: 1689851194543981566
path: '/dev/gpt/disk07'
phys_path: '/dev/gpt/disk07'
whole_disk: 1
DTL: 48
create_txg: 40
children[2]:
type: 'disk'
id: 2
guid: 9743236178648200269
path: '/dev/gpt/disk08'
phys_path: '/dev/gpt/disk08'
whole_disk: 1
DTL: 47
create_txg: 40
children[3]:
type: 'disk'
id: 3
guid: 10157617457760516410
path: '/dev/gpt/disk09'
phys_path: '/dev/gpt/disk09'
whole_disk: 1
DTL: 46
create_txg: 40
children[4]:
type: 'disk'
id: 4
guid: 5035981195206926078
path: '/dev/gpt/disk10'
phys_path: '/dev/gpt/disk10'
whole_disk: 1
DTL: 45
create_txg: 40
children[5]:
type: 'disk'
id: 5
guid: 4975835521778875251
path: '/dev/gpt/disk11'
phys_path: '/dev/gpt/disk11'
whole_disk: 1
DTL: 44149
create_txg: 40
children[2]:
type: 'raidz'
id: 2
guid: 7453512836015019221
nparity: 2
metaslab_array: 38974
metaslab_shift: 37
ashift: 12
asize: 18003521961984
is_log: 0
create_txg: 4455560
children[0]:
type: 'disk'
id: 0
guid: 11182458869377968267
path: '/dev/gpt/disk12'
phys_path: '/dev/gpt/disk12'
whole_disk: 1
DTL: 45059
create_txg: 4455560
children[1]:
type: 'disk'
id: 1
guid: 5844283175515272344
path: '/dev/gpt/disk13'
phys_path: '/dev/gpt/disk13'
whole_disk: 1
DTL: 44145
create_txg: 4455560
children[2]:
type: 'disk'
id: 2
guid: 13095364699938843583
path: '/dev/gpt/disk14'
phys_path: '/dev/gpt/disk14'
whole_disk: 1
DTL: 44144
create_txg: 4455560
children[3]:
type: 'disk'
id: 3
guid: 5196507898996589388
path: '/dev/gpt/disk15'
phys_path: '/dev/gpt/disk15'
whole_disk: 1
DTL: 44143
create_txg: 4455560
children[4]:
type: 'disk'
id: 4
guid: 12809770017318709512
path: '/dev/gpt/disk16'
phys_path: '/dev/gpt/disk16'
whole_disk: 1
DTL: 44142
create_txg: 4455560
children[5]:
type: 'disk'
id: 5
guid: 7339883019925920701
path: '/dev/gpt/disk17'
phys_path: '/dev/gpt/disk17'
whole_disk: 1
DTL: 44141
create_txg: 4455560
children[3]:
type: 'disk'
id: 3
guid: 18011869864924559827
path: '/dev/ada0p2'
phys_path: '/dev/ada0p2'
whole_disk: 1
metaslab_array: 16675
metaslab_shift: 26
ashift: 12
asize: 8585216000
is_log: 1
DTL: 86787
create_txg: 5182360
children[4]:
type: 'disk'
id: 4
guid: 1338775535758010670
path: '/dev/ada1p2'
phys_path: '/dev/ada1p2'
whole_disk: 1
metaslab_array: 16693
metaslab_shift: 26
ashift: 12
asize: 8585216000
is_log: 1
DTL: 86788
create_txg: 5182377
features_for_read:
Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640,
firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been
replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware
CC24) -- this is an 4k sector drive of a new generation (notice the
relatively 'old' firmware, that can't be upgraded).
Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but
(ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number.
Some have firmware CC4C which I replace drive by drive (but other than
the excessive load counts no other issues so far).
The only ZFS related tuning is in /etc/sysctl.conf
# improve ZFS resilver
vfs.zfs.resilver_delay=0
vfs.zfs.scrub_delay=0
vfs.zfs.top_maxinflight=128
vfs.zfs.resilver_min_time_ms=5000
vfs.zfs.vdev.max_pending=24
# L2ARC:
vfs.zfs.l2arc_norw=0
vfs.zfs.l2arc_write_max=83886080
vfs.zfs.l2arc_write_boost=83886080
The pool of course had dedup and had serious dedup ratios, like over
10x. In general, with the ZIL and L2ARC, the only trouble I have seen
with dedup is when deleting lots of data... which this server has seen
plenty of. During this experiment, I have moved most data to other
server and un-dedup the last remaining TBs.
While doing zfs destroy on an 2-3TB dataset, I observe very annoying
behaviour. The pool would stay mostly idle, accepting almost no I/O and
doing small random reads, like this
$ zpool iostat storage 10
storage 45.3T 3.45T 466 0 1.82M 0
storage 45.3T 3.45T 50 0 203K 0
storage 45.3T 3.45T 45 25 183K 1.70M
storage 45.3T 3.45T 49 0 199K 0
storage 45.3T 3.45T 50 0 202K 0
storage 45.3T 3.45T 51 0 204K 0
storage 45.3T 3.45T 57 0 230K 0
storage 45.3T 3.45T 65 0 260K 0
storage 45.3T 3.45T 68 25 274K 1.70M
storage 45.3T 3.45T 65 0 260K 0
storage 45.3T 3.45T 64 0 260K 0
storage 45.3T 3.45T 67 0 272K 0
storage 45.3T 3.45T 66 0 266K 0
storage 45.3T 3.45T 64 0 258K 0
storage 45.3T 3.45T 62 25 250K 1.70M
storage 45.3T 3.45T 57 0 231K 0
storage 45.3T 3.45T 58 0 235K 0
storage 45.3T 3.45T 66 0 267K 0
storage 45.3T 3.45T 64 0 257K 0
storage 45.3T 3.45T 60 0 241K 0
storage 45.3T 3.45T 50 0 203K 0
storage 45.3T 3.45T 52 25 209K 1.70M
storage 45.3T 3.45T 54 0 217K 0
storage 45.3T 3.45T 51 0 205K 0
storage 45.3T 3.45T 54 0 216K 0
storage 45.3T 3.45T 55 0 222K 0
storage 45.3T 3.45T 56 0 226K 0
storage 45.3T 3.45T 65 0 264K 0
storage 45.3T 3.45T 71 0 286K 0
The write peaks are from processes syncing data to the pool - in this
state it does not do reads (the data the sync process deals with is
already in ARC).
Then it goes into writing back to the pool (perhaps DDT metadata)
storage 45.3T 3.45T 17 24.4K 69.6K 97.5M
storage 45.3T 3.45T 0 19.6K 0 78.5M
storage 45.3T 3.45T 0 14.2K 0 56.8M
storage 45.3T 3.45T 0 7.90K 0 31.6M
storage 45.3T 3.45T 0 7.81K 0 32.8M
storage 45.3T 3.45T 0 9.54K 0 38.2M
storage 45.3T 3.45T 0 7.07K 0 28.3M
storage 45.3T 3.45T 0 7.70K 0 30.8M
storage 45.3T 3.45T 0 6.19K 0 24.8M
storage 45.3T 3.45T 0 5.45K 0 21.8M
storage 45.3T 3.45T 0 5.78K 0 24.7M
storage 45.3T 3.45T 0 5.29K 0 21.2M
storage 45.3T 3.45T 0 5.69K 0 22.8M
storage 45.3T 3.45T 0 5.52K 0 22.1M
storage 45.3T 3.45T 0 3.26K 0 13.1M
storage 45.3T 3.45T 0 1.77K 0 7.10M
storage 45.3T 3.45T 0 1.63K 0 8.14M
storage 45.3T 3.45T 0 1.41K 0 5.64M
storage 45.3T 3.45T 0 1.22K 0 4.88M
storage 45.3T 3.45T 0 1.27K 0 5.09M
storage 45.3T 3.45T 0 1.06K 0 4.26M
storage 45.3T 3.45T 0 1.07K 0 4.30M
storage 45.3T 3.45T 0 979 0 3.83M
storage 45.3T 3.45T 0 1002 0 3.91M
storage 45.3T 3.45T 0 1010 0 3.95M
storage 45.3T 3.45T 0 948 2.40K 3.71M
storage 45.3T 3.45T 0 939 0 3.67M
storage 45.3T 3.45T 0 1023 0 7.10M
storage 45.3T 3.45T 0 1.01K 4.80K 4.04M
storage 45.3T 3.45T 0 822 0 3.22M
storage 45.3T 3.45T 0 434 0 1.70M
storage 45.3T 3.45T 0 398 2.40K 1.56M
For quite some time, there are no reads from the pool. When that
happens, gstat (gstat -f 'da[0-9]*$') displays something like this:
dT: 1.001s w: 1.000s filter: da[0-9]*$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
24 1338 0 0 0.0 1338 12224 17.8 100.9| da0
24 6888 0 0 0.0 6888 60720 3.5 100.0| da1
24 6464 0 0 0.0 6464 71997 3.7 100.0| da2
24 6117 0 0 0.0 6117 82386 3.9 99.9| da3
24 6455 0 0 0.0 6455 66822 3.7 100.0| da4
24 6782 0 0 0.0 6782 69207 3.5 100.0| da5
24 698 0 0 0.0 698 27533 34.1 99.6| da6
24 590 0 0 0.0 590 21627 40.9 99.7| da7
24 561 0 0 0.0 561 21031 42.8 100.2| da8
24 724 0 0 0.0 724 25583 33.1 99.9| da9
24 567 0 0 0.0 567 22965 41.4 98.0| da10
24 566 0 0 0.0 566 21834 42.4 99.9| da11
24 586 0 0 0.0 586 4899 43.5 100.2| da12
24 487 0 0 0.0 487 4008 49.3 100.9| da13
24 628 0 0 0.0 628 5007 38.9 100.2| da14
24 714 0 0 0.0 714 5706 33.8 99.9| da15
24 595 0 0 0.0 595 4831 39.8 99.8| da16
24 485 0 0 0.0 485 3932 49.2 100.1| da17
0 0 0 0 0.0 0 0 0.0 0.0| da18
0 0 0 0 0.0 0 0 0.0 0.0| da19
0 0 0 0 0.0 0 0 0.0 0.0| da20
0 0 0 0 0.0 0 0 0.0 0.0| ada0
0 0 0 0 0.0 0 0 0.0 0.0| ada1
(drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and
ada0 are the boot SSDs in separate zpool)
Now, here comes the weird part. the gpart display would show intensive
writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs
would complete writing, and stop writing, while other vdevs continue, at
the end only one vdev writes until as it seems, data is completely
written to all vdevs (this can be observed in the zfs iostat output
above with the decreasing write IOPS each 10 seconds), then there is a
few seconds "do nothing" period and then we are back to small reads.
The other observation I have is with the first vdev: the 512b drives do
a lot of I/O fast, complete first and then sit idle, while da0 continues
to write for many more seconds. They consistently show many more IOPS
than the other drives for this type of activity -- on streaming writes
all drives behave more or less the same. It is only on this un-dedup
scenario where the difference is so much pronounced.
All the vdevs in the pool are with ashift=12 so the theory that ZFS
actually issues 512b writes to these drives can't be true, can it?
Another worry is this Seagate Barracuda 7200.14 (AF)
(ST3000DM001-1CH166, firmware CC24) drive. It seems constantly
under-performing. Does anyone know if it is so different from the
ST3000DM001-9YN166 drives? Might be, I should just replace it?
My concern is the bursty and irregular nature of writing to vdevs. As it
is now, an write operation to the pool needs to wait for all of the vdev
writes to complete which is this case takes tens of seconds. A single
drive in an vdev that underperforms will slow down the entire pool.
Perhaps ZFS could prioritize vdev usage based on the vdev troughput,
similar to how it prioritizes writes based on how much it is full.
Also, what is ZFS doing during the idle periods? Are there some timeouts
involved? It is certainly not using any CPU... The small random I/O is
certainly not loading the disks.
Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet,
the DDT apparently does not want to go there... Is there a way to
"force" it to be loaded to L2ARC? Before the last big delete, I had
zdb -D storage
DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core
DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core
dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies
= 1.92
With time, the in core values stay more or less the same.
I also discovered, that the L2ARC drive apparently is not subject to
TRIM for some reason. TRIM works on the boot drives, but these are
connected to the motherboard SATA ports).
# sysctl kern.cam.da.20
kern.cam.da.20.delete_method: ATA_TRIM
kern.cam.da.20.minimum_cmd_size: 6
kern.cam.da.20.sort_io_queue: 0
kern.cam.da.20.error_inject: 0
# sysctl -a | grep trim
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.vdev.trim_max_bytes: 2147483648
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
kstat.zfs.misc.zio_trim.bytes: 139489971200
kstat.zfs.misc.zio_trim.success: 628351
kstat.zfs.misc.zio_trim.unsupported: 622819
kstat.zfs.misc.zio_trim.failed: 0
Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d
Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable?
Daniel
More information about the freebsd-fs
mailing list