ZFS vdev I/O questions

Tue Jul 16 11:41:44 UTC 2013

I am observing some "strange" behaviour with I/O spread on ZFS vdevs and 
thought I might ask if someone has observed it too.

The system hardware is an Supermicro X8DTH-6F board with integrated 
LSI2008 controller, two Xeon E5620 CPUs and 72GB or RAM (6x4 + 6x8 GB 
modules).
Runs 9-stable r252690.

It has currently 18 drive zpool, split on three 6 drive raidz2 vdevs, 
plus ZIL and L2ARC on separate SSDs (240GB Intel 520). The ZIL consists 
of two partitions of the boot SSDs (Intel 320), not mirrored. The zpool 
layout is

   pool: storage
  state: ONLINE
   scan: scrub canceled on Thu Jul 11 17:14:50 2013
config:

         NAME            STATE     READ WRITE CKSUM
         storage         ONLINE       0     0     0
           raidz2-0      ONLINE       0     0     0
             gpt/disk00  ONLINE       0     0     0
             gpt/disk01  ONLINE       0     0     0
             gpt/disk02  ONLINE       0     0     0
             gpt/disk03  ONLINE       0     0     0
             gpt/disk04  ONLINE       0     0     0
             gpt/disk05  ONLINE       0     0     0
           raidz2-1      ONLINE       0     0     0
             gpt/disk06  ONLINE       0     0     0
             gpt/disk07  ONLINE       0     0     0
             gpt/disk08  ONLINE       0     0     0
             gpt/disk09  ONLINE       0     0     0
             gpt/disk10  ONLINE       0     0     0
             gpt/disk11  ONLINE       0     0     0
           raidz2-2      ONLINE       0     0     0
             gpt/disk12  ONLINE       0     0     0
             gpt/disk13  ONLINE       0     0     0
             gpt/disk14  ONLINE       0     0     0
             gpt/disk15  ONLINE       0     0     0
             gpt/disk16  ONLINE       0     0     0
             gpt/disk17  ONLINE       0     0     0
         logs
           ada0p2        ONLINE       0     0     0
           ada1p2        ONLINE       0     0     0
         cache
           da20p2        ONLINE       0     0     0

zdb output

storage:
     version: 5000
     name: 'storage'
     state: 0
     txg: 5258772
     pool_guid: 17094379857311239400
     hostid: 3505628652
     hostname: 'a1.register.bg'
     vdev_children: 5
     vdev_tree:
         type: 'root'
         id: 0
         guid: 17094379857311239400
         children[0]:
             type: 'raidz'
             id: 0
             guid: 2748500753748741494
             nparity: 2
             metaslab_array: 33
             metaslab_shift: 37
             ashift: 12
             asize: 18003521961984
             is_log: 0
             create_txg: 4
             children[0]:
                 type: 'disk'
                 id: 0
                 guid: 5074824874132816460
                 path: '/dev/gpt/disk00'
                 phys_path: '/dev/gpt/disk00'
                 whole_disk: 1
                 DTL: 378
                 create_txg: 4
             children[1]:
                 type: 'disk'
                 id: 1
                 guid: 14410366944090513563
                 path: '/dev/gpt/disk01'
                 phys_path: '/dev/gpt/disk01'
                 whole_disk: 1
                 DTL: 53
                 create_txg: 4
             children[2]:
                 type: 'disk'
                 id: 2
                 guid: 3526681390841761237
                 path: '/dev/gpt/disk02'
                 phys_path: '/dev/gpt/disk02'
                 whole_disk: 1
                 DTL: 52
                 create_txg: 4
             children[3]:
                 type: 'disk'
                 id: 3
                 guid: 3773850995072323004
                 path: '/dev/gpt/disk03'
                 phys_path: '/dev/gpt/disk03'
                 whole_disk: 1
                 DTL: 51
                 create_txg: 4
             children[4]:
                 type: 'disk'
                 id: 4
                 guid: 16528489666301728411
                 path: '/dev/gpt/disk04'
                 phys_path: '/dev/gpt/disk04'
                 whole_disk: 1
                 DTL: 50
                 create_txg: 4
             children[5]:
                 type: 'disk'
                 id: 5
                 guid: 11222774817699257051
                 path: '/dev/gpt/disk05'
                 phys_path: '/dev/gpt/disk05'
                 whole_disk: 1
                 DTL: 44147
                 create_txg: 4
         children[1]:
             type: 'raidz'
             id: 1
             guid: 614220834244218709
             nparity: 2
             metaslab_array: 39
             metaslab_shift: 37
             ashift: 12
             asize: 18003521961984
             is_log: 0
             create_txg: 40
             children[0]:
                 type: 'disk'
                 id: 0
                 guid: 8076478524731550200
                 path: '/dev/gpt/disk06'
                 phys_path: '/dev/gpt/disk06'
                 whole_disk: 1
                 DTL: 2914
                 create_txg: 40
             children[1]:
                 type: 'disk'
                 id: 1
                 guid: 1689851194543981566
                 path: '/dev/gpt/disk07'
                 phys_path: '/dev/gpt/disk07'
                 whole_disk: 1
                 DTL: 48
                 create_txg: 40
             children[2]:
                 type: 'disk'
                 id: 2
                 guid: 9743236178648200269
                 path: '/dev/gpt/disk08'
                 phys_path: '/dev/gpt/disk08'
                 whole_disk: 1
                 DTL: 47
                 create_txg: 40
             children[3]:
                 type: 'disk'
                 id: 3
                 guid: 10157617457760516410
                 path: '/dev/gpt/disk09'
                 phys_path: '/dev/gpt/disk09'
                 whole_disk: 1
                 DTL: 46
                 create_txg: 40
             children[4]:
                 type: 'disk'
                 id: 4
                 guid: 5035981195206926078
                 path: '/dev/gpt/disk10'
                 phys_path: '/dev/gpt/disk10'
                 whole_disk: 1
                 DTL: 45
                 create_txg: 40
             children[5]:
                 type: 'disk'
                 id: 5
                 guid: 4975835521778875251
                 path: '/dev/gpt/disk11'
                 phys_path: '/dev/gpt/disk11'
                 whole_disk: 1
                 DTL: 44149
                 create_txg: 40
         children[2]:
             type: 'raidz'
             id: 2
             guid: 7453512836015019221
             nparity: 2
             metaslab_array: 38974
             metaslab_shift: 37
             ashift: 12
             asize: 18003521961984
             is_log: 0
             create_txg: 4455560
             children[0]:
                 type: 'disk'
                 id: 0
                 guid: 11182458869377968267
                 path: '/dev/gpt/disk12'
                 phys_path: '/dev/gpt/disk12'
                 whole_disk: 1
                 DTL: 45059
                 create_txg: 4455560
             children[1]:
                 type: 'disk'
                 id: 1
                 guid: 5844283175515272344
                 path: '/dev/gpt/disk13'
                 phys_path: '/dev/gpt/disk13'
                 whole_disk: 1
                 DTL: 44145
                 create_txg: 4455560
             children[2]:
                 type: 'disk'
                 id: 2
                 guid: 13095364699938843583
                 path: '/dev/gpt/disk14'
                 phys_path: '/dev/gpt/disk14'
                 whole_disk: 1
                 DTL: 44144
                 create_txg: 4455560
             children[3]:
                 type: 'disk'
                 id: 3
                 guid: 5196507898996589388
                 path: '/dev/gpt/disk15'
                 phys_path: '/dev/gpt/disk15'
                 whole_disk: 1
                 DTL: 44143
                 create_txg: 4455560
             children[4]:
                 type: 'disk'
                 id: 4
                 guid: 12809770017318709512
                 path: '/dev/gpt/disk16'
                 phys_path: '/dev/gpt/disk16'
                 whole_disk: 1
                 DTL: 44142
                 create_txg: 4455560
             children[5]:
                 type: 'disk'
                 id: 5
                 guid: 7339883019925920701
                 path: '/dev/gpt/disk17'
                 phys_path: '/dev/gpt/disk17'
                 whole_disk: 1
                 DTL: 44141
                 create_txg: 4455560
         children[3]:
             type: 'disk'
             id: 3
             guid: 18011869864924559827
             path: '/dev/ada0p2'
             phys_path: '/dev/ada0p2'
             whole_disk: 1
             metaslab_array: 16675
             metaslab_shift: 26
             ashift: 12
             asize: 8585216000
             is_log: 1
             DTL: 86787
             create_txg: 5182360
         children[4]:
             type: 'disk'
             id: 4
             guid: 1338775535758010670
             path: '/dev/ada1p2'
             phys_path: '/dev/ada1p2'
             whole_disk: 1
             metaslab_array: 16693
             metaslab_shift: 26
             ashift: 12
             asize: 8585216000
             is_log: 1
             DTL: 86788
             create_txg: 5182377
     features_for_read:

Drives da0-da5 were Hitachi Deskstar 7K3000 (Hitachi HDS723030ALA640, 
firmware MKAOA3B0) -- these are 512 byte sector drives, but da0 has been 
replaced by Seagate Barracuda 7200.14 (AF) (ST3000DM001-1CH166, firmware 
CC24) -- this is an 4k sector drive of a new generation (notice the 
relatively 'old' firmware, that can't be upgraded).
Drives da6-da17 are also Seagate Barracuda 7200.14 (AF) but 
(ST3000DM001-9YN166, firmware CC4H) -- the more "normal" part number. 
Some have firmware CC4C which I replace drive by drive (but other than 
the excessive load counts no other issues so far).

The only ZFS related tuning is in /etc/sysctl.conf
# improve ZFS resilver
vfs.zfs.resilver_delay=0
vfs.zfs.scrub_delay=0
vfs.zfs.top_maxinflight=128
vfs.zfs.resilver_min_time_ms=5000
vfs.zfs.vdev.max_pending=24
# L2ARC:
vfs.zfs.l2arc_norw=0
vfs.zfs.l2arc_write_max=83886080
vfs.zfs.l2arc_write_boost=83886080

The pool of course had dedup and had serious dedup ratios, like over 
10x. In general, with the ZIL and L2ARC, the only trouble I have seen 
with dedup is when deleting lots of data... which this server has seen 
plenty of. During this experiment, I have moved most data to other 
server and un-dedup the last remaining TBs.

While doing zfs destroy on an 2-3TB dataset, I observe very annoying 
behaviour. The pool would stay mostly idle, accepting almost no I/O and 
doing small random reads, like this

$ zpool iostat storage 10
storage     45.3T  3.45T    466      0  1.82M      0
storage     45.3T  3.45T     50      0   203K      0
storage     45.3T  3.45T     45     25   183K  1.70M
storage     45.3T  3.45T     49      0   199K      0
storage     45.3T  3.45T     50      0   202K      0
storage     45.3T  3.45T     51      0   204K      0
storage     45.3T  3.45T     57      0   230K      0
storage     45.3T  3.45T     65      0   260K      0
storage     45.3T  3.45T     68     25   274K  1.70M
storage     45.3T  3.45T     65      0   260K      0
storage     45.3T  3.45T     64      0   260K      0
storage     45.3T  3.45T     67      0   272K      0
storage     45.3T  3.45T     66      0   266K      0
storage     45.3T  3.45T     64      0   258K      0
storage     45.3T  3.45T     62     25   250K  1.70M
storage     45.3T  3.45T     57      0   231K      0
storage     45.3T  3.45T     58      0   235K      0
storage     45.3T  3.45T     66      0   267K      0
storage     45.3T  3.45T     64      0   257K      0
storage     45.3T  3.45T     60      0   241K      0
storage     45.3T  3.45T     50      0   203K      0
storage     45.3T  3.45T     52     25   209K  1.70M
storage     45.3T  3.45T     54      0   217K      0
storage     45.3T  3.45T     51      0   205K      0
storage     45.3T  3.45T     54      0   216K      0
storage     45.3T  3.45T     55      0   222K      0
storage     45.3T  3.45T     56      0   226K      0
storage     45.3T  3.45T     65      0   264K      0
storage     45.3T  3.45T     71      0   286K      0

The write peaks are from processes syncing data to the pool - in this 
state it does not do reads (the data the sync process deals with is 
already in ARC).
Then it goes into writing back to the pool (perhaps DDT metadata)

storage     45.3T  3.45T     17  24.4K  69.6K  97.5M
storage     45.3T  3.45T      0  19.6K      0  78.5M
storage     45.3T  3.45T      0  14.2K      0  56.8M
storage     45.3T  3.45T      0  7.90K      0  31.6M
storage     45.3T  3.45T      0  7.81K      0  32.8M
storage     45.3T  3.45T      0  9.54K      0  38.2M
storage     45.3T  3.45T      0  7.07K      0  28.3M
storage     45.3T  3.45T      0  7.70K      0  30.8M
storage     45.3T  3.45T      0  6.19K      0  24.8M
storage     45.3T  3.45T      0  5.45K      0  21.8M
storage     45.3T  3.45T      0  5.78K      0  24.7M
storage     45.3T  3.45T      0  5.29K      0  21.2M
storage     45.3T  3.45T      0  5.69K      0  22.8M
storage     45.3T  3.45T      0  5.52K      0  22.1M
storage     45.3T  3.45T      0  3.26K      0  13.1M
storage     45.3T  3.45T      0  1.77K      0  7.10M
storage     45.3T  3.45T      0  1.63K      0  8.14M
storage     45.3T  3.45T      0  1.41K      0  5.64M
storage     45.3T  3.45T      0  1.22K      0  4.88M
storage     45.3T  3.45T      0  1.27K      0  5.09M
storage     45.3T  3.45T      0  1.06K      0  4.26M
storage     45.3T  3.45T      0  1.07K      0  4.30M
storage     45.3T  3.45T      0    979      0  3.83M
storage     45.3T  3.45T      0   1002      0  3.91M
storage     45.3T  3.45T      0   1010      0  3.95M
storage     45.3T  3.45T      0    948  2.40K  3.71M
storage     45.3T  3.45T      0    939      0  3.67M
storage     45.3T  3.45T      0   1023      0  7.10M
storage     45.3T  3.45T      0  1.01K  4.80K  4.04M
storage     45.3T  3.45T      0    822      0  3.22M
storage     45.3T  3.45T      0    434      0  1.70M
storage     45.3T  3.45T      0    398  2.40K  1.56M

For quite some time, there are no reads from the pool. When that 
happens, gstat (gstat -f 'da[0-9]*$') displays something like this:

dT: 1.001s  w: 1.000s  filter: da[0-9]*$
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    24   1338      0      0    0.0   1338  12224   17.8  100.9| da0
    24   6888      0      0    0.0   6888  60720    3.5  100.0| da1
    24   6464      0      0    0.0   6464  71997    3.7  100.0| da2
    24   6117      0      0    0.0   6117  82386    3.9   99.9| da3
    24   6455      0      0    0.0   6455  66822    3.7  100.0| da4
    24   6782      0      0    0.0   6782  69207    3.5  100.0| da5
    24    698      0      0    0.0    698  27533   34.1   99.6| da6
    24    590      0      0    0.0    590  21627   40.9   99.7| da7
    24    561      0      0    0.0    561  21031   42.8  100.2| da8
    24    724      0      0    0.0    724  25583   33.1   99.9| da9
    24    567      0      0    0.0    567  22965   41.4   98.0| da10
    24    566      0      0    0.0    566  21834   42.4   99.9| da11
    24    586      0      0    0.0    586   4899   43.5  100.2| da12
    24    487      0      0    0.0    487   4008   49.3  100.9| da13
    24    628      0      0    0.0    628   5007   38.9  100.2| da14
    24    714      0      0    0.0    714   5706   33.8   99.9| da15
    24    595      0      0    0.0    595   4831   39.8   99.8| da16
    24    485      0      0    0.0    485   3932   49.2  100.1| da17
     0      0      0      0    0.0      0      0    0.0    0.0| da18
     0      0      0      0    0.0      0      0    0.0    0.0| da19
     0      0      0      0    0.0      0      0    0.0    0.0| da20
     0      0      0      0    0.0      0      0    0.0    0.0| ada0
     0      0      0      0    0.0      0      0    0.0    0.0| ada1

(drives da8 and 19 are spares, da20 is the L2ARC SSD drive, ada0 and 
ada0 are the boot SSDs in separate zpool)
Now, here comes the weird part. the gpart display would show intensive 
writes to all vdevs (da0-da5, da6-da11,da12-da17) then one of the vdevs 
would complete writing, and stop writing, while other vdevs continue, at 
the end only one vdev writes until as it seems, data is completely 
written to all vdevs (this can be observed in the zfs iostat output 
above with the decreasing write IOPS each 10 seconds), then there is a 
few seconds "do nothing" period and then we are back to small reads.

The other observation I have is with the first vdev: the 512b drives do 
a lot of I/O fast, complete first and then sit idle, while da0 continues 
to write for many more seconds. They consistently show many more IOPS 
than the other drives for this type of activity -- on streaming writes 
all drives behave more or less the same. It is only on this un-dedup 
scenario where the difference is so much pronounced.

All the vdevs in the pool are with ashift=12 so the theory that ZFS 
actually issues 512b writes to these drives can't be true, can it?

Another worry is this Seagate Barracuda 7200.14 (AF) 
(ST3000DM001-1CH166, firmware CC24) drive. It seems constantly 
under-performing. Does anyone know if it is so different from the 
ST3000DM001-9YN166 drives? Might be, I should just replace it?

My concern is the bursty and irregular nature of writing to vdevs. As it 
is now, an write operation to the pool needs to wait for all of the vdev 
writes to complete which is this case takes tens of seconds. A single 
drive in an vdev that underperforms will slow down the entire pool.
Perhaps ZFS could prioritize vdev usage based on the vdev troughput, 
similar to how it prioritizes writes based on how much it is full.

Also, what is ZFS doing during the idle periods? Are there some timeouts 
involved? It is certainly not using any CPU... The small random I/O is 
certainly not loading the disks.

Then, I have 240GB L2ARC and secondarycache=metadata for the pool. Yet, 
the DDT apparently does not want to go there... Is there a way to 
"force" it to be loaded to L2ARC? Before the last big delete, I had

zdb -D storage
DDT-sha256-zap-duplicate: 19907778 entries, size 1603 on disk, 259 in core
DDT-sha256-zap-unique: 30101659 entries, size 1428 on disk, 230 in core

dedup = 1.98, compress = 1.00, copies = 1.03, dedup * compress / copies 
= 1.92

With time, the in core values stay more or less the same.

I also discovered, that the L2ARC drive apparently is not subject to 
TRIM for some reason. TRIM works on the boot drives, but these are 
connected to the motherboard SATA ports).

# sysctl kern.cam.da.20
kern.cam.da.20.delete_method: ATA_TRIM
kern.cam.da.20.minimum_cmd_size: 6
kern.cam.da.20.sort_io_queue: 0
kern.cam.da.20.error_inject: 0

# sysctl -a | grep trim
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.vdev.trim_max_bytes: 2147483648
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
kstat.zfs.misc.zio_trim.bytes: 139489971200
kstat.zfs.misc.zio_trim.success: 628351
kstat.zfs.misc.zio_trim.unsupported: 622819
kstat.zfs.misc.zio_trim.failed: 0

Yet, I don't observe any BIO_DELETE activity to this drive with gstat -d

Wasn't TRIM supposed to work on drives attached to LSI2008 in 9-stable?

Daniel