Dev help requested: short DTrace instrumentation script to narrow down cause of a longstanding ZFS write issue

Fri Dec 21 13:19:25 UTC 2018

To add some more data about this issue. I just tried the suggestions
offered by Delphix, for write throttle problems
<https://www.delphix.com/blog/delphix-engineering/tuning-openzfs-write-throttle>
(specifically the last part, about when ZFS genuinely can't keep up).

I used the same setup as above (same hardware, clean FreeNAS 11.2, single
SMB client writing a single 100GB to the pool, with an otherwise-idle 10G
LAN). I modifyied only a few sysctls (otherwise left at default for FreeNAS
11.2), in line with the guide.

The sysctls I modified were: (vfs.zfs.)*delay_scale,
delay_min_dirty_percent, async_write_active_min/max_dirty_percent*. The
usual sysctl for chelsio max MTU 4096 under 11.2-REL was in use, and RNG
data harvesting from 10G NIC was disabled.Apart from that, it was default
settings for ZFS sysctls.

Hopefully this may give ideas what's up. I kept it very simple.

I started by looking at dirty data writeout patterns before making any
sysctl changes.  This is the output of the well-known script for combined
dirty.d + duration.d, when the file transfer began on the client. It seems
to show matters almost immediately getting out of hand with dirty data
writeout, by the 4th TXG sync:

dtrace -s dirty_and_duration.d tank
dtrace: script 'dirty_and_duration.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  6  63989                  none:txg-synced    0MB of 4096MB synced in 0.47
seconds
  6  63989                  none:txg-synced    0MB of 4096MB synced in 0.00
seconds
  6  63989                  none:txg-synced    0MB of 4096MB synced in 0.00
seconds
  5  63989                  none:txg-synced   64MB of 4096MB synced in 0.76
seconds <-- copy begins
  1  63989                  none:txg-synced  305MB of 4096MB synced in 0.32
seconds
  2  63989                  none:txg-synced  438MB of 4096MB synced in 2.13
seconds
  1  63989                  none:txg-synced  819MB of 4096MB synced in 7.88
seconds <-- 4th sync is already taking too long

for 5 second TXGs to keep up.

And 7.9 seconds to write out 820MB!

A write-out rate of 104MB/sec, when it's got a 43% empty pool and can write
to 4 7200 mirros simultaneously, without any need to calculate parity data?
Something's got to be wrong.

Having no idea why writeout is so incredibly slow, I first tried to get it
to write consistently, without backlogging/stalling the file transfer.
Following the guide, I first tried increasing *delay_scale *(current value
500K). The Delphix guide said to try doubling it or more. I took it up to
1M, then 2M, then 5M, then 50M - a factor of 100x - and the dtrace script
showed that even at that scaling level, ZFS was * still* needing to invoke
delays of 50M - 100M most of the time - and despite doing so, was
*still *unable
to prevent the issue:

# dtrace -n delay-mintime'{ @ = quantize(arg2); }'
dtrace: description 'delay-mintime' matched 1 probe
^C

           value  ------------- Distribution ------------- count
             128 |                                         0
             256 |                                         1
             512 |                                         0
            1024 |                                         1
            2048 |                                         1
            4096 |                                         4
            8192 |                                         6
           16384 |                                         13
           32768 |                                         26
           65536 |                                         51
          131072 |                                         102
          262144 |                                         203
          524288 |@                                        399
         1048576 |@                                        775
         2097152 |@@                                       1453
         4194304 |@@@@                                     2603
         8388608 |@@@@@@                                   4243
        16777216 |@@@@@                                    4048
        33554432 |@@@@@@                                   4485
        67108864 |@@@@@@@@@@@@@@@                          11208
       134217728 |                                         0

Next, I tried the following part of the Delphix guide. I took
*delay_scale *back
down to "sane" values" and reducing the dirty data write-out percentages. I
finally got past the problem that way - but I had to take
*delay_min_dirty_percent
*and *vdev.async_write_active_max_dirty_percent* down to 15%, and *delay_scale
*up to 2M to do it. Those are insanely limited values for an enterprise
server with a single file transfer.

I checked disk IO. As shown below, all disks seemed to be performing
roughly equally, suggesting the issue isn't the HBA or "bad" HDDs:

# zpool iostat -T d -v tank 1
                                           capacity     operations
bandwidth
pool                                    alloc   free   read  write   read
write
--------------------------------------  -----  -----  -----  -----  -----
-----
tank                                    14.5T  10.9T      0  52.3K      0
210M
  mirror                                4.12T  3.13T      0  8.60K      0
34.7M
    da6                                     -      -      0    778      0
35.1M
    da5                                     -      -      0    945      0
35.8M
    da7                                     -      -      0  1.01K      0
36.8M
  mirror                                3.83T  1.61T      0  20.7K      0
82.9M
    da8                                     -      -      0  5.01K      0
83.0M
    da9                                     -      -      0  5.11K      0
83.0M
    da10                                    -      -      0  3.14K      0
82.9M
  mirror                                3.81T  1.62T      0  12.2K      0
48.8M
    da4                                     -      -      0  2.17K      0
48.8M
    da11                                    -      -      0  2.13K      0
50.7M
    da2                                     -      -      0  2.73K      0
50.7M
  mirror                                2.69T  4.57T      0  10.8K      0
43.6M
    da0                                     -      -      0  1.26K      0
45.5M
    da1                                     -      -      0  1.29K      0
43.7M
    da3                                     -      -      0  1.50K      0
45.8M

I gather the stats here aren't reliable, but if that's any indication of
the average bytes per write, it also suggests something odd is going on.

The vdevs comprise matched enterprise 7200 HDDs, all new, and the top 2
vdevs have slightly faster HDDs than the bottom two vdevs. Writing a single
large seq file in a 57% full pool, data should be written 4 ways at a time,
hence topping out at about 4 x single HDD speed as an ideal ceiling, say
about 400 - 800MB/sec (the HDDs probably benchmark at around 180 - 200
MB/sec seq write individually).

I double checked by testing local write speed for a non-cached 5GB local
file (pre-created on a different device therefore also in ZFS ARC/L2ARC and
won't slow down writing). I copied the file to three destinations:  */dev/null
* (to isolate the time required by /dev/random with no disk writes), *the
boot pool* (ZFS on mirrored slow-ish SATA SSDs 54% full) and *the same pool
as above*, to allow effects to be distinguished. I also used dtrace to
watch writeouts, as above, to compare with LAN file tfr writeout patterns.
I can't decide if this shows the entire time taken is due to /dev/random,
or whether it shows the problem is still there and hasn't actually gone
away, despite the apparently smooth LAN dataflow achieved above:

# dd if=/dev/random of=/TEST0.DAT bs=1048576 count=5120   <-- saved to boot
pool
# dd if=/dev/random of=TEST1.DAT bs=1048576 count=5120    <-- saved to tank

# /usr/bin/time -h cp /TEST0.DAT /dev/null                <-- read speed
for boot pool file
1.25s real              0.00s user              1.25s sys

# /usr/bin/time -h cp /TEST0.DAT TEST0-COPY1.DAT          <-- copy boot
pool file to tank
2m30.34s real           0.00s user              2.29s sys

  1  63989                  none:txg-synced [At timestamp: 9507491.794
sec]      0MB of  476MB synced in 0.00 seconds
  2  63989                  none:txg-synced [At timestamp: 9507496.057
sec]     64MB of  476MB synced in 0.88 seconds
  7  63989                  none:txg-synced [At timestamp: 9507499.424
sec]    476MB of  476MB synced in 3.36 seconds
  2  63989                  none:txg-synced [At timestamp: 9507515.465
sec]    476MB of  476MB synced in 16.04 seconds
  4  63989                  none:txg-synced [At timestamp: 9507532.055
sec]    476MB of  476MB synced in 16.59 seconds
  4  63989                  none:txg-synced [At timestamp: 9507548.635
sec]    476MB of  476MB synced in 16.58 seconds
  5  63989                  none:txg-synced [At timestamp: 9507564.808
sec]    476MB of  476MB synced in 16.17 seconds
  4  63989                  none:txg-synced [At timestamp: 9507580.779
sec]    476MB of  476MB synced in 15.97 seconds
  7  63989                  none:txg-synced [At timestamp: 9507598.004
sec]    476MB of  476MB synced in 17.22 seconds
  2  63989                  none:txg-synced [At timestamp: 9507614.247
sec]    476MB of  476MB synced in 16.24 seconds
  4  63989                  none:txg-synced [At timestamp: 9507630.392
sec]    476MB of  476MB synced in 16.14 seconds
  4  63989                  none:txg-synced [At timestamp: 9507646.723
sec]    476MB of  476MB synced in 16.33 seconds
  3  63989                  none:txg-synced [At timestamp: 9507657.185
sec]    300MB of  476MB synced in 10.46 seconds
  3  63989                  none:txg-synced [At timestamp: 9507657.218
sec]      0MB of  476MB synced in 0.03 seconds
       [completed]

# /usr/bin/time -h cp TEST1.DAT /dev/null                 <-- read speed
for tank file
1.21s real              0.00s user              1.20s sys

# /usr/bin/time -h cp TEST1.DAT /TEST1-COPY1.DAT          <-- copy tank
file to boot pool
2m0.98s real            0.00s user              2.57s sys

  2  63989                  none:txg-synced [At timestamp: 9507827.131
sec]      0MB of  476MB synced in 0.01 seconds
  0  63989                  none:txg-synced [At timestamp: 9507831.391
sec]     64MB of  476MB synced in 1.59 seconds
  2  63989                  none:txg-synced [At timestamp: 9507840.045
sec]    384MB of  476MB synced in 8.65 seconds
  2  63989                  none:txg-synced [At timestamp: 9507840.045
sec]     62MB of  476MB synced in 0.00 seconds
  0  63989                  none:txg-synced [At timestamp: 9507841.743
sec]     64MB of  476MB synced in 1.69 seconds
  2  63989                  none:txg-synced [At timestamp: 9507850.117
sec]    368MB of  476MB synced in 8.37 seconds
  2  63989                  none:txg-synced [At timestamp: 9507850.118
sec]     60MB of  476MB synced in 0.00 seconds
  1  63989                  none:txg-synced [At timestamp: 9507851.751
sec]     62MB of  476MB synced in 1.63 seconds
  7  63989                  none:txg-synced [At timestamp: 9507859.728
sec]    351MB of  476MB synced in 7.97 seconds
  7  63989                  none:txg-synced [At timestamp: 9507859.728
sec]     58MB of  476MB synced in 0.00 seconds
  2  63989                  none:txg-synced [At timestamp: 9507861.298
sec]     61MB of  476MB synced in 1.56 seconds
  5  63989                  none:txg-synced [At timestamp: 9507868.970
sec]    336MB of  476MB synced in 7.67 seconds
  5  63989                  none:txg-synced [At timestamp: 9507868.971
sec]     56MB of  476MB synced in 0.00 seconds
  2  63989                  none:txg-synced [At timestamp: 9507870.507
sec]     60MB of  476MB synced in 1.53 seconds
  6  63989                  none:txg-synced [At timestamp: 9507877.644
sec]    322MB of  476MB synced in 7.13 seconds
  6  63989                  none:txg-synced [At timestamp: 9507877.664
sec]     55MB of  476MB synced in 0.02 seconds
  4  63989                  none:txg-synced [At timestamp: 9507879.197
sec]     96MB of  476MB synced in 1.53 seconds
       [same pattern until completed]

Hmm. 16-17 seconds average to write out 476MB that's already in ARC/L2ARC
from a previous write? 150 secs to do the writeout of a simple 5GB file?
That's about 34 MB/sec effective copying speed, to copy a file locally from
ARC/L2ARC, to a pool of 4 fast striped mirrors, that should be able to
write at 600-800MB/sec.  It does confirm, however, that the issue isn't LAN
related, since it happens even with local file writing.  The boot pool
isn't much faster, although it's SSD it only has one mirror, so it's at a
4x speed disadvantage from the start.  Very, very wrong-looking.....

Finally, I returned to dirty.d/duration.d, to compare the later TXG
writeout patterns (using the new, much tighter settings that didn't stall
LAN writes) to those at the initial testing run. This time we can see it is
handling the tfr fine - but it also shows again how much it had to
strangulate performance in order to do so:

 # dtrace -s dirty_and_duration.d tank
dtrace: script 'dirty_and_duration.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  63989                  none:txg-synced [At timestamp: 9503423.244
sec]      0MB of  476MB synced in 0.00 seconds
  4  63989                  none:txg-synced [At timestamp: 9503429.056
sec]      0MB of  476MB synced in 0.50 seconds
  3  63989                  none:txg-synced [At timestamp: 9503431.789
sec]     64MB of  476MB synced in 0.69 seconds
  7  63989                  none:txg-synced [At timestamp: 9503433.114
sec]    280MB of  476MB synced in 1.32 seconds
  0  63989                  none:txg-synced [At timestamp: 9503435.103
sec]    476MB of  476MB synced in 1.98 seconds
  2  63989                  none:txg-synced [At timestamp: 9503436.962
sec]    476MB of  476MB synced in 1.85 seconds
  6  63989                  none:txg-synced [At timestamp: 9503438.810
sec]    470MB of  476MB synced in 1.84 seconds
  3  63989                  none:txg-synced [At timestamp: 9503440.791
sec]    474MB of  476MB synced in 1.98 seconds
  0  63989                  none:txg-synced [At timestamp: 9503442.726
sec]    476MB of  476MB synced in 1.93 seconds
  2  63989                  none:txg-synced [At timestamp: 9503445.287
sec]    476MB of  476MB synced in 2.56 seconds
  2  63989                  none:txg-synced [At timestamp: 9503447.193
sec]    476MB of  476MB synced in 1.90 seconds
  2  63989                  none:txg-synced [At timestamp: 9503449.566
sec]    476MB of  476MB synced in 2.37 seconds
  3  63989                  none:txg-synced [At timestamp: 9503451.476
sec]    476MB of  476MB synced in 1.91 seconds
  2  63989                  none:txg-synced [At timestamp: 9503453.207
sec]    476MB of  476MB synced in 1.73 seconds
  4  63989                  none:txg-synced [At timestamp: 9503454.983
sec]    411MB of  476MB synced in 1.77 seconds
  3  63989                  none:txg-synced [At timestamp: 9503456.860
sec]    476MB of  476MB synced in 1.87 seconds
  3  63989                  none:txg-synced [At timestamp: 9503458.820
sec]    473MB of  476MB synced in 1.95 seconds
  3  63989                  none:txg-synced [At timestamp: 9503460.682
sec]    476MB of  476MB synced in 1.86 seconds
  4  63989                  none:txg-synced [At timestamp: 9503462.676
sec]    476MB of  476MB synced in 1.99 seconds
  0  63989                  none:txg-synced [At timestamp: 9503464.346
sec]    476MB of  476MB synced in 1.66 seconds

Crunching the numbers, the average write-out transfer rate is 242-245
MB/sec (411-476MB in ~1.8 sec for individual TXGs, and 7064 MB in 29.2 sec
overall). It looks like something has also hit a limit - TXGs are being
written out about every 1.8 sec, and also taking 1.8 sec to writeout. Put
another way, to get to a point where ZFS was inserting enough delay that
the file transfer was consistent and no longer backlogging and stalling the
NIC buffer for tens of seconds or minutes at a time, I had to push the
overall file tansfer down to about 240 MB/sec, which for a single
client/single file write on an enterprise server able to write to 4 vdevs
in parallel, is crazy low.

I don't know if this helps, but I feel it might be useful to convey how
badly my server's ZFS async write pathway is writing out, even under AFAIK
reasonable/default config and fast appropriate HW for the load,  and fast
striped mirrors, and why I think it's indicative of something deeper I'm
not seeing, whether bad sysctls or a dirty data bug as per previous replies
when the issue was earlier raised on this list.

So this is why I want to dig "under the hood" with dtrace, but I need help
in writing the necessary short instrumenting scripts required to narrow
down where it's going wrong internally..

Help kindly appreciated!

Stilez

On Fri, 21 Dec 2018 at 03:18, Stilez Stilezy <stilezy at gmail.com> wrote:

> I'm experiencing a persistent timeout problem that only occurs during
> network writes across a 10G LAN to a ZFS pool. Work to date confirms it's a
> real server-side issue, arising within the ZFS side not the client, network
> hardware, or FreeBSD's network I/O handling, and it involves ZFS
> mishandling dirty data free-up (there are bugs logged for this), or ZFS not
> correctly delaying too-fast write source. Of cours eit could also be
> mis-set sysctls, but I've replicated it with a default FreeNAS setup so it
> would have to be mis-configured in that as well, if so.
>
> To save time, can I skip the detailed debugging done to date, and just say
> that is where I've got to, so far. I have asked previously, long ago, and
> it was suggested at the time it could be a dirty data bug. With more
> experience under the belt, I'd like to dig deeper and try to narrow down
> the point which is bottlenecking, using dtrace.
>
> Specifically, I'd like help with a short instrumentation script to capture
> four key events within the ZFS write data pathway, in order to pin down
> which part of the |FS data flow is implicated in the bottleneck.  I need
> help because it requires detailed knowledge of the ZFS codebase.  I get the
> impression it could be a very short script - if I can narrow it down, maybe
> I can find a fix, or else, better evidence of the exact point where the
> bug/issue is arising.
>
> *Symptoms:*
>
> I'm using ZFS on enterprise hardware (Supermicro, Xeon E5-1620 v4, 128GB
> ECC, Intel P3700 NVMe SLOG/ZIL, 7200 mirrored enterprise SAS3 HDD pool,
> 9311 SAS3 HBA IT mode, Chelsio NICs, 10G fibre LAN, Netgear managed SFP+
> switch).
>
> With a quiesced "known good" 10G network, and just a single SMB, iSCSI or
> netcat client (same happens on all 3 protocols), and that client doing a
> single 100GB file write, I can see the network RCV buffer gradually being
> driven down to zero without getting time to recover much, causing the TCP
> window to go down to << 80 bytes in tcpdump as it tries to cope, and
> eventually (when TCP window is driven to zero often enough) the file write
> session fails due to the persistant zero window condition. Jumps in the
> network RCV queue seems to coincide in time just before disk activity in
> the HDD pool, adding to evidence it's related to the ZFS write pathway.
>
> All the evidence suggests the network and client sides aren't the issue.
> What's happening is, ZFS is misconfigured or hitting a bug, it's not
> pulling new data into the next TXG in a timely manner, or delaying
> elegantly. File data is coming into the NIC RCV buffer at 10G speed, faster
> than ZFS is allowing it to be removed, and ZFS isn't effectively signalling
> the client to slow down or delay.
>
>
> *dtrace:*
>
> The most direct route to narrow it down, would involve instrumenting to
> log the timestamps, sizes, and time taken, for (I think!) 4 events. I don't
> know the exact names used for these calls, so I have to describe their
> effect. Help with identifying appropriate functions/instrumentation points,
> even if it's not clear how to instrument them,  would be great!
>
>    - There should be a function that gets called/polled when ZFS has
>    space in its write pipeline, it checks to see if any ongoing write has a
>    chunk of data to be written, and if so, pulls data from the source
>    stream/pipe/descriptor (in this case the network RCV buffer, could be any
>    source of an ongoing file write). This will let me instrument the events,
>    when ZFS requests/gets source data from any ongoing file write, in this
>    case pulling it from the network buffer.
> * To instrument:    timestamp, max bytes to read, actual bytes read*
>
>    - Within the ZFS code, the write pathway involves (simplistically)
>    newly received write data and any related metadata/consequent updates being
>    added to the current TXG and SLOG/ZIL, and ultimately, written out to the
>    pool.
>    *To instrument for those 3 events:     timestamp, bytes written,
>    write-out time taken.      (Ideally also, dirty data cache size
>    before/after)*
>
>
> Although the output file will be huge, it should be easy to analyse the
> timestamped list of events+dataflow and see which of the last 3 processes
> is causing the bottleneck.
>
> I would enjoy doing the work, and tracking this down somewhat, but need a
> little help with the instrumentation aspect which really needs developer
> competences.
>
> Can someone kind, please help with an appropriate short instrumentation
> script to capture the raw data for these 4 items :)
>
> (And any other data that in your experience, would be useful)
>
> Thank you!
>
> Stilez
>