Dev help requested: short DTrace instrumentation script to narrow down cause of a longstanding ZFS write issue

Fri Dec 21 03:18:43 UTC 2018

I'm experiencing a persistent timeout problem that only occurs during
network writes across a 10G LAN to a ZFS pool. Work to date confirms it's a
real server-side issue, arising within the ZFS side not the client, network
hardware, or FreeBSD's network I/O handling, and it involves ZFS
mishandling dirty data free-up (there are bugs logged for this), or ZFS not
correctly delaying too-fast write source. Of cours eit could also be
mis-set sysctls, but I've replicated it with a default FreeNAS setup so it
would have to be mis-configured in that as well, if so.

To save time, can I skip the detailed debugging done to date, and just say
that is where I've got to, so far. I have asked previously, long ago, and
it was suggested at the time it could be a dirty data bug. With more
experience under the belt, I'd like to dig deeper and try to narrow down
the point which is bottlenecking, using dtrace.

Specifically, I'd like help with a short instrumentation script to capture
four key events within the ZFS write data pathway, in order to pin down
which part of the |FS data flow is implicated in the bottleneck.  I need
help because it requires detailed knowledge of the ZFS codebase.  I get the
impression it could be a very short script - if I can narrow it down, maybe
I can find a fix, or else, better evidence of the exact point where the
bug/issue is arising.

*Symptoms:*

I'm using ZFS on enterprise hardware (Supermicro, Xeon E5-1620 v4, 128GB
ECC, Intel P3700 NVMe SLOG/ZIL, 7200 mirrored enterprise SAS3 HDD pool,
9311 SAS3 HBA IT mode, Chelsio NICs, 10G fibre LAN, Netgear managed SFP+
switch).

With a quiesced "known good" 10G network, and just a single SMB, iSCSI or
netcat client (same happens on all 3 protocols), and that client doing a
single 100GB file write, I can see the network RCV buffer gradually being
driven down to zero without getting time to recover much, causing the TCP
window to go down to << 80 bytes in tcpdump as it tries to cope, and
eventually (when TCP window is driven to zero often enough) the file write
session fails due to the persistant zero window condition. Jumps in the
network RCV queue seems to coincide in time just before disk activity in
the HDD pool, adding to evidence it's related to the ZFS write pathway.

All the evidence suggests the network and client sides aren't the issue.
What's happening is, ZFS is misconfigured or hitting a bug, it's not
pulling new data into the next TXG in a timely manner, or delaying
elegantly. File data is coming into the NIC RCV buffer at 10G speed, faster
than ZFS is allowing it to be removed, and ZFS isn't effectively signalling
the client to slow down or delay.

*dtrace:*

The most direct route to narrow it down, would involve instrumenting to log
the timestamps, sizes, and time taken, for (I think!) 4 events. I don't
know the exact names used for these calls, so I have to describe their
effect. Help with identifying appropriate functions/instrumentation points,
even if it's not clear how to instrument them,  would be great!

   - There should be a function that gets called/polled when ZFS has space
   in its write pipeline, it checks to see if any ongoing write has a chunk of
   data to be written, and if so, pulls data from the source
   stream/pipe/descriptor (in this case the network RCV buffer, could be any
   source of an ongoing file write). This will let me instrument the events,
   when ZFS requests/gets source data from any ongoing file write, in this
   case pulling it from the network buffer.
* To instrument:    timestamp, max bytes to read, actual bytes read*

   - Within the ZFS code, the write pathway involves (simplistically) newly
   received write data and any related metadata/consequent updates being added
   to the current TXG and SLOG/ZIL, and ultimately, written out to the pool.
   *To instrument for those 3 events:     timestamp, bytes written,
   write-out time taken.      (Ideally also, dirty data cache size
   before/after)*

Although the output file will be huge, it should be easy to analyse the
timestamped list of events+dataflow and see which of the last 3 processes
is causing the bottleneck.

I would enjoy doing the work, and tracking this down somewhat, but need a
little help with the instrumentation aspect which really needs developer
competences.

Can someone kind, please help with an appropriate short instrumentation
script to capture the raw data for these 4 items :)

(And any other data that in your experience, would be useful)

Thank you!

Stilez