ZFS dedup write pathway - possible inefficiency or..?

Mon Apr 2 13:36:58 UTC 2018

Hi list,

I'm writing because of the fairly technical nature of this question about 2
possibly related issues within ZFS. The first issue is specific to the
dedup write pathway.   I've tested locally to a point where it seems it's
not due to inadequate hardware and it's very consistent and specific, even
on idle conditions/minimal load.  I'm wondering whether there's a code
bottleneck specifically affecting just the dedup write pathway. The second
issue seems to be that in some scenarios, ZFS doesn't read from the IO
buffers where I'd expect it to, causing netio issues elsewhere.

I should say that I'm aware of the intense nature of dedup processing and
hopefully I'm not a noob user asking as usual about dedup on crap hardware
that can't do the job with data that shouldn't be anywhere near a dedup
engine.  That's not the case here. The system I'm testing on is built and
specced to handle dedup and has a ton of dedupable data with a high ratio,
so it's not an edge case:   it should be ideal.

I'm not really asking for help in resolving the issue. My question is aimed
at understanding technically more about the bottlenecks/issues so I can
make intelligent decisions how to approach it.  More to the point, my gut
feeling is that there is some kind of issue/inefficiency in the dedup write
pathway, and a second issue whereby ZFS isn't always reading as it should
from the network buffers:  netio rcv buffers periodically fail to empty
during ZFS processing, it's apparently somehow related to txg handling, and
causes netrcv buffer backup and TCP zero window issuance within
milliseconds lasting almost continually for lengthy periods. That doesn't
look right.

My main reason for posting here is that if these do seem to be genuine
inefficiencies/issues, I'd like to ask if it's sensible to put in an
enhancement request on bugzilla for either or both of them.  So either way
these probably needs technical/dev/committer insight, as I'd like to find
out (1) if it's possible to guess what the internal underlying ZFS issues
are, and (2) if it's worth putting in enhancement/fix requests.

*TEST HARDWARE / OS:*

   - *Baseboard/CPU/RAM* = Supermicro X10 series + Xeon E5-1680 v4 (3.4+
   GHz, octo core, 20MB cache, Broadwell generation) + 128 GB ECC @ 2400
   - *Main pool* = array of 12 enterprise 7200 SAS HDDs hanging off 2 x LSI
   9311 PCIe3 HBA. The HDDs are configured in ZFS as 4 x (3 way mirrors).
   Cache drives = Intel P3700 NVMe SLOG/ZIL (reckoned to be very good for
   reliable low write latency) and 250GB Samsung NVMe (L2ARC)
   - *NIC* = 10G Chelsio (if it matters)
   - *Power stability* = EVGA Supernova Platinum 1600W + APC 1500VA UPS
   - *OS version* = clean install full ISO FreeBSD 11.1 arm64 onto wiped
   boot SSD mirror (tested with both bare install and also prepackaged as
   "FreeNAS")
   - *Installed sw:* Very little running beyond bare OS - no jails, no
   bhyve, no mods/patches, no custom kernel. Samba and iperf for testing
   across LAN (see below).

   - *Main pool:*   The main pool has >22 TB capacity and in physical terms
   it's about 55% full. The data is nicely balanced across the disks, which
   are almost all the same (or very similar) performance. The data in the
   existing pool is highly dedupable - it has a ratio of about 4x and judging
   by zdb's output (total blocks x bytes needed per block) the DDT is about 50
   GB.
   - * Sysctls / loader / rc:* Various sysctls - can list if required. In
   particular metadata about 75GB of RAM, so that DDTs aren't likely to be
   forced out, with the remainder split between OS and other file cache (about
   10G for OS and about 35G for ARC not reserved for metadata).
   *Significant values if needed:   **vfs.zfs.arc_meta_limit* 75G,
   *vfs.zfs.l2arc_write_max/write_boost* 300000000 (300MB/s),
   *vfs.zfs.vdev.cache.size* 200MB, *vfs.zfs.delay_min_dirty_percent* 70.
   Also various tunables for efficient 10G networking, including testing with
   large receive buffer sizes.

In theory, it should be a fairly powerful setup for handling the heavy
workload of a small-scale dedup pool, with no parity data/RaidZ, in quiet
circumstances. Certainly I'm not expecting the dedup write outcomes I'm
seeing.

*TEST SETUP:*
I attached 2 x fast wiped SSDs capable of 500 MB/s+ rw, formatted as UFS,
and an additional 3 way mirror of another 7200 enterprise SAS drives on the
same HBAs, for testing.
I created a second pool on the temporary HDD pool, configured identical to
the main pool but empty and with dedup=off.

I copied a few very large files and a directory of smaller files onto the
SSDs (30, 50 and 110GB single files, plus a mixed dir of 3MB mp3s,
datasets, ISOs etc), and also copied them to twin SSDs on my directly
connected workstation.  I hash-checked the copies to ensure they were
identical, so that dedup would probably match the blocks they contain.
Then I tested copying the files onto both dedup and non-dedup pools,
locally (CLI) and across the LAN (Samba) as well as testing raw networking
IO (iperf).  In each case I copied the files to/from newly created empty
dirs, with the intent that the dedup pool would dedup these against
existing copies, and the non-dedup pool would just write them as normal.
The network and server were both checked as being quiet/idle apart from
these copies (previous write flushing finished, all netio/diskio/CPU idle
for several seconds, etc).

I copied the files from SSD (client/UFS) to the dedup pool and the
non-dedup pool, repeatedly and in turn, to offset/minimise issues related
to non-cached vs cached data, and to ensure that if dedup was on,
performance was measured when DDTs already contained entries for the blocks
and they were alrready known cached in ARC.  In theory, the writes would be
identical other than dedup on/off. Repeat to check stable results.    When
copying data, I watched common system stats (gstat, iostat, netstat, top,
via SSH in multiple windows, all updating every 1 sec).

*RESULTS:*
Checking with iperf and Samba showed that the system was very fast for
reading from the pool, and networking (both ways). It was capable of up to
1 GByte/sec both ways (duplex) on Samba, fractionally more on iperf. But
when writing data, whether locally via CLI or across the LAN with Samba,
writing to the dedup pool was consistently 10x ~ 20x slower than writing to
the non-dedup pool. (raw file write speeds 30~50 MB/s dedup vs 400 ~ 1000
MB/s non-dedup, as seen by client on a 100 GB single file transfer, before
allowing for caching effects, nothing else going on).

I had known dedup would impact RAM and performance but I had expected a
good CPU and hardware to mitigate it a lot, and it wasn't being mitigated
much if at all.  It was impacting so much that when writing across Samba,
the networking subsystem could be seen in tcpdump to be driven to smaller
windows and floods of tiny- and zero-windows on TCP, in order to allow
*something* within ZFS a lot of extra time for write request handling.

Nothing like this outcome happened during non-dedup pool writing, or during
dedup reading, of these files. But the server's performance consistently
dropped by 10x ~ 20x when writing to the dedup pool.  The system should
almost surely have enough RAM, and high standard of hardware/setup. DiskIO
and txg's looked about right, networking looked sane, and the issue
affected only dedup writing.  The main suspect seems likely to be either
CPU/threading, or an unexpectedly huge avalanche of required metadata
updates. With a large amount of RAM to play with and very fast ZIL, and
large diskIO caching setup to even out diskIO, it doesn't seem *that*
likely to be down to metadata updates, and "top" showed that most of the
CPU was idle, but I can't tell if that's cause or effect. I altered a
number of tunables to increase TXG/max dirty data/write coalescing, to the
point it was writing in noticeable bursts, and even so it didn't help.

I also noticed as an aside, if relevant, that where I'd expected one TXG to
be building up while the previous TXG was writing out, that wasn't what was
happening when writing across the LAN. What I saw consistently was that
regularly, ZFS would stop pulling data off the network buffers for a
lengthy number of seconds. At 10G speeds the netrcv buffer backed up in
milliseconds, causing zero windows. Then, abruptly, the buffer would almost
instantly empty and this seemed to coincide with the start of a high lebvel
of HDD writing-out. I'm not sure why networking is being stalled and not
continuing smoothly, perhaps someone will know?

I posted some technical details elsewhere - graphs showing netrcv buffer
fill rate (which matches to the millisecond what you'd see if ZFS
completely stopped reading incoming data for a lengthy period), and other
screenshots. If useful I'll add links in a followup.

*DISCUSSION / QUESTIONS:*
I suspect that the reason dedup writes (only) were so slow, is that
somewhere in the dedup write pathway, where it hashes data, matches it to a
DDT entry, and optionally verifies it, something inefficient is going on,
and it's slowing down the entire pathway. Perhaps it's only using single
core? Perhaps metadata updates aren'are more serious than I realised or
inefficient? I'm not sure what's up. But it's very consistent, I've
repeated this on multiple platforms and installs since doing the first
tests.  I don't know where to look further and I probably need input from
someone knowledgeable with the internals of the ZFS subsystem to do more.
I'd like to nail this down closer and get ideas what's (probably) up.  And
I'd like to see if it can be enhanced for others, by feeding back into
bugzilla if helpful.

So my questions are -

   1. There seem to be two ZFS issues, and they're somehow linked:

   *(A)* the dedup write pathway suffers from what feels like an
   unexpectedly horrible slowdown that's excessive in the scenario;  *and*

   *(B)* ZFS seems to halt from pulling data from the network rcv buffer
   during a significant part of its processing cycle, to the point that netio
   is forced to zero win for lengthy periods and most of the time, whereas I'd
   expect incoming network data to be processed into a new txg regardless of
   other processing going on, and not cause congestion unless a lot more was
   going on.

   Does anyone "in the know" on the technical side have an insight into
   what might be going on with either of these, or suggest any
   diagnostics/further info that's useful to pin it down?

   2. I'd expect dedup write performance to have *some* kind of impact due
   to the processing required. But is the dedup write performance slowdown on
   dedup write (specifically) that I'm seeing, usual *to this extent, on
   this class of hardware*, on an idle system with just one large file
   being written between 2 local file systems?

   3. If either of these matters *does* turn out to be a threading or other
   clear inefficiency on the write pathway or anywhere else, is it likely to
   be useful if I file an enhancement request in bugzilla?

After all, dedup is incredibly useful in a small number of scenarios and if
a server of this hardware on a single user load is struggling that much, it
would be interesting to know the technical point where it's occurring.
(Equally, I guess many people advocate not using ZFS dedup in almost any
scenario, because end users inevitably use it on completely inadequate
hardware or s totally unsuitable data, so perhaps it's a pathologised area
with little patience and "don't expect much to be done now it's stable"!)

Anyhow, hoping for an insightful reply!
Thank you

Stilez