Strange slowdown when cache devices enabled in ZFS

Wed May 8 21:35:48 UTC 2013

Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013):
|
| The following settings in /etc/sysctl.conf prevent the "stalls"
completely,
| even when the L2ARC devices are 100% full and all RAM is wired into the
| ARC.  Been running without issues for 5 days now:
|
| vfs.zfs.l2arc_norw=0                                  # Default is 1
| vfs.zfs.l2arc_feed_again=0                         # Default is 1
| vfs.zfs.l2arc_noprefetch=0                          # Default is 0
| vfs.zfs.l2arc_feed_min_ms=1000                 # Default is 200
| vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
| vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps
|
| With these settings, I'm also able to expand the ARC to use the full 128
GB
| of RAM in the biggest box, and to use both L2ARC devices (60 GB in total).
| And, can set primarycache and secondarycache to all (the default) instead
| of just metadata.
|[...]

The thread earlier described a 100% CPU-bound l2arc_feed_thread, which
could be caused by these settings:

vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps

If I'm reading that correctly, it's increasing the write max and boost to
be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory
from the tail of the ARC lists, lists which may be composed of tiny buffers
(eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it
doesn't find many bytes to write out, then it will rescan the same buffers
on the next interval, wasting CPU cycles.

I understand the intent was probably to warm up the L2ARC faster. There is
no easy way to do this: you are bounded by the throughput of random reads
from the pool disks.

Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc
feed thread can't eat uncached data faster than the random reads can be
read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read
disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for
rotational disks performing roughly 100 random IOPS (use a different rate
if you like), means about a dozen disks - depending on the ZFS RAID config.
All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is
the default.

To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least
20,000 random read disk IOPS. How many spindles does that take? A lot. Do
you have a lot?

I wanted to point this out because the warm up problem isn't the
l2arc_feed_thread (that it scans, how far it scans, whether it rescans,
etc) – it's the input to the system.

...

I just noticed that the https://wiki.freebsd.org/ZFSTuningGuide writes:

"
vfs.zfs.l2arc_write_max

vfs.zfs.l2arc_write_boost

The former value sets the runtime max that data will be loaded into L2ARC.
The latter can be used to accelerate the loading of a freshly booted
system. For a device capable of 400MB/sec reasonable values might be 200MB
and 380MB respectively. Note that the same caveats apply about these
sysctls and pool imports as the previous one. Setting these values properly
is the difference between an L2ARC subsystem that can take days to heat up
versus one that heats up in minutes.
"

This advise seems a little unwise: you could tune the feed rates that high
– if you have enough spindles to feed it – but I think for most people this
will waste CPU cycles failing to find buffers to cache. Can the author
please double check?

Brendan

-- 
Brendan Gregg, Joyent                      http://dtrace.org/blogs/brendan