Strange slowdown when cache devices enabled in ZFS

Brendan Gregg brendan.gregg at joyent.com
Wed May 8 21:46:52 UTC 2013


On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg <brendan.gregg at joyent.com>wrote:

> Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013):
> |
> | The following settings in /etc/sysctl.conf prevent the "stalls"
> completely,
> | even when the L2ARC devices are 100% full and all RAM is wired into the
> | ARC.  Been running without issues for 5 days now:
> |
> | vfs.zfs.l2arc_norw=0                                  # Default is 1
> | vfs.zfs.l2arc_feed_again=0                         # Default is 1
> | vfs.zfs.l2arc_noprefetch=0                          # Default is 0
> | vfs.zfs.l2arc_feed_min_ms=1000                 # Default is 200
> | vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
> | vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps
> |
> | With these settings, I'm also able to expand the ARC to use the full 128
> GB
> | of RAM in the biggest box, and to use both L2ARC devices (60 GB in
> total).
> | And, can set primarycache and secondarycache to all (the default) instead
> | of just metadata.
> |[...]
>
> The thread earlier described a 100% CPU-bound l2arc_feed_thread, which
> could be caused by these settings:
>
> vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
> vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps
>
> If I'm reading that correctly, it's increasing the write max and boost to
> be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory
> from the tail of the ARC lists, lists which may be composed of tiny buffers
> (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it
> doesn't find many bytes to write out, then it will rescan the same buffers
> on the next interval, wasting CPU cycles.
>
> I understand the intent was probably to warm up the L2ARC faster. There is
> no easy way to do this: you are bounded by the throughput of random reads
> from the pool disks.
>
> Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc
> feed thread can't eat uncached data faster than the random reads can be
> read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read
> disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for
> rotational disks performing roughly 100 random IOPS (use a different rate
> if you like), means about a dozen disks - depending on the ZFS RAID config.
> All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is
> the default.
>
> To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least
> 20,000 random read disk IOPS. How many spindles does that take? A lot. Do
> you have a lot?
>
> I wanted to point this out because the warm up problem isn't the
> l2arc_feed_thread (that it scans, how far it scans, whether it rescans,
> etc) – it's the input to the system.
>
> ...
>
> I just noticed that the https://wiki.freebsd.org/ZFSTuningGuide writes:
>
> "
> vfs.zfs.l2arc_write_max
>
> vfs.zfs.l2arc_write_boost
>
> The former value sets the runtime max that data will be loaded into L2ARC.
> The latter can be used to accelerate the loading of a freshly booted
> system. For a device capable of 400MB/sec reasonable values might be 200MB
> and 380MB respectively. Note that the same caveats apply about these
> sysctls and pool imports as the previous one. Setting these values properly
> is the difference between an L2ARC subsystem that can take days to heat up
> versus one that heats up in minutes.
> "
>
> This advise seems a little unwise: you could tune the feed rates that high
> – if you have enough spindles to feed it – but I think for most people this
> will waste CPU cycles failing to find buffers to cache. Can the author
> please double check?
>

Sorry - just noticed that vfs.zfs.l2arc_noprefetch=0 was also set, and, the
guide recommends that. What I described was for the default of 1, where
only random reads feed the L2ARC. Streaming workloads can feed it much
quicker, so, you can increase the feed rate if either you have a lot of
spindles, or, are caching streaming workloads – both providing the
throughput desired.

Back when the L2ARC was developed, the SSD max throughput (around 200
Mbytes/sec) could not compete with the pool disks (say, 12 x 180
Mbytes/sec), so it didn't make sense to cache sequential workloads in the
L2ARC. It's another subtlety that the ZFSTuningGuide might want to explain:
your pool disks might already be very good at streaming workloads – better
than the L2ARC – and so you want to leave sequential workloads to them.

Brendan

-- 
Brendan Gregg, Joyent                      http://dtrace.org/blogs/brendan


More information about the freebsd-fs mailing list