Commit r345200 (new ARC reclamation threads) looks suspicious to me - second potential problem
karl at denninger.net
Wed May 22 15:47:44 UTC 2019
On 5/22/2019 10:19 AM, Alexander Motin wrote:
> On 20.05.2019 12:42, Mark Johnston wrote:
>> On Mon, May 20, 2019 at 07:05:07PM +0300, Lev Serebryakov wrote:
>>> I'm looking at last commit to
>>> 'sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c' (r345200) and
>>> have another question.
>>> Here are such code:
>>> 4960 /*
>>> 4961 * Kick off asynchronous kmem_reap()'s of all our caches.
>>> 4962 */
>>> 4963 arc_kmem_reap_soon();
>>> 4965 /*
>>> 4966 * Wait at least arc_kmem_cache_reap_retry_ms between
>>> 4967 * arc_kmem_reap_soon() calls. Without this check it is
>>> possible to
>>> 4968 * end up in a situation where we spend lots of time reaping
>>> 4969 * caches, while we're near arc_c_min. Waiting here also
>>> gives the
>>> 4970 * subsequent free memory check a chance of finding that the
>>> 4971 * asynchronous reap has already freed enough memory, and
>>> we don't
>>> 4972 * need to call arc_reduce_target_size().
>>> 4973 */
>>> 4974 delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);
>>> But looks like `arc_kmem_reap_soon()` is synchronous on FreeBSD! So,
>>> this `delay()` looks very wrong. Am I right?
> Why is it wrong?
>>> Looks like it should be `#ifdef illumos`.
>> See also r338142, which I believe was reverted by the update.
> My r345200 indeed reverted that value, but I don't see a problem there.
> When OS need more RAM, pagedaemon will drain UMA caches by itself. I
> don't see a point in re-draining UMA caches in a tight loop without
> delay. If caches are not sufficient to sustain one second of workload,
> then usefulness of such caches is not very clear and shrinking ARC to
> free some space may be a right move. Also making ZFS drain its caches
> more active then anything else in a system looks unfair to me.
There is a long-lasting pathology with the older implementation. The
short answer is that if you have cache in UMA but unallocated to current
working set it's completely wasted -- unless quickly re-used. So a
small buffer between current and allocation is ok, but the UMA system
will leave large amounts out but unused. Reclaiming that after a
reasonable amount of time is a very good thing.
The other problem is that disk cache should NEVER be preferred over
working set space. It's always wrong to do so because a working set
page-out is 1 *guaranteed* I/O (to page it out) and possibly 2 I/Os (if
required again and thus must be recalled) while a disk cache page is 1
*possible* I/O avoided (if the disk cache block is requested again)
It is never the right move to intentionally take an I/O in order to
avoid a *possible* I/O. Under certain workloads making that choice leads
to severe pathological behavior (~30 second "pauses" where the system is
doing I/O like crazy but a desired process -- such as a database, or
shell, does nothing waiting on working set to be paged back in) when
there are gigabytes (or 10s of gigabytes) of ARC outstanding.
-- Karl Denninger
S/MIME Email accepted and preferred
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
More information about the freebsd-fs