Commit r345200 (new ARC reclamation threads) looks suspicious to me - second potential problem

Wed May 22 15:47:44 UTC 2019

On 5/22/2019 10:19 AM, Alexander Motin wrote:
> On 20.05.2019 12:42, Mark Johnston wrote:
>> On Mon, May 20, 2019 at 07:05:07PM +0300, Lev Serebryakov wrote:
>>>   I'm looking at last commit to
>>> 'sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c' (r345200) and
>>> have another question.
>>>
>>>   Here are such code:
>>>
>>> 4960 	        /*
>>> 4961 	         * Kick off asynchronous kmem_reap()'s of all our caches.
>>> 4962 	         */
>>> 4963 	        arc_kmem_reap_soon();
>>> 4964 	
>>> 4965 	        /*
>>> 4966 	         * Wait at least arc_kmem_cache_reap_retry_ms between
>>> 4967 	         * arc_kmem_reap_soon() calls. Without this check it is
>>> possible to
>>> 4968 	         * end up in a situation where we spend lots of time reaping
>>> 4969 	         * caches, while we're near arc_c_min.  Waiting here also
>>> gives the
>>> 4970 	         * subsequent free memory check a chance of finding that the
>>> 4971 	         * asynchronous reap has already freed enough memory, and
>>> we don't
>>> 4972 	         * need to call arc_reduce_target_size().
>>> 4973 	         */
>>> 4974 	        delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);
>>> 4975 	
>>>
>>>   But looks like `arc_kmem_reap_soon()` is synchronous on FreeBSD! So,
>>> this `delay()` looks very wrong. Am I right?
> Why is it wrong?
>
>>>    Looks like it should be `#ifdef illumos`.
>> See also r338142, which I believe was reverted by the update.
> My r345200 indeed reverted that value, but I don't see a problem there.
> When OS need more RAM, pagedaemon will drain UMA caches by itself.  I
> don't see a point in re-draining UMA caches in a tight loop without
> delay.  If caches are not sufficient to sustain one second of workload,
> then usefulness of such caches is not very clear and shrinking ARC to
> free some space may be a right move.  Also making ZFS drain its caches
> more active then anything else in a system looks unfair to me.

There is a long-lasting pathology with the older implementation. The 
short answer is that if you have cache in UMA but unallocated to current 
working set it's completely wasted -- unless quickly re-used.  So a 
small buffer between current and allocation is ok, but the UMA system 
will leave large amounts out but unused. Reclaiming that after a 
reasonable amount of time is a very good thing.

The other problem is that disk cache should NEVER be preferred over 
working set space.  It's always wrong to do so because a working set 
page-out is 1 *guaranteed* I/O (to page it out) and possibly 2 I/Os (if 
required again and thus must be recalled) while a disk cache page is 1 
*possible* I/O avoided (if the disk cache block is requested again)

It is never the right move to intentionally take an I/O in order to 
avoid a *possible* I/O. Under certain workloads making that choice leads 
to severe pathological behavior (~30 second "pauses" where the system is 
doing I/O like crazy but a desired process -- such as a database, or 
shell, does nothing waiting on working set to be paged back in) when 
there are gigabytes (or 10s of gigabytes) of ARC outstanding.

-- 
-- Karl Denninger
/The Market-Ticker/
S/MIME Email accepted and preferred
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20190522/28b7e237/attachment.bin>