svn commit: r351673 - in head: lib/libmemstat share/man/man9 sys/cddl/compat/opensolaris/kern sys/kern sys/vm

Wed Sep 4 14:45:28 UTC 2019

On Tue, Sep 03, 2019 at 06:01:06PM -0400, Mark Johnston wrote:

> > > have you considered running UMA_RECLAIM_TRIM periodically, even without a memory
> > > pressure?
> > > I think that with such a periodic trimming there will be less need to invoke
> > > vm_lowmem().
> 
> Slawa and I talked about this in the past.  His complaint is that a
> large cache can take a significant amount of time to trim, and it

Not only large cache. Also I am see large mbuf cache (10GB+ after pike network activites)
and many other zones. For example, live server w/ stable/11, cache sizes of zones in MB:

    54 RADIX NODE
    55 zio_data_buf_131072
    73 zio_data_buf_12288
    77 zio_data_buf_98304
    93 socket
    99 tcpcb
  1072 mbuf
  1136 zio_buf_131072
  1443 zio_data_buf_1048576
 17242 mbuf_jumbo_page

> manifests as a spike of CPU usage and contention on the zone lock.  In
> particular, keg_drain() iterates over the list of free slabs with the
> keg lock held, and if the many items were freed to the keg while
> trimming/draining, the list can be quite long.  This can have effects
> outside the zone, for example if we are reclaiming items from zones used
> by other UMA zones, like the bucket or slab zones.
> 
> Reclaiming cached items when there is no demand for free pages seems
> wrong to me.  We historically had similar problems with the page daemon,
> which last year was changed to perform smaller reclamations at a greater
> frequency.  I suspect a better approach for UMA would be to similarly
> increase reclaim frequency and reduce the number of items freed in one
> go.

My goal is next:

1. Today memory size is quite big. 64GB is minimal, 256GB is not extraordinary.
   As result memory processed at lowmem event can be very big compared to historicaly.
2. Memory reclaiming is very expensetive at last stage. As result reclaiming some 10GB
   can take 10s and more
3. Memory depletion may be very fast at current speed (40Gbit network connectivity is 5GB/s).
   As result (2+3) memory reclaiming at lowmem event may be slow for depletion compensation.
4. Many susbsystem now try do't cause lowmem by automatic memory depletion. Large unused in
   zone cache memory cause inefficent memory use (see above -- about 18GB memory may be used
   as cache or other way but currently just in zone cache. lowmem don't caused because all consumer
   try to rest sufficient free memory)
5. NUMA dramatize situation because (as I see) memory allocation can give from dom1 and free to dom0.
   As result zone cache in dom0 rise and don't used. Currently kernel not fully NUMA-avare and need many work.
6. All of this can cause exause memory below vmd_free_reserved and slowly many operation in kernel.

I am see all of this.

> > > Also, I think that we would be able to retire (or re-purpose) lowmem_period.
> > > E.g., the trimming would be done every lowmem_period, but vm_lowmem() would not
> > > be throttled.
> 
> Some of the vm_lowmem eventhandlers probably shouldn't be called each
> time the page daemon scans the inactive queue (every 0.1s under memory
> pressure).  ufsdirhash_lowmem and mb_reclaim in particular don't seem
> like they need to be invoked very frequently.  We could easily define
> multiple eventhandlers to differentiate between these cases, though.
> 
> > > One example of the throttling of vm_lowmem being bad is its interaction with the
> > > ZFS ARC.  When there is a spike in memory usage we want the ARC to adapt as
> > > quickly as possible.  But at present the lowmem_period logic interferes with that.
> >
> > Some time ago, I sent Mark a patch that implements this logic,
> > specialy for ARC and mbuf cooperate.
> >
> > Mostly problem I am see at this
> > work -- very slowly vm_page_free(). May be currenly this is more
> > speedy...
> 
> How did you determine this?

This is you guess:

======
>         while ((slab = SLIST_FIRST(&freeslabs)) != NULL) {
>                 SLIST_REMOVE(&freeslabs, slab, uma_slab, us_hlink);
>                 keg_free_slab(keg, slab, keg->uk_ipers);
>         }
> 2019 Feb  2 19:49:54.800524364       zio_data_buf_1048576  1032605 cache_reclaim limit      100 dom 0 nitems     1672 imin      298
> 2019 Feb  2 19:49:54.800524364       zio_data_buf_1048576  1033736 cache_reclaim recla      149 dom 0 nitems     1672 imin      298
> 2019 Feb  2 19:49:54.802524468       zio_data_buf_1048576  3119710 cache_reclaim limit      100 dom 1 nitems        1 imin        0
> 2019 Feb  2 19:49:54.802524468       zio_data_buf_1048576  3127550 keg_drain2
> 2019 Feb  2 19:49:54.803524487       zio_data_buf_1048576  4444219 keg_drain3
> 2019 Feb  2 19:49:54.838524634       zio_data_buf_1048576 39553705 keg_drain4
> 2019 Feb  2 19:49:54.838524634       zio_data_buf_1048576 39565323 zone_reclaim:return
>
> 35109.486 ms for last loop, 149 items to freed.

35ms to free 149MB (38144 4KB pages), so roughly 1us per page.  That
does seem like a lot, but freeing a page (vm_page_free(m)) is much
more expensive than freeing an item to UMA (i.e., uma_zfree()).
Most of that time will be spent in _kmem_unback().
======

> You are on stable/12 I believe, so r350374 might help if you do not
> already have it.

Not try this at this moment.

>  I guess the vm_page_free() calls are coming from the UMA trimmer?

Indirect, from keg_drain()