uma: zone fragmentation

Fri Aug 27 21:49:34 UTC 2010

On Sun, Aug 22, 2010 at 1:45 PM, Andriy Gapon <avg at icyb.net.ua> wrote:
> Unfortunately I don't have any conclusive results to report.
> The numbers seem to be better with the patch, but they are changing all the time
> depending on system usage.
> I couldn't think of any good test that would reflect real-world usage patterns,
> which I believe to be not entirely random.

I do see measurable improvement on my system. Without this change my
ARC would grow up to ~5200M and would oscillate pretty close to that
number. With your patch applied ARC reaches ~5900M. In both cases I
end up with ~7000M worth of wired memory. So, in normal case we have
about 1800M of memory lost to fragmentation and your patch reduces
that amount down to ~1100M which is a noticeable improvement.

On a side note -- how hard would that be to voluntairly drain uma
zones used by ARC? When I enable vfs.zfs.zio.use_uma I see that
there's a lot of memory there listed as 'free' which could be used for
something else. In absolute terms it's large-item zones that seem to
waste memory in my case. In my case there are ~1000 free items on
rarely used ~100K-sized zones. That memory will be released when
pagedaemon wakes up, but the same event would also back-pressure the
ARC. With ZIO UMA allocator my ARC size never grows above ~4800M --
and that's *with* your patch applied. Without the patch it was even
worse.  My guess is that if we'd manually drain those zones the memory
would find better use elsewhere. For instance as ARC data.

--Artem

On Sun, Aug 22, 2010 at 1:45 PM, Andriy Gapon <avg at icyb.net.ua> wrote:
>
> It seems that with inclusion of ZFS, which is a significant UMA user even when
> it is not used for ARC, zone fragmentation becomes an issue.
> For example, on my systems with 4GB of RAM I routinely observe several hundred
> megabytes in free items after zone draining (via lowmem event).
>
> I wrote a one-liner (quite long line though) for post-processing vmstat -z
> output and here's an example:
> $ vmstat -z | sed -e 's/ /_/' -e 's/:_* / /' -e 's/,//g' | tail +3 | awk 'BEGIN
> { total = 0; } { total += $2 * $5; print $2 * $5, $1, $4, $5, $2;} END { print
> total, "total"; }' | sort -n | tail -10
> 6771456 256 7749 26451 256
> 10710144 128 173499 83673 128
> 13400424 VM_OBJECT 33055 62039 216
> 17189568 zfs_znode_cache 33259 48834 352
> 19983840 VNODE 33455 41633 480
> 30936464 arc_buf_hdr_t 145387 148733 208
> 57030400 dmu_buf_impl_t 82816 254600 224
> 57619296 dnode_t 78811 73494 784
> 62067712 512 71050 121226 512
> 302164776 total
>
> When UMA is used for ARC, then "wasted" memory grows above 1GB effectevily
> making that setup unusable for me.
>
> I see that in OpenSolaris they developed a few measures to (try to) prevent
> fragmentation and perform defragmentation.
>
> First, they keep their equivalent of partial slab list sorted by number of used
> items thus trying to fill up the most used slab.
> Second, they allow to set a 'move' callback for a zone and have a special
> monitoring thread that tries to compact slabs when zone fragmentation goes above
> certain limit.
> The details can be found here (lengthy comment at the beginning and links in it):
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c
>
> Not sure if we would want to implement anything like or some alternative, but
> zone fragmentation seems to have become an issue, at least for ZFS.
>
> I am testing the following primitive patch that tries to "lazily sort" (or
> pseudo sort) slab partial list.  Linked list is not the kind of data structure
> that's easy to keep sorted in efficient manner.
>
> diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c
> index 2dcd14f..ed07ecb 100644
> --- a/sys/vm/uma_core.c
> +++ b/sys/vm/uma_core.c
> @@ -2727,14 +2727,26 @@ zone_free_item(uma_zone_t zone, void *item, void *udata,
>        }
>        MPASS(keg == slab->us_keg);
>
> -       /* Do we need to remove from any lists? */
> +       /* Move to the appropriate list or re-queue further from the head. */
>        if (slab->us_freecount+1 == keg->uk_ipers) {
> +               /* Partial -> free. */
>                LIST_REMOVE(slab, us_link);
>                LIST_INSERT_HEAD(&keg->uk_free_slab, slab, us_link);
>        } else if (slab->us_freecount == 0) {
> +               /* Full -> partial. */
>                LIST_REMOVE(slab, us_link);
>                LIST_INSERT_HEAD(&keg->uk_part_slab, slab, us_link);
>        }
> +       else {
> +               /* Partial -> partial. */
> +               uma_slab_t tmp;
> +
> +               tmp = LIST_NEXT(slab, us_link);
> +               if (tmp != NULL && slab->us_freecount > tmp->us_freecount) {
> +                       LIST_REMOVE(slab, us_link);
> +                       LIST_INSERT_AFTER(tmp, slab, us_link);
> +               }
> +       }
>
>        /* Slab management stuff */
>        freei = ((unsigned long)item - (unsigned long)slab->us_data)
>
>
> Unfortunately I don't have any conclusive results to report.
> The numbers seem to be better with the patch, but they are changing all the time
> depending on system usage.
> I couldn't think of any good test that would reflect real-world usage patterns,
> which I believe to be not entirely random.
>
> --
> Andriy Gapon
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>