[patch] zfs kmem fragmentation

Mon May 4 22:34:58 UTC 2009

On Sat, 2 May 2009, Ben Kelly wrote:

> Hello all,
>
> Lately I've been looking into the "kmem too small" panics that often occur 
> with zfs if you don't restrict the arc.  What I found in my test environment 
> was that everything works well until the kmem usage hits the 75% limit set in 
> arc.c.  At this point the arc is shrunk and slabs are reclaimed from uma. 
> Unfortunately, every time this reclamation process runs the kmem space 
> becomes more fragmented.  The vast majority of the time my machine hits the 
> "kmem too small" panic it has over 200MB of kmem space available, but the 
> largest fragment is less than 128KB.

What consumers make requests of kmem for 128kb and over?  What ultimately 
trips the panic?

>
> Ideally things would be arranged to free memory without fragmentation.  I 
> have tried a few things along those lines, but none of them have been 
> successful so far.  I'm going to continue that work, but in the meantime I've 
> put together a patch that tries to avoid fragmentation by slowing kmem growth 
> before the aggressive reclamation process is required:
>
> http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff
>
> It uses the following heuristics to do this:
>
> - Start arc_c at arc_c_min instead of arc_c_max.  This causes the system to 
> warm up more slowly.
> - Half the rate arc_c grows when kmem exceeds kmem_slow_growth_thresh
> - Stop arc_c growth when kmem exceeds kmem_target
> - Evict arc data when the kmem exceeds kmem_target
> - If kmem usage exceeds kmem_target then ask the pagedaemon to reclaim pages
> - If the largest kmem fragment is less than kmem_fragment_target then ask 
> the pagedaemon to reclaim pages
> - If the largest kmem fragment is less than a kmem_fragment_thresh then 
> force the aggressve kmem/arc reclamation process
>
> The defaults for the various targets and thresholds are:
>
> kmem_reclaim_threshold = 7/8 kmem
> kmem_target = 3/4 kmem
> kmem_slow_growth_threshold = 5/8 kmem
> kmem_fragment_target = 1/8 kmem
> kmem_fragment_thresh = 1/16 kmem
>
> With this patch I've been able to run my load tests with the default arc size 
> with kmem values of 512MB to 700MB.  I tried one loaded run with a 300MB 
> kmem, but it panic'ed due to legitimate, non-fragmented kmem exhaustion.
>

May I suggest an alternate approach;  Have you considered placing zfs in 
its own kernel submap?  If all of its allocations are of a like size, 
fragmentation won't be an issue and it can be constrained to a fixed size 
without placing pressure on other kmem_map consumers.  This is the 
approach taken for the buffer cache.  It makes a good deal of sense.  If 
arc can be taught to handle allocation failures we could eliminate the 
panic entirely by simply causing arc to run out of space and flush more 
buffers.

Do you believe this would also address the problem?

Thanks,
Jeff

> Please note that you may still encounter some fragmentation.  Its possible 
> for the system to get stuck in a degraded state where its constantly trying 
> to free pages and memory in attempt to fix the fragmentation.  If the system 
> is in this state the kstat.zfs.misc.arcstats.fragmented_kmem_count sysctl 
> will be increasing at a fairly rapid rate.
>
> Anyway, I just thought I would put this out there in case anyone wanted to 
> try to test with it.  I've mainly been loading it using rsync between two 
> pools on a non-SMP, i386, with 2GB memory.
>
> Also, if anyone is interested in helping with the fragmentation problem 
> please let me know.  At this point I think the best odds are to modify UMA to 
> allow some zones to use a custom slab size of 128KB (max zfs buffer size) so 
> that most of the allocations from kmem are the same size.  It also occurred 
> to me that much of this mess would be simpler if kmem information were passed 
> up through the vnode so that the top layer entities like pagedaemon could 
> make better choices for the overall memory usage of the system.  Right now we 
> have a sub-system two or three layers down making decisions for everyone. 
> Anyway, suggestions and insights are more than welcome.
>
> Thanks!
>
> - Ben
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"