zfs + uma

Sun Sep 19 08:26:37 UTC 2010

On Sun, 19 Sep 2010, Andriy Gapon wrote:

> on 19/09/2010 01:16 Jeff Roberson said the following:
>> Not specifically in reaction to Robert's comment but I would like to add my
>> thoughts to this notion of resource balancing in buckets.  I really prefer not
>> to do any specific per-zone tuning except in extreme cases. This is because
>> quite often the decisions we make don't apply to some class of machines or
>> workloads.  I would instead prefer to keep the algorithm adaptable.
>
> Agree.
>
>> I like the idea of weighting the bucket decisions by the size of the item.
>> Obviously this has some flaws with compound objects but in the general case it
>> is good.  We should consider increasing the cost of bucket expansion based on
>> the size of the item.  Right now buckets are expanded fairly readily.
>>
>> We could also consider decreasing the default bucket size for a zone based on vm
>> pressure and use.  Right now there is no downward pressure on bucket size, only
>> upward based on trips to the slab layer.
>>
>> Additionally we could make a last ditch flush mechanism that runs on each cpu in
>> turn and flushes some or all of the buckets in per-cpu caches. Presently that is
>> not done due to synchronization issues.  It can't be done from a central place.
>> It could be done with a callout mechanism or a for loop that binds to each core
>> in succession.
>
> I like all of the tree above approaches.
> The last one is a bit hard to implement, the first two seem easier.

All the last one requires is a loop calling sched_bind() on each available 
cpu.

>
>> I believe the combination of these approaches would significantly solve the
>> problem and should be relatively little new code.  It should also preserve the
>> adaptable nature of the system without penalizing resource heavy systems.  I
>> would be happy to review patches from anyone who wishes to undertake it.
>
> FWIW, the approach of simply limiting maximum bucket size based on item size
> seems to work rather well too, as my testing with zfs+uma shows.
> I will also try to add code to completely bypass the per-cpu cache for "really
> huge" items.

I don't like this because even with very large buffers you can still have 
high enough turnover to require per-cpu caching.  Kip specifically added 
UMA support to address this issue in zfs.  If you have allocations which 
don't require per-cpu caching and are very large why even use UMA?

One thing that would be nice if we are frequently using page size 
allocations is to eliminate the requirement for a slab header for each 
page.  It seems unnecessary for any zone where the items per slab is 1 but 
it would require careful modification to support properly.

Thanks,
Jeff

>
> -- 
> Andriy Gapon
>