superpages for UMA

Mon Aug 18 15:03:23 UTC 2014

Hello list.

Currently UMA(9) uses PAGE_SIZE kegs to store items in.
It seems fine for most usage scenarios,  however there are some where 
very large number of items is required.

I've run into this problem while using ipfw tables (radix based) with 
~50k records. This is how
`pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0 
unresolved

%SAMP IMAGE      FUNCTION             CALLERS
  28.7 kernel     rn_match             ipfw_lookup_table:21.7 
rtalloc_fib_nolock:7.0
  25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
   6.0 kernel     rn_lookup            ipfw_lookup_table

Some numbers: table entry occupies 128 bytes, so we may store no more 
than 30 records in single page-sized keg.
50k records require more than 1500 kegs.
As far as I understand second-level TLB for modern Intel CPU may be 256 
or 512 entries( for 4K pages ), so using large number of entries
results in TLB cache misses constantly happening.

Other examples:
Route tables (in current implementation): struct rte occupies more than 
128 bytes and storing full-view (> 500k routes) would result in TLB 
misses happening all of the time.
Various stateful packet processing: modern SLB/firewall can have 
millions of states. Regardless of state size PAGE_SIZE'd kegs is not the 
best choice.

All of these can be addressed:
Ipwa tables/ipfw dynamic state allocation code can (and will) be 
rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
radix should simply be changed to a different lookup algo (as it is 
happening in ipfw tables).

However, we may consider on adding another UMA flag to allocate 
2M/1G-sized kegs per request.
(Additionally, Intel Haswell arch has 512 entries in STLB shared? 
between 4k/2M so it should help the former).

What do you think?