When will ZFS become stable?

Mon Jan 7 15:39:14 PST 2008

On Tue, 8 Jan 2008, Vadim Goncharov wrote:

>> To make life slightly more complicated, small malloc allocations are 
>> actually implemented using uma -- there are a small number of small object 
>> size zones reserved for this purpose, and malloc just rounds up to the next 
>> such bucket size and allocations from that bucket.  For larger sizes, 
>> malloc goes through uma, but pretty much directly to VM which makes pages 
>> available directly.  So when you look at "vmstat -z" output, be aware that 
>> some of the information presented there (zones named things like "128", 
>> "256", etc) are actually the pools from which malloc allocations come, so 
>> there's double-counting.
>
> Yes, I've known it, but didn't known what column names exactly mean. 
> Requests/Failures, I guess, is a pure statistics, Size is one element size, 
> but why USED + FREE != LIMIT (on whose where limit is non-zero) ?

Possibly we should rename the "FREE" column to "CACHE" -- the free count is 
the number of items in the UMA cache.  These may be hung in buckets off the 
per-CPU cache, or be spare buckets in the zone.  Either way, the memory has to 
be reclaimed before it can be used for other purposes, and generally for 
complex objects, it can be allocated much more quickly than going back to VM 
for more memory.  LIMIT is an administrative limit that may be configured on 
the zone, and is configured for some but not all zones.

I'll let someone with a bit more VM experience follow up with more information 
about how the various maps and submaps relate to each other.

>> The concept of kernel memory, as seen above, is a bit of a convoluted 
>> concept. Simple memory allocated by the kernel for its internal data 
>> structures, such as vnodes, sockets, mbufs, etc, is almost always not 
>> something that can be paged, as it may be accessed from contexts where 
>> blocking on I/O is not permitted (for example, in interrupt threads or with 
>> critical mutexes held). However, other memory in the kernel map may well be 
>> pageable, such as kernel thread stacks for sleeping user threads
>
> We can assume for simplicty that their memoru is not-so-kernel but part of 
> process address space :)

If it is mapped into the kernel address space, then it still counts towards 
the limit on the map.  There are really two critical resources: memory itself, 
and address space to map it into.  Over time, the balance between address 
space and memory changes -- for a long time, 32 bits was the 640k of the UNIX 
world, so there was always plenty of address space and not enough memory to 
fill it.  More recently, physical memory started to overtake address space, 
and now with the advent of widely available 64-bit systems, it's swinging in 
the other direction.  The trick is always in how to tune things, as tuning 
parameters designed for "memory is bounded and address space is infinite" 
often work less well when that's not the case.  In the early 5.x series, we 
had a lot of kernel panics because kernel constants were scaling to physical 
memory rather than address space, so the kernel would run out of address 
space, for example.

>> (which can be swapped out under heavy memory load), pipe buffers, and 
>> general cached data for the buffer cache / file system, which will be paged 
>> out or discarded when memory pressure goes up.
>
> Umm. I think there is no point in swapping disk cache which can be 
> discarded, so the most actual part of kernel memory which is swappable are 
> anonymous pipe(2) buffers?

Yes, that's what I meant.  There are some other types of pageable kernel 
memory, such as memory used for swap-backed md devices.

Robert N M Watson
Computer Laboratory
University of Cambridge