Cleanup and untangling of kernel VM initialization
Konstantin Belousov
kostikbel at gmail.com
Fri Mar 8 09:16:41 UTC 2013
On Thu, Mar 07, 2013 at 06:03:51PM +0100, Andre Oppermann wrote:
> On 01.02.2013 18:09, Alan Cox wrote:
> > On 02/01/2013 07:25, Andre Oppermann wrote:
> >> Rebase auto-sizing of limits on the available KVM/kmem_map instead of
> >> physical
> >> memory. Depending on the kernel and architecture configuration these
> >> two can
> >> be very different.
> >>
> >> Comments and reviews appreciated.
> >>
> >
> > I would really like to see the issues with the current auto-sizing code
> > addressed before any of the stylistic changes or en-masse conversions to
> > SYSINIT()s are considered. In particular, can we please start with the
> > patch that moves the pipe_map initialization? After that, I think that
> > we should revisit tunable_mbinit() and "maxmbufmem".
>
> OK. I'm trying to describe and explain the big picture for myself and
> other interested observers. The following text and explanations are going
> to be verbose and sometime redundant. If something is incorrect or incomplete
> please yell, I'm not an expert in all these parts and may easily have missed
> some subtle aspects.
>
> The kernel_map serves as the container of the entire available kernel VM
> address space, including the kernel text, data and bss itself, as well as
> other bootstrapped and pre-VM allocated structures.
>
> The kernel_map should cover a reasonable large amount of address space to be
> able to serve the various kernel subsystems demands in memory allocation.
> The cpu architecture's address range (32 or 64 bits) puts a hard ceiling on
> the total size of the kernel_map. Depending on the architecture the kernel_map
> covers a special range in the total addressable address range.
>
> * VM_MIN_KERNEL_ADDRESS
> * [KERNBASE]
> * kernel_map [actually mapped KVM range, direct allocations]
> * kernel text, data, bss
> * bootstrap and statically allocated structures [pmap]
> * virtual_avail [start of useable KVM]
> * kmem_map [submap for (most) UMA zones and kernel malloc]
> * exec_map [submap for temporary mapping during process exec()]
> * pipe_map [submap for temporary buffering of data between piped processes]
> * clean_map [submap for buffer_map and pager_map]
> * buffer_map [submap for BIO buffers]
> * pager_map [submap for temporary pager IO holding]
> * memguard_map [submap for debugging of UMA and kernel malloc]
> * ... [kernel_map direct allocations, free and unused space]
> * kernel_map [end of kernel_map]
> * ...
> * virtual_end [end of possible KVM]
> * VM_MAX_KERNEL_ADDRESS
>
> Some kernel_map's submaps are special by being non-pageable and
> by pre-allocating the necessary pmap structures to avoid page
> faults. The pre-allocation consumes physical memory. Thus a submap's
> pre-allocation should not be larger than a reasonable small fraction
> of available physical memory to leave enough space for other kernel
> and userspace memory demands.
Preallocation is done to ensure that calls to functions like pmap_qenter()
always succeed and do not sleep for succession.
>
> The pseudo-code for a dynamic calculation of a submap size would look like this:
>
> submap.size = min(physmem.size / pmap.prealloc_max_fraction / pmap.size_per_page *
> page_size, kernel_map.free_size)
>
> The pmap.prealloc_max_fraction is the largest fraction of physical
> memory we allow the pre-allocated pmap structures of a single submap
> to occupy.
>
> Separate submaps are usually used to segregate certain types of memory
> usage and to have individual limits applied to them:
>
> kmem_map: tries to be as large as possible. It serves the bulk of
> all dynamically allocated kernel memory usage. It is the memory
> pool used by UMA and kernel malloc. Almost all kernel structures
> come from here: process-, thread-, file descriptors, mbuf's and
> mbuf clusters, network connection control blocks, sockets, etc...
> It is not pageable. Calculation: is currently only partially done
> dynamically and the MD parts can specify particular min, max limits
> and scaling factors. It likely can be generalized and with only very
> special platforms requiring additional limits.
>
> exec_map: is used as temporary storage to set up a processes address
> space and related items. It is very small and by default contains
> only 16 pages. Calculation: (exec_map_entries * round_page(PATH_MAX
> + ARG_MAX)).
>
> pipe_map: is used to move piped data between processes. It is
> pageable memory. Calculation: min(physmem.size, kernel_map.size) /
> 64.
>
> clean_map: overarching submap to contain the buffer_map and
> pager_map. Likely no longer necessary and a leftover from earlier
> incarnations of the kernel VM.
>
> buffer_map: is used for BIO structures to perform IO between the
> kernel VM and storage media (disk). Not pageable. Calculation:
> min(physmem.size, kernel_map.size) / 4 up to 64MB and 1/10
> thereafter.
>
> pager_map: is used for pager IO to a storage media (disk). Not
> pageable. Calculation: MAXPHYS * min(max(nbuf/4, 16), 256).
It is more versatile. The space is used for pbufs, and pbufs currently
also serve for physio, for the clustering, for aio needs.
>
> memguard_map: is a special debugging submap substituting parts of
> kmem_map. Normally not used.
>
> There is some competition between these maps for physical memory. One
> has to be careful to find a total balance among them wrt. static and
> dynamic physical memory use.
They mostly compete for KVA, not for the physical memory.
>
> Within the submaps, especially the kmem_map, we have a number of
> dynamic UMA suballocators where we have to put a ceiling on their
> total memory usage to prevent them to consume all physical *and/or*
> kmem_map virtual memory. This is done with UMA zone limits.
Note that architectures with the direct maps do not use kmem_map for
the small allocations. The uma_small_alloc() utilizes the direct map
for VA of the new page. kmem_map is needed when allocation is multi-page
sized, to provide the continuous virtual mapping.
>
> No externally exploitable single UMA zone should be able to consume
> all available physical memory. This applies for example to the
> number of processes, file descriptors, sockets, mbufs and mbuf
> clusters. These need to be limited to a reasonable and heavy work-load
> permitting amount of available physical memory. However there is going
> to be overcommit among them and not all them can be at their limit
> at the same time. Probably none of these UMA zones should be allowed
> to occupy more than 1/2 of all available physical memory. Often
> individual UMA zone limits have to be put into context and related to
> other concurrent UMA zones. This usually means reduced UMA zone limit
> for a particular zone. Balancing this takes a slight amount of voodoo
> magic and knowledge of common extreme work-loads to align. On the
> other hand for most of those zones allocations are permitted to fail
> rendering an attempt at connection establishment unsuccessful. It can
> be retried later.
>
> Generic pseudo-code: UMA zone limit = min(kmem_map.size, physmem.size)
> / 4 (or other appropriate fraction).
>
> It could be that some of the kernel_map submaps are no longer
> necessary and their purpose could simply be emulated by using an
> appropriately limited UMA zone. For example the exec_map is very small
> and only used for the exec arguments. Putting this into pageable
> memory isn't very useful anymore.
I disagree. Having the strings copied on execve() pageable is good,
the default size of around 260KB max for the strings is quite a
load on the allocator.
>
> Also the interesting construct of the clean_map containing only
> the buffer_map and pager_map doesn't seem necessary anymore and is
> probably remains of an earlier incarnation of the VM.
>
> Comments, discussion and additional input welcome.
>
> -- Andre
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20130308/87d7bd24/attachment.sig>
More information about the freebsd-current
mailing list