About Transparent Superpages and Non-transparent superapges
Cedric Blancher
cedric.blancher at gmail.com
Sat Sep 21 01:09:26 UTC 2013
On 20 September 2013 17:20, Sebastian Kuzminsky <S.Kuzminsky at f5.com> wrote:
> On Sep 19, 2013, at 22:06 , Patrick Dung wrote:
>
>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on amd64. We're basing our work on 9.1.0 for now.
>> >
>> >An early preview is available here:
>> >
>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2
>>
>> That is cool.
>>
>> What type of applications can take advantage of the 1Gb page size?
>> And is it transparent? Or applications need to be modified?
>
> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is backed by 1 gig superpages.
>
> It's not transparent for userspace: applications need to pass a new flag to mmap() to get 1 gig pages.
That may be the wrong approach. What happens if x86 gets more
huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and
AMD and get surprised, and then allocate 16 more bits for mmap() if
you wish to stick with your approach)? For example SPARC64 does 8k,
64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes
differ from MMU to MMU implementation, and can be probed via pagesize
-a).
A much better option would be to follow the Solaris API which has APIs
to enumerate the available page sizes, and then set it either for
heap, stack or a given address range (the last one is used to use
largepages for file I/O via mmap()).
For example ksh93 uses this to use 64k pages for the stack (this
mainly aims at SPARC where 64k stack pages can be a real performance
booster if you shuffle a lot of strings via stack):
-----------
int main(int argc, char *argv[])
{
#if _lib_memcntl
/* advise larger stack size */
struct memcntl_mha mha;
mha.mha_cmd = MHA_MAPSIZE_STACK;
mha.mha_flags = 0;
mha.mha_pagesize = 64 * 1024;
(void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0);
#endif
return(sh_main(argc, argv, (Shinit_f)0));
}
-----------
Below is the memcntl(2) manpage describing the API:
---------------------------------------
System Calls memcntl(2)
NAME
memcntl - memory management control
SYNOPSIS
#include <sys/types.h>
#include <sys/mman.h>
int memcntl(caddr_t _a_d_d_r, size_t _l_e_n, int
_c_m_d, caddr_t _a_r_g,
int _a_t_t_r, int _m_a_s_k);
DESCRIPTION
The memcntl() function allows the calling process to apply a
variety of control operations over the address space identi-
fied by the mappings established for the address range
[_a_d_d_r, _a_d_d_r + _l_e_n).
The _a_d_d_r argument must be a multiple of the pagesize as
returned by sysconf(3C). The scope of the control operations
can be further defined with additional selection criteria
(in the form of attributes) according to the bit pattern
contained in _a_t_t_r.
The following attributes specify page mapping selection cri-
teria:
SHARED Page is mapped shared.
PRIVATE Page is mapped private.
The following attributes specify page protection selection
criteria. The selection criteria are constructed by a bit-
wise OR operation on the attribute bits and must match
exactly.
PROT_READ Page can be read.
PROT_WRITE Page can be written.
PROT_EXEC Page can be executed.
The following criteria may also be specified:
SunOS 5.11 Last change: 10 Apr 2007 1
System Calls memcntl(2)
PROC_TEXT Process text.
PROC_DATA Process data.
The PROC_TEXT attribute specifies all privately mapped seg-
ments with read and execute permission, and the PROC_DATA
attribute specifies all privately mapped segments with write
permission.
Selection criteria can be used to describe various abstract
memory objects within the address space on which to operate.
If an operation shall not be constrained by the selection
criteria, _a_t_t_r must have the value 0.
The operation to be performed is identified by the argument
_c_m_d. The symbolic names for the operations are defined in
<sys/mman.h> as follows:
MC_LOCK
Lock in memory all pages in the range with attributes
_a_t_t_r. A given page may be locked multiple times through
different mappings; however, within a given mapping,
page locks do not nest. Multiple lock operations on the
same address in the same process will all be removed
with a single unlock operation. A page locked in one
process and mapped in another (or visible through a dif-
ferent mapping in the locking process) is locked in
memory as long as the locking process does neither an
implicit nor explicit unlock operation. If a locked map-
ping is removed, or a page is deleted through file remo-
val or truncation, an unlock operation is implicitly
performed. If a writable MAP_PRIVATE page in the address
range is changed, the lock will be transferred to the
private page.
The _a_r_g argument is not used, but must be 0 to ensure
compatibility with potential future enhancements.
MC_LOCKAS
Lock in memory all pages mapped by the address space
with attributes _a_t_t_r. The _a_d_d_r and _l_e_n
arguments are not
used, but must be _N_U_L_L and 0 respectively, to ensure
compatibility with potential future enhancements. The
_a_r_g argument is a bit pattern built from the flags:
SunOS 5.11 Last change: 10 Apr 2007 2
System Calls memcntl(2)
MCL_CURRENT Lock current mappings.
MCL_FUTURE Lock future mappings.
The value of _a_r_g determines whether the pages to be
locked are those currently mapped by the address space,
those that will be mapped in the future, or both. If
MCL_FUTURE is specified, then all mappings subsequently
added to the address space will be locked, provided suf-
ficient memory is available.
MC_SYNC
Write to their backing storage locations all modified
pages in the range with attributes _a_t_t_r. Optionally,
invalidate cache copies. The backing storage for a modi-
fied MAP_SHARED mapping is the file the page is mapped
to; the backing storage for a modified MAP_PRIVATE map-
ping is its swap area. The _a_r_g argument is a bit pattern
built from the flags used to control the behavior of the
operation:
MS_ASYNC Perform asynchronous writes.
MS_SYNC Perform synchronous writes.
MS_INVALIDATE Invalidate mappings.
MS_ASYNC Return immediately once all write operations
are scheduled; with MS_SYNC the function will not return
until all write operations are completed.
MS_INVALIDATE Invalidate all cached copies of data in
memory, so that further references to the pages will be
obtained by the system from their backing storage loca-
tions. This operation should be used by applications
that require a memory object to be in a known state.
MC_UNLOCK
Unlock all pages in the range with attributes _a_t_t_r. The
_a_r_g argument is not used, but must be 0 to ensure compa-
tibility with potential future enhancements.
MC_UNLOCKAS
SunOS 5.11 Last change: 10 Apr 2007 3
System Calls memcntl(2)
Remove address space memory locks and locks on all pages
in the address space with attributes _a_t_t_r. The
_a_d_d_r,
_l_e_n, and _a_r_g arguments are not used, but must be
_N_U_L_L, 0
and 0, respectively, to ensure compatibility with poten-
tial future enhancements.
MC_HAT_ADVISE
Advise system how a region of user-mapped memory will be
accessed. The _a_r_g argument is interpreted as a "struct
memcntl_mha *". The following members are defined in a
struct memcntl_mha:
uint_t mha_cmd;
uint_t mha_flags;
size_t mha_pagesize;
The accepted values for mha_cmd are:
MHA_MAPSIZE_VA
MHA_MAPSIZE_STACK
MHA_MAPSIZE_BSSBRK
The mha_flags member is reserved for future use and must
always be set to 0. The mha_pagesize member must be a
valid size as obtained from getpagesizes(3C) or the con-
stant value 0 to allow the system to choose an appropri-
ate hardware address translation mapping size.
MHA_MAPSIZE_VA sets the preferred hardware address
translation mapping size of the region of memory from
_a_d_d_r to _a_d_d_r + _l_e_n. Both _a_d_d_r
and _l_e_n must be aligned to
an mha_pagesize boundary. The entire virtual address
region from _a_d_d_r to _a_d_d_r + _l_e_n must not
have any holes.
Permissions within each mha_pagesize-aligned portion of
the region must be consistent. When a size of 0 is
specified, the system selects an appropriate size based
on the size and alignment of the memory region, type of
processor, and other considerations.
MHA_MAPSIZE_STACK sets the preferred hardware address
translation mapping size of the process main thread
stack segment. The _a_d_d_r and _l_e_n arguments must
be _N_U_L_L
and 0, respectively.
MHA_MAPSIZE_BSSBRK sets the preferred hardware address
translation mapping size of the process heap. The _a_d_d_r
and _l_e_n arguments must be _N_U_L_L and 0, respectively. See
the NOTES section of the ppgsz(1) manual page for addi-
tional information on process heap alignment.
SunOS 5.11 Last change: 10 Apr 2007 4
System Calls memcntl(2)
The _a_t_t_r argument must be 0 for all MC_HAT_ADVISE opera-
tions.
The _m_a_s_k argument must be 0; it is reserved for future use.
Locks established with the lock operations are not inherited
by a child process after fork(2). The memcntl() function
fails if it attempts to lock more memory than a system-
specific limit.
Due to the potential impact on system resources, the opera-
tions MC_LOCKAS, MC_LOCK, MC_UNLOCKAS, and MC_UNLOCK are
restricted to privileged processes.
USAGE
The memcntl() function subsumes the operations of plock(3C).
MC_HAT_ADVISE is intended to improve performance of applica-
tions that use large amounts of memory on processors that
support multiple hardware address translation mapping sizes;
however, it should be used with care. Not all processors
support all sizes with equal efficiency. Use of larger sizes
may also introduce extra overhead that could reduce perfor-
mance or available memory. Using large sizes for one appli-
cation may reduce available resources for other applications
and result in slower system wide performance.
RETURN VALUES
Upon successful completion, memcntl() returns 0; otherwise,
it returns -1 and sets errno to indicate an error.
ERRORS
The memcntl() function will fail if:
EAGAIN When the selection criteria match, some or all of
the memory identified by the operation could not
be locked when MC_LOCK or MC_LOCKAS was specified,
some or all mappings in the address range [_a_d_d_r,
_a_d_d_r + _l_e_n) are locked for I/O when MC_HAT_ADVISE
was specified, or the system has insufficient
resources when MC_HAT_ADVISE was specified.
The _c_m_d is MC_LOCK or MC_LOCKAS and locking the
memory identified by this operation would exceed a
limit or resource control on locked memory.
SunOS 5.11 Last change: 10 Apr 2007 5
System Calls memcntl(2)
EBUSY When the selection criteria match, some or all of
the addresses in the range [_a_d_d_r, _a_d_d_r
+ _l_e_n) are
locked and MC_SYNC with the MS_INVALIDATE option
was specified.
EINVAL The _a_d_d_r argument specifies invalid selection cri-
teria or is not a multiple of the page size as
returned by sysconf(3C); the _a_d_d_r and/or _l_e_n
argument does not have the value 0 when MC_LOCKAS
or MC_UNLOCKAS is specified; the _a_r_g argument is
not valid for the function specified; mha_pagesize
or mha_cmd is invalid; or MC_HAT_ADVISE is speci-
fied and not all pages in the specified region
have the same access permissions within the given
size boundaries.
ENOMEM When the selection criteria match, some or all of
the addresses in the range [_a_d_d_r, _a_d_d_r
+ _l_e_n) are
invalid for the address space of a process or
specify one or more pages which are not mapped.
EPERM The {PRIV_PROC_LOCK_MEMORY} privilege is not
asserted in the effective set of the calling pro-
cess and MC_LOCK, MC_LOCKAS, MC_UNLOCK, or
MC_UNLOCKAS was specified.
ATTRIBUTES
See attributes(5) for descriptions of the following attri-
butes:
____________________________________________________________
| ATTRIBUTE TYPE | ATTRIBUTE VALUE |
|______________________________|______________________________|
| MT-Level | MT-Safe |
|______________________________|______________________________|
SEE ALSO
ppgsz(1), fork(2), mmap(2), mprotect(2), getpagesizes(3C),
mlock(3C), mlockall(3C), msync(3C), plock(3C), sysconf(3C),
attributes(5), privileges(5)
SunOS 5.11 Last change: 10 Apr 2007 6
---------------------------------------
Ced
--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur
More information about the freebsd-hackers
mailing list