About Transparent Superpages and Non-transparent superapges

Cedric Blancher cedric.blancher at gmail.com
Sat Sep 21 01:09:26 UTC 2013


On 20 September 2013 17:20, Sebastian Kuzminsky <S.Kuzminsky at f5.com> wrote:
> On Sep 19, 2013, at 22:06 , Patrick Dung wrote:
>
>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on amd64.  We're basing our work on 9.1.0 for now.
>> >
>> >An early preview is available here:
>> >
>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2
>>
>> That is cool.
>>
>> What type of applications can take advantage of the 1Gb page size?
>> And is it transparent? Or applications need to be modified?
>
> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is backed by 1 gig superpages.
>
> It's not transparent for userspace: applications need to pass a new flag to mmap() to get 1 gig pages.

That may be the wrong approach. What happens if x86 gets more
huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and
AMD and get surprised, and then allocate 16 more bits for mmap() if
you wish to stick with your approach)? For example SPARC64 does 8k,
64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes
differ from MMU to MMU implementation, and can be probed via pagesize
-a).

A much better option would be to follow the Solaris API which has APIs
to enumerate the available page sizes, and then set it either for
heap, stack or a given address range (the last one is used to use
largepages for file I/O via mmap()).

For example ksh93 uses this to use 64k pages for the stack (this
mainly aims at SPARC where 64k stack pages can be a real performance
booster if you shuffle a lot of strings via stack):
-----------
int main(int argc, char *argv[])
{
#if _lib_memcntl
        /* advise larger stack size */
        struct memcntl_mha mha;
        mha.mha_cmd = MHA_MAPSIZE_STACK;
        mha.mha_flags = 0;
        mha.mha_pagesize = 64 * 1024;
        (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0);
#endif
        return(sh_main(argc, argv, (Shinit_f)0));
}
-----------

Below is the memcntl(2) manpage describing the API:
---------------------------------------



System Calls                                           memcntl(2)



NAME
     memcntl - memory management control

SYNOPSIS
     #include <sys/types.h>
     #include <sys/mman.h>

     int memcntl(caddr_t _a_d_d_r, size_t _l_e_n, int
_c_m_d, caddr_t _a_r_g,
          int _a_t_t_r, int _m_a_s_k);


DESCRIPTION
     The memcntl() function allows the calling process to apply a
     variety of control operations over the address space identi-
     fied by the  mappings  established  for  the  address  range
     [_a_d_d_r, _a_d_d_r + _l_e_n).


     The _a_d_d_r argument must be a  multiple  of  the  pagesize  as
     returned by sysconf(3C). The scope of the control operations
     can be further defined with  additional  selection  criteria
     (in  the  form  of  attributes) according to the bit pattern
     contained in _a_t_t_r.


     The following attributes specify page mapping selection cri-
     teria:

     SHARED     Page is mapped shared.


     PRIVATE    Page is mapped private.



     The following attributes specify page  protection  selection
     criteria.  The  selection criteria are constructed by a bit-
     wise OR operation on  the  attribute  bits  and  must  match
     exactly.

     PROT_READ     Page can be read.


     PROT_WRITE    Page can be written.


     PROT_EXEC     Page can be executed.



     The following criteria may also be specified:




SunOS 5.11          Last change: 10 Apr 2007                    1






System Calls                                           memcntl(2)



     PROC_TEXT    Process text.


     PROC_DATA    Process data.



     The PROC_TEXT attribute specifies all privately mapped  seg-
     ments  with  read  and execute permission, and the PROC_DATA
     attribute specifies all privately mapped segments with write
     permission.


     Selection criteria can be used to describe various  abstract
     memory objects within the address space on which to operate.
     If an operation shall not be constrained  by  the  selection
     criteria, _a_t_t_r must have the value 0.


     The operation to be performed is identified by the  argument
     _c_m_d.  The  symbolic  names for the operations are defined in
     <sys/mman.h> as follows:

     MC_LOCK

         Lock in memory all pages in the  range  with  attributes
         _a_t_t_r.  A given page may be locked multiple times through
         different mappings; however,  within  a  given  mapping,
         page  locks do not nest. Multiple lock operations on the
         same address in the same process  will  all  be  removed
         with  a  single  unlock  operation. A page locked in one
         process and mapped in another (or visible through a dif-
         ferent  mapping  in  the  locking  process) is locked in
         memory as long as the locking process  does  neither  an
         implicit nor explicit unlock operation. If a locked map-
         ping is removed, or a page is deleted through file remo-
         val  or  truncation,  an  unlock operation is implicitly
         performed. If a writable MAP_PRIVATE page in the address
         range  is  changed,  the lock will be transferred to the
         private page.

         The _a_r_g argument is not used, but must be  0  to  ensure
         compatibility with potential future enhancements.


     MC_LOCKAS

         Lock in memory all pages mapped  by  the  address  space
         with attributes _a_t_t_r. The _a_d_d_r and _l_e_n
arguments are not
         used, but must be _N_U_L_L and  0  respectively,  to  ensure
         compatibility  with  potential future enhancements.  The
         _a_r_g argument is a bit pattern built from the flags:



SunOS 5.11          Last change: 10 Apr 2007                    2






System Calls                                           memcntl(2)



         MCL_CURRENT    Lock current mappings.


         MCL_FUTURE     Lock future mappings.

         The value of _a_r_g determines  whether  the  pages  to  be
         locked  are those currently mapped by the address space,
         those that will be mapped in the  future,  or  both.  If
         MCL_FUTURE  is specified, then all mappings subsequently
         added to the address space will be locked, provided suf-
         ficient memory is available.


     MC_SYNC

         Write to their backing storage  locations  all  modified
         pages  in  the  range  with attributes _a_t_t_r. Optionally,
         invalidate cache copies. The backing storage for a modi-
         fied  MAP_SHARED  mapping is the file the page is mapped
         to; the backing storage for a modified MAP_PRIVATE  map-
         ping is its swap area. The _a_r_g argument is a bit pattern
         built from the flags used to control the behavior of the
         operation:

         MS_ASYNC         Perform asynchronous writes.


         MS_SYNC          Perform synchronous writes.


         MS_INVALIDATE    Invalidate mappings.

         MS_ASYNC Return immediately once  all  write  operations
         are scheduled; with MS_SYNC the function will not return
         until all write operations are completed.

         MS_INVALIDATE Invalidate all cached copies  of  data  in
         memory,  so that further references to the pages will be
         obtained by the system from their backing storage  loca-
         tions.  This  operation  should  be used by applications
         that require a memory object to be in a known state.


     MC_UNLOCK

         Unlock all pages in the range with attributes _a_t_t_r.  The
         _a_r_g argument is not used, but must be 0 to ensure compa-
         tibility with potential future enhancements.


     MC_UNLOCKAS




SunOS 5.11          Last change: 10 Apr 2007                    3






System Calls                                           memcntl(2)



         Remove address space memory locks and locks on all pages
         in  the  address  space  with attributes _a_t_t_r. The
_a_d_d_r,
         _l_e_n, and _a_r_g arguments are not used, but must be
_N_U_L_L, 0
         and 0, respectively, to ensure compatibility with poten-
         tial future enhancements.


     MC_HAT_ADVISE

         Advise system how a region of user-mapped memory will be
         accessed.  The  _a_r_g argument is interpreted as a "struct
         memcntl_mha *". The following members are defined  in  a
         struct memcntl_mha:

           uint_t mha_cmd;
           uint_t mha_flags;
           size_t mha_pagesize;

         The accepted values for mha_cmd are:

           MHA_MAPSIZE_VA
           MHA_MAPSIZE_STACK
           MHA_MAPSIZE_BSSBRK

         The mha_flags member is reserved for future use and must
         always  be  set  to 0. The mha_pagesize member must be a
         valid size as obtained from getpagesizes(3C) or the con-
         stant value 0 to allow the system to choose an appropri-
         ate hardware address translation mapping size.

         MHA_MAPSIZE_VA  sets  the  preferred  hardware   address
         translation  mapping  size  of the region of memory from
         _a_d_d_r to _a_d_d_r + _l_e_n. Both _a_d_d_r
and _l_e_n must be aligned to
         an  mha_pagesize  boundary.  The  entire virtual address
         region from _a_d_d_r to _a_d_d_r + _l_e_n must not
have any  holes.
         Permissions  within each mha_pagesize-aligned portion of
         the region must be consistent.  When  a  size  of  0  is
         specified,  the system selects an appropriate size based
         on the size and alignment of the memory region, type  of
         processor, and other considerations.

         MHA_MAPSIZE_STACK sets the  preferred  hardware  address
         translation  mapping  size  of  the  process main thread
         stack segment. The _a_d_d_r and _l_e_n arguments must
be  _N_U_L_L
         and 0, respectively.

         MHA_MAPSIZE_BSSBRK sets the preferred  hardware  address
         translation  mapping  size of the process heap. The _a_d_d_r
         and _l_e_n arguments must be _N_U_L_L and 0, respectively.  See
         the  NOTES section of the ppgsz(1) manual page for addi-
         tional information on process heap alignment.




SunOS 5.11          Last change: 10 Apr 2007                    4






System Calls                                           memcntl(2)



         The _a_t_t_r argument must be 0 for all MC_HAT_ADVISE opera-
         tions.



     The _m_a_s_k argument must be 0; it is reserved for future use.


     Locks established with the lock operations are not inherited
     by  a  child  process  after fork(2). The memcntl() function
     fails if it attempts to lock  more  memory  than  a  system-
     specific limit.


     Due to the potential impact on system resources, the  opera-
     tions  MC_LOCKAS,  MC_LOCK,  MC_UNLOCKAS,  and MC_UNLOCK are
     restricted to privileged processes.

USAGE
     The memcntl() function subsumes the operations of plock(3C).


     MC_HAT_ADVISE is intended to improve performance of applica-
     tions  that  use  large amounts of memory on processors that
     support multiple hardware address translation mapping sizes;
     however,  it  should  be  used with care. Not all processors
     support all sizes with equal efficiency. Use of larger sizes
     may  also introduce extra overhead that could reduce perfor-
     mance or available memory.  Using large sizes for one appli-
     cation may reduce available resources for other applications
     and result in slower system wide performance.

RETURN VALUES
     Upon successful completion, memcntl() returns 0;  otherwise,
     it returns -1 and sets errno to indicate an error.

ERRORS
     The memcntl() function will fail if:

     EAGAIN    When the selection criteria match, some or all  of
               the  memory  identified by the operation could not
               be locked when MC_LOCK or MC_LOCKAS was specified,
               some  or  all mappings in the address range [_a_d_d_r,
               _a_d_d_r + _l_e_n) are locked for I/O when  MC_HAT_ADVISE
               was  specified,  or  the  system  has insufficient
               resources when MC_HAT_ADVISE was specified.

               The _c_m_d is MC_LOCK or MC_LOCKAS  and  locking  the
               memory identified by this operation would exceed a
               limit or resource control on locked memory.





SunOS 5.11          Last change: 10 Apr 2007                    5






System Calls                                           memcntl(2)



     EBUSY     When the selection criteria match, some or all  of
               the  addresses in the range [_a_d_d_r, _a_d_d_r
+ _l_e_n) are
               locked and MC_SYNC with the  MS_INVALIDATE  option
               was specified.


     EINVAL    The _a_d_d_r argument specifies invalid selection cri-
               teria  or  is  not  a multiple of the page size as
               returned by   sysconf(3C);  the  _a_d_d_r  and/or  _l_e_n
               argument  does not have the value 0 when MC_LOCKAS
               or MC_UNLOCKAS is specified; the _a_r_g  argument  is
               not valid for the function specified; mha_pagesize
               or mha_cmd is invalid; or MC_HAT_ADVISE is  speci-
               fied  and  not  all  pages in the specified region
               have the same access permissions within the  given
               size boundaries.


     ENOMEM    When the selection criteria match, some or all  of
               the  addresses in the range [_a_d_d_r, _a_d_d_r
+ _l_e_n) are
               invalid for the address  space  of  a  process  or
               specify one or more pages which are not mapped.


     EPERM     The  {PRIV_PROC_LOCK_MEMORY}  privilege   is   not
               asserted  in the effective set of the calling pro-
               cess  and  MC_LOCK,   MC_LOCKAS,   MC_UNLOCK,   or
               MC_UNLOCKAS was specified.


ATTRIBUTES
     See attributes(5) for descriptions of the  following  attri-
     butes:



     ____________________________________________________________
    |       ATTRIBUTE TYPE        |       ATTRIBUTE VALUE       |
    |______________________________|______________________________|
    | MT-Level                    | MT-Safe                     |
    |______________________________|______________________________|


SEE ALSO
     ppgsz(1), fork(2), mmap(2),  mprotect(2),  getpagesizes(3C),
     mlock(3C),  mlockall(3C), msync(3C), plock(3C), sysconf(3C),
     attributes(5), privileges(5)








SunOS 5.11          Last change: 10 Apr 2007                    6
---------------------------------------

Ced
-- 
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur


More information about the freebsd-hackers mailing list