mlockall() failure and direction for possible solution

Sun Apr 5 11:43:01 PDT 2009

Kostik Belousov wrote:
> On Sun, Apr 05, 2009 at 01:51:44PM +0200, Hans Ottevanger wrote:
>> Hi folks,
>>
>> As has been noted before, there is an issue with the mlockall() system
>> call always failing on (at least) the amd64 architecture. This is quite
>> evident by the automounter (as configured out-of-the-box) printing error
>> messages on startup like:
>>
>> Couldn't lock process pages in memory using mlockall()
>>
>> I have verified the occurrence of this issue on the amd64 platform on
>> 7.1-STABLE and 8.0-CURRENT. On the i386 platform this problem does not
>> occur.
>>
>> To investigate this issue a bit further I ran the following trivial program:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>>
>> int main(int argc, char *argv[])
>> {
>>         if (mlockall(MCL_CURRENT|MCL_FUTURE) == -1)
>>                 perror(argv[0]);
>>
>>         char command[80];
>>         snprintf(command, 80, "procstat -v %d", getpid());
>>         system(command);
>>
>>         exit(0);
>> }
>>
>> which yields (using CURRENT-8.0 as of today, on an Intel DP965LT board
>> with a Q6600 and 8 Gbyte RAM, GENERIC kernel stripped of unused devices,
>> output folded to 72 characters per line):
>>
>> /mltest: Resource temporarily unavailable
>>   PID              START                END PRT  RES PRES REF SHD FL TP
>> PATH
>>  1064           0x400000           0x401000 r-x    1    0   1   0 CN vn
>> /root/mlockall/mltest
>>  1064           0x500000           0x501000 rw-    1    0   1   0 CN df
>>  1064           0x501000           0x600000 rwx  255    0   1   0 -- df
>>  1064        0x800500000        0x80052c000 r-x   44    0  64  31 CN vn
>> /libexec/ld-elf.so.1
>>  1064        0x80052c000        0x800534000 rw-    8    0   1   0 C- df
>>  1064        0x80062b000        0x800633000 rw-    8    0   1   0 CN vn
>> /libexec/ld-elf.so.1
>>  1064        0x800633000        0x80063f000 rw-   12    0   1   0 C- df
>>  1064        0x80063f000        0x80072e000 r-x  239    0 128  62 CN vn
>> /lib/libc.so.7
>>  1064        0x80072e000        0x80072f000 r-x    1    0   1   0 CN vn
>> /lib/libc.so.7
>>  1064        0x80072f000        0x80082f000 r-x   51    0 128  62 CN vn
>> /lib/libc.so.7
>>  1064        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
>> /lib/libc.so.7
>>  1064        0x80084f000        0x800865000 rw-    6    0   1   0 CN df
>>  1064        0x800900000        0x800965000 rw-  101    0   1   0 -- df
>>  1064        0x800965000        0x800a00000 rw-  155    0   1   0 -- df
>>  1064     0x7ffffffe0000     0x800000000000 rwx    3    0   1   0 C- df
>>
>> I have hunted down the exact location in the kernel where the call to 
>> mlockall() returns an error (just using printf's, debugging using 
>> Firewire proved not to be as trivial to set up as it was just a few 
>> years ago). It appears that while wiring the memory, finally vm_fault() 
>> is called and it bails out at line 412 of vm_fault.c. The virtual 
>> address of the page that the system is attempting to wire (argument 
>> vaddr of vm_fault()) is 0x800762000. From the procstat output above it 
>> appears that this in the third region backed by /lib/libc.so.7.
>>
>> This made me think that the issue might be somehow related to the way in 
>> which dynamic libraries are linked on runtime. Indeed, if above program 
>> is linked -statically- it does not fail. Also if the program in compiled 
>> and linked -dynamically- on a i386 platform and run on an amd64, it runs 
>> successfully.
>>
>> To make a long story at least a bit shorter, I found that the problem is 
>> in /usr/src/libexec/rtld_elf/map_object.c at line 156. Here a contiguous 
>>  region is staked out for the code and data. For the amd64, where the 
>> required alignment of the segments is 1 Mbytes, this causes a region to 
>> be mapped that is far larger than the library file by which it is 
>> backed. Addresses that are not backed by the file cannot be resident and 
>> hence the region cannot be locked into memory. On the i386 architecture 
>> this problem does not occur since the alignment of the segments is just 
>> 4 Kbytes. I suspect that the problem also occurs at least on the sparc64 
>> architecture.
>>
>> As a first step to a possible solution you can apply the attached 
>> (provisional) patch, that uses an anonymous, read-only mapping to create 
>> the required region.
>>
>> The output of the above program then becomes:
>>
>>   PID              START                END PRT  RES PRES REF SHD FL TP
>> PATH
>>  1302           0x400000           0x401000 r-x    1    0   1   0 CN vn
>> /root/mlockall/mltest
>>  1302           0x500000           0x501000 rw-    1    0   1   0 -- df
>>  1302        0x800500000        0x80052c000 r-x   44    0   8   4 CN vn
>> /libexec/ld-elf.so.1
>>  1302        0x80052c000        0x800534000 rw-    8    0   1   0 -- df
>>  1302        0x80062b000        0x800633000 rw-    8    0   1   0 C- vn
>> /libexec/ld-elf.so.1
>>  1302        0x800633000        0x80063f000 rw-   12    0   1   0 -- df
>>  1302        0x80063f000        0x80072e000 r-x  239    0 124  62 CN vn
>> /lib/libc.so.7
>>  1302        0x80072e000        0x80072f000 r-x    1    0   1   0 C- vn
>> /lib/libc.so.7
>>  1302        0x80072f000        0x80082f000 r--  256    0   1   0 -- df
>>  1302        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
>> /lib/libc.so.7
>>  1302        0x80084f000        0x800865000 rw-   22    0   1   0 -- df
>>  1302     0x7ffffffe0000     0x800000000000 rwx   32    0   1   0 -- df
>>
>> i.e. mlockall() does not return an error anymore.
>>
>> I still have the following questions:
>>
>> 1. Is worth the trouble to solve the mlockall() problem at all ? Should 
>> I file a PR ?
> Yes. Do as you want, but I see no reason.
> 
> Your analisys looks correct and useful.
> 
>> 2. Can someone confirm that it also occurs on the other 64 bit 
>> architectures ?
>>
>> 3. It might be more elegant to use PROT_NONE instead of PROT_READ when 
>> just staking out the address space. Currently mlockall() returns an 
>> error when attempting that, so most likely mlockall() would need to be 
>> changed to ignore regions mapped with PROT_NONE. On the other hand, the 
>> pthread implementation uses PROT_NONE to create red zones on the stack 
>> and mlockall() apparently succeeds with threaded applications (using the 
>> provided patch). Any opinions/ideas/hints ?
> I think that it is better to unmap the holes, instead of making some
> mapping.
> 

In that way you free up virtual address space and make it available to 
the next call to mmap() with the first argument set to zero (i.e. where 
the caller does not care about the exact location), if the requested 
space fits in the hole you left. In this way unrelated mappings could 
end up between the regions of you dynamic libraries. I don't think that 
would be desirable. Using PROT_NONE would prevent such a mix up: the 
address space is still there, but not accessible.

BTW: Note that even in the current implementation there is a hole 
available between the regions for /libexec/ld-elf.so.1 itself, starting 
at 0x800534000 in the above examples.

> Please, try this patch instead.
> 

I have tried your patch on my amd64 8.0-CURRENT system and it works 
perfectly with the described test program. I will stress test it later 
by running a "make buildworld".

In can easily be demonstrated however, that allocations using mmap() as 
described above may end up in "strange" locations.

> diff --git a/libexec/rtld-elf/map_object.c b/libexec/rtld-elf/map_object.c
> index 2d06074..3266af0 100644
> --- a/libexec/rtld-elf/map_object.c
> +++ b/libexec/rtld-elf/map_object.c
> @@ -83,6 +83,7 @@ map_object(int fd, const char *path, const struct stat *sb)
>      Elf_Addr bss_vaddr;
>      Elf_Addr bss_vlimit;
>      caddr_t bss_addr;
> +    size_t hole;
>  
>      hdr = get_elf_header(fd, path);
>      if (hdr == NULL)
> @@ -91,8 +92,7 @@ map_object(int fd, const char *path, const struct stat *sb)
>      /*
>       * Scan the program header entries, and save key information.
>       *
> -     * We rely on there being exactly two load segments, text and data,
> -     * in that order.
> +     * We expect that the loadable segments are ordered by load address.
>       */
>      phdr = (Elf_Phdr *) ((char *)hdr + hdr->e_phoff);
>      phsize  = hdr->e_phnum * sizeof (phdr[0]);
> @@ -214,6 +214,17 @@ map_object(int fd, const char *path, const struct stat *sb)
>  		return NULL;
>  	    }
>  	}
> +
> +	/* Unmap the region between two non-adjusted ELF segments */
> +	if (i < nsegs) {
> +	    hole = trunc_page(segs[i + 1]->p_vaddr) - bss_vlimit;
> +	    if (hole > 0 && munmap(mapbase + bss_vlimit, hole) == -1) {
> +		_rtld_error("%s: munmap hole failed: %s", path,
> +		    strerror(errno));
> +		return NULL;
> +	    }
> +	}
> +
>  	if (phdr_vaddr == 0 && data_offset <= hdr->e_phoff &&
>  	  (data_vlimit - data_vaddr + data_offset) >=
>  	  (hdr->e_phoff + hdr->e_phnum * sizeof (Elf_Phdr))) {