mlockall() failure and direction for possible solution

Sun Apr 5 08:59:24 PDT 2009

On Sun, Apr 05, 2009 at 01:51:44PM +0200, Hans Ottevanger wrote:
> Hi folks,
> 
> As has been noted before, there is an issue with the mlockall() system
> call always failing on (at least) the amd64 architecture. This is quite
> evident by the automounter (as configured out-of-the-box) printing error
> messages on startup like:
> 
> Couldn't lock process pages in memory using mlockall()
> 
> I have verified the occurrence of this issue on the amd64 platform on
> 7.1-STABLE and 8.0-CURRENT. On the i386 platform this problem does not
> occur.
> 
> To investigate this issue a bit further I ran the following trivial program:
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
> 
> int main(int argc, char *argv[])
> {
>         if (mlockall(MCL_CURRENT|MCL_FUTURE) == -1)
>                 perror(argv[0]);
> 
>         char command[80];
>         snprintf(command, 80, "procstat -v %d", getpid());
>         system(command);
> 
>         exit(0);
> }
> 
> which yields (using CURRENT-8.0 as of today, on an Intel DP965LT board
> with a Q6600 and 8 Gbyte RAM, GENERIC kernel stripped of unused devices,
> output folded to 72 characters per line):
> 
> /mltest: Resource temporarily unavailable
>   PID              START                END PRT  RES PRES REF SHD FL TP
> PATH
>  1064           0x400000           0x401000 r-x    1    0   1   0 CN vn
> /root/mlockall/mltest
>  1064           0x500000           0x501000 rw-    1    0   1   0 CN df
>  1064           0x501000           0x600000 rwx  255    0   1   0 -- df
>  1064        0x800500000        0x80052c000 r-x   44    0  64  31 CN vn
> /libexec/ld-elf.so.1
>  1064        0x80052c000        0x800534000 rw-    8    0   1   0 C- df
>  1064        0x80062b000        0x800633000 rw-    8    0   1   0 CN vn
> /libexec/ld-elf.so.1
>  1064        0x800633000        0x80063f000 rw-   12    0   1   0 C- df
>  1064        0x80063f000        0x80072e000 r-x  239    0 128  62 CN vn
> /lib/libc.so.7
>  1064        0x80072e000        0x80072f000 r-x    1    0   1   0 CN vn
> /lib/libc.so.7
>  1064        0x80072f000        0x80082f000 r-x   51    0 128  62 CN vn
> /lib/libc.so.7
>  1064        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
> /lib/libc.so.7
>  1064        0x80084f000        0x800865000 rw-    6    0   1   0 CN df
>  1064        0x800900000        0x800965000 rw-  101    0   1   0 -- df
>  1064        0x800965000        0x800a00000 rw-  155    0   1   0 -- df
>  1064     0x7ffffffe0000     0x800000000000 rwx    3    0   1   0 C- df
> 
> I have hunted down the exact location in the kernel where the call to 
> mlockall() returns an error (just using printf's, debugging using 
> Firewire proved not to be as trivial to set up as it was just a few 
> years ago). It appears that while wiring the memory, finally vm_fault() 
> is called and it bails out at line 412 of vm_fault.c. The virtual 
> address of the page that the system is attempting to wire (argument 
> vaddr of vm_fault()) is 0x800762000. From the procstat output above it 
> appears that this in the third region backed by /lib/libc.so.7.
> 
> This made me think that the issue might be somehow related to the way in 
> which dynamic libraries are linked on runtime. Indeed, if above program 
> is linked -statically- it does not fail. Also if the program in compiled 
> and linked -dynamically- on a i386 platform and run on an amd64, it runs 
> successfully.
> 
> To make a long story at least a bit shorter, I found that the problem is 
> in /usr/src/libexec/rtld_elf/map_object.c at line 156. Here a contiguous 
>  region is staked out for the code and data. For the amd64, where the 
> required alignment of the segments is 1 Mbytes, this causes a region to 
> be mapped that is far larger than the library file by which it is 
> backed. Addresses that are not backed by the file cannot be resident and 
> hence the region cannot be locked into memory. On the i386 architecture 
> this problem does not occur since the alignment of the segments is just 
> 4 Kbytes. I suspect that the problem also occurs at least on the sparc64 
> architecture.
> 
> As a first step to a possible solution you can apply the attached 
> (provisional) patch, that uses an anonymous, read-only mapping to create 
> the required region.
> 
> The output of the above program then becomes:
> 
>   PID              START                END PRT  RES PRES REF SHD FL TP
> PATH
>  1302           0x400000           0x401000 r-x    1    0   1   0 CN vn
> /root/mlockall/mltest
>  1302           0x500000           0x501000 rw-    1    0   1   0 -- df
>  1302        0x800500000        0x80052c000 r-x   44    0   8   4 CN vn
> /libexec/ld-elf.so.1
>  1302        0x80052c000        0x800534000 rw-    8    0   1   0 -- df
>  1302        0x80062b000        0x800633000 rw-    8    0   1   0 C- vn
> /libexec/ld-elf.so.1
>  1302        0x800633000        0x80063f000 rw-   12    0   1   0 -- df
>  1302        0x80063f000        0x80072e000 r-x  239    0 124  62 CN vn
> /lib/libc.so.7
>  1302        0x80072e000        0x80072f000 r-x    1    0   1   0 C- vn
> /lib/libc.so.7
>  1302        0x80072f000        0x80082f000 r--  256    0   1   0 -- df
>  1302        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
> /lib/libc.so.7
>  1302        0x80084f000        0x800865000 rw-   22    0   1   0 -- df
>  1302     0x7ffffffe0000     0x800000000000 rwx   32    0   1   0 -- df
> 
> i.e. mlockall() does not return an error anymore.
> 
> I still have the following questions:
> 
> 1. Is worth the trouble to solve the mlockall() problem at all ? Should 
> I file a PR ?
Yes. Do as you want, but I see no reason.

Your analisys looks correct and useful.

> 
> 2. Can someone confirm that it also occurs on the other 64 bit 
> architectures ?
> 
> 3. It might be more elegant to use PROT_NONE instead of PROT_READ when 
> just staking out the address space. Currently mlockall() returns an 
> error when attempting that, so most likely mlockall() would need to be 
> changed to ignore regions mapped with PROT_NONE. On the other hand, the 
> pthread implementation uses PROT_NONE to create red zones on the stack 
> and mlockall() apparently succeeds with threaded applications (using the 
> provided patch). Any opinions/ideas/hints ?
I think that it is better to unmap the holes, instead of making some
mapping.

Please, try this patch instead.

diff --git a/libexec/rtld-elf/map_object.c b/libexec/rtld-elf/map_object.c
index 2d06074..3266af0 100644
--- a/libexec/rtld-elf/map_object.c
+++ b/libexec/rtld-elf/map_object.c
@@ -83,6 +83,7 @@ map_object(int fd, const char *path, const struct stat *sb)
     Elf_Addr bss_vaddr;
     Elf_Addr bss_vlimit;
     caddr_t bss_addr;
+    size_t hole;
 
     hdr = get_elf_header(fd, path);
     if (hdr == NULL)
@@ -91,8 +92,7 @@ map_object(int fd, const char *path, const struct stat *sb)
     /*
      * Scan the program header entries, and save key information.
      *
-     * We rely on there being exactly two load segments, text and data,
-     * in that order.
+     * We expect that the loadable segments are ordered by load address.
      */
     phdr = (Elf_Phdr *) ((char *)hdr + hdr->e_phoff);
     phsize  = hdr->e_phnum * sizeof (phdr[0]);
@@ -214,6 +214,17 @@ map_object(int fd, const char *path, const struct stat *sb)
 		return NULL;
 	    }
 	}
+
+	/* Unmap the region between two non-adjusted ELF segments */
+	if (i < nsegs) {
+	    hole = trunc_page(segs[i + 1]->p_vaddr) - bss_vlimit;
+	    if (hole > 0 && munmap(mapbase + bss_vlimit, hole) == -1) {
+		_rtld_error("%s: munmap hole failed: %s", path,
+		    strerror(errno));
+		return NULL;
+	    }
+	}
+
 	if (phdr_vaddr == 0 && data_offset <= hdr->e_phoff &&
 	  (data_vlimit - data_vaddr + data_offset) >=
 	  (hdr->e_phoff + hdr->e_phnum * sizeof (Elf_Phdr))) {
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20090405/d2dbd338/attachment.pgp