[rfc] allow to boot with >= 256GB physmem

Alan Cox alan.l.cox at gmail.com
Fri Jan 21 21:43:24 UTC 2011


On Fri, Jan 21, 2011 at 2:58 PM, Alan Cox <alan.l.cox at gmail.com> wrote:

> On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin <jhb at freebsd.org> wrote:
>
>> On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote:
>> > Hello.
>> >
>> > Some time ago I faced with a problem booting with 400GB physmem.
>> > The problem is that vm.max_proc_mmap type overflows with
>> > such high value, and that results in a broken mmap() syscall.
>> > The max_proc_mmap value is a signed int and roughly calculated
>> > at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient:
>> > vm_kmem_size / sizeof(struct vm_map_entry) / 100.
>> >
>> > Although at the time it was introduced at svn r57263 the value
>> > was quite low (f.e. the related commit log stands:
>> > "The value defaults to around 9000 for a 128MB machine."),
>> > the problem is observed on amd64 where KVA space after
>> > r212784 is factually bound to the only physical memory size.
>> >
>> > With INT_MAX here is 0x7fffffff, and sizeof(struct vm_map_entry)
>> > is 120, it's enough to have sligthly less than 256GB to be able
>> > to reproduce the problem.
>> >
>> > I rewrote vmmapentry_rsrc_init() to set large enough limit for
>> > max_proc_mmap just to protect from integer type overflow.
>> > As it's also possible to live tune this value, I also added a
>> > simple anti-shoot constraint to its sysctl handler.
>> > I'm not sure though if it's worth to commit the second part.
>> >
>> > As this patch may cause some bikeshedding,
>> > I'd like to hear your comments before I will commit it.
>> >
>> > http://plukky.net/~pluknet/patches/max_proc_mmap.diff<http://plukky.net/%7Epluknet/patches/max_proc_mmap.diff>
>>
>> Is there any reason we can't just make this variable and sysctl a long?
>>
>>
> Or just delete it.
>
> 1. Contrary to what the commit message says, this sysctl does not
> effectively limit the number of vm map entries.  It only limits the number
> that are created by one system call, mmap().  Other system calls create vm
> map entries just as easily, for example, mprotect(), madvise(), mlock(), and
> minherit().  Basically, anything that alters the properties of a mapping.
> Thus, in 2000, after this sysctl was added, the same resource exhaustion
> induced crash could have been reproduced by trivially changing the program
> in PR/16573 to do an mprotect() or two.
>
> In a nutshell, if you want to really limit the number of vm map entries
> that a process can allocate, the implementation is a bit more involved than
> what was done for this sysctl.
>
> 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did
> not.  Moreover, vm map entries for user maps are allocated with M_WAITOK.
> So, the exact crash reported in PR/16573 couldn't happen any longer.
>
>
Actually, I take back part of what I said here.  The old zone allocator did
implement something like M_WAITOK, and that appears to have been used for
user maps.  However, the crash described in PR/16573 was actually on the
allocation of a vm map entry within the *kernel* address space for a process
U area.  This type of allocation did not use the old zone allocator's
equivalent to M_WAITOK.  However, we no longer have U areas, so the exact
crash scenario is clearly no longer possible.  Interestingly, the sysctl in
question has no direct effect on the allocation of kernel vm map entries.

So, I remain skeptical that this sysctl is preventing any resource
exhaustion based panics in the current kernel.  Again, I would be thrilled
to see one or more people do some testing, such as rerunning the program
from PR/16573.


3. We now have the "vmemoryuse" resource limit.  When this sysctl was
> defined, we didn't.  Limiting the virtual memory indirectly but effectively
> limits the number of vm map entries that a process can allocate.
>
> In summary, I would do a little due diligence, for example, run the program
> from PR/16573 with the limit disabled.  If you can't reproduce the crash, in
> other words, nothing contradicts point #2 above, then I would just delete
> this sysctl.
>
> Alan
>
>


More information about the freebsd-hackers mailing list