Memory reserves or lack thereof
Konstantin Belousov
kostikbel at gmail.com
Mon Nov 12 13:36:54 UTC 2012
On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel at gmail.com>wrote:
>
> > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
> > > I have a memory subsystem design question that I'm hoping someone can
> > answer.
> > >
> > > I've been looking at a machine that is completely out of memory, as in
> > >
> > > v_free_count = 0,
> > > v_cache_count = 0,
> > >
> > > I wondered how a machine could completely run out of memory like this,
> > especially after finding a lack of interrupt storms or other pathologies
> > that would tend to overcommit memory. So I started investigating.
> > >
> > > Most allocators come down to vm_page_alloc(), which has this guard:
> > >
> > > if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
> > > page_req = VM_ALLOC_SYSTEM;
> > > };
> > >
> > > if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
> > > (page_req == VM_ALLOC_SYSTEM &&
> > > cnt.v_free_count + cnt.v_cache_count >
> > cnt.v_interrupt_free_min) ||
> > > (page_req == VM_ALLOC_INTERRUPT &&
> > > cnt.v_free_count + cnt.v_cache_count > 0)) {
> > >
> > > The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
> > every last page.
> > >
> > > >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
> > perhaps only used from interrupt threads. Not so, see kmem_malloc() or
> > uma_small_alloc() which both contain this mapping:
> > >
> > > if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> > > pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> > > else
> > > pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> > >
> > > Note that M_USE_RESERVE has been deprecated and is used in just a
> > handful of places. Also note that lots of code paths come through these
> > routines.
> > >
> > > What this means is essentially _any_ allocation using M_NOWAIT will
> > bypass whatever reserves have been held back and will take every last page
> > available.
> > >
> > > There is no documentation stating M_NOWAIT has this side effect of
> > essentially being privileged, so any innocuous piece of code that can't
> > block will use it. And of course M_NOWAIT is literally used all over.
> > >
> > > It looks to me like the design goal of the BSD allocators is on
> > recovery; it will give all pages away knowing it can recover.
> > >
> > > Am I missing anything? I would have expected some small number of pages
> > to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
> > sort of back door for grabbing memory.
> > >
> >
> > Your analysis is right, there is nothing to add or correct.
> > This is the reason to strongly prefer M_WAITOK.
> >
>
> Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was
> well understand that it should only be used by interrupt handlers.
>
> The trouble is that M_NOWAIT conflates two orthogonal things. The obvious
> being that the allocation shouldn't sleep. The other being how far we're
> willing to deplete the cache/free page queues.
>
> When fine-grained locking got sprinkled throughout the kernel, we all to
> often found ourselves wanting to do allocations without the possibility of
> blocking. So, M_NOWAIT became commonplace, where it wasn't before.
>
> This had the unintended consequence of introducing a lot of memory
> allocations in the top-half of the kernel, i.e., non-interrupt handling
> code, that were digging deep into the cache/free page queues.
>
> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
> allocation is less likely to succeed than an "M_NOWAIT" allocation.
> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
> could only allocate a free page. M_USE_RESERVE said that it ok to allocate
> a cached page even though M_NOWAIT was specified. Consequently, the system
> wouldn't dig as far into the free page queue if M_USE_RESERVE was
> specified, because it was allowed to reclaim a cached page.
>
> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
> dig any deeper into the cache/free page queues than M_WAITOK does and
> reintroduce a M_USE_RESERVE-like flag that says dig deep into the
> cache/free page queues. The trouble is that we then need to identify all
> of those places that are implicitly depending on the current behavior of
> M_NOWAIT also digging deep into the cache/free page queues so that we can
> add an explicit M_USE_RESERVE.
>
> Alan
>
> P.S. I suspect that we should also increase the size of the "page reserve"
> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How
> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
> could actually be satisfied by two pages?
I am almost sure that most of people who put the M_NOWAIT flag, do not
know the 'allow the deeper drain of free queue' effect. As such, I believe
we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
of the problematic places would be in the swapout path.
I found a single explicit use of M_USE_RESERVE in the kernel,
so the flip is relatively simple.
Below is the patch which I only compile-tested on amd64, and which booted
fine.
Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
index dc9c307..ab1e869 100644
--- a/sys/amd64/amd64/uma_machdep.c
+++ b/sys/amd64/amd64/uma_machdep.c
@@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
#include <sys/param.h>
#include <sys/lock.h>
+#include <sys/malloc.h>
#include <sys/mutex.h>
#include <sys/systm.h>
#include <vm/vm.h>
@@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
int pflags;
*flags = UMA_SLAB_PRIV;
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags);
if (m == NULL) {
diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
index f60cdb1..75366e3 100644
--- a/sys/arm/arm/vm_machdep.c
+++ b/sys/arm/arm/vm_machdep.c
@@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
return (ret);
}
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
if (m == NULL) {
diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
index 71caa29..2ce1ca6 100644
--- a/sys/fs/devfs/devfs_devs.c
+++ b/sys/fs/devfs/devfs_devs.c
@@ -121,7 +121,7 @@ devfs_alloc(int flags)
struct cdev *cdev;
struct timespec ts;
- cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
+ cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
if (cdp == NULL)
return (NULL);
diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
index 37353ff..9f77762 100644
--- a/sys/ia64/ia64/uma_machdep.c
+++ b/sys/ia64/ia64/uma_machdep.c
@@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
int pflags;
*flags = UMA_SLAB_PRIV;
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
index 798e632..24baef0 100644
--- a/sys/mips/mips/uma_machdep.c
+++ b/sys/mips/mips/uma_machdep.c
@@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
void *va;
*flags = UMA_SLAB_PRIV;
-
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT;
- else
- pflags = VM_ALLOC_SYSTEM;
+ pflags = m2vm_flags(wait, 0);
for (;;) {
m = pmap_alloc_direct_page(0, pflags);
diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
index a491680..3e320b9 100644
--- a/sys/powerpc/aim/mmu_oea64.c
+++ b/sys/powerpc/aim/mmu_oea64.c
@@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
*flags = UMA_SLAB_PRIV;
needed_lock = !PMAP_LOCKED(kernel_pmap);
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c
index 162c7fb..3882bfa 100644
--- a/sys/powerpc/aim/slb.c
+++ b/sys/powerpc/aim/slb.c
@@ -483,12 +483,7 @@ slb_uma_real_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
realmax = platform_real_maxaddr();
*flags = UMA_SLAB_PRIV;
- if ((wait & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc_contig(NULL, 0, pflags, 1, 0, realmax,
diff --git a/sys/powerpc/aim/uma_machdep.c b/sys/powerpc/aim/uma_machdep.c
index 39deb43..23a333f 100644
--- a/sys/powerpc/aim/uma_machdep.c
+++ b/sys/powerpc/aim/uma_machdep.c
@@ -56,12 +56,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
int pflags;
*flags = UMA_SLAB_PRIV;
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/sparc64/sparc64/vm_machdep.c b/sys/sparc64/sparc64/vm_machdep.c
index cdb94c7..573ab3a 100644
--- a/sys/sparc64/sparc64/vm_machdep.c
+++ b/sys/sparc64/sparc64/vm_machdep.c
@@ -501,14 +501,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
PMAP_STATS_INC(uma_nsmall_alloc);
*flags = UMA_SLAB_PRIV;
-
- if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
-
- if (wait & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
for (;;) {
m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c
index 46e7f1c..e4c3704 100644
--- a/sys/vm/vm_kern.c
+++ b/sys/vm/vm_kern.c
@@ -222,12 +222,7 @@ kmem_alloc_attr(vm_map_t map, vm_size_t size, int flags, vm_paddr_t low,
vm_object_reference(object);
vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL,
VM_PROT_ALL, 0);
- if ((flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
- if (flags & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
+ pflags = m2vm_flags(flags, VM_ALLOC_NOBUSY);
VM_OBJECT_LOCK(object);
end_offset = offset + size;
for (; offset < end_offset; offset += PAGE_SIZE) {
@@ -296,14 +291,7 @@ kmem_alloc_contig(vm_map_t map, vm_size_t size, int flags, vm_paddr_t low,
vm_object_reference(object);
vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL,
VM_PROT_ALL, 0);
- if ((flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
- if (flags & M_ZERO)
- pflags |= VM_ALLOC_ZERO;
- if (flags & M_NODUMP)
- pflags |= VM_ALLOC_NODUMP;
+ pflags = m2vm_flags(flags, VM_ALLOC_NOBUSY);
VM_OBJECT_LOCK(object);
tries = 0;
retry:
@@ -487,11 +475,7 @@ kmem_back(vm_map_t map, vm_offset_t addr, vm_size_t size, int flags)
entry->wired_count == 0 && (entry->eflags & MAP_ENTRY_IN_TRANSITION)
== 0, ("kmem_back: entry not found or misaligned"));
- if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
- else
- pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
-
+ pflags = m2vm_flags(flags, VM_ALLOC_WIRED);
if (flags & M_ZERO)
pflags |= VM_ALLOC_ZERO;
if (flags & M_NODUMP)
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 70b8416..0286a6d 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -344,6 +344,24 @@ extern struct mtx_padalign vm_page_queue_mtx;
#define VM_ALLOC_COUNT_SHIFT 16
#define VM_ALLOC_COUNT(count) ((count) << VM_ALLOC_COUNT_SHIFT)
+#ifdef M_NOWAIT
+static inline int
+m2vm_flags(int malloc_flags, int alloc_flags)
+{
+ int pflags;
+
+ if ((malloc_flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
+ pflags = VM_ALLOC_SYSTEM | alloc_flags;
+ else
+ pflags = VM_ALLOC_INTERRUPT | alloc_flags;
+ if (malloc_flags & M_ZERO)
+ pflags |= VM_ALLOC_ZERO;
+ if (malloc_flags & M_NODUMP)
+ pflags |= VM_ALLOC_NODUMP;
+ return (pflags);
+}
+#endif
+
void vm_page_busy(vm_page_t m);
void vm_page_flash(vm_page_t m);
void vm_page_io_start(vm_page_t m);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20121112/fea64beb/attachment.sig>
More information about the freebsd-hackers
mailing list