svn commit: r343030 - in head/sys: cam conf dev/md dev/nvme fs/fuse fs/nfsclient fs/smbfs kern sys ufs/ffs vm

Fri Feb 15 07:13:36 UTC 2019

On Thu, 14 Feb 2019, Gleb Smirnoff wrote:

> On Wed, Feb 13, 2019 at 07:24:50PM -0600, Justin Hibbits wrote:
> J> This seems to break 32-bit platforms, or at least 32-bit book-e
> J> powerpc, which has a limited KVA space (~500MB).  It preallocates I've
> J> seen over 2500 pbufs, at 128kB each, eating up over 300MB KVA,
> J> leaving very little left for the rest of runtime.
> J>
> J> I spent a couple hours earlier today debugging with Mark Johnston, and
> J> his consensus is that the vnode_pbuf_zone is too big on 32-bit
> J> platforms.  Unfortunately I know very little about this area, so can't
> J> provide much extra insight, but can readily reproduce the issues I see
> J> triggered by this change, so am willing to help where I can.
>
> Ok, let's roll back to old default on 32-bit platforms and somewhat
> reduce the default on 64-bits.

This reduces the largest allocation by a factor of 16 on 32-bit arches,
(back to where it was), but it leves the other allocations unchanged,
so the total allocation is still almost 5 times larger than before
(down from 20 times larger).  E.g., with the usual limit of 256 on
nswbuf, the total allocation was 32MB with overcommit by a factor of
about 5/2 on all systems, but it is now almost 80MB with no overcommit
on 32-bit systems.  Approximately 0MB of the extras are available on
systems with 1GB kva, and less on systems with 512MB kva.

> Can you please confirm that the patch attached works for you?

I don't have any systems affected by the bug, except when I boot with
small hw.physmem or large kmem to test things.  hw.physmem=72m leaves
about 2MB afailable to map into buffers, and doesn't properly reduce
nswbuf, so almost 80MB of kva is still used for pbufs.  Allocating these
must fail due to the RAM shortage.  The old value of 32MB gives much the
same failures (in practice, a larger operation like fork or exec tends
to fail first).  Limiting available kva is more interesting, and I haven't
tested reducing it intentionally, except once I expanded kmem a lot to
put a maximal md malloc()-backed disk in it).  Expanding kmem steals from
residual kva, and residual kva is not properly scaled except in my version.
Large allocations then to cause panics at boot time, except for ones that
crash because they don't check for errors.

Here is debugging output for large allocations (1MB or more) at boot time
on i386:

XX pae_mode=0 with ~2.7 GB mapped RAM:
XX kva_alloc: large allocation: 7490 pages: 0x5800000[0x1d42000]       vm radix
XX kva_alloc: large allocation: 6164 pages: 0x8400000[0x1814000]       pmap init
XX kva_alloc: large allocation: 28876 pages: 0xa000000[0x70cc000]      buf
XX kmem_suballoc: large allocation: 1364 pages: 0x11400000[0x554000]   exec
XX kmem_suballoc: large allocation: 10986 pages: 0x11954000[0x2aea000] pipe
XX kva_alloc: large allocation: 6656 pages: 0x14800000[0x1a00000]      sfbuf

It went far above the old size of 1GB to nearly 1.5GB, but there is plenty
to spare out of 4GB.  Versions that fitted in 1GB started these allocations
about 256MB lower and were otherwise similar.

XX pae_mode=1 with 16 GB mapped RAM:
XX kva_alloc: large allocation: 43832 pages: 0x14e00000[0xab38000]     vm radix
XX kva_alloc: large allocation: 15668 pages: 0x20000000[0x3d34000]     pmap init
XX kva_alloc: large allocation: 28876 pages: 0x23e00000[0x70cc000]     buf
XX kmem_suballoc: large allocation: 1364 pages: 0x2b000000[0x554000]   exec
XX kmem_suballoc: large allocation: 16320 pages: 0x2b554000[0x3fc0000] pipe
XX kva_alloc: large allocation: 6656 pages: 0x2f600000[0x1a00000]      sfbuf

Only the vm radix and pmap init allocations are different, and they start
much higher.  The allocations now go over 3GB without any useful expansion
except for the page tables.  PAE was didn't work with 16 GB RAM and 1 GB
kva, except in my version.  PAE needed to be configured with 2 GB of kva
to work with 16 GB RAM, but that was not the default or clearly documented.

XX old PAE fixed fit work with 16GB RAM in 1GB KVA:
XX kva_alloc: large allocation: 15691 pages: 0xd2c00000[0x3d4b000]   pmap init
XX kva_alloc: large allocation: 43917 pages: 0xd6a00000[0xab8d000]   vm radix
XX kva_alloc: large allocation: 27300 pages: 0xe1600000[0x6aa4000]   buf
XX kmem_suballoc: large allocation: 1364 pages: 0xe8200000[0x554000] exec
XX kmem_suballoc: large allocation: 2291 pages: 0xe8754000[0x8f3000] pipe
XX kva_alloc: large allocation: 6336 pages: 0xe9200000[0x18c0000]    sfbuf

PAE uses much more kva (almost 256MB extra) before the pmap and radix
initializations here too.  This is page table metadata before kva
allocations are available.  The fixes start by keeping track of this
amout.   It is about 1/16 of the address space for PAE in 1GB, so all
later scaling was off by a factor of 16/15 (too high), and since there
was less than 1/16 of 1GB to spare, PAE didn't fit.

Only 'pipe' is reduced significantly to fit.  swzone is reduced to 1 page
in all cases, so it doesn't show here.  It is about the same as sfbuf IIRC.
The fixes were developed before reducing swzone and needed to squeeze
harder to fit.  Otherwise, panics tended to occur in the swzone allocation.

sfbuf is the most mis-scaled and must be reduced significantly when RAM
is small, and could be reduced under kva pressure too.  It was the hardest
to debug since it doesn't check for allocation failures.

The above leaves more than 256MB at the end.  This is mostly reserved for
kmem.  kmem ends up at about 200MB (down from 341MB).

XX old non-PAE with fixes needed for old PAE, ~2.7 GB RAM in 1GB KVA:
XX kva_alloc: large allocation: 7517 pages: 0xc4c00000[0x1d5d000]     pmap init
XX kva_alloc: large allocation: 6164 pages: 0xc7000000[0x1814000]     vm radix
XX kva_alloc: large allocation: 42848 pages: 0xc8c00000[0xa760000]    buf
XX kmem_suballoc: large allocation: 1364 pages: 0xd3400000[0x554000]  exec
XX kmem_suballoc: large allocation: 4120 pages: 0xd3954000[0x1018000] pipe
XX kva_alloc: large allocation: 6656 pages: 0xd5000000[0x1a00000]     sfbuf

Since pmap starts almost 256MB lower and the pmap radix allocations are
naturally much smaller, and I still shrink 'pipe', there is plenty of space
for useful expansion.  I only expand 'buf' back to a value that gives the
historical maxufspace, and kmem a lot, and vnode space in kmem a lot.  The
space at the end is about 700MB.  kmem is 527MB (up from 341MB).

Back to -current.  The 128KB allocations go somewhere in gaps between
the reported allocations (left by smaller aligned uma allocations?),
then at the end.  dmesg is not spammed by printing such small
allocations, but combined they are 279MB without this patch.
pbuf_prealloc() is called towards the end of the boot, long after all
the allocations reported above.  It uses space that is supposed to be
reserved for kmem when kva is small.  It allocates many buffers (perhaps
100) in gaps before starting a contiguous range of allocations at the
end.  Using the gaps is good for minimizing fragmentation provided
these buffers are never freed.

Bruce