amd64: change VM_KMEM_SIZE_SCALE to 1?

Mon Jul 26 19:43:27 UTC 2010

on 26/07/2010 22:30 Alan Cox said the following:
> On Mon, Jul 26, 2010 at 1:19 PM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> 
>     on 26/07/2010 20:04 Matthew Fleming said the following:
>     > On Mon, Jul 26, 2010 at 9:07 AM, Andriy Gapon <avg at freebsd.org
>     <mailto:avg at freebsd.org>> wrote:
>     >> Anyone knows any reason why VM_KMEM_SIZE_SCALE on amd64 should
>     not be set to 1?
>     >> I mean things potentially breaking, or some unpleasant surprise
>     for an
>     >> administrator/user...
>     >
>     > As I understand it, it's merely a resource usage issue.  amd64 needs
>     > page table entries for the expected virtual address space, so allowing
>     > more than e.g. 1/3 of physical memory means needing more PTEs.  But
>     > the memory overhead isn't all that large IIRC: each 4k physical memory
>     > devoted to PTEs maps 512 4k virtual addresses, or 2MB, so e.g. it
>     > takes about 4MB reserved as PTE pages to map 2GB of kernel virtual
>     > address space.
> 
>     My understanding is that paging entries are only allocated when actual
>     (physical) memory allocation is done.  But I am not sure.
> 
>     > Having cut my OS teeth on AIX/PowerPC where virutal address space is
>     > free and has no relation to the size of the hardware page table, the
>     > FreeBSD architecture limiting the size of the kernel virtual space
>     > seemed weird to me.  However, since FreeBSD also does not page kernel
>     > data to disk, there's a good reason to limit the size of the kernel's
>     > virtual space, since that also limits the kernel's physical space.
>     >
>     > In other words, setting it to 1 could lead to the system being out of
>     > memory but not trying to fail kernel malloc requests.  I'm not
>     > entirely sure this is a new problem since one could also chew through
>     > physical memory with sub-page uma allocations as well on amd64.
> 
>     Well, personally I would prefer kernel eating a lot of memory over
>     getting
>     "kmem_map too small" panic.  Unexpectedly large memory usage by
>     kernel can be
>     detected and diagnosed, and then proper limits and (auto-)tuning
>     could be put in
>     place.  Panic at some random allocation is not that helpful.
>     Besides, presently there are more and more workloads that require a
>     lot of
>     kernel memory - e.g. ZFS is gaining popularity.
> 
> 
> Like what exactly?  Since I increased the size of the kernel address
> space for amd64 to 512GB, and thus the size of the kernel heap was no
> longer limited by virtual address space size, but only by the
> auto-tuning based upon physical memory size, I am not aware of any
> "kmem_map to small" panics that are not ZFS/ARC related.

Well, I meant exactly these.

>     Hence, the question/suggestion.
> 
>     Of course, the things can be tuned by hand, but I think that
>     VM_KMEM_SIZE_SCALE=1 would be a more reasonable default than current
>     value.
> 
> 
> Even this would not eliminate the ZFS/ARC panics.  I have heard that
> some people must configure the kmem_map to 1.5 times a machine's
> physical memory size to avoid panics.  The reason is that unlike the
> traditional FreeBSD way of caching file data, the ZFS/ARC wants to have
> every page of cached data *mapped* (and wired) in the kernel address
> space.  Over time, the available, unused space in the kmem_map becomes
> fragmented, and even though the ARC thinks that it has not reached its
> size limit, kmem_malloc() cannot find contiguous space to satisfy the
> allocation request.  To see this described in great detail, do a web
> search for an e-mail by Ben Kelly with the subject "[patch] zfs kmem
> fragmentation".

Yes, I am aware of the fragmentation issue.
But I haven't hit that panic myself since setting vm.kmem_size_scale="1" in
loader.conf.
Of course, what I propose would not fix the fragmentation issue.
But... it's something that ZFS users (especially serious ZFS users like file
servers) would want to do anyway and it won't cause any harm for others.

> As far as eliminating or reducing the manual tuning that many ZFS users
> do, I would love to see someone tackle the overly conservative hard
> limit that we place on the number of vnode structures.  The current hard
> limit was put in place when we had just introduced mutexes into many
> structures and more a mutex was much larger than it is today.

I agree.  But that's a little bit different topic.

-- 
Andriy Gapon