svn commit: r344188 - in head: lib/libc/sys sys/vm

Sat Feb 16 15:58:55 UTC 2019

On Fri, 15 Feb 2019, Gleb Smirnoff wrote:

> Log:
>  For 32-bit machines  rollback the default number of vnode pager pbufs
>  back to the lever before r343030.  For 64-bit machines reduce it slightly,
>  too.  Together with r343030 I bumped the limit up to the value we use at
>  Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
>  good default.

This is only a rollback for the vnode pager pbufs sub-pool.  Total
resource usage (preallocated kva and maximum on RAM that can be mapped
into this kva) is still about 5/2 times higher than before in my
configuration.  It would be 7/2 times higher if I configured fuse and
smbfs.  r343030 changed the allocation methods in all subsystems except
out-of-tree modules, and broke at least the KBI for these modules (*),
so it is easy to find the full expansion except for these modules by
looking at the diffs (I found the use in fuse and smbfs by grepping
some subtrees).  Also, the user's and vfs_bio's resource limit is still
broken by expanding it by this factor of 5/2 or more.

In the old allocation method, there was a single pool of pbufs of size
nswbuf which normally has its normal limiting value of 256, where this
magic 256 is hard-coded in vfs_bio.c, but if the user somehow knows
about this and the tunable kern.nswbuf, then it can be overridden.
The limit of 256 was documented in pbuf(9), but the tunable was never
documented AFAIK.  The variable nswbuf for this was documented in
pbuf(9).  The 256 entries are shared between any number of subsystems.
Most subsystems limited themselves to nswbuf/2 entries, and the man
page recommended this.  This gave overcommit by a factor of about 5/2
in my configuration (there are 7 subsystems, but some of these have a
smaller limit).

Now there each subsystem has a separate pool.  The size of the sub-pool
is still usually nswbuf / 2.  This gives overallocation by a factor of
about 5/2 in my configuration.  The overcommit only causes minor
performance problems.  2 subsystems might use all of the buffers, and
then all all the other subsystems have to wait, but it is rare for even
2 subsystems to be under load at the same time.  It is more of a problem
that the limit is too small for a single subsystem.  The overallocation
gives worse problems such as crashing at boot time or a little later when
the user or auto-tuning has maxed out nswbuf.

>  Provide a loader tunable to change vnode pager pbufs count. Document it.

This only controls 1 of the subsystems.

It is too much to have a sysctl for each of the subsystems.  Some users
don't even know about the global sysctl kern.nswbuf that was enough for
sendfile on larger (mostly 64-bit) systems.  Just increase nswbuf a lot.
This wastes kva for most of the subsystems, but kva is cheap if the address
space is large.

Now the user has to know even more arcane details to limit the kva, and
it is impossible to recover the old behaviour.  To get the old limit,
kern.nswbuf must be set to (256 * 2 / 5) in my configuration, but
that significantly decreases the number of buffers for each subsystem.
Users might already have set kern.nswbuf to a large value.  Since most
subsystems used to use half of that many buffers, the wastage from setting
it large for the benefit of 1 subsystem was at most a factor of 2.  Now
the limit can't be increased as much without running out of kva, and the
safe increase is more arcane and machine-dependent (starting with the
undocumented default being 8 times higher for 64-bit systems, but only
for 1 of the subsystems).

(*) The KBI wa getpbuf(), trypbuf() and relpbuf(), and this was very
easy to (ab)use.  Any number of subsystems can try to use the shared
pool.  This is abused because a small fixed-size pool can't support
an unbounded number of subsystems.  Now getpbuf() doesn't exist (but
is still referred to in swap_pager.c), and there is no man page for
the new allocation method.  The boot-time preallocation can't work
for modules loaded later, and leaves unusable allocations for modules
unloaded later.  Modules apparently have to do their own preallocation.
They should probably not use pbufs at all, and do their own allocations
too.

It is now clear that there has always been a problem with the default
limits.  The magic number of 256 hasn't been changed since before
FreeBSD-1.  There were no pbufs in FreeBSD-1, but there was nswbuf and
it dynamically tuned but limited to 256.  I think 256 meant "ininity"
in 1992, but it wasn't large enough even then.   Before r343030 it was
even smaller, since there are more subsystems then than in FreeBSD-1.

nswbuf needs to be very large to support slow devices.  By the very
nature of slow devices, the i/o queue tends to fill up with buffers
for the slowest device, and if there is a buffer shortage then everything
else has to wait for this device to free the buffers.

Slowness is relative.  In FreeBSD-1, floppy disk devices were still in
use and were especially slow.  Now hard disks are slow relative to fast
SSDs.  But the number of buffers was unchanged.  It is still essentially
unchanged except for vn pager pbufs.  The hard disks can complete 128
i/o's for a full queue much faster than a floppy disk, so the relative
slowness might be similar, but now there are more subsystems and some
systems have many more disks.

I have seen this queueing problem before mainly for DVD disks, but thought
it was more in the buffer cache than in pbufs.

Testing this by increasing and decreasing kern.nswbuf didn't show much
change in makeworld benchmarks.  They still have idle time with large
variance, as if something waits for buffers and doesn't get woken up
promptly.  Only clpbufs are used much.  The counts now available in
uma statistics show the strange behaviour that the free count rarely
reaches the limit, but with larger limits the free count goes above
smaller limits.

Bruce