svn commit: r317061 - in head: libexec/rpc.rstatd sys/amd64/amd64 sys/amd64/include sys/arm/arm sys/arm/include sys/arm64/include sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/compat/linprocfs...

Bruce Evans brde at optusnet.com.au
Mon May 1 08:25:27 UTC 2017


On Sun, 30 Apr 2017, Konstantin Belousov wrote:

> On Wed, Apr 19, 2017 at 02:09:58PM +1000, Bruce Evans wrote:
>> On Tue, 18 Apr 2017, Alan Somers wrote:
>>
>>> On Mon, Apr 17, 2017 at 11:34 AM, Gleb Smirnoff <glebius at freebsd.org> wrote:
>>
>>>>   head/usr.bin/top/machine.c
>>>>   head/usr.bin/vmstat/vmstat.c
>>
>> The previous 2 lines turn out to be relevant.  I missed the update to
>> vmstat and checked a wrong version in my bug report described below
>> I checked an updated top, but the change seemed too small to help.
>> systat was not updated.
>>
>>> This change broke backwards compatibility with old top binaries.  When
>>> I use a kernel at version 317094 but a top from 14-April, I get the
>>> error "top: sysctl(vm.stats.vm.v_swappgsin...) failed: Cannot allocate
>>> memory".  I get the same error when running top from an 11.0-RELEASE
>>> jail.  Can you please add backward compatibility shims?
>>
>> I sent a much longer (30 times longer) bug report to the author of the
>> bug.
>>
>> Most or all statistics utilities that use vmmeter are broken.  vmstat
>> is too broken to even notice the bug -- it silently ignores the error,
>> an has type errors which prevent other errors which it doesn't ignore.
>> The result is silently truncating to 32 bits, like a quick fix would
>> do intentionally.  systat -v detects the errors but prints warning
>> messages to the status line where they overwrite each other so make
>> the problem look smaller than it is.  The result is unsilently
>> truncating to 32 bits.
>>
>> I've never seen any backwards compatibility shims added for sysctls.
>> None should be needed, but there are few or no utilities other than
>> sysctl(8) which use sysctl(3) in a portable way.
> See vfs.bufspace handled by vfs_bio.c:sysctl_bufspace().

Ugh.

My fix for this continues to work fine.  It handles about 162
combinations of integer sizes/signedness for all sysctls in not much
more space than sysctl_bufspace().  Unfortunately, the full version
breaks the API slightly (it even breaks sysctl(8)), and the dumbed
down version causes problems although it technically doesn't change
the API.

sysctl_bufspace() also changes the API slightly, much like my dumbed
down version.  This tends to break probes.

> In fact, not only top and vmstat are broken. Quite unexpected and
> worrying, reboot(8) is broken as well. This is due to get_pageins()
> failing and system stopping userspace much earlier than it is desirable.
>
> I now believe that compat shims to handle int-sized vmstat queries are
> required.

Here are my current versions and a test program (mainly for old design
and implementation bugs with short reads and writes).

XX Index: kern_sysctl.c
XX ===================================================================
XX --- kern_sysctl.c	(revision 317604)
XX +++ kern_sysctl.c	(working copy)

This version changes the API too much, and I now use the simpler patch
to fix the immediate problem with counters.

The full version of it does:
- let userland do reads and writes of integers with any size large enough
   to hold the value
- new errno EOVERFLOW for userland sizes that are not large enough.  This
   replaces ENOMEM in most cases for integer sysctls.  Only oldlen = 0
   should mean to probe.
- new errno EOVERFLOW for kernel sizes that are not large enough.    Be
   more careful not to write anything if the kernel size is not large
   enough.  The previous errno for this is EINVAL.
- it is no longer very fragile to read with a type larger than the kernel
   type.  The kernel expands to the requested size.  This used to be a
   non-error with garbage in the bits after the end of the unexpanded
   kernel size.
- it is now possible to write with a type smaller than the kernel type.
   The kernel expands to its size.  The case is documented to fail with
   EINVAL, and works as documented IIRC.
- it is now possible to write correctly with a type larger than the
   kernel type, provided the value fits.  The case is documented to fail
   with EINVAL, but IIRC it is the case that has been broken for almost
   20 years.  Usually, the bug is harmless since the value fits (due to
   are 0's or 1's in the unwritten top bits).

The dumbed down version keeps ENOMEM instead of EOVERFLOW, and doesn't
allow all combinations of sizes.  For the counter ABI change, the case
where the userland variable is smaller must be supported for reading,
and the opposite is not needed yet.  I forget what happens for writing.

XX @@ -1344,6 +1344,302 @@
XX  	return (error);
XX  }
XX 
XX +int sysctl_uicvt(const void *src, size_t srclen, void *dst, size_t dstlen);
XX +int
XX +sysctl_uicvt(const void *src, size_t srclen, void *dst, size_t dstlen)
XX +{
XX +	uintmax_t val, valtrunc;
XX +	uint64_t val64;
XX +	uint32_t val32;
XX +	uint16_t val16;
XX +	uint8_t val8;
XX +
XX +	if (dstlen > sizeof(uintmax_t))
XX +		dstlen = sizeof(uintmax_t);
XX +	if (dstlen > srclen)
XX +		dstlen = srclen;

The conversion functions are trivial, but not as small or as fast as
they could be.  In most case, no conversion is needed.

These dstlen adjustments are dumbing down.  The first one is mainly
to reduce the number of cases, but is not quite write.  Suppose that
the userland uintmax_t is larger than the kernel's, or userland is
reading into a large u_char array or struct for some reason.  Then
reducing dstlen to sizeof(uintmax_t) gives garbage after the end of
a uintmax_t, the same as in the current API.

The second one loses the size adjustment in 1 direction -- see below.

This is complicated and perhaps buggy because it tries to support
both directions in userland <-> kernel in shared code.

XX +	switch (srclen) {
XX +	case 1:
XX +		val = *(const int8_t *)src;
XX +		break;
XX +	case 2:
XX +		val = *(const int16_t *)src;
XX +		break;
XX +	case 4:
XX +		val = *(const int32_t *)src;
XX +		break;
XX +	case 8:
XX +		val = *(const int64_t *)src;
XX +		break;
XX +	default:
XX +		bcopy(src, dst, MIN(srclen, dstlen));
XX +		return (srclen);
XX +	}
XX +
XX +	switch (dstlen) {
XX +	case 1:
XX +		valtrunc = val8 = val;
XX +		src = &val8;
XX +		break;
XX +	case 2:
XX +		valtrunc = val16 = val;
XX +		src = &val16;
XX +		break;
XX +	case 4:
XX +		valtrunc = val32 = val;
XX +		src = &val32;
XX +		break;
XX +	case 8:
XX +		valtrunc = val64 = val;
XX +		src = &val64;
XX +		break;
XX +	default:
XX +		valtrunc = val - 1;
XX +		break;
XX +	}
XX +	if (valtrunc != val) {
XX +		if (dstlen != 0)
XX +			printf("val %#jx, valtrunc %#jx\n", val, valtrunc);
XX +		bcopy(src, dst, dstlen);
XX +		return (srclen);
XX +	} else {
XX +		bcopy(src, dst, dstlen);
XX +		return (dstlen);
XX +	}
XX +}
XX +
XX +/* Output a signed integer if possible. */
XX +static int
XX +sysctl_out_i(struct sysctl_req *req, intmax_t val, size_t kernlen)
XX +{
XX +	size_t len;
XX +	intmax_t valtrunc;
XX +	int64_t val64;
XX +	int32_t val32;
XX +	int16_t val16;
XX +	int8_t val8;
XX +	int error;
XX +
XX +	len = req->oldlen != 0 ? req->oldlen : kernlen;
XX +	/*
XX +	 * XXX: sysctl(8) determines the correct length, but then twee-ly
XX +	 * doubles it, so we can't tell the application's integer size.
XX +	 * If we supplied the doubled length (as is possible up to size 4),
XX +	 * then sysctl(8) would treat the result as an array.
XX +	 *
XX +	 * Work around this by only supplying the default length if that
XX +	 * fits.  This restores the original broken behaviour of not
XX +	 * extending, so applications get garbage in the top bits if they
XX +	 * use a larger integer size.
XX +	 */
XX +	if (len > kernlen)
XX +		len = kernlen;

See the comment.  Without this hack, sysctl(8) prints most variables as
"v 0", where v is the correct output, because it sees int as [u_]int[2].
Statistics utilities are not as "smart" as sysctl, so they kept seeing
plain [u_]int, as intended for the counter variables which became uint64_t.


XX +	switch (len) {
XX +	case 1:
XX +		valtrunc = val8 = val;
XX +		error = SYSCTL_OUT(req, &val8, 1);
XX +		break;
XX +	case 2:
XX +		valtrunc = val16 = val;
XX +		error = SYSCTL_OUT(req, &val16, 2);
XX +		break;
XX +	case 4:
XX +		valtrunc = val32 = val;
XX +		error = SYSCTL_OUT(req, &val32, 4);
XX +		break;
XX +	case 8:
XX +		valtrunc = val64 = val;
XX +		error = SYSCTL_OUT(req, &val64, 8);
XX +		break;
XX +	default:
XX +		/* XXX: not quite compat; should convert back to kernel type. */
XX +		return (SYSCTL_OUT(req, &val, sizeof(val)));
XX +	}
XX +	if (valtrunc != val && error == 0) {
XX +		printf("val %#jx, valtrunc %#jx\n", val, valtrunc);
XX +		error = ENOMEM;		/* should be EOVERFLOW */
XX +	}
XX +	return (error);
XX +}
XX +
XX +/* Output an usigned integer if possible. */
XX +static int
XX +sysctl_out_ui(struct sysctl_req *req, uintmax_t val, size_t kernlen)
XX +{
XX +	size_t len;
XX +	uintmax_t valtrunc;
XX +	uint64_t val64;
XX +	uint32_t val32;
XX +	uint16_t val16;
XX +	uint8_t val8;
XX +	int error;
XX +
XX +	len = req->oldlen != 0 ? req->oldlen : kernlen;
XX +	if (len > kernlen)
XX +		len = kernlen;
XX +	switch (len) {
XX +	case 1:
XX +		valtrunc = val8 = val;
XX +		error = SYSCTL_OUT(req, &val8, 1);
XX +		break;
XX +	case 2:
XX +		valtrunc = val16 = val;
XX +		error = SYSCTL_OUT(req, &val16, 2);
XX +		break;
XX +	case 4:
XX +		valtrunc = val32 = val;
XX +		error = SYSCTL_OUT(req, &val32, 4);
XX +		break;
XX +	case 8:
XX +		valtrunc = val64 = val;
XX +		error = SYSCTL_OUT(req, &val64, 8);
XX +		break;
XX +	default:
XX +		return (SYSCTL_OUT(req, &val, sizeof(val)));
XX +	}
XX +	if (valtrunc != val && error == 0) {
XX +		printf("val %#jx, valtrunc %#jx\n", val, valtrunc);
XX +		error = ENOMEM;		/* should be EOVERFLOW */
XX +	}
XX +	return (error);
XX +}
XX +
XX +/* Input a signed integer if possible. */
XX +static int
XX +sysctl_in_i(struct sysctl_req *req, void *valp, size_t kernlen)
XX +{
XX +	intmax_t val;
XX +	int64_t val64;
XX +	int32_t val32;
XX +	int16_t val16;
XX +	int8_t val8;
XX +	int error;
XX +
XX +	switch (req->newlen) {
XX +	case 1:
XX +		error = SYSCTL_IN(req, &val8, 1);
XX +		val = val8;
XX +		break;
XX +	case 2:
XX +		error = SYSCTL_IN(req, &val16, 2);
XX +		val = val16;
XX +		break;
XX +	case 4:
XX +		error = SYSCTL_IN(req, &val32, 4);
XX +		val = val32;
XX +		break;
XX +	case 8:
XX +		error = SYSCTL_IN(req, &val64, 8);
XX +		val = val64;
XX +		break;
XX +	default:
XX +		/* Only kernel lengths are supported; we hard-code {1,2,4,8}. */
XX +		error = EINVAL;
XX +		break;
XX +	}
XX +	if (error != 0)
XX +		return (error);
XX +
XX +	switch (kernlen) {
XX +	case 1:
XX +		if ((int8_t)val == val)
XX +			*(int8_t *)valp = val;
XX +		else
XX +			error = EINVAL;	/* should be EOVERFLOW everywhere */
XX +		break;
XX +	case 2:
XX +		if ((int16_t)val == val)
XX +			*(int16_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	case 4:
XX +		if ((int32_t)val == val)
XX +			*(int32_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	case 8:
XX +		if ((uint64_t)val == val)
XX +			*(uint64_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +	default:
XX +		error = EDOOFUS;
XX +		break;
XX +	}
XX +	return (error);
XX +}
XX +
XX +/* Input an unsigned integer if possible. */
XX +static int
XX +sysctl_in_ui(struct sysctl_req *req, void *valp, size_t kernlen)
XX +{
XX +	uintmax_t val;
XX +	uint64_t val64;
XX +	uint32_t val32;
XX +	uint16_t val16;
XX +	uint8_t val8;
XX +	int error;
XX +
XX +	switch (req->newlen) {
XX +	case 1:
XX +		error = SYSCTL_IN(req, &val8, 1);
XX +		val = val8;
XX +		break;
XX +	case 2:
XX +		error = SYSCTL_IN(req, &val16, 2);
XX +		val = val16;
XX +		break;
XX +	case 4:
XX +		error = SYSCTL_IN(req, &val32, 4);
XX +		val = val32;
XX +		break;
XX +	case 8:
XX +	default:
XX +		if (req->newlen == sizeof(val))
XX +			error = SYSCTL_IN(req, &val, sizeof(val));
XX +		else if (req->newlen == 8) {
XX +			error = SYSCTL_IN(req, &val64, 8);
XX +			val = val64;
XX +		} else
XX +			error = EINVAL;
XX +		break;
XX +	}
XX +	if (error != 0)
XX +		return (error);
XX +
XX +	switch (kernlen) {
XX +	case 1:
XX +		if ((uint8_t)val == val)
XX +			*(uint8_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	case 2:
XX +		if ((uint16_t)val == val)
XX +			*(uint16_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	case 4:
XX +		if ((uint32_t)val == val)
XX +			*(uint32_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	case 8:
XX +		if ((uint64_t)val == val)
XX +			*(uint64_t *)valp = val;
XX +		else
XX +			error = EINVAL;
XX +		break;
XX +	default:
XX +		error = EDOOFUS;
XX +	}
XX +	return (error);
XX +}
XX +
XX  /*
XX   * Handle an int, signed or unsigned.
XX   * Two cases:
XX @@ -1354,24 +1650,35 @@
XX  int
XX  sysctl_handle_int(SYSCTL_HANDLER_ARGS)
XX  {
XX -	int tmpout, error = 0;
XX +	int error;
XX 
XX -	/*
XX -	 * Attempt to get a coherent snapshot by making a copy of the data.
XX -	 */
XX -	if (arg1)
XX -		tmpout = *(int *)arg1;
XX -	else
XX -		tmpout = arg2;
XX -	error = SYSCTL_OUT(req, &tmpout, sizeof(int));
XX +	switch (oidp->oid_kind & CTLTYPE) {
XX +	case CTLTYPE_UINT:
XX +		error = sysctl_out_ui(req,
XX +		    arg1 != NULL ? *(u_int *)arg1 : arg2, sizeof(u_int));
XX +		break;
XX +	case CTLTYPE_INT:
XX +	default:
XX +		error = sysctl_out_i(req,
XX +		    arg1 != NULL ? *(int *)arg1 : arg2, sizeof(int));
XX +		break;
XX +	}
XX 
XX  	if (error || !req->newptr)
XX  		return (error);
XX 
XX  	if (!arg1)
XX -		error = EPERM;
XX -	else
XX -		error = SYSCTL_IN(req, arg1, sizeof(int));
XX +		return (EPERM);
XX +
XX +	switch (oidp->oid_kind & CTLTYPE) {
XX +	case CTLTYPE_UINT:
XX +		error = sysctl_in_ui(req, arg1, sizeof(int)); 
XX +		break;
XX +	case CTLTYPE_INT:
XX +	default:
XX +		error = sysctl_in_i(req, arg1, sizeof(int)); 
XX +		break;
XX +	}
XX  	return (error);
XX  }

I only fixed sysctl_handle_int(), including style bugs and its type
puns for u_int.  All of the integer handling functions should be in
this one, with the case statement expanded in the obvious way.

XX 
XX Index: subr_counter.c
XX ===================================================================
XX --- subr_counter.c	(revision 317604)
XX +++ subr_counter.c	(working copy)
XX @@ -78,11 +78,15 @@
XX  sysctl_handle_counter_u64(SYSCTL_HANDLER_ARGS)
XX  {
XX  	uint64_t out;
XX +	uint32_t out32;
XX  	int error;
XX 
XX  	out = counter_u64_fetch(*(counter_u64_t *)arg1);
XX 
XX -	error = SYSCTL_OUT(req, &out, sizeof(uint64_t));
XX +	if (req->oldlen == 4 && (out32 = out) == out)
XX +		error = SYSCTL_OUT(req, &out32, sizeof(out32));
XX +	else
XX +		error = SYSCTL_OUT(req, &out, sizeof(out));
XX 
XX  	if (error || !req->newptr)
XX  		return (error);

This does the minimum necessary to fix the current problem.

This works until the counters become too large for a u_int.  There
could be an option to get truncation with no error, but that would
require an API change, so applications should be required to do the
truncation for themself if that is what they want.

"Smart" applications are still broken by the above.  They might have
support for reading with 64 bit sizes, but probe with 32-bit sizes.
Then the above will succeed early, but later when the counters
overflow the sysctl will start failing.  It fails with the old
errno of ENOMEM, so the "smart" applications should start using 64
bit sizes, but it might assume that the initial sizing is enough for
integers.  After all, the size of an integer can't change in the
kernel while the kernel is running, and the API tacitly implies
returning actual kernel integers and not their representation in
any convenient smaller type.

Test program.  It tests mainly for old bugs involving no errors for
short reads and writes.  After the API change, these non-errors become
correct.

YY #include <sys/types.h>
YY #include <sys/sysctl.h>
YY 
YY #include <err.h>
YY #include <errno.h>
YY #include <stdint.h>
YY #include <stdlib.h>
YY 
YY #define	TESTMIB	"vfs.ffs.doasyncfree"
YY 
YY /* Use for initialization and later tests. */
YY static void
YY get(int32_t *oldvalp)
YY {
YY 	size_t oldlen;
YY 
YY 	oldlen = sizeof(*oldvalp);
YY 	if (sysctlbyname(TESTMIB, oldvalp, &oldlen, NULL, 0) != 0)
YY 		err(1, "basic: read failed");
YY 	/*
YY 	 * It is currently difficult to find the correct size/type to use.
YY 	 * Assume that the size is 32 bits here since the focus is not quite
YY 	 * on the design errors behind that, and we want to do a safe null
YY 	 * change to the MIB (the kernel variable should be a bool or a
YY 	 * 1-bit bit-field, and we should be able to set it to 0 or 1 but
YY 	 * no other values using any integer type (but not a bit-field).
YY 	 */
YY 	if (oldlen != sizeof(*oldvalp))
YY 		err(1, "basic: size mismatch");
YY }
YY 
YY /* Use for initial test and unclobber. */
YY static void
YY set(int32_t newval)
YY {
YY 	size_t newlen;
YY 
YY 	newlen = sizeof(newval);
YY 	if (sysctlbyname(TESTMIB, NULL, NULL, &newval,
YY 	    newlen) != 0)
YY 		err(1, "basic: write failed");
YY }
YY 
YY int
YY main(void)
YY {
YY 	size_t newlen, oldlen;
YY 	int64_t longnewval, longoldval;
YY 	int32_t oldval, origval;
YY 	int16_t shortnewval, shortoldval;
YY 	int error;
YY 
YY 	/* Normal operation. */
YY 	get(&origval);
YY 	set(origval);
YY 	warnx("basic: passed: val %d", origval);
YY 
YY 	/* Long read.  No bugs expected.  Just design errors. */
YY 	shortoldval = ~origval;
YY 	oldlen = sizeof(shortoldval);
YY 	error = sysctlbyname(TESTMIB, &shortoldval, &oldlen, NULL, 0);
YY 	if (error == 0 && shortoldval == origval)
YY 		warnx("long read: unexpectedly worked right");
YY 	else if (error == 0)
YY 		warnx("long read: unexpectedly succeeded, but gave garbage");
YY 	else if (errno != ENOMEM)
YY 		warn("long read: failed with unexpected errno");
YY 	else
YY 		warnx("long read: passed");
YY 
YY 	/* Short read.  No bugs expected.  Just design errors. */
YY 	longoldval = ~origval;
YY 	oldlen = sizeof(longoldval);
YY 	error = sysctlbyname(TESTMIB, &longoldval, &oldlen, NULL, 0);
YY 	if (error != 0)
YY 		warn("short read: failed");
YY 	else if (longoldval != origval)
YY 		warnx("short read: passed: garbage in top bits as expected");
YY 	else if (oldlen == sizeof(longoldval) && longoldval == origval)
YY 		warnx("short read: unexpectedly worked right");
YY 	else
YY 		warnx("short read: succeeded with right val but wrong length");
YY 
YY 	/* Long write.  Expect success with top bits not written (broken). */
YY 	longnewval = 0x100000666;	/* unrepresentable */
YY 	newlen = sizeof(longnewval);
YY 	error = sysctlbyname(TESTMIB, NULL, NULL, &longnewval, newlen);
YY 	get(&oldval);
YY 	if (error != 0 && errno == EINVAL)
YY 		warnx("long write: unexpectedly unbroken (failed with EINVAL)");
YY 	else if (error == 0 && oldval == 0x666)
YY 		warnx("long write: broken as expected (success and value)");
YY 	else if (error == 0)
YY 		warnx("long write: more broken than expected (succ. and !val)");
YY 	else
YY 		warn("long write: failed with unexpected errno");
YY 	if (oldval == origval)
YY 		warnx("long write: preserved value");
YY 	else if (oldval == longnewval)
YY 		warnx("long write: set value correctly (impossible)");
YY 	else
YY 		warnx("long write: clobbered value");
YY 	set(origval);
YY 
YY 	/* Another case for long write. */
YY 	longnewval = -1;	/* representable assuming signed */
YY 	newlen = sizeof(longnewval);
YY 	error = sysctlbyname(TESTMIB, NULL, NULL, &longnewval, newlen);
YY 	get(&oldval);
YY 	if (error != 0 && errno == EINVAL)
YY 		warnx("long write: unexpectedly unbroken (failed with EINVAL)");
YY 	else if (error == 0 && oldval == longnewval)
YY 		warnx("long write: broken as expected, or working");
YY 	else if (error == 0)
YY 		warnx("long write: more broken than expected (succ. and !val)");
YY 	else
YY 		warn("long write: failed with unexpected errno");
YY 	if (oldval == origval)
YY 		warnx("long write: preserved value");
YY 	else if (oldval == longnewval)
YY 		warnx("long write: set value correctly");
YY 	else
YY 		warnx("long write: clobbered value");
YY 	set(origval);
YY 
YY 	/* Short write.  No bugs expected.  Just design errors. */
YY 	shortnewval = origval + 1;
YY 	newlen = sizeof(shortnewval);
YY 	error = sysctlbyname(TESTMIB, NULL, NULL, &shortnewval, newlen);
YY 	get(&oldval);
YY 	if (error == 0 && oldval == shortnewval)
YY 		warnx("short write: unexpectedly worked");
YY 	else if (error == 0)
YY 		warnx("short write: unexpectedly succeed but set garbage");
YY 	else if (errno != EINVAL)
YY 		warn("short write: failed with unexpected errno");
YY 	else
YY 		warnx("short write: passed");
YY 	if (oldval == origval)
YY 		warnx("short write: preserved value");
YY 	else if (oldval == shortnewval)
YY 		warnx("short write: set value correctly");
YY 	else
YY 		warnx("short write: clobbered value");
YY 	set(origval);
YY 	return (0);
YY }

Bruce


More information about the svn-src-head mailing list