svn commit: r326736 - head/usr.bin/wc

Sun Dec 10 08:53:42 UTC 2017

On Sat, 9 Dec 2017, Conrad Meyer wrote:

> Log:
>  wc(1): Extend non-controversial optimizations to '-c' mode
>
>  wc(1)'s slow path for counting words or multibyte characters requires
>  conversion of the 8-bit input stream to wide characters.  However, a faster
>  path can be used for counting only lines ('-l' -- newlines have the same
>  representation in all supported encodings) or bytes ('-c').
>
>  The existing line count optimization was not used if the input was the
>  implicit stdin.  Additionally, it wasn't used if only byte counting was
>  requested.  This change expands the fast path to both of these scenarios.

This seems reasonable, but I don't like the complicated logic for handling
combinations of lines and bytes in the same loop.  Bytes alone just require
adding up the byte counts from read() (in the slow case where st_size
cannot be used).  This is obviously faster than using cat(1) (modulo
block size pessimizations), since cat has to do output.

PS: actually, it pessimizes the fastest case of wc -c on a regular file.
See below.  (The wc -l code is before the fastest case.  That was not
a very good organization, but it worked since wc -l can never use the
fastest case.)

Your change seems to work almost right for wc -c but not for wc -m.  On
freefall, dd bs=1m goes at 32GB/sec.
     (I think this depends on some vm magic to reduce copies.  1m is
     large and asks for cache busting, but it should only use 4K of L1
     in the kernel and 1 or 2 times 1m in dd.  Apparently dd is avoiding
     the output.  My throughput utility which does avoid the output is
     only 1GB/sec faster).
wc -c is now mediocre.  It goes at 5.5GB/sec for a pipe from dd bs=1m.
5-6 times slower is too many, but cat it is better than cat which is 11.5
times slower (3GB/sec).  1m is too large a block size for me, and dd bs=64k
is only slightly slower (28GB/sec).  I blame pipes and slow syscalls for the
slowness of dd bs=1m </dev/zero | wc -c.

wc -m is still slow (on a 1-byte wide locale).  It goes at 200MB/sec.  27
times slower than wc -c.  The 1-byte wide case is supposed to be the same
as wc -c, but I think this is an invalid optimization.

>  Expanding the buffer size from 64 kB helps reduce the number of read(2)
>  calls needed, but exactly what impact that change has and what size to
>  expand the buffer to are still under discussion.

It would be wrong.  I didn't see any discussion of this.

Other utilities have different bugs in this area.  The worst are all
utilities that use stdio.  Just using stdio asks for slowness, but
fread() and fwrite() are not too bad for efficiency provided stdio's
block sizing is not trusted and setvbuf() is used to override it.
stdio naively trusts stat() to return a usable block size in st_blksize,
but stat() rarely does so.  This doesn't matter much for buffered i/o,
unless you want to get near 32GB/sec).  dd bs=4k goes at 10GB/sec on
freefall.  This wastes some CPU, but it is hard to produce i/o at even
1/10 of that speed.  Just don't go 8 times slower than that by using
a block size of 512.

One of the kernel bugs is using PAGE_SIZE as a default for st_blksize.
I notice this often when I downgrade to versions of FreeBSD with this
bug and do silly things like wc on disks and not so silly things like
cmp on disks.  I work around this using dd to change the block size.
This uses more CPU, but not enough to matter for disks (except SSD ones).

Ugh, -current has broken st_blksize for pipes.  It is now hard-coded as
PAGE_SIZE.  In old versions, it was pipe->pipe_buffer_size.  Pipe code
has internal problems configuring the latter.  It uses bogusly ifdefed
constants PIPE_SIZE = 16K and BIG_PIPE_SIZE = 64K, and an obscure method
to select between these.

Only socket code handles st_blksize perfectly.  It sets st_blksize to
sb_hiwat.  The user has complete control over socket buffer sizes
and watermarks, and the defaults are better.

> Modified: head/usr.bin/wc/wc.c
> ==============================================================================
> --- head/usr.bin/wc/wc.c	Sat Dec  9 21:04:56 2017	(r326735)
> +++ head/usr.bin/wc/wc.c	Sat Dec  9 21:55:19 2017	(r326736)
> @@ -206,30 +206,30 @@ cnt(const char *file)
> ...
> +	else if ((fd = open(file, O_RDONLY, 0)) < 0) {
> +		xo_warn("%s: open", file);
> +		return (1);
> +	}
> +	if (doword || (domulti && MB_CUR_MAX != 1))
> +		goto word;

Checks of MB_CUR_MAX like this are supposed to make wc -m do the same as
wc -c when the multi-byte width is 1.  Is this really valid?  Even with
ASCII, is is reasonable to treat bytes with no encoding as not there at
all, or as an encoding error.  I don't know of mb conversion functions
are capable of or allowed to ignore bytes with no encoding.  On exotic
machines with 32-bit chars (now allowed by POSIX).

This check seemed to work before for making wc -c equally slow as wc -m
for (domulti && MB_CUR_MAX != 1) (just doing this check in inner loops
is pessimal).  I can't see why the above doesn't make wc -m equally
fast to wc -c in this case.

> +	/*
> +	 * Line counting is split out because it's a lot faster to get
> +	 * lines than to get words, since the word count requires some
> +	 * logic.
> +	 */
> +	if (doline || dochar) {
> +		while ((len = read(fd, buf, MAXBSIZE))) {
> +			if (len == -1) {
> +				xo_warn("%s: read", file);
> +				(void)close(fd);
> +				return (1);
> +			}
> +			if (siginfo) {
> +				show_cnt(file, linect, wordct, charct,
> +				    llct);
> +			}

Style bugs:
- excessive line splitting
- excessve parenthses
The second one was in the old code.  The first one was not in the old code
since the splitting was needed because the indentation was larger.

The siginfo checks in the inner loop of the slow case help make it slower.
Here there is only 1 per read so the pessimization is small.

> +			charct += len;
> +			if (doline) {
> 				for (p = buf; len--; ++p)
> 					if (*p == '\n') {
> 						if (tmpll > llct)

If doing only chars, then this just adds up the byte counts, except for
hair like the error checks (needed) and siginfo tests (hair).

Old versions would have been much simpler using unsafe signal handlers.
This doesn't require adding siginfo checks in all i/o loops, but just
prints immediately and usually unsafely in signal handlers.  I think
the "safe" method doesn't even work for large i/o's or small blocked
i/o's.  Unsafe siginfo prints in the middle of the i/o.  Most BSD programs
can't handle EINTR in i/o loops, so most siginfo setups preserve the
default restarting of syscalls, although not restarting is required to
get back to the i/o loops to print.

> +	/*
> +	 * If all we need is the number of characters and it's a
> +	 * regular file, just stat the puppy.
> +	 */

The comment is still precious, and is now misformatted by not adjusting
for the new indentation.

> +	if (dochar || domulti) {
> +		if (fstat(fd, &sb)) {
> +			xo_warn("%s: fstat", file);
> +			(void)close(fd);
> +			return (1);
> +		}
> +		if (S_ISREG(sb.st_mode)) {
> +			reset_siginfo();
> +			charct = sb.st_size;
> 			show_cnt(file, linect, wordct, charct, llct);
> +			tcharct += charct;
> 			(void)close(fd);
> 			return (0);
> -		}

This naively trusts S_ISREG() to work, so never worked on FreeBSD's irregular
"regular" files in procfs and perhaps other places.

Similar checks come close to breaking cp and cmp on these files.  But cp
and cmp do dubious mmapping() and differently bad fallback to a slow case.
mmap() fails for most irregular regular files, so the slow case is reached,

wc could also use mmap() if it cared about efficiency, but this is hard to
do right.  No FreeBSD tilities understand buffer sizing even for read() and
write(), and mmap() is harder.  The slow case in wc is too slow fo better
i/o to make much difference.  wc -c on a regular file doesn't do i/o.  This
leaves only wc -l as possibily benefiting from mmap().

cmp is too agressive using mmap().  For regular files, it tries to map
the whole file, and falls back to a very slow method using getc() if
mmap() fails.  For non-regular files, it always uses the slow method
This method wastes a lot of CPU for getc() and wastes a lot of bandwith
for some file types using the st_blksize pessimizations.  Mapping large
files busts caches just to allow a simpler terminating condition which
saves a few cycles per char.  Comparison uses a slow per-char method in
all cases to simplify printing differences for the usual case.

cp still uses the 25+ year old option VM_AND_BUFFER_CACHE_SYNCHRONIZED
in its Makefile in case vm is still 25 years old.  Other utilities just
assume that mmap() works and that using it is actually an optimization.
cp still uses the 25+ year old hard coded limit of 8MB on mmapping().
That used to be too large, but is now a bit small.  However, with more
caching everywhere it is hard to know the best size and method.  I
don't like the mmap() method, but it works best for my usual case of
lots small files in the buffer cache.  All methods are fast then.  For
large files only on disks, everything is slow and mmamp() makes little
difference to the speed of the copy.  It might use less CPU but waste
more memory.  For very large files, it is clearly worst to map whole
files like cmp does.  This wastes memory in the application and gives
vm the wrong hint that the whole files will be used again soon.  cp's
window of 8MB is not too bad.  Perhaps smaller is better for data
that won't be used again soon.  Then we only want the freatures of
mmap that allow it to reduce copying and buffering.  cp sometimes knows
when the data won't be used again soon because it is copying large
data (large files or many small files).

I debugged while wc -m doesn't work as documented or as the code seems
to support for locales that don't support multibyte characters.  It
is just that -m turns off dochar.  This used to have no effect since
the wc -c used the slow case like wc -m.  Now the complicated logic
is broken.  The fastest case of wc -c (where it is supposed to just
stat the file()) is also broken by essentially the same error.  Now
this case use the new second-fastest read() method for wc -c.  Bizarrely,
wc -m still works as intended by using the fastests case.

Code with broken logic:

X 	if (doword || (domulti && MB_CUR_MAX != 1))
X 		goto word;

The point of this MB_CUR_MAX != 1 check is to ensure reaching the slow
case for multi-byte characters.  Otherwise, we would fall through to
test for the fastest case.  The test there (and others before there)
depends on testing MB_CUR_MAX here.

X 	/*
X 	 * Line counting is split out because it's a lot faster to get
X 	 * lines than to get words, since the word count requires some
X 	 * logic.
X 	 */
X 	if (doline || dochar) {

This used to only test doline (but handle dochar).

Testing dochar here breaks the fastest case for dochar (but not for domulti)
by handling dochar here in a slower way.

X 		[... rest of second-fastest case]
X 	}
X 	/*
X 	 * If all we need is the number of characters and it's a
X 	 * regular file, just stat the puppy.
X 	 */
X 	if (dochar || domulti) {

This used to handle dochar and the width 1 case of domutli in the fastest
way.  It still works for domulti, but is unreachable for dochar, since
dochar was eaten above.

X 		[... fastest case using fstat()]
X 	}

To fix this, first move the fstat() case earlier.  The condition for using
this case can be written as (doline == 0).  This is clearer than a testing
all the flags again and much clearer than testing only half the flags again
as is done now.

Then after handling the fastest case, don't test any more, and reduce the
indentation.  The test in the current code is the same as a null test
except it has different bugs than a null test.  We want to treat -m as -c,
so the test should be of (dochar || domulti || doline), but one of thes
flags is known to be set since doword is known to be clear.  The current
test just breaks the fastest case for -c and the second-fastest case for
-m, but avoids breaking the fastest case for -m.  Actually, the broken
cases for -m are reversed when -l is used.  Then -m behaves the same as -c
for width 1.

Perhaps some more reorganization can be used to redce the indentation.
Gotos are already used to do not much more than reduce indentation.

Bruce