bin/84992: gcc4.x cleanup of usr.bin/hexdump

Bruce Evans bde at
Wed Aug 17 02:52:26 GMT 2005

On Tue, 16 Aug 2005, Divacky Roman wrote:

>> Description:
> gcc4.x (tested with gcc41) cleanup of usr.bin/hexdump
>> How-To-Repeat:
> apply the patch
>> Fix:
> Index: conv.c
> ===================================================================
> RCS file: /home/ncvs/src/usr.bin/hexdump/conv.c,v
> retrieving revision 1.8
> diff -u -r1.8 conv.c
> --- conv.c	16 Jul 2004 11:07:07 -0000	1.8
> +++ conv.c	11 Aug 2005 13:22:00 -0000
> @@ -103,7 +103,7 @@
> 	if (odmode && MB_CUR_MAX > 1) {
> 		oclen = 0;
> retry:
> -		clen = mbrtowc(&wc, p, bufsize, &pr->mbstate);
> +		clen = mbrtowc(&wc, (const char *)p, bufsize, &pr->mbstate);

Adding casts is a cleandown.  p should have type "char *" to begin with.
If p actually needs to be "unsigned char *" then it is unclear that
mbrtowc() can handle it correctly and casting it to "char *" breaks the
warning about this.

The possible critical differences between unsigned chars and signed chars
are especially clear in hexdump.  hexdump wants to print raw bytes so it
should use unsigned chars just like it already does except without the
type mismatches from passes pointers to string functions that want signed
chars.  Signed chars may have any number of padding bits that wouldn't be
printed right if the bytes are accessed as signed chars.  The padding
bits may contain any number of trap representations which would abort the
program before it prints anything.  Signed chars must have a single sign
bit, but it may be anywhere and may be printed weirdly.  See section of the draft C99 standard.

It is clear that string functions can't in general handle arrays of
unsigned chars via type punning as above, since the unsigned chars may
have trap representations when punned to signed chars -- explicit
conversion at the level of chars is required, although it might not
work right for hexdump since the conversion might lose bits (it _must_
lose any trap bits).  What is unclear is whether the problems can
actually occur, even in theory.  Note that the above code is only
executed in the odmode case due to the "hd" mode wanting to be rawer.

> 		if (clen == 0)
> 			clen = 1;
> 		else if (clen == (size_t)-1 || (clen == (size_t)-2 &&
> @@ -118,7 +118,7 @@
> 			 * can complete it.
> 			 */
> 			oclen = bufsize;
> -			bufsize = peek(p = peekbuf, MB_CUR_MAX);
> +			bufsize = peek(p = (u_char *)peekbuf, MB_CUR_MAX);
> 			goto retry;
> 		}
> 		clen += oclen;

Similarly, in the reverse direction.  peek() wants unsigned chars so
using possibly-signed chars for its buffers is nonsense.  It is unclear
if peek() actually needs unsigned chars.

> Index: parse.c
> ===================================================================
> RCS file: /home/ncvs/src/usr.bin/hexdump/parse.c,v
> retrieving revision 1.13
> diff -u -r1.13 parse.c
> --- parse.c	22 Jul 2004 13:14:42 -0000	1.13
> +++ parse.c	11 Aug 2005 13:22:00 -0000
> @@ -54,7 +54,7 @@
> void
> addfile(char *name)
> {
> -	unsigned char *p;
> +	char *p;
> 	FILE *fp;
> 	int ch;
> 	char buf[2048 + 1];
> @@ -79,7 +79,7 @@
> void
> add(const char *fmt)
> {
> -	unsigned const char *p, *savep;
> +	const char *p, *savep;
> 	static FS **nextfs;
> 	FS *tfs;
> 	FU *tfu, **nextfu;

Changing the basic type to plain char gives bugs even on normal machines
(2's complement with 8-bit chars).  The data will have been read using
read(2) or passed on the command line via char **argv  so it naturally
has type char, but hexdump depends critically on the reverse of the
bogus casts above -- it puns the plain chars to unsigned chars to get
rid of the sign bit.  Here p points into the format string so the type
pun is more bogus than for file data, but it is intentional -- p had
type plain "char *" in rev.1.1, but it was changed to "unsigned char
*" in rev.1.3 to work around many sign extension bugs that occur in
practice.  The first one in add() is the well-known bug of using
isspace() on a possibly-signed char.  This gives undefined behaviour
if the value is negative but different from EOF, and bogus defined
behaviour if the value is EOF.  The format string can be specified on
the command line so it is easy for users who use more than the ASCII
character set to put a negative char in it.

[... Similarly to undo most of the work-arounds in rev.1.3].


More information about the freebsd-bugs mailing list