svn commit: r196752 - head/lib/libc/stdtime

Bruce Evans brde at optusnet.com.au
Fri Sep 4 03:42:16 UTC 2009


On Thu, 3 Sep 2009, [utf-8] Dag-Erling Smørgrav wrote:

> Andrey Chernov <ache at nagual.pp.ru> writes:
>> Thanx for detailed explanation.
>
> np; I've made that mistake many times myself.

Andrey doesn't make that mistake many times (at least this millenium),
but fixes it a lot.

> What do you think of the attached patch?

Why not use the same text as POSIX?  I don't like repeating this ad
nauseum, but POSIX does.  (In POSIX.1-2001-draft7, this repetition is
responsible for 1/3 of the lines matching of "unsigned char".)
C99 says this only once for ctype functions, at the beginning of the
section on <ctype.h>.

The patch is missing the corresponding text for wctype functions
(argmuments of type wint_t generally give undefined behaviour unless
their value is representable as a wchar_t or equal to the value of
WEOF).  In FreeBSD, wint_t has the same type as wchar_t and that type
is int, so bugs in this area are latent (unportable code might run
on FreeBSD, and there might be problems with the sign bit and/or with
WEOF being indistinguishable from a valid wide char encoding.
gcc/config/i386 has some targets with 16-bit wide chars, and these
have to be more careful about both.  The non-broken ones use uint16_t
for wchar_t and int for wint_t)>

% Index: lib/libc/locale/iscntrl.3
% ===================================================================
% --- lib/libc/locale/iscntrl.3	(revision 196695)
% +++ lib/libc/locale/iscntrl.3	(working copy)
% @@ -32,7 +32,7 @@
%  .\"     @(#)iscntrl.3	8.1 (Berkeley) 6/4/93
%  .\" $FreeBSD$
%  .\"
% -.Dd July 17, 2005
% +.Dd September 3, 2009
%  .Dt ISCNTRL 3
%  .Os
%  .Sh NAME
% @@ -65,6 +65,15 @@
%  .It "\&031\ EM \t032\ SUB \t033\ ESC \t034\ FS \t035\ GS"
%  .It "\&036\ RS \t037\ US \t177\ DEL"
%  .El
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn iscntrl
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.

This wording is poor.  It should be something like "if the value to be
passed is represented as a signed char" ...  I don't know a good easy
way to fix "must be cast ... to avoid sign-extension errors".  The
value must be converted to one representable as an unsigned char to
work, but that is not always possible, and blindly casting may give
a wrong value.  In most cases, it is an error to represent the value
as a signed char to begin with, and better to cast a pointer from
"char *" to "u_char":

 	char *p;			/* Already a bad type. */

 	... isalpha(*p);		/* Wrong since *p might be < 0. */
 	... isalpha((u_char)*p);	/* Mishandles exotic machines. */
 	... isalpha(*(u_char *)p);	/* Mishandles diff. exotic machines. */

"unsigned char" is spelled u_char in KNF.  Here this makes the above lines
barely fit in 80 columns before quoting.

Exotic machines include:
- simple ones complement with no trap representations, signed chars  and
   CHAR_BIT = 8.  I think '\xff' is -0.  Casting this to u_char gives
   0, which is probably not what is wanted, while casting the pointer
   to it gives 0xff (if the bits are there in the memory copy), which
   is probably what is wanted.
- ones with chars being signed and signed chars having a larger range
   than unsigned ones, or with u_char of a different size than char
   (not sure if the latter is allowed).  Then casting the pointer will
   give a reasonable value, though possibly not the wanted one (the
   difficulty is more in advancing the pointer), while blindly casting
   the value may corrupt many more values than -0.
- ones with trap representations, instead of or in addition to the above
   complications.  Now *p will trap on trap reps but there should be no
   trap reps in valid data, while casting the pointer will always give
   a reasonable value, possibly including trap bits.

Sign extension for passing a signed char is not an error.  The error is
passing a negative value.

Neither C99 nor POSIX gives any advice.  They just say that the behaviour
is undefined if the value is not representable as an unsigned char or
equal to the value of EOF.

% Index: lib/libc/locale/isrune.3
% ===================================================================
% --- lib/libc/locale/isrune.3	(revision 196695)
% +++ lib/libc/locale/isrune.3	(working copy)
% @@ -25,7 +25,7 @@
%  .\"
%  .\" $FreeBSD$
%  .\"
% -.Dd March 30, 2004
% +.Dd September 3, 2009
%  .Dt ISRUNE 3
%  .Os
%  .Sh NAME
% @@ -46,6 +46,15 @@
%  .Tn ASCII
%  character set, this is equivalent to
%  .Fn isascii .
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn isrune
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.
%  .Sh RETURN VALUES
%  The
%  .Fn isrune

This seems to be wrong.  isascii() works on all int args, so I assume
isrune() does too.  isrune.3 also misdescribes its arg as "the character"
-- this prejudges the arg.

% Index: lib/libc/locale/digittoint.3
% ===================================================================
% --- lib/libc/locale/digittoint.3	(revision 196695)
% +++ lib/libc/locale/digittoint.3	(working copy)
% @@ -28,7 +28,7 @@
%  .\"	@(#)digittoint.3	8.1 (Berkeley) 6/4/93
%  .\" $FreeBSD$
%  .\"
% -.Dd April 6, 2001
% +.Dd September 3, 2009
%  .Dt DIGITTOINT 3
%  .Os
%  .Sh NAME
% @@ -46,6 +46,15 @@
%  function converts a numeric character to its corresponding integer value.
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
%  The character can be any decimal digit or hexadecimal digit.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
%  With hexadecimal characters, the case of the values does not matter.
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn digittoint
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.
%  .Sh RETURN VALUES
%  The
%  .Fn digittoint

Here the behaviour already seems to be undefined unless the arg is a
decimal or hex digit.  Like isascii(), this function isn't in C99, so
it can be more or less strict than other ctype functions.

% Index: lib/libc/locale/isascii.3
% ===================================================================
% --- lib/libc/locale/isascii.3	(revision 196695)
% +++ lib/libc/locale/isascii.3	(working copy)
% @@ -28,7 +28,7 @@
%  .\"     @(#)isascii.3	8.2 (Berkeley) 12/11/93
%  .\" $FreeBSD$
%  .\"
% -.Dd October 6, 2002
% +.Dd September 3, 2009
%  .Dt ISASCII 3
%  .Os
%  .Sh NAME
% @@ -47,6 +47,15 @@
%  .Tn ASCII
%  character, which is any character
%  between 0 and octal 0177 inclusive.
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn isascii
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.
%  .Sh SEE ALSO
%  .Xr ctype 3 ,
%  .Xr iswascii 3 ,

This is wrong -- see above.  POSIX specifically says that "The isascii()
function is defined on all integer values".  Please check POSIX for any
other special cases -- I only remembered this one and passed most of your
other changes since most of the functions are in C99.

% Index: lib/libc/locale/toascii.3
% ===================================================================
% --- lib/libc/locale/toascii.3	(revision 196695)
% +++ lib/libc/locale/toascii.3	(working copy)
% @@ -28,7 +28,7 @@
%  .\"	@(#)toascii.3	8.1 (Berkeley) 6/4/93
%  .\" $FreeBSD$
%  .\"
% -.Dd June 4, 1993
% +.Dd September 3, 2009
%  .Dt TOASCII 3
%  .Os
%  .Sh NAME
% @@ -45,6 +45,15 @@
%  .Fn toascii
%  function strips all but the low 7 bits from a letter,
%  including parity or other marker bits.
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn toascii
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.
%  .Sh RETURN VALUES
%  The
%  .Fn toascii

This seems wrong too.  POSIX doesn't say anything specific for it (neither
the restriction nor lack of it).  It just says that the result is
<arg> and 0x7f where the above says "strips all but the low 7 bits from
a letter".  Another old bug is the above saying "letter" -- toascii()
always worked on at least ASCII non-letters.

% Index: lib/libc/locale/toupper.3
% ===================================================================
% --- lib/libc/locale/toupper.3	(revision 196695)
% +++ lib/libc/locale/toupper.3	(working copy)
% @@ -32,7 +32,7 @@
%  .\"	@(#)toupper.3	8.1 (Berkeley) 6/4/93
%  .\" $FreeBSD$
%  .\"
% -.Dd July 17, 2005
% +.Dd September 3, 2009
%  .Dt TOUPPER 3
%  .Os
%  .Sh NAME
% @@ -53,6 +53,15 @@
%  .Vt "unsigned char"
%  or the value of
%  .Dv EOF .
% +.Pp
% +.Em NOTE :
% +if the value passed to the
% +.Fn toupper
% +function is a
% +.Vt signed char ,
% +as is usually the case, it must be cast to an
% +.Vt unsigned char
% +to avoid sign-extension errors.
%  .Sh RETURN VALUES
%  If the argument is a lower-case letter, the
%  .Fn toupper

toupper.2 already says that "the value of the argument is representable
as an unsigned char or the value EOF" (it should say "the argument is
representable as an unsigned char or equal to the value of EOF").  The
new text, if any, should be mixed with this.

Presumably similarly for tolower (I didn't noticed these problems
earlier).

Bruce


More information about the svn-src-all mailing list