gnu/116363: isspace broken for UTF-8 locales

Sun Sep 16 10:01:47 PDT 2007

On Sun, Sep 16, 2007 at 04:40:07PM +0000, Andrey Chernov wrote:
> The following reply was made to PR gnu/116363; it has been noted by GNATS.
> 
> From: Andrey Chernov <ache at nagual.pp.ru>
> To: Hye-Shik Chang <perky at FreeBSD.org>
> Cc: Petr Hroudny <petr.hroudny at gmail.com>, freebsd-gnats-submit at FreeBSD.org,
>         jkoshy at FreeBSD.org, i18n at FreeBSD.org
> Subject: Re: gnu/116363: isspace broken for UTF-8 locales
> Date: Sun, 16 Sep 2007 20:34:07 +0400
> 
>  On Mon, Sep 17, 2007 at 01:22:14AM +0900, Hye-Shik Chang wrote:
>  > In fact, UTF-8.src defines values for not UTF-8 but Unicode codepoints.
>  > Using the Unicode codepoint as wchar_t's internal representation gives
>  > much benefit.  I think we would be better to make isspace() and
>  > other ctypes functions aware of "encoding".  IIRC, tjr@ provided the
>  > workaround as in the URL mentioned above and said that it would get
>  > a chance to be fixed in 6 or 7 on 2004.
>  
>  Currently wchar_t represents given encoding in all places including 
>  wc<->mbr conversions. To make it UCS-4-only instead we need to rewrite the 

Oops, sorry for my overlook, we really have UCS-4 as wchar_t, no 
UTF-8.src replacement is needed. 

In that case iswspace(0xA0) should be 1 but not isspace(0xA0) so it seems 
it is isspace() (and others plain ctype) bug. It seems even isspace(' ') 
is illegal in UTF-8 locale because all chars are wide, but I am not sure.

-- 
http://ache.pp.ru/