svn commit: r250883 - in head: include include/xlocale lib/libc/locale sys/sys tools/regression/lib/libc/locale

Tue May 21 22:00:06 UTC 2013

On Tue, May 21, 2013 at 07:59:37PM +0000, Ed Schouten wrote:
> Author: ed
> Date: Tue May 21 19:59:37 2013
> New Revision: 250883
> URL: http://svnweb.freebsd.org/changeset/base/250883

> Log:
>   Add <uchar.h>.

Looks like an interesting approach to make applications LC_CTYPE locale
aware.

>   The <uchar.h> header, part of C11, adds a small number of utility
>   functions for 16/32-bit "universal" characters, which may or may not be
>   UTF-16/32. As our wchar_t is already ISO 10646, simply add light-weight
>   wrappers around wcrtomb() and mbrtowc().

Our wchar_t is only ISO 10646 for UTF-8 and possibly US-ASCII and
ISO8859-1 (subset) locales. However, we support various other charsets
such as ISO8859-2 and even some non-Unicode multibyte encodings
(although I have no idea how complete the support for the latter is).
Using the new functions with such locales is likely to produce garbage.
Some sort of iconv is necessary to avoid this mojibake and we do not
have it in base.

I encountered the same issue when implementing $'\uXXXX' and
$'\UXXXXXXXX' for /bin/sh. I decided to implement full support only for
UTF-8 locales and restrict the rest to US-ASCII, converting all other
(valid) characters to question marks. I did not see a reason to treat
Western Europe preferentially by adding a simple ISO8859-1 mapping.

For these functions, it is probably best to fail (such as with [EILSEQ])
when a character is encountered for which the conversion is not
implemented.

>   While there, also add (non-yet-standard) _l functions, similar to the
>   ones we already have for the other locale-dependent functions.

This makes sense.

> [snip]
> Added: head/lib/libc/locale/c16rtomb.c
> ==============================================================================
> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
> +++ head/lib/libc/locale/c16rtomb.c	Tue May 21 19:59:37 2013	(r250883)
> [snip]
> +typedef struct {
> +	char16_t	lead_surrogate;
> +	mbstate_t	c32_mbstate;
> +} _Char16State;
> +
> +size_t
> +c16rtomb_l(char * __restrict s, char16_t c16, mbstate_t * __restrict ps,
> +    locale_t locale)
> +{
> +	_Char16State *cs;
> +	char32_t c32;
> +
> +	FIX_LOCALE(locale);
> +	if (ps == NULL)
> +		ps = &locale->c16rtomb;
> +	cs = (_Char16State *)ps;
> +
> +	/* If s is a null pointer, the value of parameter c16 is ignored. */
> +	if (s == NULL) {
> +		c32 = 0;
> +	} else if (cs->lead_surrogate >= 0xd800 &&
> +	    cs->lead_surrogate <= 0xdbff) {
> +		/* We should see a trail surrogate now. */
> +		if (c16 < 0xdc00 || c16 > 0xdfff) {
> +			errno = EILSEQ;
> +			return ((size_t)-1);
> +		}
> +		c32 = 0x10000 + ((cs->lead_surrogate & 0x3ff) << 10 |
> +		    (c16 & 0x3ff));
> +	} else if (c16 >= 0xd800 && c16 <= 0xdbff) {
> +		/* Store lead surrogate for next invocation. */
> +		cs->lead_surrogate = c16;
> +		return (0);
> +	} else {
> +		/* Regular character. */
> +		c32 = c16;
> +	}
> +	cs->lead_surrogate = 0;
> +
> +	return (c32rtomb_l(s, c32, &cs->c32_mbstate, locale));
> +}

You can probably avoid the potential buffer overflow by copying the
trailing 126 bytes of the _Char16State into a new mbstate_t object, and
back after calling c32rtomb_l(). Likewise for mbrtoc16_l(). (The
committed code allows using 120 bytes which should still be more than
enough.)

There may also be a strict-aliasing violation if ps == NULL, as the
mbstate_t objects in __xlocale_C_locale and __xlocale_global_locale have
a declared type. If ps != NULL, it is OK as long as the object is a
separate compilation unit.

-- 
Jilles Tjoelker