converting strings from utf8

Tim Kientzle kientzle at freebsd.org
Wed Nov 5 13:37:15 PST 2008


Maksim Yevmenkin wrote:
> 
> can i use wcstombs(3) to convert a string presented in utf8 into
> current locale? basically i'm looking for something like iconv from
> ports but included into base system.

This isn't as easy as it should be, unfortunately.
First, UTF-8 is itself a multibyte encoding, so you have
to first convert to wide characters before you can use
wcstombs().  You could in theory use the following:
   * Set locale to UTF-8
   * use mbstowcs() to convert UTF-8 into wide characters
   * Set locale to your preferred locale
   * use wcstombs() to convert wide characters to your locale

Besides being ugly, the locale names themselves are not
standardized, so it's hard to do this portably.  For a
lot of applications, the error handling in wcstombs() is
also troublesome; it rejects the entire string if any one
character can't be converted.

When I had to do this for libarchive, where the code had
to be very portable (which precluded using iconv), I ended
up doing the following:
  * Wrote my own converter from UTF-8 to wide characters
    (fortunately, UTF-8 is pretty simple to decode; this
     is about 20-30 lines of C)
  * Used wctomb() to convert one character at a time from
     wide characters to the current locale.

I've found that wctomb() is more portable than a lot of
the other functions (I think it's in C89, whereas a lot
of the other standard conversion routines were introduced
in C99) and provides better error-handling capabilities
since it operates on one character at a time (so you
can, for instance, convert characters that aren't
supported in the current locale into '?' or some kind
of \-escape).

Feel free to copy any of my code from libarchive if it helps.

Tim


More information about the freebsd-hackers mailing list