Basic UTF-8 support for sh(1)

Fri Feb 25 14:53:13 UTC 2011

Here is a patch that adds basic UTF-8 support to sh(1). This is enabled
if the locale is set appropriately.

Features:
* ${#var} counts codepoints. (Really, bytes with (b & 0xc0) != 0x80.)
* ?, [...] patterns match codepoints instead of bytes. They do not match
  invalid sequences. This is so that ${var#?} removes the first
  codepoint, not the first byte. However, * continues to match any
  string and an invalid sequence matches an identical invalid sequence.
  (This differs from fnmatch(3).)

Internal:
* CTL* bytes are moved to bytes that cannot occur in UTF-8 so that
  mbrtowc(3) can be used directly. The new locations do occur in
  iso-8859-* encodings.

Limitations:
* Only UTF-8 support is added, not any other multibyte encodings. I do
  not want to bloat up sh with mbrtowc(3) and similar everywhere.
* Invalid sequences may not be handled as desired. It seems aborting on
  invalid UTF-8 sequences would break things, so they are let through.
  This also avoids bloating the code up with checking everywhere.
* There is no special treatment for combining characters, accented
  letters may match ? or ?? or even more depending on normalization
  form. This matches other code in FreeBSD and is usually good enough
  because normalization forms that use as few codepoints as possible
  tend to be used.
* IFS remains byte-based as in ksh93 (but unlike bash and zsh).
* Our version of libedit does not support UTF-8 so sh will still be
  rather unpleasant to use interactively with characters not in
  us-ascii.

Is this useful and worth the (small) bloat?

A somewhat related feature is support for \uNNNN and \UNNNNNNNN
sequences in $'...' (this will be added to POSIX, see
http://austingroupbugs.net/view.php?id=249 and I plan to add it to sh).
Ideally, these are converted using iconv(3) but as long as it is not
unconditionally available in base or if it is not supposed to be used,
the codepoints can be encoded in UTF-8 for UTF-8 locales, leaving other
locales with question marks.

-- 
Jilles Tjoelker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sh-utf8.patch
Type: text/x-diff
Size: 8370 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20110225/2675671c/sh-utf8.bin