Basic UTF-8 support for sh(1)
Jilles Tjoelker
jilles at stack.nl
Fri Feb 25 14:53:13 UTC 2011
Here is a patch that adds basic UTF-8 support to sh(1). This is enabled
if the locale is set appropriately.
Features:
* ${#var} counts codepoints. (Really, bytes with (b & 0xc0) != 0x80.)
* ?, [...] patterns match codepoints instead of bytes. They do not match
invalid sequences. This is so that ${var#?} removes the first
codepoint, not the first byte. However, * continues to match any
string and an invalid sequence matches an identical invalid sequence.
(This differs from fnmatch(3).)
Internal:
* CTL* bytes are moved to bytes that cannot occur in UTF-8 so that
mbrtowc(3) can be used directly. The new locations do occur in
iso-8859-* encodings.
Limitations:
* Only UTF-8 support is added, not any other multibyte encodings. I do
not want to bloat up sh with mbrtowc(3) and similar everywhere.
* Invalid sequences may not be handled as desired. It seems aborting on
invalid UTF-8 sequences would break things, so they are let through.
This also avoids bloating the code up with checking everywhere.
* There is no special treatment for combining characters, accented
letters may match ? or ?? or even more depending on normalization
form. This matches other code in FreeBSD and is usually good enough
because normalization forms that use as few codepoints as possible
tend to be used.
* IFS remains byte-based as in ksh93 (but unlike bash and zsh).
* Our version of libedit does not support UTF-8 so sh will still be
rather unpleasant to use interactively with characters not in
us-ascii.
Is this useful and worth the (small) bloat?
A somewhat related feature is support for \uNNNN and \UNNNNNNNN
sequences in $'...' (this will be added to POSIX, see
http://austingroupbugs.net/view.php?id=249 and I plan to add it to sh).
Ideally, these are converted using iconv(3) but as long as it is not
unconditionally available in base or if it is not supposed to be used,
the codepoints can be encoded in UTF-8 for UTF-8 locales, leaving other
locales with question marks.
--
Jilles Tjoelker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sh-utf8.patch
Type: text/x-diff
Size: 8370 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20110225/2675671c/sh-utf8.bin
More information about the freebsd-hackers
mailing list