[Bug 243229] awk in base system does not work with UTF-8 strings correctly

Fri Jan 10 01:47:08 UTC 2020

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243229

--- Comment #1 from Conrad Meyer <cem at freebsd.org> ---
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode
codepoints.  POSIX awk is somewhat clear that you're correct:

> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character classes
> within regular expressions, the identification of characters as letters, and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.

However, the resulting behavior around indexing is nutty: this implies that
index(), match(), etc, are measured in *characters*.  To do this efficiently
one probably has to convert non-ASCII strings to wchar_t and operate on those. 
As you could imagine, that would immensely slow down awk as a fast stream
processing utility.

POSIX is more explicit about toupper() and tolower(), where taking locale into
consideration is easier.

I guess I'm not clear on what value a length() function is that operates on
codepoints rather than bytes.

-- 
You are receiving this mail because:
You are the assignee for the bug.