[Bug 243229] awk in base system does not work with UTF-8 strings correctly
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Fri Jan 10 01:47:08 UTC 2020
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243229
--- Comment #1 from Conrad Meyer <cem at freebsd.org> ---
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode
codepoints. POSIX awk is somewhat clear that you're correct:
> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character classes
> within regular expressions, the identification of characters as letters, and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.
However, the resulting behavior around indexing is nutty: this implies that
index(), match(), etc, are measured in *characters*. To do this efficiently
one probably has to convert non-ASCII strings to wchar_t and operate on those.
As you could imagine, that would immensely slow down awk as a fast stream
processing utility.
POSIX is more explicit about toupper() and tolower(), where taking locale into
consideration is easier.
I guess I'm not clear on what value a length() function is that operates on
codepoints rather than bytes.
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list