Uppercase RE matching problems in FreeBSD 11
Stefan Bethke
stb at lassitu.de
Sun Nov 6 21:49:54 UTC 2016
Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin <bapt at FreeBSD.org>:
>
>> But under what circumstances would [A-Z] mean anything other than a character whose Unicode codepoint is between U+0041 and U+005A, inclusive? Especially given the locale in the example is en_US.UTF-8. Or, put another way, why would an implementation interpret [A-Z] as anything other than [ABCDE…XYZ]?
>
> The collation rules for unicode comes from: http://cldr.unicode.org/ and they do
> match the one on linux for example and the one on illumos.
>
> On some gnu tool they explicitly decide to be non locale aware to avoid that
> kind of "surprises"
>>
>> From reading your reference, I can see in 9.3.5.7:
>>> In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior[…]
>>
>> So even if the observed behaviour is conforming, I’d think it’s still highly undesirable.
>>
> That works for POSIX locale aka C aka ASCII only world
So what do I set my LANG and LC variables to? I do want UTF-8, but I do also want my scripts to continue to work. Clearly, en_US.UTF-8 is not what I want. Is it C.UTF-8? Or do I set LANG=en_US.UTF-8 and LC_COLLATE=C?
Stefan
--
Stefan Bethke <stb at lassitu.de> Fon +49 151 14070811
More information about the freebsd-stable
mailing list