Uppercase RE matching problems in FreeBSD 11

Tue Nov 8 20:07:05 UTC 2016

On Nov 8, 2016, at 11:54 AM, Stefan Ehmann <shoesoft at gmx.net> wrote:
> On 07.11.2016 22:13, Charles Swiger wrote:
>> On Nov 6, 2016, at 1:49 PM, Stefan Bethke <stb at lassitu.de> wrote:
>>> Am 06.11.2016 um 22:27 schrieb Baptiste Daroussin
>>> <bapt at FreeBSD.org>:
>>>> That works for POSIX locale aka C aka ASCII only world
>>> 
>>> So what do I set my LANG and LC variables to?  I do want UTF-8, but
>>> I do also want my scripts to continue to work.  Clearly,
>>> en_US.UTF-8 is not what I want.  Is it C.UTF-8?  Or do I set
>>> LANG=en_US.UTF-8 and LC_COLLATE=C?
>> 
>> If you want to use a UTF8 locale, then you must start using character
>> classes like '[:upper:]' and '[:lower:]' because those will-- or at
>> least "should", modulo bugs-- properly handle the collation issues
>> including for languages which do not possess a 1-1 mapping between
>> upper and lower case letters.
>> 
>> Someone with a German email address is presumably familiar with ß /
>> Eszett...?  :-)
> 
> Character classes work fine for [a-z], but I don't know of a simple way
> to match a range like [a-k].

True.  If you need smaller ranges, I don't see a portable way of doing
so in a non-POSIX / "C" locale beyond listing them out.  Or:

> Personally, I prefer the "Rational Range Interpretation" because it
> doesn't break backward compatibility and is still standard compliant.

...yes, +1.  Many of the GNU tools like grep and gawk have adopted this,
but they are replacing the system regex routines with their own code.

However, you can't rely on RRI without testing whether you've got a gawk
in the $PATH or whether /usr/bin/awk or whichever is really GNU awk.

Regards,
-- 
-Chuck