Re: Confusion with grep & locale?

From: Stefan Esser <se_at_freebsd.org>
Date: Fri, 20 Aug 2021 12:47:11 UTC
Am 20.08.21 um 11:03 schrieb Helge Oldach:
> Hi all,
> 
> I'm confused about the FreeBSD behaviour with respect to locale's
> and grep - specifically, it seems case sensitivity is not handled
> consistently when grepping character ranges. It looks to me like 11 and
> 13 are not behaving consistently however I'm unclear why.
> 
> # uname -a
> FreeBSD 11STABLE 11.4-STABLE FreeBSD 11.4-STABLE #1059 r368289M: Thu Dec  3 01:48:30 UTC 2020     root@XXX  amd64
> # export LANG=en_US.ISO8859-1
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=C
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=en_US.UTF-8
> # (echo bla; echo Bla) | grep '[A-Z]'
> bla
> Bla

This is not unexpected, since the default collating sequence for many UTF-8
locales is to have lower case letters precede their upper case versions in
the sequence, i.e.: "aAbBcC..."

	https://developer.mimer.com/services/sql-unicode-collation-charts/

Here is a collation chart for English:

	https://download.mimer.com/pub/developer/charts/english.htm

But POSIX makes no guarantees for locales other than POSIX or C.

> # uname -a
> FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49 stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021     root@XXX  amd64
> # export LANG=en_US.ISO8859-1
> # (echo bla; echo Bla) | grep '[A-Z]'
> bla
> Bla

This one is unexpected, the upper case should be a range of its own
and should not include any lower case letters.

> # export LANG=C
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla

Correct.

> # export LANG=en_US.UTF-8
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla

Here I had expected the result you got with en_US.ISO8859-1 ...

> For comparison, a Linux RHEL box delivers the expected results:
> 
> # uname -a
> Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
> # export LANG=en_US.ISO8859-1
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=C
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla
> # export LANG=en_US.UTF-8
> # (echo bla; echo Bla) | grep '[A-Z]'
> Bla

Seems that this version uses a POSIX style collating sequence for UTF-8.
It would be interesting to test with ranges that contain accented
characters or German Umlaut characters.

> There is nothing special in the environment, specifically no LC_xxx nor
> MM_CHARSET in either case.

LANG defines LC_COLLATE, unless overridden.

> Any guidance is appreciated... Thanks!

Definitely a bug in the definition of the collating sequences.

And I have just verified that de_DE.ISO8859-1 wrongly considers "ö"
to be within [a-z], while de_DE.UTF-8 does not (but should).

Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
each assigned to the other one.


Some platforms have switched to use the POSIX style collating sequence
to support traditional style [A-Z] for [[:upper:]], since a lot of shell
script have been written with that assumption for decades.

BTW, character classes work for your examples and more:

# (echo bla; echo Bla) | LANG=en_US.ISO8859-1 grep '[[:upper:]]'
Bla
# (echo bla; echo Bla) | LANG=en_US.UTF-8 grep '[[:upper:]]'
Bla

# (echo "o"; echo "ö") | LANG=de_DE.ISO8859-1 grep '[[:lower:]]'
o
# (echo "o"; echo "ö") | LANG=de_DE.UTF-8 grep '[[:lower:]]'
o
ö

Regards, STefan