Re: Confusion with grep & locale?

From: Warner Losh <imp_at_bsdimp.com>
Date: Fri, 20 Aug 2021 15:09:03 UTC
On Fri, Aug 20, 2021 at 8:19 AM Helge Oldach <freebsd@oldach.net> wrote:

> Stefan Esser wrote on Fri, 20 Aug 2021 14:47:11 +0200 (CEST):
> > Am 20.08.21 um 11:03 schrieb Helge Oldach:
> > But POSIX makes no guarantees for locales other than POSIX or C.
>
> OK, thanks for the explanation. That clarifies a lot for me. Although
> it's not really POLA. :-)
>
> Thanks a lot also to Stefan Ehmann for the pointer to gawk oddities.
>
> > > # export LANG=en_US.ISO8859-1
> > > # (echo bla; echo Bla) | grep '[A-Z]'
> > > bla
> > > Bla
> >
> > This one is unexpected, the upper case should be a range of its own
> > and should not include any lower case letters.
> >
> > > # export LANG=en_US.UTF-8
> > > # (echo bla; echo Bla) | grep '[A-Z]'
> > > Bla
> >
> > Here I had expected the result you got with en_US.ISO8859-1 ...
>
> > Definitely a bug in the definition of the collating sequences.
> >
> > And I have just verified that de_DE.ISO8859-1 wrongly considers "รถ"
> > to be within [a-z], while de_DE.UTF-8 does not (but should).
> >
> > Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
> > each assigned to the other one.
>
> PR 257972 raised.
>

I've looked at that, and I don't think it's a bug since posix says it's
undefined behavior.


> > > There is nothing special in the environment, specifically no LC_xxx nor
> > > MM_CHARSET in either case.
> >
> > LANG defines LC_COLLATE, unless overridden.
>
> Indeed. I just explicitly mentioned *no* LC_xxx to clarify that it's not
> overriden. :-)
>
> > BTW, character classes work for your examples and more:
>
> Certainly they do. But they harder to type... :-)
>

I think that A-Za-z is undefined, but :letter: is well defined. Most shell
scripts use the 'C' locale for this very reason.

Warner