Re: find(1): I18N gone wild ?

From: Yuri <yuri_at_aetern.org>
Date: Fri, 21 Apr 2023 18:03:30 UTC
Mark Millard wrote:
> Dimitry Andric <dim_at_FreeBSD.org> wrote on
> Date: Fri, 21 Apr 2023 10:38:05 UTC :
> 
>> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote:
>>> Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
>>> Datum: maandag, 17 april 2023 23:06
>>> Aan: current@freebsd.org
>>> Onderwerp: find(1): I18N gone wild ?
>>> This surprised me:
>>>
>>> # mkdir /tmp/P
>>> # cd /tmp/P
>>> # touch FOO
>>> # touch bar
>>> # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
>>> ./FOO
>>> # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
>>> ./FOO
>>> ./bar
>>>
>>> Really ?!
>> ...
>>> My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark.
>>
>> Same here. However, I have read that with unicode, you should *never*
>> use [A-Z] or [0-9], but character classes instead. That seems to give
>> both files on macOS and Linux with [[:alpha:]]:
>>
>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
>> ./BAR
>> ./foo
>>
>> and only the lowercase file with [[:lower:]]:
>>
>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
>> ./foo
>>
>> But on FreeBSD, these don't work at all:
>>
>> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
>> <nothing>
>>
>> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
>> <nothing>
>>
>> This is an interesting rabbit hole... :)
> 
> FreeBSD:
> 
>      -name pattern
>              True if the last component of the pathname being examined matches
>              pattern.  Special shell pattern matching characters (“[”, “]”,
>              “*”, and “?”) may be used as part of pattern.  These characters
>              may be matched explicitly by escaping them with a backslash
>              (“\”).
> 
> I conclude that [[:alpha:]] and [[:lower:]] were not
> considered "Special shell pattern"s. "man glob"
> indicates it is a shell specific builtin.
> 
> macOS says similarly. Different shells, different
> pattern notations and capabilities? Well, "man bash"
> reports:
[snip]
> Seems like: pick your shell (as shown by echo $SHELL) and
> that picks the pattern match rules used. (May be controllable
> in the specific shell.)

No, the pattern is not passed to shell and shell used should not matter
(pattern should be properly escaped).  The rules are here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13

...which in turn refers to the following link for bracket expressions:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05

Why we don't support all of that is different story.