Re: find(1): I18N gone wild ?
- Reply: Yuri : "Re: find(1): I18N gone wild ?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Apr 2023 17:41:45 UTC
Dimitry Andric <dim_at_FreeBSD.org> wrote on
Date: Fri, 21 Apr 2023 10:38:05 UTC :
> On 21 Apr 2023, at 12:01, Ronald Klop <ronald-lists@klop.ws> wrote:
> > Van: Poul-Henning Kamp <phk@phk.freebsd.dk>
> > Datum: maandag, 17 april 2023 23:06
> > Aan: current@freebsd.org
> > Onderwerp: find(1): I18N gone wild ?
> > This surprised me:
> >
> > # mkdir /tmp/P
> > # cd /tmp/P
> > # touch FOO
> > # touch bar
> > # env LANG=C.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > # env LANG=en_US.UTF-8 find . -name '[A-Z]*' -print
> > ./FOO
> > ./bar
> >
> > Really ?!
> ...
> > My Mac and a Linux server only give ./FOO in both cases. Just a 2 cents remark.
>
> Same here. However, I have read that with unicode, you should *never*
> use [A-Z] or [0-9], but character classes instead. That seems to give
> both files on macOS and Linux with [[:alpha:]]:
>
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> ./BAR
> ./foo
>
> and only the lowercase file with [[:lower:]]:
>
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> ./foo
>
> But on FreeBSD, these don't work at all:
>
> $ LANG=en_US.UTF-8 find . -name '[[:alpha:]]*' -print
> <nothing>
>
> $ LANG=en_US.UTF-8 find . -name '[[:lower:]]*' -print
> <nothing>
>
> This is an interesting rabbit hole... :)
FreeBSD:
-name pattern
True if the last component of the pathname being examined matches
pattern. Special shell pattern matching characters (“[”, “]”,
“*”, and “?”) may be used as part of pattern. These characters
may be matched explicitly by escaping them with a backslash
(“\”).
I conclude that [[:alpha:]] and [[:lower:]] were not
considered "Special shell pattern"s. "man glob"
indicates it is a shell specific builtin.
macOS says similarly. Different shells, different
pattern notations and capabilities? Well, "man bash"
reports:
QUOTE
Pattern Matching
. . .
Within [ and ], character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the POSIX standard:
alnum alpha ascii blank cntrl digit graph lower print punct space upper word xdigit
A character class matches any character belonging to that class. The word character class matches letters, digits, and the character _.
Within [ and ], an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the
character c.
Within [ and ], the syntax [.symbol.] matches the collating symbol symbol.
END QUOTE
"man zsh" does not document patterns but:
sh-3.2$ echo $SHELL
/bin/zsh
sh-3.2$ find . -name '[[:lower:]]*' -print
./bar
% ls -Tldt /bin/*sh
-r-xr-xr-x 1 root wheel 1326688 Feb 9 01:39:53 2023 /bin/bash
-rwxr-xr-x 2 root wheel 1153216 Feb 9 01:39:53 2023 /bin/csh
-rwxr-xr-x 1 root wheel 307232 Feb 9 01:39:53 2023 /bin/dash
-r-xr-xr-x 1 root wheel 2598864 Feb 9 01:39:53 2023 /bin/ksh
-rwxr-xr-x 1 root wheel 134000 Feb 9 01:39:53 2023 /bin/sh
-rwxr-xr-x 2 root wheel 1153216 Feb 9 01:39:53 2023 /bin/tcsh
-rwxr-xr-x 1 root wheel 1377616 Feb 9 01:39:53 2023 /bin/zsh
But in each, even bash,
% echo $SHELL
/bin/zsh
With "find" not being part of the kernel, Linux may have
a number of variations across the operating systems.
Picking one . . .
openSUSE tumbleweed:
-name pattern
Base of file name (the path with the leading directories removed) matches shell pattern pattern. Because the leading directories are removed, the file names considered for a match
with -name will never include a slash, so `-name a/b' will never match anything (you probably need to use -path instead). A warning is issued if you try to do this, unless the en-
vironment variable POSIXLY_CORRECT is set. The metacharacters (`*', `?', and `[]') match a `.' at the start of the base name (this is a change in findutils-4.2.2; see section STAN-
DARDS CONFORMANCE below). To ignore a directory and the files under it, use -prune rather than checking every file in the tree; see an example in the description of that action.
Braces are not recognised as being special, despite the fact that some shells including Bash imbue braces with a special meaning in shell patterns. The filename matching is per-
formed with the use of the fnmatch(3) library function. Don't forget to enclose the pattern in quotes in order to protect it from expansion by the shell.
"man 3 fnmatch" says:
The fnmatch() function checks whether the string argument matches the pattern argument, which is a shell wildcard pattern (see glob(7)).
"man 7 glob" (not shell specific) in turn has a section on
"Character classes and internationalization" that reports:
QUOTE
. . .
. . . Therefore, POSIX extended the bracket notation greatly,
both for wildcard patterns and for regular expressions. In the above we saw three types of items that can occur in a bracket expression: namely (i) the negation, (ii) explicit single
characters, and (iii) ranges. POSIX specifies ranges in an internationally more useful way and adds three more types:
(iii) Ranges X-Y comprise all characters that fall between X and Y (inclusive) in the current collating sequence as defined by the LC_COLLATE category in the current locale.
(iv) Named character classes, like
[:alnum:] [:alpha:] [:blank:] [:cntrl:]
[:digit:] [:graph:] [:lower:] [:print:]
[:punct:] [:space:] [:upper:] [:xdigit:]
so that one can say "[[:lower:]]" instead of "[a-z]", and have things work in Denmark, too, where there are three letters past 'z' in the alphabet. These character classes are defined by
the LC_CTYPE category in the current locale.
(v) Collating symbols, like "[.ch.]" or "[.a-acute.]", where the string between "[." and ".]" is a collating element defined for the current locale. Note that this may be a multicharacter
element.
(vi) Equivalence class expressions, like "[=a=]", where the string between "[=" and "=]" is any collating element from its equivalence class, as defined for the current locale. For exam-
ple, "[[=a=]]" might be equivalent to "[aáàäâ]", that is, to "[a[.a-acute.][.a-grave.][.a-umlaut.][.a-circumflex.]]".
END QUOTE
# file /usr/bin/sh
/usr/bin/sh: symbolic link to bash
Seems like: pick your shell (as shown by echo $SHELL) and
that picks the pattern match rules used. (May be controllable
in the specific shell.)
===
Mark Millard
marklmi at yahoo.com