Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z

In reply to: Yuri : "Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Yuri <yuri_at_aetern.org>
Date: Fri, 21 Apr 2023 20:05:55 UTC
Yuri wrote:
> parv/FreeBSD wrote:
>> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC
>> (via
>> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html <https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html> )
>>>
>>> ... However, I have read that with unicode, you should *never*
>>> use [A-Z] or [0-9], but character classes instead. That seems to give
>>> both files on macOS and Linux with [[:alpha:]]:
>> ...
>>
>> Subject to the locale, problem with that is "[[:alpha:]]" will match
>> more than 26 English letters "A" through "Z" (besides also matching
>> lower case "a" through "z") even if none of 26 * 2 English alphabets
>> appear in a string.
> 
> (replying to random recent message)
> 
> And there is a bit of quite recent history for fnmatch() related to
> [a-z], same was done for regex with the same outcome -- attempt to make
> [a-z] (guess [A-Z] as well) range non-collating failed.  I am not aware
> of the encountered failures, hopefully someone should remember:

I just tried less intrusive change that seems to help with these ranges
(but there's still a question what failed previously):

diff --git a/lib/libc/gen/fnmatch.c b/lib/libc/gen/fnmatch.c
index 40670545993..3234c1aaaa4 100644
--- a/lib/libc/gen/fnmatch.c
+++ b/lib/libc/gen/fnmatch.c
@@ -295,10 +295,11 @@ rangematch(const char *pattern, wchar_t test, int
flags, char **newp,
                        if (flags & FNM_CASEFOLD)
                                c2 = towlower(c2);

-                       if (table->__collate_load_error ?
+                       if (table->__collate_load_error ||
+                           iswascii(test) ?
                            c <= test && test <= c2 :
-                              __wcollate_range_cmp(c, test) <= 0
-                           && __wcollate_range_cmp(test, c2) <= 0
+                           __wcollate_range_cmp(c, test) <= 0 &&
+                           __wcollate_range_cmp(test, c2) <= 0
                           )
                                ok = 1;
                } else if (c == test)

$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[a-z]*'
./bar
$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[A-Z]*'
./FOO

> --------
> commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date:   Sun Jul 10 03:49:38 2016 +0000
> 
> Remove broken support for collation in [a-z] type ranges.
> Only first 256 wide chars are considered currently, all other are just
> dropped from the range. Proper implementation require reverse tables
> database lookup, since objects are really big as max UTF-8 (1114112
> code points), so just the same scanning as it was for 256 chars will
> slow things down.
> 
> POSIX does not require collation for [a-z] type ranges and does not
> prohibit it for non-POSIX locales. POSIX require collation for ranges
> only for POSIX (or C) locale which is equal to ASCII and binary for
> other chars, so we already have it.
> 
> No other *BSD implements collation for [a-z] type ranges.
> 
> Restore ABI compatibility with unused now __collate_range_cmp() which
> is visible from outside (will be removed later).
> --------
> commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date:   Thu Jul 14 08:18:12 2016 +0000
> 
> Back out non-collating [a-z] ranges.
> Instead of changing whole course to another POSIX-permitted way
> for consistency and uniformity I decide to completely ignore missing
> regex fucntionality and concentrace on fixing bugs in what we have now,
> too many small obstacles instead, counting ports.
> --------
> commit 12eae8c8f346cb459a388259ca98faebdac47038
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date:   Thu Jul 14 09:07:25 2016 +0000
> 
> 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete
> locale (which cause core dump) by removing whole 'table' argument
> by which it passed.
> 
> 2) Restore __collate_range_cmp() in __sccl().
> 
> 3) Collating [a-z] range in regcomp() only for single bytes locales
> (we can't do it now for other ones). In previous state only first 256
> wchars are considered and all others are just silently dropped from the
> range.
> --------
>