Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z
- In reply to: Yuri : "Re: find(1): I18N gone wild? [[:alpha:]] not a substitute to refer 26 English letters A-Z"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 21 Apr 2023 20:05:55 UTC
Yuri wrote:
> parv/FreeBSD wrote:
>> Wrote Dimitry Andric on Fri, 21 Apr 2023 10:38:05 UTC
>> (via
>> https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html <https://lists.freebsd.org/archives/freebsd-current/2023-April/003556.html> )
>>>
>>> ... However, I have read that with unicode, you should *never*
>>> use [A-Z] or [0-9], but character classes instead. That seems to give
>>> both files on macOS and Linux with [[:alpha:]]:
>> ...
>>
>> Subject to the locale, problem with that is "[[:alpha:]]" will match
>> more than 26 English letters "A" through "Z" (besides also matching
>> lower case "a" through "z") even if none of 26 * 2 English alphabets
>> appear in a string.
>
> (replying to random recent message)
>
> And there is a bit of quite recent history for fnmatch() related to
> [a-z], same was done for regex with the same outcome -- attempt to make
> [a-z] (guess [A-Z] as well) range non-collating failed. I am not aware
> of the encountered failures, hopefully someone should remember:
I just tried less intrusive change that seems to help with these ranges
(but there's still a question what failed previously):
diff --git a/lib/libc/gen/fnmatch.c b/lib/libc/gen/fnmatch.c
index 40670545993..3234c1aaaa4 100644
--- a/lib/libc/gen/fnmatch.c
+++ b/lib/libc/gen/fnmatch.c
@@ -295,10 +295,11 @@ rangematch(const char *pattern, wchar_t test, int
flags, char **newp,
if (flags & FNM_CASEFOLD)
c2 = towlower(c2);
- if (table->__collate_load_error ?
+ if (table->__collate_load_error ||
+ iswascii(test) ?
c <= test && test <= c2 :
- __wcollate_range_cmp(c, test) <= 0
- && __wcollate_range_cmp(test, c2) <= 0
+ __wcollate_range_cmp(c, test) <= 0 &&
+ __wcollate_range_cmp(test, c2) <= 0
)
ok = 1;
} else if (c == test)
$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[a-z]*'
./bar
$ LC_ALL=en_US.UTF-8
LD_PRELOAD=/usr/obj/home/yuri/ws/find/amd64.amd64/lib/libc/libc.so.7
find . -name '[A-Z]*'
./FOO
> --------
> commit 5a5807dd4ca34467ac5fb458bc19f12bf62075a5
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date: Sun Jul 10 03:49:38 2016 +0000
>
> Remove broken support for collation in [a-z] type ranges.
> Only first 256 wide chars are considered currently, all other are just
> dropped from the range. Proper implementation require reverse tables
> database lookup, since objects are really big as max UTF-8 (1114112
> code points), so just the same scanning as it was for 256 chars will
> slow things down.
>
> POSIX does not require collation for [a-z] type ranges and does not
> prohibit it for non-POSIX locales. POSIX require collation for ranges
> only for POSIX (or C) locale which is equal to ASCII and binary for
> other chars, so we already have it.
>
> No other *BSD implements collation for [a-z] type ranges.
>
> Restore ABI compatibility with unused now __collate_range_cmp() which
> is visible from outside (will be removed later).
> --------
> commit 1daad8f5ad767dfe7896b8d1959a329785c9a76b
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date: Thu Jul 14 08:18:12 2016 +0000
>
> Back out non-collating [a-z] ranges.
> Instead of changing whole course to another POSIX-permitted way
> for consistency and uniformity I decide to completely ignore missing
> regex fucntionality and concentrace on fixing bugs in what we have now,
> too many small obstacles instead, counting ports.
> --------
> commit 12eae8c8f346cb459a388259ca98faebdac47038
> Author: Andrey A. Chernov <ache@FreeBSD.org>
> Date: Thu Jul 14 09:07:25 2016 +0000
>
> 1) Eliminate possibility to call __*collate_range_cmp() with inclomplete
> locale (which cause core dump) by removing whole 'table' argument
> by which it passed.
>
> 2) Restore __collate_range_cmp() in __sccl().
>
> 3) Collating [a-z] range in regcomp() only for single bytes locales
> (we can't do it now for other ones). In previous state only first 256
> wchars are considered and all others are just silently dropped from the
> range.
> --------
>