libc/regex: r302824 added invalid check breaking collating ranges
    Kyle Evans 
    kevans at freebsd.org
       
    Tue Jan 23 07:15:17 UTC 2018
    
    
  
(apologies for broken quoting, on mobile)
On Jan 22, 2018 11:54 PM, "Kyle Evans" <kevans at freebsd.org> wrote:
On Mon, Jan 22, 2018 at 6:53 PM, Yuri Pankov <yuripv at icloud.com> wrote:
> (CCing Kyle as he's working on regex at the moment and not because he
broke
> something)
Phew, brief moment of panic until I read this line and went back to
actually look at the full revision number you wrote. =p
I note here that I know incredibly little about collation issues, so
feel free to enlighten me.
> Hi,
>
> r302284 added an invalid check which breaks collating ranges:
>
> -if (table->__collate_load_error) {
> -    (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
> +    (void)REQUIRE(start <= finish, REG_ERANGE);
>
> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
> according to current locale's collation and not simply comparing the
wchar_t
> values.
Right; that seems like almost the complete opposite of when you might
be able to do a direct comparison like this. As you mention, though,
even MB_CUR_MAX == 1 wouldn't necessarily be safe.
> Example -- see Table 1 in http://www.unicode.org/reports/tr10/:
>
> Let's try Swedish collation:
> $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[ö-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=se_SE.UTF-8 grep '[z-ö]'
>
> OK, the above seems to be correct, 'ö' > 'z' in Swedish collation, but we
> just got lucky here, as wchar_t comparison gives us the same result.
>
> Now German one:
> $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[ö-z]'
> grep: invalid character range
> $ echo 'test' | LC_COLLATE=de_DE.UTF-8 grep '[z-ö]'
>
> Same, but according to the table, 'ö' < 'z' in German collation!
>
> I think the fix here would be to drop the "if (table->__collate_load_error
> || MB_CUR_MAX > 1)" block entirely as we no longer use the "table" so
> there's no point in getting it and checking error, wcscoll() which would
be
> called eventually in p_range_cmp() does the table handling itself, and we
> can't use the direct comparison for anything other than 'C' locale (not
sure
> if it's applicable even there).
I've arrived at the same conclusion, and it makes me both sad and
happy at the same time. I had a lot more written here, but erased it
because I clearly fail to read some of this stuff.
I have a proposed patch at [1] that addresses this in theory, but the
patch is not right and breaks things in en_US.UTF-8 at the very least.
`echo "TEST" | sed -e 's/[s-t]/x/g'` outputs "TExT"; debug output
shows that it's adding to the character set: 'S', 's', 't', and
U+00DF. I've not been able to dig any deeper and figure out why
p_range_comp/wcscoll place these things in the range of [s-t] (115 -
116) with en_US.UTF-8 -- the answer isn't immediately obvious to me.
I see now where I goofed this up- please disregard.
Do note that this patch might not apply against a clean -head, but the
spirit of it is still there. It implements your suggestion, but
rewrites the loop in the (former) second branch to keep track of
contiguous ranges of characters. This is a (pehaps misguided) attempt
to take more of the performance hit at expr compilation time rather
than execution, reducing cs->wides as much as possible so that we
don't have to do as many comparisons in not worst-case scenarios.
I've also rearranged the loops in regex2.h:CHIN1 so that we check the
smaller/faster comparisons before checking the larger character set at
wide. I don't know if it really makes for any of a performance
difference, but it seems worth it to at least benchmark for some more
complex character sets one can build in different locales with
mixtures of actually contiguous ranges (in the current locale) and
randomly distributed non-contiguous characters.
[1] https://people.freebsd.org/~kevans/regex-collation.diff
    
    
More information about the freebsd-hackers
mailing list