libc/regex: r302824 added invalid check breaking collating ranges

Tue Jan 23 19:11:09 UTC 2018

On Tue, Jan 23, 2018 at 08:10:32AM -0600, Kyle Evans wrote:
> On Mon, Jan 22, 2018 at 11:36 PM, Yuri Pankov <yuripv at icloud.com> wrote:
>> On Tue, Jan 23, 2018 at 03:53:19AM +0300, Yuri Pankov wrote:
>>>
>>> (CCing Kyle as he's working on regex at the moment and not because he
>>> broke something)
>>>
>>> Hi,
>>>
>>> r302284 added an invalid check which breaks collating ranges:
>>>
>>> -if (table->__collate_load_error) {
>>> -    (void)REQUIRE((uch)start <= (uch)finish, REG_ERANGE);
>>> +if (table->__collate_load_error || MB_CUR_MAX > 1) {
>>> +    (void)REQUIRE(start <= finish, REG_ERANGE);
>>>
>>> The "MB_CUR_MAX > 1" is wrong, we should be doing proper comparison
>>> according to current locale's collation and not simply comparing the
>>> wchar_t values.
>>
>>
>> After re-reading the specification I now see that what looked like a bug is
>> actually an implementation choice, though the one that needs to be
>> documentated.  I'll update the man page if anyone is willing to review (and
>> commit) the changes.
> 
> Can you point to the section of specification that indicates this is
> OK behavior? It doesn't seem desirable, but I see that GNU systems
> will operate in the same manner that we do now.

Here -- 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:
------------------------------------------------------------------------
In the POSIX locale, a range expression represents the set of collating 
elements that fall between two elements in the collation sequence, 
inclusive. In other locales, a range expression has unspecified 
behavior: strictly conforming applications shall not rely on whether the 
range expression is valid, or on the set of collating elements matched.
------------------------------------------------------------------------

I've tried to "fix" what I was seeing as well, and yes, everything 
outside of ASCII is ugly, e.g. Cyrillic 'а-я' would match much more than 
you could expect if you are doing lookups based on collation order 
(capital chars and a lot of other symbols).

So what we have currently looks the least evil to me:

- non-collating ASCII lookups for any locale -- looking at the log for
   regcomp.c there was an attempt to "fix" this, but it was reverted as
   a lot of existing code relies on this;
- non-collating multi-byte locale lookups -- they will work for almost
   all   cases, and where they don't, well POSIX says it's undefined :D
- collating single-byte locale lookups for outside of ASCII range --
   they make sense as collation order there doesn't seem to mix
   small/caps/other characters together.

What I think we need to do is document this as implementation choice in 
the code and regex(3) "IMPLEMENTATION NOTES" so that another poor soul 
doesn't come trying to fix it as I did :-)