Uppercase RE matching problems in FreeBSD 11
Mark Martinec
Mark.Martinec+freebsd at ijs.si
Sun Nov 6 12:26:58 UTC 2016
2016-11-06 12:07, Baptiste Daroussin wrote:
> Yes A-Z only means uppercase in an ASCII only world in a unicode world
> it means
> AaBb... Z because there are way more characters that simple A-Z. In
> FreeBSD 11
> we have a unicode collation instead of falling back in on LC_COLLATE=C
> which
> means ascii only
>
> For regrexp for example one should use the classes: :upper: or :lower:.
It is a good idea to keep LC_COLLATE and LC_NUMERIC (and LC_MONETARY?)
at "C"
when LANG or LC_CTYPE is set to something else, otherwise unexpected
things may happen.
Mark
> On Sat, Nov 05, 2016 at 08:23:25PM -0500, Greg Rivers wrote:
>> I happened to run an old script today that uses sed(1) to extract the
>> system
>> boot time from the kern.boottime sysctl MIB. On 11.0 this no longer
>> works as
>> expected:
>>
>> $ sysctl kern.boottime
>> kern.boottime: { sec = 1478380714, usec = 145351 } Sat Nov 5 16:18:34
>> 2016
>> $ sysctl kern.boottime | sed -e 's/.*\([A-Z].*\)$/\1/'
>> v 5 16:18:34 2016
>>
>> sed passes over 'S' and 'N' until it hits 'v', which it considers
>> uppercase
>> apparently. This is with LANG=en_US.UTF-8. If I set LANG=C, it works
>> as
>> expected:
>>
>> $ sysctl kern.boottime | LANG=C sed -e 's/.*\([A-Z].*\)$/\1/'
>> Nov 5 16:18:34 2016
>>
>> Testing every lowercase character separately gives even more
>> inconsistent
>> results:
>>
>> $ cat <<! | LANG=en_US.UTF-8 sed -n -e '/^[A-Z]$/'p
>> > a
>> > b
>> > c
>> > d
>> > e
>> > f
>> > g
>> > h
>> > i
>> > j
>> > k
>> > l
>> > m
>> > n
>> > o
>> > p
>> > q
>> > r
>> > s
>> > t
>> > u
>> > v
>> > w
>> > x
>> > y
>> > z
>> > !
>> b
>> c
>> d
>> e
>> f
>> g
>> h
>> i
>> j
>> k
>> l
>> m
>> n
>> o
>> p
>> q
>> r
>> s
>> t
>> u
>> v
>> w
>> x
>> y
>> z
>>
>> Here sed thinks every lowercase character except for 'a' is uppercase!
>> This
>> differs from the first test where sed did not think 'o' is uppercase.
>> Again,
>> the above behaves as expected with LANG=C.
>>
>> Does anyone have any insight into this? This is likely to break a lot
>> of
>> existing code.
More information about the freebsd-stable
mailing list