Re: FreeBSD awk behavior change proposal

From: Stefan Esser <>
Date: Sat, 10 Jul 2021 16:24:56 +0200
Am 10.07.21 um 14:22 schrieb Rodney W. Grimes:
>> Am 09.07.21 um 15:21 schrieb Rodney W. Grimes:
>>>> Greetings,
>>>> I've posted which eliminates the last
>>>> delta we have from upstream one-true-awk. This delta has basically been
>>>> rejected by upstream as being a really bad idea. Let me give some
>>>> background.
>>>> In 2005, FreeBSD changed one-true-awk to honor the locale's collating order.
>>>> This was billed as a temporary patch. It was also compatible with
>>>> the then-current behavior of gawk. That temporary patch has lasted 16
>>>> years now.
>>>> However, IEEE Std 1003.1-2008 changed the behaivor of ranges in regular
>>>> expressions outside of the "C" and "POSIX" locales to be undefined.
>>>> Starting in 2011, gawk 4.0 stopped using the locale for the range
>>>> regular expressions and used the traditional behavior only. The
>>>> maintainer had grown weary of answering why '[A-Z]' would sometimes
>>>> match lower-case expressions. The details about are explained here:
>>>> To restore compatibility with other implementaitons of awk, revert this
>>>> patch. FreeBSD is the odd-system out. It also has the nice side effect
>>>> of eliminating the last of our differences with upstream one-true-awk.
>>>> I'd like to commit the change at least to -current. Ideally, I'd like to MFC
>>>> the change. I believe better compatibility with gawk and other awk
>>>> implementations justifies this change in behavior because the current
>>>> behavior is outside the mainstream enough to be considered a bug.
>>>> I'd like to solicit input before I do this, however.
>>> My only concern on this is does anything in the ports system get
>>> tickled by this change, I know its a pita, but maybe have an exp
>>> run done?  I reviewed and accepted the differential, and by examination
>>> I do not see how this could cause an issue now, so Meh give it a long
>>> back in -current and things should be ok.
>> While possible in theory, I do not see how the ports system could
>> be affected in practice.
>> Ports are built in a C/POSIX locale on the official builders, and
>> thus using a different locale and collating sequence on a user's
>> system could break the port, but should never be a requirement.
>> I have checked the port Makefiles for occurrences of LANG or LC_*
>> outside specific command invocations (e.g. to set the locale for
>> a sort command). These are the results:
>> - ${USE_LOCALE} is used in, but the only case where
>>   a locale other than C or en_US.UTF-8 is specified is shells/fd
>>   which has USE_LOCALE=ja (i.e. does not specify an encoding).
>> - ${ELIXIR_LOCALE} is used to set LANG and LC_ALL for USES=elixir.
>>   But ELIXIR_LOCALE is only ever set to en_US.UTF-8, AFAICT.
>> - print/libpaper explicitly requests LANG=C LC_ALL=C for AWK.
>> - The only port that requests a locale that is not en_US.UTF-8,
>>   en_US.ISO8859-1, or C is textproc/te-hunspell, which uses
>>   LANG=te_IN.utf8 LC_ALL=te_IN.utf8 to execute wordlist2hunspell,
>>   but only for this single shell script that does not invoke AWK
>>   and which does internally use LC_ALL=C for sort and uniq to
>>   make those not depend on an externally set locale.
>> All other cases where LC_* or LANG are used in port Makefiles are
>> in e.g. EXTRACT_CMD, TEST_ENV or in patch files, but those do
>> enforce a C or C.UTF-8 locale (or en_US.*) and thus have no effect
>> on the proposed change to AWK (besides often only setting the locale
>> for a TAR file extraction).
>> If an exp-run is planned for other reasons, using the modified
>> AWK could be thrown in as a little risk modification.
>> But I do not see any possible effect on the ports system, after
>> performing a grep for LANG and LC_* on the Makefiles and patch
>> files.
>> Regards, STefan
> My concers are/were along the lines that awk was explicitly
> setting or actually ignoring the users LC, what happens if
> some user WANTS to build a port using some other locale, could
> that lead to bad awk results, and a failed build.

I understand your concern, but since the collating sequence is
different for different encodings even for the same location
(e.g. *.UTF-8 vs *.ISO8859-1) the result would be unpredictable,
even for a port that addresses a spefific country. The same
applies to the different collating sequences of cyrillic or
japanese encodings that exist.

Yes, in theory that is possible, but it would lead to a program
that is bound not only to some language but also the specific
encoding it has been built with - and that would be a paradox
situation for a locale aware program.

And considering the fact, that all other operating systems are
already in line with the proposed change, these sources could
not be built on them. If some software is meant to be built on
Linux, then the new AWK behavior will already have been assumed.

I have looked for locales passed by our ports system, not in any
port's sources, for that reason. It would not build on Linux.

We do only have to check out that the ports system does not in
any way depend on the current behavior of AWK, and that's what
I have looked for without finding a single case where it might.

> Though it is true this would not effect the production FreeBSD
> build infustructure, my concerns are more along the lines of
> users that DO build for other locales that this change MAY
> effect.

The most obvious effect of the proposed change is that [A-Z]
will definitely not include lower case letters, while it does
in a number of UTF-8 locales (but not ISO8859-x) now.

> An exp run would not catch this, so is rather pointless,
> but we should keep our eyes out for this other failure
> mechanism.
> Regards,

I'm supporting the proposed change, and in the unlikely case
that there is a port that stops working due to this change,
I'd be willing to fix the underlying issue, just assign the
PR to me ...

Regards, STefan

Received on Sat Jul 10 2021 - 14:24:56 UTC

Original text of this message