Re: FreeBSD awk behavior change proposal

From: Rodney W. Grimes <freebsd-rwg_at_gndrsh.dnsmgr.net>
Date: Sat, 10 Jul 2021 12:22:05 UTC
> Am 09.07.21 um 15:21 schrieb Rodney W. Grimes:
> >> Greetings,
> >>
> >> I've posted  https://reviews.freebsd.org/D31114 which eliminates the last
> >> delta we have from upstream one-true-awk. This delta has basically been
> >> rejected by upstream as being a really bad idea. Let me give some
> >> background.
> >>
> >> In 2005, FreeBSD changed one-true-awk to honor the locale's collating order.
> >> https://svnweb.freebsd.org/base/head/usr.bin/awk/b.c.diff?annotate=146322&pathrev=201988
> >> This was billed as a temporary patch. It was also compatible with
> >> the then-current behavior of gawk. That temporary patch has lasted 16
> >> years now.
> >>
> >> However, IEEE Std 1003.1-2008 changed the behaivor of ranges in regular
> >> expressions outside of the "C" and "POSIX" locales to be undefined.
> >>
> >> Starting in 2011, gawk 4.0 stopped using the locale for the range
> >> regular expressions and used the traditional behavior only. The
> >> maintainer had grown weary of answering why '[A-Z]' would sometimes
> >> match lower-case expressions. The details about are explained here:
> >> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
> >>
> >> To restore compatibility with other implementaitons of awk, revert this
> >> patch. FreeBSD is the odd-system out. It also has the nice side effect
> >> of eliminating the last of our differences with upstream one-true-awk.
> >>
> >> I'd like to commit the change at least to -current. Ideally, I'd like to MFC
> >> the change. I believe better compatibility with gawk and other awk
> >> implementations justifies this change in behavior because the current
> >> behavior is outside the mainstream enough to be considered a bug.
> >>
> >> I'd like to solicit input before I do this, however.
> > 
> > My only concern on this is does anything in the ports system get
> > tickled by this change, I know its a pita, but maybe have an exp
> > run done?  I reviewed and accepted the differential, and by examination
> > I do not see how this could cause an issue now, so Meh give it a long
> > back in -current and things should be ok.
> 
> While possible in theory, I do not see how the ports system could
> be affected in practice.
> 
> Ports are built in a C/POSIX locale on the official builders, and
> thus using a different locale and collating sequence on a user's
> system could break the port, but should never be a requirement.
> 
> I have checked the port Makefiles for occurrences of LANG or LC_*
> outside specific command invocations (e.g. to set the locale for
> a sort command). These are the results:
> 
> - ${USE_LOCALE} is used in bsd.port.mk, but the only case where
>   a locale other than C or en_US.UTF-8 is specified is shells/fd
>   which has USE_LOCALE=ja (i.e. does not specify an encoding).
> 
> - ${ELIXIR_LOCALE} is used to set LANG and LC_ALL for USES=elixir.
>   But ELIXIR_LOCALE is only ever set to en_US.UTF-8, AFAICT.
> 
> - print/libpaper explicitly requests LANG=C LC_ALL=C for AWK.
> 
> - The only port that requests a locale that is not en_US.UTF-8,
>   en_US.ISO8859-1, or C is textproc/te-hunspell, which uses
>   LANG=te_IN.utf8 LC_ALL=te_IN.utf8 to execute wordlist2hunspell,
>   but only for this single shell script that does not invoke AWK
>   and which does internally use LC_ALL=C for sort and uniq to
>   make those not depend on an externally set locale.
> 
> All other cases where LC_* or LANG are used in port Makefiles are
> in e.g. EXTRACT_CMD, TEST_ENV or in patch files, but those do
> enforce a C or C.UTF-8 locale (or en_US.*) and thus have no effect
> on the proposed change to AWK (besides often only setting the locale
> for a TAR file extraction).
> 
> If an exp-run is planned for other reasons, using the modified
> AWK could be thrown in as a little risk modification.
> 
> But I do not see any possible effect on the ports system, after
> performing a grep for LANG and LC_* on the Makefiles and patch
> files.
> 
> Regards, STefan
> 

My concers are/were along the lines that awk was explicitly
setting or actually ignoring the users LC, what happens if
some user WANTS to build a port using some other locale, could
that lead to bad awk results, and a failed build.

Though it is true this would not effect the production FreeBSD
build infustructure, my concerns are more along the lines of
users that DO build for other locales that this change MAY
effect.

An exp run would not catch this, so is rather pointless,
but we should keep our eyes out for this other failure
mechanism.

Regards,
-- 
Rod Grimes                                                 rgrimes@freebsd.org