svn commit: r361884 - in head/usr.bin/sed: . tests

Kyle Evans kevans at freebsd.org
Sun Jun 7 14:04:14 UTC 2020


On Sun, Jun 7, 2020 at 8:31 AM Rodney W. Grimes
<freebsd at gndrsh.dnsmgr.net> wrote:
>
> > Author: kevans
> > Date: Sun Jun  7 04:32:38 2020
> > New Revision: 361884
> > URL: https://svnweb.freebsd.org/changeset/base/361884
> >
> > Log:
> >   sed: attempt to learn about hex escapes (e.g. \x27)
> >
> >   Somewhat predictably, software often wants to use \x27/\x24 among others so
> >   that they can decline worrying about ugly escaping, if said escaping is even
> >   possible. Right now, this software is using these and getting the wrong
> >   results, as we'll interpret those as x27 and x24 respectively. Some examples
> >   of this, when an exp-run was ran, were science/octopus and misc/vifm.
> >
> >   Go ahead and process these at all times.  We allow either one or two digits,
> >   and the tests account for both.  If extra digits are specified, e.g. \x2727,
> >   then the third and fourth digits are interpreted literally as one might
> >   expect.
>
> Does it work to do \\x27, ie I want it to NOT do \x27 so I can sed
> on files that contain sequences of escapes.

I'm so glad you asked this. :-) For your immediate answer: yes, the
semantics there work as you expect.

For the long answer, that's actually what you should have been doing
all along; raising awareness of that fact is what PR 229925 aims to
do, by switching our interpretation of the UB for escaping ordinary
characters to make them an error if it's not specially interpreted.

Prior to this change, if you had:

printf "\\\\x27\n" | sed -e 's/\x27//'

What you end up with is actually *not* an empty string with a newline,
but just a single backslash! \x27 in the replacement pattern gets
passed through to the underlying regex(3) implementation, which then
happily interprets \x => x and replaces the literal 'x27', leaving \
-- which is perhaps not what you might have expected if \x27 didn't
have special meaning and it almost certainly isn't what you wanted.
With the new sed, you can change 'x27' to 'b27' in both strings above
to see what I mean.

In the New World Order, all regex(3) users will be forced to be
precise here so that we don't get it wrong. This is especially
important when I add GNU extensions to libregex, because some of those
escaped-ordinaries will now be granted special meaning, so \s will no
longer match a literal s but instead [[:space:]]; using the
unadulterated libc regex(3) interface instead will give you an error
and allow you to detect whether you're accidentally using libc
regex(3) rather than the GNU-extended libregex.

This is going to be a large and potentially world-breaking change for
many, but I think we'll all be better for it in the end. The symbol
version of regcomp will get bumped, so that older binaries will
continue to operate with the old escaping behavior in case that was
actually pertinent to their functionality.

Thanks,

Kyle Evans


More information about the svn-src-head mailing list