bin/153502: regex(3) bug with UTF-8 locale
Mathieu
sigsys at gmail.com
Tue Dec 28 18:00:33 UTC 2010
>Number: 153502
>Category: bin
>Synopsis: regex(3) bug with UTF-8 locale
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Tue Dec 28 18:00:32 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator: Mathieu
>Release: 8.1-STABLE, 7.3-RELEASE-p3
>Organization:
>Environment:
8.1-STABLE/amd64 r212312M
7.3-RELEASE-p3/i386 r215233M
>Description:
I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs.
Sometimes it seems to work right:
$ echo 'é' | sed -ne '/^.$/p'
é
$ echo 'éé' | sed -ne '/^..$/p'
éé
$ echo 'aéa' | sed -ne '/a.a/p'
aéa
$ echo 'aéa' | sed -ne '/a.*a/p'
aéa
$ echo 'aaéaa' | sed -ne '/aa.aa/p'
aaéaa
$ echo 'aéaéa' | sed -ne '/a.a.a/p'
aéaéa
But not always:
$ echo 'éa' | sed -ne '/.a/p'
$ echo 'aéaa' | sed -ne '/a.aa/p'
$ echo 'éaé' | sed -ne '/.a./p'
Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always.
>How-To-Repeat:
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list