Re: Grep with non-ascii

From: Eugene Grosbein <eugen_at_grosbein.net>
Date: Sat, 04 Feb 2023 01:41:17 UTC
03.02.2023 21:18, Eivind Nicolay Evensen wrote:

> Den Fri, 3 Feb 2023 19:12:32 +0700
> skrev Eugene Grosbein <eugen@grosbein.net>:
> 
>> 03.02.2023 17:06, Eivind Nicolay Evensen wrote:
>>> Hello.
>>>
>>> I just noticed this today:
>>>   
>>> elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø  
>>> grep: trailing backslash (\)  
>>> elg!ene[~]> echo $LC_CTYPE $LANG  
>>> nb_NO.ISO8859-1 nb_NO.ISO8859-1
>>>
>>> While I have the result I envisioned with gnugrep:
>>>   
>>> elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø  
>>> bø
>>> øl
>>>
>>> Also, on OpenIndiana, linux and Netbsd, grep gives the proper
>>> result.
>>>
>>> Is lib/libc/regex the right place to look into this if I
>>> find the time, or does anybody know this enough to know the
>>> problem?  
>>
>> Try single quotes instead of double quotes.
>> And pleace specify system version and shell name, and shell version
>> if its not in base system.
> 
> This is
> elg!ene[~]> uname -a
> FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD 13.2-PRERELEASE
> #1: Tue Jan 31 11:23:29 CET 2023
> ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv
> amd64
> 
> Using the tcsh that comes with it. But I don't think the quotes matter
> much because of this:
> 
> elg!ene[~]> grep ø
> grep: trailing backslash (\)
> 
> The output was more just to have something to look for, like
> with ggrep but anyway:
> 
> elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø
> grep: trailing backslash (\)
> 
> And obviously:
> 
> elg!ene[~]> printf 'bø\nhei\nøl\n' 
> bø
> hei
> øl
> 
> And it seems to be the same for any 8859-1 character not part
> of ascii:
> 
> elg!ene[~]> grep ä
> grep: trailing backslash (\)
> elg!ene[~]> grep ß
> grep: trailing backslash (\)
> elg!ene[~]> grep ç
> grep: trailing backslash (\)

I checked it with ru_RU.KOI8-R locale and same problem manifested, with every Cyrillic letter.
The following line shows codes and characters of affected positions in last half of 8-bit character table.

$ jot -w '%o' - 128 255 1 | xargs -n2 -I^ printf '^ \^\n' | while read octal char; do grep -q "$char" /etc/motd 2>/dev/null; [ $? -gt 1 ] && echo $octal $char; done

Note that this problem does not exist in 12.4 or earlier FreeBSD versions, so this is recent regression.
Surely that's due to grep command being GNU grep in 12.4 but BSD grep in 13.x