Re: Grep with non-ascii

From: Eivind Nicolay Evensen <eivinde_at_terraplane.org>
Date: Sat, 04 Feb 2023 09:47:27 UTC
Den Sat, 4 Feb 2023 08:41:17 +0700
skrev Eugene Grosbein <eugen@grosbein.net>:

> 03.02.2023 21:18, Eivind Nicolay Evensen wrote:
> 
> > Den Fri, 3 Feb 2023 19:12:32 +0700
> > skrev Eugene Grosbein <eugen@grosbein.net>:
> >   
> >> 03.02.2023 17:06, Eivind Nicolay Evensen wrote:  
> >>> Hello.
> >>>
> >>> I just noticed this today:
> >>>     
> >>> elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø    
> >>> grep: trailing backslash (\)    
> >>> elg!ene[~]> echo $LC_CTYPE $LANG    
> >>> nb_NO.ISO8859-1 nb_NO.ISO8859-1
> >>>
> >>> While I have the result I envisioned with gnugrep:
> >>>     
> >>> elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø    
> >>> bø
> >>> øl
> >>>
> >>> Also, on OpenIndiana, linux and Netbsd, grep gives the proper
> >>> result.
> >>>
> >>> Is lib/libc/regex the right place to look into this if I
> >>> find the time, or does anybody know this enough to know the
> >>> problem?    
> >>
> >> Try single quotes instead of double quotes.
> >> And pleace specify system version and shell name, and shell version
> >> if its not in base system.  
> > 
> > This is  
> > elg!ene[~]> uname -a  
> > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD
> > 13.2-PRERELEASE #1: Tue Jan 31 11:23:29 CET 2023
> > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv
> > amd64
> > 
> > Using the tcsh that comes with it. But I don't think the quotes
> > matter much because of this:
> >   
> > elg!ene[~]> grep ø  
> > grep: trailing backslash (\)
> > 
> > The output was more just to have something to look for, like
> > with ggrep but anyway:
> >   
> > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø  
> > grep: trailing backslash (\)
> > 
> > And obviously:
> >   
> > elg!ene[~]> printf 'bø\nhei\nøl\n'   
> > bø
> > hei
> > øl
> > 
> > And it seems to be the same for any 8859-1 character not part
> > of ascii:
> >   
> > elg!ene[~]> grep ä  
> > grep: trailing backslash (\)  
> > elg!ene[~]> grep ß  
> > grep: trailing backslash (\)  
> > elg!ene[~]> grep ç  
> > grep: trailing backslash (\)  
> 
> I checked it with ru_RU.KOI8-R locale and same problem manifested,
> with every Cyrillic letter. The following line shows codes and
> characters of affected positions in last half of 8-bit character
> table.
> 
> $ jot -w '%o' - 128 255 1 | xargs -n2 -I^ printf '^ \^\n' | while
> read octal char; do grep -q "$char" /etc/motd 2>/dev/null; [ $? -gt 1
> ] && echo $octal $char; done
> 
> Note that this problem does not exist in 12.4 or earlier FreeBSD
> versions, so this is recent regression. Surely that's due to grep
> command being GNU grep in 12.4 but BSD grep in 13.x

That makes sense, since I know for certain I have grepped for
Norwegian words containing æøå without seeing this problem before.
And I switched from 11 to 13 very late, and only because I wanted
to use hardware unsupported by the old one, so that would explain why
it took me so long to discover.




-- 
Eivind Nicolay Evensen