Re: Grep with non-ascii

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Fri, 03 Feb 2023 16:06:05 UTC
On Fri, 3 Feb 2023 15:18:53 +0100
Eivind Nicolay Evensen <eivinde@terraplane.org> wrote:

> Den Fri, 3 Feb 2023 19:12:32 +0700
> skrev Eugene Grosbein <eugen@grosbein.net>:
> 
> > 03.02.2023 17:06, Eivind Nicolay Evensen wrote:
> > > Hello.
> > > 
> > > I just noticed this today:
> > >   
> > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø  
> > > grep: trailing backslash (\)  
> > > elg!ene[~]> echo $LC_CTYPE $LANG  
> > > nb_NO.ISO8859-1 nb_NO.ISO8859-1
> > > 
> > > While I have the result I envisioned with gnugrep:
> > >   
> > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø  
> > > bø
> > > øl
> > > 
> > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper
> > > result.
> > > 
> > > Is lib/libc/regex the right place to look into this if I
> > > find the time, or does anybody know this enough to know the
> > > problem?  
> > 
> > Try single quotes instead of double quotes.
> > And pleace specify system version and shell name, and shell version
> > if its not in base system.
> 
> This is
> elg!ene[~]> uname -a
> FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD 13.2-PRERELEASE
> #1: Tue Jan 31 11:23:29 CET 2023
> ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv
> amd64
> 
> Using the tcsh that comes with it. But I don't think the quotes matter
> much because of this:
> 
> elg!ene[~]> grep ø
> grep: trailing backslash (\)
> 
> The output was more just to have something to look for, like
> with ggrep but anyway:
> 
> elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø
> grep: trailing backslash (\)
> 
> And obviously:
> 
> elg!ene[~]> printf 'bø\nhei\nøl\n' 
> bø
> hei
> øl
> 
> And it seems to be the same for any 8859-1 character not part
> of ascii:
> 
> elg!ene[~]> grep ä
> grep: trailing backslash (\)
> elg!ene[~]> grep ß
> grep: trailing backslash (\)
> elg!ene[~]> grep ç
> grep: trailing backslash (\)
> 
> -- 
> Eivind Nicolay Evensen

I recalled  very, very old problem on Japanese characters.
Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1
charset?

In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan.
In this charset, some 2bytes kanji characters have 0x5c in its second
byte.

This caused imported, non-Japanese-aware softwares mis-handle Japanese
texts, and the workaround was to add excessive 0x5c after problematic
characters. :-(

For example, 表示 in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and
as 0x5c was usually considered as backslash, escape character, it was
modified to 0x95 0x8e 0xa6 in non-Japanese softwares.
As this mis-conversion often happened recussively, the required numbers
of excessive 0x5c varied, varied and varied!!!!! Crazily.

If this is the case like above, the only solution is to move to
character set containing ALL characters all over the world.

AFAIK, the only candidates are only two, TRON code [1] and Unicode (UCS,
ISO/IEC 10646) [2]. And TRON code is very rarely used, actual candidate
would be Unicode only.
Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32
for data transfer (sometimes raw UCS-2?).


[1] https://en.wikipedia.org/wiki/TRON_(encoding)
[2] https://en.wikipedia.org/wiki/Unicode

P.S.
On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be
OK.

-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>