Re: Grep with non-ascii

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Sat, 04 Feb 2023 03:46:02 UTC
On Fri, 3 Feb 2023 17:31:55 +0100
Eivind Nicolay Evensen <eivinde@terraplane.org> wrote:

> Den Sat, 4 Feb 2023 01:06:05 +0900
> skrev Tomoaki AOKI <junchoon@dec.sakura.ne.jp>:
> 
> > On Fri, 3 Feb 2023 15:18:53 +0100
> > Eivind Nicolay Evensen <eivinde@terraplane.org> wrote:
> > 
> > > Den Fri, 3 Feb 2023 19:12:32 +0700
> > > skrev Eugene Grosbein <eugen@grosbein.net>:
> > >   
> > > > 03.02.2023 17:06, Eivind Nicolay Evensen wrote:  
> > > > > Hello.
> > > > > 
> > > > > I just noticed this today:
> > > > >     
> > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø    
> > > > > grep: trailing backslash (\)    
> > > > > elg!ene[~]> echo $LC_CTYPE $LANG    
> > > > > nb_NO.ISO8859-1 nb_NO.ISO8859-1
> > > > > 
> > > > > While I have the result I envisioned with gnugrep:
> > > > >     
> > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø    
> > > > > bø
> > > > > øl
> > > > > 
> > > > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper
> > > > > result.
> > > > > 
> > > > > Is lib/libc/regex the right place to look into this if I
> > > > > find the time, or does anybody know this enough to know the
> > > > > problem?    
> > > > 
> > > > Try single quotes instead of double quotes.
> > > > And pleace specify system version and shell name, and shell
> > > > version if its not in base system.  
> > > 
> > > This is  
> > > elg!ene[~]> uname -a  
> > > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD
> > > 13.2-PRERELEASE #1: Tue Jan 31 11:23:29 CET 2023
> > > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv
> > > amd64
> > > 
> > > Using the tcsh that comes with it. But I don't think the quotes
> > > matter much because of this:
> > >   
> > > elg!ene[~]> grep ø  
> > > grep: trailing backslash (\)
> > > 
> > > The output was more just to have something to look for, like
> > > with ggrep but anyway:
> > >   
> > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø  
> > > grep: trailing backslash (\)
> > > 
> > > And obviously:
> > >   
> > > elg!ene[~]> printf 'bø\nhei\nøl\n'   
> > > bø
> > > hei
> > > øl
> > > 
> > > And it seems to be the same for any 8859-1 character not part
> > > of ascii:
> > >   
> > > elg!ene[~]> grep ä  
> > > grep: trailing backslash (\)  
> > > elg!ene[~]> grep ß  
> > > grep: trailing backslash (\)  
> > > elg!ene[~]> grep ç  
> > > grep: trailing backslash (\)
> > > 
> > > -- 
> > > Eivind Nicolay Evensen  
> > 
> > I recalled  very, very old problem on Japanese characters.
> > Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1
> > charset?
> > 
> > In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan.
> > In this charset, some 2bytes kanji characters have 0x5c in its second
> > byte.
> > 
> > This caused imported, non-Japanese-aware softwares mis-handle Japanese
> > texts, and the workaround was to add excessive 0x5c after problematic
> > characters. :-(
> > 
> > For example, ?? in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and
> > as 0x5c was usually considered as backslash, escape character, it was
> > modified to 0x95 0x8e 0xa6 in non-Japanese softwares.
> > As this mis-conversion often happened recussively, the required
> > numbers of excessive 0x5c varied, varied and varied!!!!! Crazily.
> > 
> > If this is the case like above, the only solution is to move to
> > character set containing ALL characters all over the world.
> > 
> > AFAIK, the only candidates are only two, TRON code [1] and Unicode
> > (UCS, ISO/IEC 10646) [2]. And TRON code is very rarely used, actual
> > candidate would be Unicode only.
> > Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32
> > for data transfer (sometimes raw UCS-2?).
> > 
> > 
> > [1] https://en.wikipedia.org/wiki/TRON_(encoding)
> > [2] https://en.wikipedia.org/wiki/Unicode
> > 
> > P.S.
> > On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be
> > OK.
> 
> In 8859-1, "ø" is:
> 
> elg!ene[~]> printf ø |hexdump -C
> 00000000  f8                                                |ø|
> 00000001
> 
> so this does not seem to be the problem here. And all those
> characters I tried are one-byte (all 8859-1 are):
> 
> elg!ene[~]> printf "äßç" |hexdump -C
> 00000000  e4 df e7                                          |äßç|
> 00000003
> 
> So I do not believe this is the same problem. I did, however,
> find it interesting that multi-byte character sets may have been
> in use longer than I imagined.
> 
> 
> -- 
> Eivind Nicolay Evensen
> 

OK. Agreed. Sorry for the noise.

Possibly, 8bits (non-7bits) characters which is not a part of UTF-8 or
on-memory Unicode would not be converted properly in BSD grep?

0x5c (backslash) problem was a nightmare for Japanese programmers and
early adopters (running IBM PC softwares on NEC PC98 with simulator)
ATM, came in conjunction with Turbo C that was not yet properly ported
for Shift-JIS. :-(

Until then, it was a nightmare only for corporate, professional
programmers only.

-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>