From nobody Sat Feb 04 03:46:02 2023
X-Original-To: stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4P7z2w49Mpz3kcmd
	for <stable@mlmmj.nyi.freebsd.org>; Sat,  4 Feb 2023 03:46:16 +0000 (UTC)
	(envelope-from junchoon@dec.sakura.ne.jp)
Received: from www121.sakura.ne.jp (www121.sakura.ne.jp [153.125.133.21])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4P7z2t6L2Gz4Cnf
	for <stable@freebsd.org>; Sat,  4 Feb 2023 03:46:14 +0000 (UTC)
	(envelope-from junchoon@dec.sakura.ne.jp)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	spf=none (mx1.freebsd.org: domain of junchoon@dec.sakura.ne.jp has no SPF policy when checking 153.125.133.21) smtp.mailfrom=junchoon@dec.sakura.ne.jp;
	dmarc=none
Received: from kalamity.joker.local (123-1-88-210.area1b.commufa.jp [123.1.88.210])
	(authenticated bits=0)
	by www121.sakura.ne.jp (8.16.1/8.16.1/[SAKURA-WEB]/20201212) with ESMTPA id 3143k2Af044157
	for <stable@freebsd.org>; Sat, 4 Feb 2023 12:46:03 +0900 (JST)
	(envelope-from junchoon@dec.sakura.ne.jp)
Date: Sat, 4 Feb 2023 12:46:02 +0900
From: Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
To: stable@freebsd.org
Subject: Re: Grep with non-ascii
Message-Id: <20230204124602.abe78f4a441f747941d3f858@dec.sakura.ne.jp>
In-Reply-To: <20230203173155.179902a4@elg.hjerdalen.lokalnett>
References: <20230203110642.70e4a076@elg.hjerdalen.lokalnett>
	<819a4336-9689-bdbe-a90d-8f1d7b842662@grosbein.net>
	<20230203151853.02732bd6@elg.hjerdalen.lokalnett>
	<20230204010605.4874609f80eed28543407807@dec.sakura.ne.jp>
	<20230203173155.179902a4@elg.hjerdalen.lokalnett>
Organization: Junchoon corps
X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; amd64-portbld-freebsd13.0)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spamd-Result: default: False [-1.60 / 15.00];
	AUTH_NA(1.00)[];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-0.999];
	MV_CASE(0.50)[];
	MIME_GOOD(-0.10)[text/plain];
	R_DKIM_NA(0.00)[];
	ASN(0.00)[asn:7684, ipnet:153.125.128.0/18, country:JP];
	FROM_EQ_ENVFROM(0.00)[];
	MIME_TRACE(0.00)[0:+];
	R_SPF_NA(0.00)[no SPF record];
	MLMMJ_DEST(0.00)[stable@freebsd.org];
	RCVD_TLS_LAST(0.00)[];
	HAS_ORG_HEADER(0.00)[];
	RCVD_COUNT_TWO(0.00)[2];
	FROM_HAS_DN(0.00)[];
	ARC_NA(0.00)[];
	RCVD_VIA_SMTP_AUTH(0.00)[];
	DMARC_NA(0.00)[sakura.ne.jp];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	TO_DN_NONE(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org];
	RCPT_COUNT_ONE(0.00)[1];
	MID_RHS_MATCH_FROM(0.00)[]
X-Rspamd-Queue-Id: 4P7z2t6L2Gz4Cnf
X-Spamd-Bar: -
X-ThisMailContainsUnwantedMimeParts: N

On Fri, 3 Feb 2023 17:31:55 +0100
Eivind Nicolay Evensen <eivinde@terraplane.org> wrote:

> Den Sat, 4 Feb 2023 01:06:05 +0900
> skrev Tomoaki AOKI <junchoon@dec.sakura.ne.jp>:
> 
> > On Fri, 3 Feb 2023 15:18:53 +0100
> > Eivind Nicolay Evensen <eivinde@terraplane.org> wrote:
> > 
> > > Den Fri, 3 Feb 2023 19:12:32 +0700
> > > skrev Eugene Grosbein <eugen@grosbein.net>:
> > >   
> > > > 03.02.2023 17:06, Eivind Nicolay Evensen wrote:  
> > > > > Hello.
> > > > > 
> > > > > I just noticed this today:
> > > > >     
> > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø    
> > > > > grep: trailing backslash (\)    
> > > > > elg!ene[~]> echo $LC_CTYPE $LANG    
> > > > > nb_NO.ISO8859-1 nb_NO.ISO8859-1
> > > > > 
> > > > > While I have the result I envisioned with gnugrep:
> > > > >     
> > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø    
> > > > > bø
> > > > > øl
> > > > > 
> > > > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper
> > > > > result.
> > > > > 
> > > > > Is lib/libc/regex the right place to look into this if I
> > > > > find the time, or does anybody know this enough to know the
> > > > > problem?    
> > > > 
> > > > Try single quotes instead of double quotes.
> > > > And pleace specify system version and shell name, and shell
> > > > version if its not in base system.  
> > > 
> > > This is  
> > > elg!ene[~]> uname -a  
> > > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD
> > > 13.2-PRERELEASE #1: Tue Jan 31 11:23:29 CET 2023
> > > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv
> > > amd64
> > > 
> > > Using the tcsh that comes with it. But I don't think the quotes
> > > matter much because of this:
> > >   
> > > elg!ene[~]> grep ø  
> > > grep: trailing backslash (\)
> > > 
> > > The output was more just to have something to look for, like
> > > with ggrep but anyway:
> > >   
> > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø  
> > > grep: trailing backslash (\)
> > > 
> > > And obviously:
> > >   
> > > elg!ene[~]> printf 'bø\nhei\nøl\n'   
> > > bø
> > > hei
> > > øl
> > > 
> > > And it seems to be the same for any 8859-1 character not part
> > > of ascii:
> > >   
> > > elg!ene[~]> grep ä  
> > > grep: trailing backslash (\)  
> > > elg!ene[~]> grep ß  
> > > grep: trailing backslash (\)  
> > > elg!ene[~]> grep ç  
> > > grep: trailing backslash (\)
> > > 
> > > -- 
> > > Eivind Nicolay Evensen  
> > 
> > I recalled  very, very old problem on Japanese characters.
> > Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1
> > charset?
> > 
> > In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan.
> > In this charset, some 2bytes kanji characters have 0x5c in its second
> > byte.
> > 
> > This caused imported, non-Japanese-aware softwares mis-handle Japanese
> > texts, and the workaround was to add excessive 0x5c after problematic
> > characters. :-(
> > 
> > For example, ?? in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and
> > as 0x5c was usually considered as backslash, escape character, it was
> > modified to 0x95 0x8e 0xa6 in non-Japanese softwares.
> > As this mis-conversion often happened recussively, the required
> > numbers of excessive 0x5c varied, varied and varied!!!!! Crazily.
> > 
> > If this is the case like above, the only solution is to move to
> > character set containing ALL characters all over the world.
> > 
> > AFAIK, the only candidates are only two, TRON code [1] and Unicode
> > (UCS, ISO/IEC 10646) [2]. And TRON code is very rarely used, actual
> > candidate would be Unicode only.
> > Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32
> > for data transfer (sometimes raw UCS-2?).
> > 
> > 
> > [1] https://en.wikipedia.org/wiki/TRON_(encoding)
> > [2] https://en.wikipedia.org/wiki/Unicode
> > 
> > P.S.
> > On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be
> > OK.
> 
> In 8859-1, "ø" is:
> 
> elg!ene[~]> printf ø |hexdump -C
> 00000000  f8                                                |ø|
> 00000001
> 
> so this does not seem to be the problem here. And all those
> characters I tried are one-byte (all 8859-1 are):
> 
> elg!ene[~]> printf "äßç" |hexdump -C
> 00000000  e4 df e7                                          |äßç|
> 00000003
> 
> So I do not believe this is the same problem. I did, however,
> find it interesting that multi-byte character sets may have been
> in use longer than I imagined.
> 
> 
> -- 
> Eivind Nicolay Evensen
> 

OK. Agreed. Sorry for the noise.

Possibly, 8bits (non-7bits) characters which is not a part of UTF-8 or
on-memory Unicode would not be converted properly in BSD grep?

0x5c (backslash) problem was a nightmare for Japanese programmers and
early adopters (running IBM PC softwares on NEC PC98 with simulator)
ATM, came in conjunction with Turbo C that was not yet properly ported
for Shift-JIS. :-(

Until then, it was a nightmare only for corporate, professional
programmers only.

-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>