From nobody Sat Feb 04 03:46:02 2023 X-Original-To: stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4P7z2w49Mpz3kcmd for ; Sat, 4 Feb 2023 03:46:16 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Received: from www121.sakura.ne.jp (www121.sakura.ne.jp [153.125.133.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4P7z2t6L2Gz4Cnf for ; Sat, 4 Feb 2023 03:46:14 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Authentication-Results: mx1.freebsd.org; dkim=none; spf=none (mx1.freebsd.org: domain of junchoon@dec.sakura.ne.jp has no SPF policy when checking 153.125.133.21) smtp.mailfrom=junchoon@dec.sakura.ne.jp; dmarc=none Received: from kalamity.joker.local (123-1-88-210.area1b.commufa.jp [123.1.88.210]) (authenticated bits=0) by www121.sakura.ne.jp (8.16.1/8.16.1/[SAKURA-WEB]/20201212) with ESMTPA id 3143k2Af044157 for ; Sat, 4 Feb 2023 12:46:03 +0900 (JST) (envelope-from junchoon@dec.sakura.ne.jp) Date: Sat, 4 Feb 2023 12:46:02 +0900 From: Tomoaki AOKI To: stable@freebsd.org Subject: Re: Grep with non-ascii Message-Id: <20230204124602.abe78f4a441f747941d3f858@dec.sakura.ne.jp> In-Reply-To: <20230203173155.179902a4@elg.hjerdalen.lokalnett> References: <20230203110642.70e4a076@elg.hjerdalen.lokalnett> <819a4336-9689-bdbe-a90d-8f1d7b842662@grosbein.net> <20230203151853.02732bd6@elg.hjerdalen.lokalnett> <20230204010605.4874609f80eed28543407807@dec.sakura.ne.jp> <20230203173155.179902a4@elg.hjerdalen.lokalnett> Organization: Junchoon corps X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; amd64-portbld-freebsd13.0) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [-1.60 / 15.00]; AUTH_NA(1.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; R_DKIM_NA(0.00)[]; ASN(0.00)[asn:7684, ipnet:153.125.128.0/18, country:JP]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_SPF_NA(0.00)[no SPF record]; MLMMJ_DEST(0.00)[stable@freebsd.org]; RCVD_TLS_LAST(0.00)[]; HAS_ORG_HEADER(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; DMARC_NA(0.00)[sakura.ne.jp]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; MID_RHS_MATCH_FROM(0.00)[] X-Rspamd-Queue-Id: 4P7z2t6L2Gz4Cnf X-Spamd-Bar: - X-ThisMailContainsUnwantedMimeParts: N On Fri, 3 Feb 2023 17:31:55 +0100 Eivind Nicolay Evensen wrote: > Den Sat, 4 Feb 2023 01:06:05 +0900 > skrev Tomoaki AOKI : > > > On Fri, 3 Feb 2023 15:18:53 +0100 > > Eivind Nicolay Evensen wrote: > > > > > Den Fri, 3 Feb 2023 19:12:32 +0700 > > > skrev Eugene Grosbein : > > > > > > > 03.02.2023 17:06, Eivind Nicolay Evensen wrote: > > > > > Hello. > > > > > > > > > > I just noticed this today: > > > > > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø > > > > > grep: trailing backslash (\) > > > > > elg!ene[~]> echo $LC_CTYPE $LANG > > > > > nb_NO.ISO8859-1 nb_NO.ISO8859-1 > > > > > > > > > > While I have the result I envisioned with gnugrep: > > > > > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø > > > > > bø > > > > > øl > > > > > > > > > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper > > > > > result. > > > > > > > > > > Is lib/libc/regex the right place to look into this if I > > > > > find the time, or does anybody know this enough to know the > > > > > problem? > > > > > > > > Try single quotes instead of double quotes. > > > > And pleace specify system version and shell name, and shell > > > > version if its not in base system. > > > > > > This is > > > elg!ene[~]> uname -a > > > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD > > > 13.2-PRERELEASE #1: Tue Jan 31 11:23:29 CET 2023 > > > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv > > > amd64 > > > > > > Using the tcsh that comes with it. But I don't think the quotes > > > matter much because of this: > > > > > > elg!ene[~]> grep ø > > > grep: trailing backslash (\) > > > > > > The output was more just to have something to look for, like > > > with ggrep but anyway: > > > > > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø > > > grep: trailing backslash (\) > > > > > > And obviously: > > > > > > elg!ene[~]> printf 'bø\nhei\nøl\n' > > > bø > > > hei > > > øl > > > > > > And it seems to be the same for any 8859-1 character not part > > > of ascii: > > > > > > elg!ene[~]> grep ä > > > grep: trailing backslash (\) > > > elg!ene[~]> grep ß > > > grep: trailing backslash (\) > > > elg!ene[~]> grep ç > > > grep: trailing backslash (\) > > > > > > -- > > > Eivind Nicolay Evensen > > > > I recalled very, very old problem on Japanese characters. > > Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1 > > charset? > > > > In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan. > > In this charset, some 2bytes kanji characters have 0x5c in its second > > byte. > > > > This caused imported, non-Japanese-aware softwares mis-handle Japanese > > texts, and the workaround was to add excessive 0x5c after problematic > > characters. :-( > > > > For example, ?? in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and > > as 0x5c was usually considered as backslash, escape character, it was > > modified to 0x95 0x8e 0xa6 in non-Japanese softwares. > > As this mis-conversion often happened recussively, the required > > numbers of excessive 0x5c varied, varied and varied!!!!! Crazily. > > > > If this is the case like above, the only solution is to move to > > character set containing ALL characters all over the world. > > > > AFAIK, the only candidates are only two, TRON code [1] and Unicode > > (UCS, ISO/IEC 10646) [2]. And TRON code is very rarely used, actual > > candidate would be Unicode only. > > Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32 > > for data transfer (sometimes raw UCS-2?). > > > > > > [1] https://en.wikipedia.org/wiki/TRON_(encoding) > > [2] https://en.wikipedia.org/wiki/Unicode > > > > P.S. > > On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be > > OK. > > In 8859-1, "ø" is: > > elg!ene[~]> printf ø |hexdump -C > 00000000 f8 |ø| > 00000001 > > so this does not seem to be the problem here. And all those > characters I tried are one-byte (all 8859-1 are): > > elg!ene[~]> printf "äßç" |hexdump -C > 00000000 e4 df e7 |äßç| > 00000003 > > So I do not believe this is the same problem. I did, however, > find it interesting that multi-byte character sets may have been > in use longer than I imagined. > > > -- > Eivind Nicolay Evensen > OK. Agreed. Sorry for the noise. Possibly, 8bits (non-7bits) characters which is not a part of UTF-8 or on-memory Unicode would not be converted properly in BSD grep? 0x5c (backslash) problem was a nightmare for Japanese programmers and early adopters (running IBM PC softwares on NEC PC98 with simulator) ATM, came in conjunction with Turbo C that was not yet properly ported for Shift-JIS. :-( Until then, it was a nightmare only for corporate, professional programmers only. -- Tomoaki AOKI