From nobody Fri Feb 03 16:06:05 2023 X-Original-To: stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4P7gW85SB4z3kTCf for ; Fri, 3 Feb 2023 16:06:12 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Received: from www121.sakura.ne.jp (www121.sakura.ne.jp [153.125.133.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4P7gW66vWnz3lnZ for ; Fri, 3 Feb 2023 16:06:10 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Authentication-Results: mx1.freebsd.org; dkim=none; spf=none (mx1.freebsd.org: domain of junchoon@dec.sakura.ne.jp has no SPF policy when checking 153.125.133.21) smtp.mailfrom=junchoon@dec.sakura.ne.jp; dmarc=none Received: from kalamity.joker.local (123-1-88-210.area1b.commufa.jp [123.1.88.210]) (authenticated bits=0) by www121.sakura.ne.jp (8.16.1/8.16.1/[SAKURA-WEB]/20201212) with ESMTPA id 313G65eq049117 for ; Sat, 4 Feb 2023 01:06:05 +0900 (JST) (envelope-from junchoon@dec.sakura.ne.jp) Date: Sat, 4 Feb 2023 01:06:05 +0900 From: Tomoaki AOKI To: stable@freebsd.org Subject: Re: Grep with non-ascii Message-Id: <20230204010605.4874609f80eed28543407807@dec.sakura.ne.jp> In-Reply-To: <20230203151853.02732bd6@elg.hjerdalen.lokalnett> References: <20230203110642.70e4a076@elg.hjerdalen.lokalnett> <819a4336-9689-bdbe-a90d-8f1d7b842662@grosbein.net> <20230203151853.02732bd6@elg.hjerdalen.lokalnett> Organization: Junchoon corps X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; amd64-portbld-freebsd13.0) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [-1.57 / 15.00]; AUTH_NA(1.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-0.999]; NEURAL_HAM_MEDIUM(-0.97)[-0.974]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; R_DKIM_NA(0.00)[]; ASN(0.00)[asn:7684, ipnet:153.125.128.0/18, country:JP]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_SPF_NA(0.00)[no SPF record]; MLMMJ_DEST(0.00)[stable@freebsd.org]; RCVD_TLS_LAST(0.00)[]; HAS_ORG_HEADER(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; DMARC_NA(0.00)[sakura.ne.jp]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; MID_RHS_MATCH_FROM(0.00)[] X-Rspamd-Queue-Id: 4P7gW66vWnz3lnZ X-Spamd-Bar: - X-ThisMailContainsUnwantedMimeParts: N On Fri, 3 Feb 2023 15:18:53 +0100 Eivind Nicolay Evensen wrote: > Den Fri, 3 Feb 2023 19:12:32 +0700 > skrev Eugene Grosbein : > > > 03.02.2023 17:06, Eivind Nicolay Evensen wrote: > > > Hello. > > > > > > I just noticed this today: > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø > > > grep: trailing backslash (\) > > > elg!ene[~]> echo $LC_CTYPE $LANG > > > nb_NO.ISO8859-1 nb_NO.ISO8859-1 > > > > > > While I have the result I envisioned with gnugrep: > > > > > > elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø > > > bø > > > øl > > > > > > Also, on OpenIndiana, linux and Netbsd, grep gives the proper > > > result. > > > > > > Is lib/libc/regex the right place to look into this if I > > > find the time, or does anybody know this enough to know the > > > problem? > > > > Try single quotes instead of double quotes. > > And pleace specify system version and shell name, and shell version > > if its not in base system. > > This is > elg!ene[~]> uname -a > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD 13.2-PRERELEASE > #1: Tue Jan 31 11:23:29 CET 2023 > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv > amd64 > > Using the tcsh that comes with it. But I don't think the quotes matter > much because of this: > > elg!ene[~]> grep ø > grep: trailing backslash (\) > > The output was more just to have something to look for, like > with ggrep but anyway: > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø > grep: trailing backslash (\) > > And obviously: > > elg!ene[~]> printf 'bø\nhei\nøl\n' > bø > hei > øl > > And it seems to be the same for any 8859-1 character not part > of ascii: > > elg!ene[~]> grep ä > grep: trailing backslash (\) > elg!ene[~]> grep ß > grep: trailing backslash (\) > elg!ene[~]> grep ç > grep: trailing backslash (\) > > -- > Eivind Nicolay Evensen I recalled very, very old problem on Japanese characters. Does the characters you mentioned include 0x5c in nb_NO.ISO8859-1 charset? In dirty, ugly DOS era, Shift-JIS (CP932) was the mainstream in Japan. In this charset, some 2bytes kanji characters have 0x5c in its second byte. This caused imported, non-Japanese-aware softwares mis-handle Japanese texts, and the workaround was to add excessive 0x5c after problematic characters. :-( For example, 表示 in Shift-JIS bytestream was 0x95 0x5c 0x8e 0xa6, and as 0x5c was usually considered as backslash, escape character, it was modified to 0x95 0x8e 0xa6 in non-Japanese softwares. As this mis-conversion often happened recussively, the required numbers of excessive 0x5c varied, varied and varied!!!!! Crazily. If this is the case like above, the only solution is to move to character set containing ALL characters all over the world. AFAIK, the only candidates are only two, TRON code [1] and Unicode (UCS, ISO/IEC 10646) [2]. And TRON code is very rarely used, actual candidate would be Unicode only. Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32 for data transfer (sometimes raw UCS-2?). [1] https://en.wikipedia.org/wiki/TRON_(encoding) [2] https://en.wikipedia.org/wiki/Unicode P.S. On UTF-8, character ø was encoded to UTF-8: 0xC3 0xB8. So it should be OK. -- Tomoaki AOKI