From nobody Sat Feb 04 01:41:17 2023 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4P7wHP2s3Qz3kVlc for ; Sat, 4 Feb 2023 01:41:53 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [IPv6:2a01:4f8:c2c:26d8::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4P7wHN6GF5z4012 for ; Sat, 4 Feb 2023 01:41:52 +0000 (UTC) (envelope-from eugen@grosbein.net) Authentication-Results: mx1.freebsd.org; none Received: from eg.sd.rdtc.ru (root@eg.sd.rdtc.ru [62.231.161.221] (may be forged)) by hz.grosbein.net (8.16.1/8.16.1) with ESMTPS id 3141fgUk015904 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sat, 4 Feb 2023 01:41:43 GMT (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: eivinde@terraplane.org Received: from [10.58.0.11] (dadvw [10.58.0.11] (may be forged)) by eg.sd.rdtc.ru (8.16.1/8.16.1) with ESMTPS id 3141ffhZ013794 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Sat, 4 Feb 2023 08:41:41 +0700 (+07) (envelope-from eugen@grosbein.net) Subject: Re: Grep with non-ascii To: Eivind Nicolay Evensen References: <20230203110642.70e4a076@elg.hjerdalen.lokalnett> <819a4336-9689-bdbe-a90d-8f1d7b842662@grosbein.net> <20230203151853.02732bd6@elg.hjerdalen.lokalnett> Cc: freebsd-stable@freebsd.org From: Eugene Grosbein Message-ID: Date: Sat, 4 Feb 2023 08:41:17 +0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 In-Reply-To: <20230203151853.02732bd6@elg.hjerdalen.lokalnett> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,SHORTCIRCUIT autolearn=disabled version=3.4.6 X-Spam-Report: * -0.0 SHORTCIRCUIT No description available. * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hz.grosbein.net X-Rspamd-Queue-Id: 4P7wHN6GF5z4012 X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/32, country:DE] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N 03.02.2023 21:18, Eivind Nicolay Evensen wrote: > Den Fri, 3 Feb 2023 19:12:32 +0700 > skrev Eugene Grosbein : > >> 03.02.2023 17:06, Eivind Nicolay Evensen wrote: >>> Hello. >>> >>> I just noticed this today: >>> >>> elg!ene[~]> printf "bø\nhei\nøl\n" | grep ø >>> grep: trailing backslash (\) >>> elg!ene[~]> echo $LC_CTYPE $LANG >>> nb_NO.ISO8859-1 nb_NO.ISO8859-1 >>> >>> While I have the result I envisioned with gnugrep: >>> >>> elg!ene[~]> printf "bø\nhei\nøl\n" | ggrep ø >>> bø >>> øl >>> >>> Also, on OpenIndiana, linux and Netbsd, grep gives the proper >>> result. >>> >>> Is lib/libc/regex the right place to look into this if I >>> find the time, or does anybody know this enough to know the >>> problem? >> >> Try single quotes instead of double quotes. >> And pleace specify system version and shell name, and shell version >> if its not in base system. > > This is > elg!ene[~]> uname -a > FreeBSD elg.hjerdalen.lokalnett 13.2-PRERELEASE FreeBSD 13.2-PRERELEASE > #1: Tue Jan 31 11:23:29 CET 2023 > ene@elg.hjerdalen.lokalnett:/usr/obj/usr/src/amd64.amd64/sys/ENE-spurv > amd64 > > Using the tcsh that comes with it. But I don't think the quotes matter > much because of this: > > elg!ene[~]> grep ø > grep: trailing backslash (\) > > The output was more just to have something to look for, like > with ggrep but anyway: > > elg!ene[~]> printf 'bø\nhei\nøl\n' |grep ø > grep: trailing backslash (\) > > And obviously: > > elg!ene[~]> printf 'bø\nhei\nøl\n' > bø > hei > øl > > And it seems to be the same for any 8859-1 character not part > of ascii: > > elg!ene[~]> grep ä > grep: trailing backslash (\) > elg!ene[~]> grep ß > grep: trailing backslash (\) > elg!ene[~]> grep ç > grep: trailing backslash (\) I checked it with ru_RU.KOI8-R locale and same problem manifested, with every Cyrillic letter. The following line shows codes and characters of affected positions in last half of 8-bit character table. $ jot -w '%o' - 128 255 1 | xargs -n2 -I^ printf '^ \^\n' | while read octal char; do grep -q "$char" /etc/motd 2>/dev/null; [ $? -gt 1 ] && echo $octal $char; done Note that this problem does not exist in 12.4 or earlier FreeBSD versions, so this is recent regression. Surely that's due to grep command being GNU grep in 12.4 but BSD grep in 13.x