From nobody Thu Feb 29 04:01:20 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TlcwT4hg8z5Bp3N for ; Thu, 29 Feb 2024 04:01:29 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Received: from www121.sakura.ne.jp (www121.sakura.ne.jp [153.125.133.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4TlcwR0XyHz4Mjf for ; Thu, 29 Feb 2024 04:01:26 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of junchoon@dec.sakura.ne.jp designates 153.125.133.21 as permitted sender) smtp.mailfrom=junchoon@dec.sakura.ne.jp Received: from kalamity.joker.local (123-1-21-232.area1b.commufa.jp [123.1.21.232]) (authenticated bits=0) by www121.sakura.ne.jp (8.17.1/8.17.1/[SAKURA-WEB]/20201212) with ESMTPA id 41T41LXW030034; Thu, 29 Feb 2024 13:01:21 +0900 (JST) (envelope-from junchoon@dec.sakura.ne.jp) Date: Thu, 29 Feb 2024 13:01:20 +0900 From: Tomoaki AOKI To: George Mitchell Cc: FreeBSD Hackers Subject: Re: ISO-8859-1 file name in UTF-8 file system Message-Id: <20240229130120.edadb01ed5b4e7a6757c98d3@dec.sakura.ne.jp> In-Reply-To: <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com> References: <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com> Organization: Junchoon corps X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; amd64-portbld-freebsd14.0) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spamd-Bar: -- X-Spamd-Result: default: False [-2.70 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; MV_CASE(0.50)[]; R_SPF_ALLOW(-0.20)[+ip4:153.125.133.16/28]; MIME_GOOD(-0.10)[text/plain]; ONCE_RECEIVED(0.10)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MIME_TRACE(0.00)[0:+]; HAS_ORG_HEADER(0.00)[]; TAGGED_RCPT(0.00)[freebsd]; ASN(0.00)[asn:7684, ipnet:153.125.128.0/18, country:JP]; ARC_NA(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_ONE(0.00)[1]; TO_DN_ALL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; RCVD_TLS_LAST(0.00)[]; MLMMJ_DEST(0.00)[freebsd-hackers@FreeBSD.org]; DMARC_NA(0.00)[sakura.ne.jp]; R_DKIM_NA(0.00)[]; FROM_HAS_DN(0.00)[] X-Rspamd-Queue-Id: 4TlcwR0XyHz4Mjf On Wed, 28 Feb 2024 20:30:19 -0500 George Mitchell wrote: > (I tried sending this to freebsd-python, but I can't post there > because I haven't subscribed, and I'm hoping someone here will have > a suggestion. Thanks for your indulgence.) > > In Python 3.9 on FreeBSD 13.2-RELEASE, sys.getfilesystemencoding() > reports 'utf-8'. However, a couple of ancient files on one of my > disks have names that were evidently ISO-8859-1 encoded at the time > they were originally created. When I os.walk() through a directory > with one of these files, the UTF-8 string name of the file has, for > example, a '\udcc3' in it. Literally, the file name on disk had > hex c3 at that position (ISO-8859-1 for Ã), and I guess \udcc3 is a > surrogate for the 0xc3, which is incomprehensible in conformant > UTF-8 (though I don't understand "surrogates" in UTF-8 and you can't > take that last statement as gospel). > > Be that as it may, what can I do at this point to transmogrify that > Python str with the \udcc3 back into the literal bytes found in the > file name on the disk, so that I can then encode them into proper > UTF-8 from ISO-8859-1? -- George Use converters/convmv [1] to rename files? I used it to convert ShiftJIS (CP932) filenames to UTF-8 long, long ago. [1] https://www.freshports.org/converters/convmv/ -- Tomoaki AOKI