Re: ISO-8859-1 file name in UTF-8 file system

In reply to: Chris Torek : "Re: ISO-8859-1 file name in UTF-8 file system"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: George Mitchell <george+freebsd_at_m5p.com>
Date: Thu, 29 Feb 2024 18:11:29 UTC

On 2/28/24 23:22, Chris Torek wrote:
> On Wed, Feb 28, 2024 at 5:31 PM George Mitchell <george+freebsd@m5pcom> wrote:
>> [...]
>> Be that as it may, what can I do at this point to transmogrify that
>> Python str with the \udcc3 back into the literal bytes found in the
>> file name on the disk, so that I can then encode them into proper
>> UTF-8 from ISO-8859-1?                                    -- George
> 
> I ran into this problem ages ago on another system.  Here is what I did
> (note that some modern Python checkers hate the lambda form, I wrote
> this a long time ago):
> 
> if sys.version_info[0] >= 3:
>      # Python3 encodes "impossible" strings using UTF-8 and
>      # surrogate escapes.  For instance, a file named <\300><\300>eek
>      # (where \300 is octal 300, 0xc0 hex) turns into '\udcc0\udcc0eek'.
>      # This is how we can losslessly re-encode this as a byte string:
>      path_to_bytes = lambda path: path.encode('utf8', 'surrogateescape')
> 
>      # If we wish to print one of these byte strings, we have a
>      # problem, because they're not valid UTF-8.  This method
>      # treats the encoded bytes as pass-through, which is
>      # probably the best we can do.
>      bpath_to_str = lambda path: path.decode('unicode_escape')
> else:
>      # Python2 just uses byte strings, so OS paths are already
>      # byte strings and we return them unmodified.
>      path_to_bytes = lambda path: path
>      bpath_to_str = lambda path: path
> 
> Chris
> 
This is what I needed!  Specifically, upon first getting the file name
from os.walk, I immediately replace it (call it 'orig') with:
fn = orig.encode(utf8', 'surrogateescape').decode('iso8859-1')
Works perfectly!

On 2/28/24 23:01, Tomoaki AOKI wrote:
 > [...]
 > Use converters/convmv [1] to rename files?
 >
 > I used it to convert ShiftJIS (CP932) filenames to UTF-8 long, long ago.
 >
 > [1] https://www.freshports.org/converters/convmv/
 >
I didn't try this, but at first I did simply rename the files manually.
Thanks for the suggestion, though I ended up using Chris's approach.