Re: shell script for removing unprintable characters in file names
- In reply to: Per olof Ljungmark : "shell script for removing unprintable characters in file names"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 30 Nov 2024 20:06:15 UTC
On 11/30/24 07:33, Per olof Ljungmark wrote:
> Hi,
>
> I am tasked with recovering hundreds or more files created with unknown
> OSs and have unknown characters in the name, replaced with a '?'.
>
> Like file?nam?.???
>
> Please, if you have such a script can you post or email it? Replacing
> the unknown character with anything, like '-' or '_' using whatever
> shell, sh, bash or csh.
>
> Thanks a lot!
>
> Per
The first thing to understand is that a traditional Unix file names may
be composed of any 8-bit characters except the directory path delimiter
(forward slash '/', hexadecimal 0x2f) and the C programming language
string terminator (NUL, hexadecimal 0x00).
Create a test file with some control characters in the file name:
2024-11-30 11:40:47 dpchrist@laalaa ~/sandbox/rename
$ cat /etc/debian_version ; uname -a
11.11
Linux laalaa 5.10.0-33-amd64 #1 SMP Debian 5.10.226-1 (2024-10-03)
x86_64 GNU/Linux
2024-11-30 11:41:04 dpchrist@laalaa ~/sandbox/rename
$ perl -e 'open FH, ">", "space tab\tnewline\n"; print FH "hello,
world!\n"; close FH'
2024-11-30 11:41:55 dpchrist@laalaa ~/sandbox/rename
$ ls
'space tab'$'\t''newline'$'\n'
2024-11-30 11:42:09 dpchrist@laalaa ~/sandbox/rename
$ ls | hexdump
00000000 73 70 61 63 65 20 74 61 62 09 6e 65 77 6c 69 6e |space
tab.newlin|
00000010 65 0a 0a |e..|
00000013
2024-11-30 11:42:31 dpchrist@laalaa ~/sandbox/rename
$ cat 'space tab'$'\t''newline'$'\n'
hello, world!
Perl and the URI::Escape module can be used to replace problematic
characters with percent-hexadecimal escape codes:
2024-11-30 11:47:11 dpchrist@laalaa ~/sandbox/rename
$ perl -v | head -n 2 | tail -n 1
This is perl 5, version 32, subversion 1 (v5.32.1) built for
x86_64-linux-gnu-thread-multi
2024-11-30 11:48:11 dpchrist@laalaa ~/sandbox/rename
$ perl -mURI::Escape -e 'print $URI::Escape::VERSION, $/'
5.08
2024-11-30 11:49:25 dpchrist@laalaa ~/sandbox/rename
$ perl -mURI::Escape=uri_escape -e 'foreach (@ARGV) {$in=$_;
$out=uri_escape($_); rename($in, $out) or die "failed to rename $in"}' *
2024-11-30 11:50:54 dpchrist@laalaa ~/sandbox/rename
$ ls
space%20tab%09newline%0A
By encoding problematic characters, rather than replacing them with a
constant placeholder character ('?', '_', etc.), information is
preserved and the process is reversible:
2024-11-30 11:50:57 dpchrist@laalaa ~/sandbox/rename
$ perl -mURI::Escape=uri_unescape -e 'foreach (@ARGV) {$in=$_;
$out=uri_unescape($_); rename($in, $out) or die "failed to rename $in"}' *
2024-11-30 11:51:07 dpchrist@laalaa ~/sandbox/rename
$ ls
'space tab'$'\t''newline'$'\n'
HTH,
David
p.s. I tried rename(1), but it has problems with newlines OOTB.
Writing a pair of one-liners was faster than trying to work-around
rename(1):
https://manpages.debian.org/bullseye/rename/rename.1.en.html