Re: Tool to compare directories and delete duplicate files from one directory
- In reply to: Sysadmin Lists : "Re: Tool to compare directories and delete duplicate files from one directory"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 14 May 2023 01:55:26 UTC
On 5/12/23 10:24, Sysadmin Lists wrote:
> Curiosity got the better of me. I've been searching for a project that requires
> the use of multi-dimensional arrays in BSD-awk (not explicitly supported). But
> after writing it, I realized there was a more efficient way without them (only
> run `stat' on files with matching paths plus names) [nonplussed].
> Here's that one.
>
> #!/bin/sh -e
> # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
> if [ "X$1" = "X-n" ]; then n=1; shift; fi
>
> echo "Building files list from ... ${@}"
>
> find "${@}" -xdev -type f |
> awk -v n=$n 'BEGIN { cmd = "stat -f %z "
> for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
> { files[$0] = match($0, "(" args ")/?") + RLENGTH } # index of filename
> END { for (i in ARGV) sub("/+$", "", ARGV[i]) # remove trailing-/s
> print "Comparing files ..."
> for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
> for (j = i +1; j < x; j++)
> if (ARGV[j] "/" substr(file, files[file]) in files) {
> dup = ARGV[j] "/" substr(file, files[file])
> cmd file | getline fil_s; close(cmd file)
> cmd dup | getline dup_s; close(cmd dup)
> if (dup_s == fil_s) act(file, dup, "dup")
> else act(file, dup, "diff") }
> delete files[file]
> } }
>
> function act(file, dup, message) {
> print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
> if (!n) system("rm -vi " dup "</dev/tty")
> }' "${@}"
>
> Priority is given by the order of the arguments (first highest, last lowest).
> The user is prompted to delete lower-priority dupes encountered if '-n' isn't
> given, otherwise it just reports what it finds. Comparing by size and name only
> seems odd (a simple `diff' would be easier). Surprisingly, accounting for a
> mixture of dirnames with and w/o trailing-slashes was a bit tricky (dir1 dir2/).
>
> Fun challenge. Learned a lot about awk.
I wrestled with a Perl script years ago when I did not know of
fdupes(1), jdupes(1), etc.. Brute force O(N^2) comparison worked for
toy datasets, but was impractical when I applied it to a directory
containing thousands of files and hundreds of gigabytes. (The OP
mentioned 12 TB.) Practical considerations of run time, memory usage,
disk I/O, etc., drove me to find the kinds of optimizations fdupes(1)
and jdupes(1) mention.
I do not know Awk, so it is hard to comment on your script. I suggest
commenting out any create/update/delete code, running the script against
larger and larger datasets, and seeing what optimizations you can add.
David