Re: Tool to compare directories and delete duplicate files from one directory

From: Sysadmin Lists <sysadmin.lists_at_mailfence.com>
Date: Sun, 14 May 2023 22:48:52 UTC
> ----------------------------------------
> From: David Christensen <dpchrist@holgerdanske.com>
> Date: May 13, 2023, 6:55:26 PM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
>
> 
> I wrestled with a Perl script years ago when I did not know of 
> fdupes(1), jdupes(1), etc..  Brute force O(N^2) comparison worked for 
> toy datasets, but was impractical when I applied it to a directory 
> containing thousands of files and hundreds of gigabytes.  (The OP 
> mentioned 12 TB.)  Practical considerations of run time, memory usage, 
> disk I/O, etc., drove me to find the kinds of optimizations fdupes(1) 
> and jdupes(1) mention.
> 
> 
> I do not know Awk, so it is hard to comment on your script.  I suggest 
> commenting out any create/update/delete code, running the script against 
> larger and larger datasets, and seeing what optimizations you can add.
> 
> 
> David

All good points, and why I rewrote it without multi-dimensional arrays.
Initially, `stat' was ran on each file encountered, then compared sizes on
matched path/filename-pairs. The multi-dimensional arrays stored the filenames,
paths, and sizes (hence, multi-d). But that's wasteful since we only care about
size if there are duplicates somewhere. This version runs `stat' only if an
apparent duplicate is found, which cuts down the `stat' calls significantly.

The reason awk is so efficient on types of tasks is because it's doing string
comparisons and string manipulation, which is very efficient when done
properly. The most resource-intensive part of the program is the initial `find'
command, which traverses the directories given on the command line once, then
caches what it finds (running `find' twice successively uses the cache the
second time).

It even trims the list of files as it goes, which makes the match-tests smaller
as it runs.

I've ran it on paths containing 40,000+ files and it takes less than 1-second
on the second run, and less than 5-seconds on its first. Sizes of files don't
matter since we're only doing a `stat' call to retrieve its known size, not
comparing contents.

That said, I found a bug when command-line paths have similar leading
predicates, and realized I wasn't protecting names with white-spaces. This
version fixes both. Try it out using the [-n] flag. It doesn't do anything but
find files, compare names, and compare sizes on duplicates. The `rm' command is
even protected during non-dryrun calls using the [-i] flag, which prompts the
user before deleting anything.

#!/bin/sh -e
# remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
if [ "X$1" = "X-n" ]; then n=1; shift; fi

echo "Building files list from: ${@}"

find "${@}" -xdev -type f |
awk -v n=$n 'BEGIN { cmd = "stat -f %z "
for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
     { files[$0] = match($0, "(" args ")/?") + RLENGTH }
END  { for (i in ARGV) sub("/*$", "/", ARGV[i])
       print "Comparing files ..."
       for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
           for (j = i +1; j < x; j++)
               if (ARGV[j] substr(file, files[file]) in files) {
                   dup = ARGV[j] substr(file, files[file])
                   cmd "\"" file "\"" | getline fil_s; close(cmd "\"" file "\"")
                   cmd "\"" dup  "\"" | getline dup_s; close(cmd "\"" dup  "\"")
                   if (dup_s == fil_s) act("dup")
                   else act("diff") }
           delete files[file]
     } }
function act(message) {
    print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
    if (!n) system("rm -vi \"" dup "\" </dev/tty")
}' "${@}"


-- 
Sent with https://mailfence.com  
Secure and private email