Re: Tool to compare directories and delete duplicate files from one directory
Date: Mon, 15 May 2023 22:26:07 UTC
> ----------------------------------------
> From: David Christensen <dpchrist@holgerdanske.com>
> Date: May 15, 2023, 1:43:38 AM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
>
>
> I looks like your script only finds duplicates when the subpath is
> identical (?):
>
Yeah. Wasn't that the original problem description? I went off the example
given by Paul earlier in this thread, and it looked like only files with
matching subpaths were being considered (because the OP accidentally rsync'd
files from a source to a bunch of destination dirs).
If we're simply looking for files that have the same name anywhere in the set
of dirs, then comparing their sizes to know if they're assumed (!) duplicates
or differ in size, that's way easier to program.
As a side note on performance, I ran the program on a set of 8 dirs containing
over 750,000 files and 300G of data. Here are the results:
real 0m10.791s
user 0m5.361s
sys 0m5.928s
And here are the results for counting the files in the dirs using `wc':
real 0m12.464s
user 0m0.834s
sys 0m11.671s
That means the program processed the list of files quicker that `wc' could
count them, which is wild. Obviously, as the number of apparent duplicates is
encountered, the number of `stat' calls increases, and the run-time will, too.
But this shows how efficient awk is at comparing strings.
> 2023-05-15 01:38:20 dpchrist@vf1 /vf1zpool1/dpchrist
> $ cp -Ra foo bar
>
> 2023-05-15 01:39:18 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> duplicates: bar/1/i-j foo/1/i-j
> duplicates: bar/1/2/e foo/1/2/e
> duplicates: bar/1/a-b foo/1/a-b
> duplicates: bar/1/g foo/1/g
> duplicates: bar/1/2/i foo/1/2/i
> duplicates: bar/q-r foo/q-r
> duplicates: bar/m-n foo/m-n
> duplicates: bar/1/2/m foo/1/2/m
> duplicates: bar/c foo/c
> duplicates: bar/e-f foo/e-f
> duplicates: bar/1/s foo/1/s
> duplicates: bar/k foo/k
> duplicates: bar/o foo/o
> duplicates: bar/q foo/q
> duplicates: bar/1/c-d foo/1/c-d
> duplicates: bar/1/2/s-t foo/1/2/s-t
> duplicates: bar/1/2/o-p foo/1/2/o-p
> duplicates: bar/1/2/k-l foo/1/2/k-l
> duplicates: bar/g-h foo/g-h
>
> 2023-05-15 01:39:41 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
> 26 24 82
>
> 2023-05-15 01:39:44 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 bar | wc
> 26 24 82
>
> 2023-05-15 01:40:10 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> duplicates: bar/1/i-j foo/1/i-j
> duplicates: bar/1/2/e foo/1/2/e
> duplicates: bar/1/a-b foo/1/a-b
> duplicates: bar/1/g foo/1/g
> duplicates: bar/1/2/i foo/1/2/i
> duplicates: bar/q-r foo/q-r
> duplicates: bar/m-n foo/m-n
> duplicates: bar/1/2/m foo/1/2/m
> duplicates: bar/c foo/c
> duplicates: bar/e-f foo/e-f
> duplicates: bar/1/s foo/1/s
> duplicates: bar/k foo/k
> duplicates: bar/o foo/o
> duplicates: bar/q foo/q
> duplicates: bar/1/c-d foo/1/c-d
> duplicates: bar/1/2/s-t foo/1/2/s-t
> duplicates: bar/1/2/o-p foo/1/2/o-p
> duplicates: bar/1/2/k-l foo/1/2/k-l
> duplicates: bar/g-h foo/g-h
>
> 2023-05-15 01:40:22 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
> 26 24 82
>
> 2023-05-15 01:40:29 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 bar | wc
> 26 24 82
>
> 2023-05-15 01:40:34 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo bar
> Building files list from: foo bar
> Comparing files ...
> duplicates: bar/1/2/a foo/1/2/a
> remove bar/1/2/a? n
> duplicates: bar/1/i-j foo/1/i-j
> remove bar/1/i-j? n
> duplicates: bar/1/2/e foo/1/2/e
> remove bar/1/2/e? n
> duplicates: bar/1/a-b foo/1/a-b
> remove bar/1/a-b? n
> duplicates: bar/1/g foo/1/g
> remove bar/1/g? n
> duplicates: bar/1/2/i foo/1/2/i
> remove bar/1/2/i? n
> duplicates: bar/q-r foo/q-r
> remove bar/q-r? n
> duplicates: bar/m-n foo/m-n
> remove bar/m-n? n
> duplicates: bar/1/2/m foo/1/2/m
> remove bar/1/2/m? n
> duplicates: bar/c foo/c
> remove bar/c? n
> duplicates: bar/e-f foo/e-f
> remove bar/e-f? n
> duplicates: bar/1/s foo/1/s
> remove bar/1/s? n
> duplicates: bar/k foo/k
> remove bar/k? n
> duplicates: bar/o foo/o
> remove bar/o? n
> duplicates: bar/q foo/q
> remove bar/q? n
> duplicates: bar/1/c-d foo/1/c-d
> remove bar/1/c-d? n
> duplicates: bar/1/2/s-t foo/1/2/s-t
> remove bar/1/2/s-t? n
> duplicates: bar/1/2/o-p foo/1/2/o-p
> remove bar/1/2/o-p? n
> duplicates: bar/1/2/k-l foo/1/2/k-l
> remove bar/1/2/k-l? n
> duplicates: bar/g-h foo/g-h
> remove bar/g-h? n
>
>
> David
>
Thanks for running that test. It's working as designed. However, it doesn't
check if the apparent duplicate is literally the same file (same inode)
encountered through an overlapping directory, or a hard-link. This one does
(although it might be a moot point if I misunderstood the original problem).
#!/bin/sh -e
# remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
if [ "X$1" = "X-n" ]; then n=1; shift; fi
echo "Building files list from: ${@}"
find "${@}" -xdev -type f |
awk -d1 -v n=$n 'BEGIN { cmd = "stat -f \"%i %z\" "
for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
{ files[$0] = match($0, "(" args ")/?") + RLENGTH }
END { for (i in ARGV) sub("/*$", "/", ARGV[i])
print "Comparing files ..."
for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
for (j = i +1; j < x; j++)
if (ARGV[j] substr(file, files[file]) in files) {
dup = ARGV[j] substr(file, files[file])
cmd "\"" file "\"" | getline; close(cmd "\"" file "\"")
fil_i = $1; fil_s = $2
cmd "\"" dup "\"" | getline; close(cmd "\"" dup "\"")
dup_i = $1; dup_s = $2
if (fil_i == dup_i) continue
if (fil_s == dup_s) { act("dup") } else act("diff") }
delete files[file]
} }
function act(message) {
print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
if (!n) system("rm -vi \"" dup "\" </dev/tty")
}' "${@}"
--
Sent with https://mailfence.com
Secure and private email