Re: Tool to compare directories and delete duplicate files from one directory
- In reply to: Sysadmin Lists : "Re: Tool to compare directories and delete duplicate files from one directory"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 20 May 2023 21:59:41 UTC
> ----------------------------------------
> From: Sysadmin Lists <sysadmin.lists@mailfence.com>
> Date: May 19, 2023, 10:19:33 AM
> To: <questions@freebsd.org>
> Subject: Re: Tool to compare directories and delete duplicate files from one directory
>
>
> Performance is pretty good:
> $ time dedup_multidirs.sh -V dedup{1..13}
> DEBUG: 313087 files, 3497 duplicates, 309590 unique, 42848 stat calls
> # 773723 differences: same filenames, different sizes or hashes
>
> real 1m32.719s
> user 0m50.671s
> sys 0m44.054s
>
> $ du -xs dedup{1..13} | awk '{ sum = sum + $1 } END { print sum }'
> 219195746 # 200G+ of data
Found a bug; shaved 10-seconds:
--------------------------------------------------------------------------------
diff --git a/dedup_multidirs.sh b/dedup_multidirs.sh
index 8563d49..86c5f07 100755
--- a/dedup_multidirs.sh
+++ b/dedup_multidirs.sh
@@ -48,8 +48,8 @@ END { for (i in ARGV) sub("/*$", "/", ARGV[i])
processed[d]
hits++ }
else act("diff")
- if (c++ == hasf[ARGV[k], file])
- break
+ if (++c == hasf[ARGV[k], file])
+ { c = 0; break }
} } } }
if (e) debug(3)
processed[dups[file, j]]; delete dups[file, j]
--------------------------------------------------------------------------------
As a sanity-check, I checked to see how much time it would take to merely store
every encountered file, grouped by filename. It's so slow:
total files: 14347
real 1m37.176s
user 1m36.823s
sys 0m0.212s
--------------------------------------------------------------------------------
{ files[$0] = substr($0, match($0, /[^\/]+$/)); tfiles++ }
END { for (f in files)
if (f in processed == 0) {
processed[f]; dups[f]; hits[files[f]]++
for (s in files) {
if (f != s && s in processed == 0)
if (s ~ "/" files[f] "$") {
processed[s]; dups[s]; hits[files[f]]++
}
}
compare(dups)
for (f in dups) { delete dups[f]; delete files[f] }
}
for (h in hits) printf("%6d %s\n", hits[h], h) | "sort"
close("sort")
print "total files:", tfiles
}
function compare(array, f) {
for (f in array) { } # do nothing
}
--
Sent with https://mailfence.com
Secure and private email