Re: Tool to compare directories and delete duplicate files from one directory

From: David Christensen <dpchrist_at_holgerdanske.com>
Date: Mon, 15 May 2023 08:29:38 UTC
On 5/14/23 15:48, Sysadmin Lists wrote:
> #!/bin/sh -e
> # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
> if [ "X$1" = "X-n" ]; then n=1; shift; fi
> 
> echo "Building files list from: ${@}"
> 
> find "${@}" -xdev -type f |
> awk -v n=$n 'BEGIN { cmd = "stat -f %z "
> for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; ARGC = 0 }
>       { files[$0] = match($0, "(" args ")/?") + RLENGTH }
> END  { for (i in ARGV) sub("/*$", "/", ARGV[i])
>         print "Comparing files ..."
>         for (i = 1; i < x; i++) for (file in files) if (file ~ "^" ARGV[i]) {
>             for (j = i +1; j < x; j++)
>                 if (ARGV[j] substr(file, files[file]) in files) {
>                     dup = ARGV[j] substr(file, files[file])
>                     cmd "\"" file "\"" | getline fil_s; close(cmd "\"" file "\"")
>                     cmd "\"" dup  "\"" | getline dup_s; close(cmd "\"" dup  "\"")
>                     if (dup_s == fil_s) act("dup")
>                     else act("diff") }
>             delete files[file]
>       } }
> function act(message) {
>      print ((message == "dup") ? "duplicates:" : "difference:"), dup, file
>      if (!n) system("rm -vi \"" dup "\" </dev/tty")
> }' "${@}"


A virtual machine for testing:

2023-05-15 00:59:39 dpchrist@vf1 /vf1zpool1/dpchrist
$ freebsd-version ; uname -a ; perl -v | grep . | head -n 1
12.4-RELEASE-p2
FreeBSD vf1.tracy.holgerdanske.com 12.4-RELEASE-p1 FreeBSD 
12.4-RELEASE-p1 GENERIC  amd64
This is perl 5, version 32, subversion 1 (v5.32.1) built for 
amd64-freebsd-thread-multi


A Perl script to generate a test tree (tuned to generate a small tree by 
default):

2023-05-15 01:09:12 dpchrist@vf1 /vf1zpool1/dpchrist
$ cat ~/bin/t_dir_tree
#!/usr/bin/env perl
# $Id: t_dir_tree,v 1.4 2023/05/15 08:09:08 dpchrist Exp $
# Generate tree of random directories and files with duplicates
# By David Paul Christensen dpchrist@holgerdanske.com
# Public Domain
use strict;
use warnings;
use File::Path		qw( make_path );
use Getopt::Long;
my $dd = '/usr/bin/env dd';
my $d=3; my $f=10; my $m=1E3; my $u=1;
GetOptions('d=i'=>\$d,'f=i'=>\$f,'m=i'=>\$m,'u=i'=>\$u) && @ARGV == 1
     or die "Usage: t_dir_tree [-d=NDIR] [-f=NFILE] [-m=MAXFILESIZE]",
	   " [-u=MAXDUP] PATH";
my $p = shift;
die "$0: refusing to overwrite existing path '$p'" if -e $p;
my %dp = (0 => $p);
map {$dp{$_} = $dp{int(rand($_))} . "/$_"} 1 .. $d-1;
print map {"$_ directory$/"} make_path(values %dp);
my $n = 'a';
for (0 .. $f-1) {
     my $nsave = $n;
     my $of = $dp{int(rand($d))} . '/' . $n++;
     my $bs = int(rand($m));
     print "$of file size=$bs$/";
     qx($dd if=/dev/random of=$of bs=$bs count=1 2>/dev/null);
     die if $?;
     for (0 .. int(rand($u))) {
	my $df = $dp{int(rand($d))} . '/' . $nsave . '-' . $n++;
	print "$df file size=$bs$/";
	qx($dd if=$of of=$df bs=$bs count=1 2>/dev/null);
	die if $?;
     }
}


Create a test tree:

2023-05-15 01:10:29 dpchrist@vf1 /vf1zpool1/dpchrist
$ t_dir_tree foo
foo directory
foo/1 directory
foo/1/2 directory
foo/1/2/a file size=784
foo/1/a-b file size=784
foo/c file size=655
foo/1/c-d file size=655
foo/1/2/e file size=885
foo/e-f file size=885
foo/1/g file size=267
foo/g-h file size=267
foo/1/2/i file size=438
foo/1/i-j file size=438
foo/k file size=902
foo/1/2/k-l file size=902
foo/1/2/m file size=520
foo/m-n file size=520
foo/o file size=91
foo/1/2/o-p file size=91
foo/q file size=928
foo/q-r file size=928
foo/1/s file size=22
foo/1/2/s-t file size=22


Do a recursive listing and word count of tree to monitor for changes:

2023-05-15 01:16:27 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82


fdupes(1) finds duplicates and does not change the tree:

2023-05-15 01:16:32 dpchrist@vf1 /vf1zpool1/dpchrist
$ fdupes -fr foo | grep .
foo/q-r
foo/e-f
foo/o
foo/m-n
foo/1/2/a
foo/c
foo/1/2/i
foo/1/2/s-t
foo/g-h
foo/1/2/k-l

2023-05-15 01:17:57 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82


Your script does not appear to do anything (?):

2023-05-15 01:19:00 dpchrist@vf1 /vf1zpool1/dpchrist
$ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo
Building files list from: foo
Comparing files ...

2023-05-15 01:19:33 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82

2023-05-15 01:19:35 dpchrist@vf1 /vf1zpool1/dpchrist
$ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo
Building files list from: foo
Comparing files ...

2023-05-15 01:19:48 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82


David