Re: Tool to compare directories and delete duplicate files from one directory

From: David Christensen <dpchrist_at_holgerdanske.com>
Date: Mon, 15 May 2023 08:43:38 UTC
On 5/15/23 01:29, David Christensen wrote:
> On 5/14/23 15:48, Sysadmin Lists wrote:
>> #!/bin/sh -e
>> # remove or report duplicate files: $0 [-n] dir[1] dir[2] ... dir[n]
>> if [ "X$1" = "X-n" ]; then n=1; shift; fi
>>
>> echo "Building files list from: ${@}"
>>
>> find "${@}" -xdev -type f |
>> awk -v n=$n 'BEGIN { cmd = "stat -f %z "
>> for (x = 1; x < ARGC; x++) args = args ? args "|" ARGV[x] : ARGV[x]; 
>> ARGC = 0 }
>>       { files[$0] = match($0, "(" args ")/?") + RLENGTH }
>> END  { for (i in ARGV) sub("/*$", "/", ARGV[i])
>>         print "Comparing files ..."
>>         for (i = 1; i < x; i++) for (file in files) if (file ~ "^" 
>> ARGV[i]) {
>>             for (j = i +1; j < x; j++)
>>                 if (ARGV[j] substr(file, files[file]) in files) {
>>                     dup = ARGV[j] substr(file, files[file])
>>                     cmd "\"" file "\"" | getline fil_s; close(cmd "\"" 
>> file "\"")
>>                     cmd "\"" dup  "\"" | getline dup_s; close(cmd "\"" 
>> dup  "\"")
>>                     if (dup_s == fil_s) act("dup")
>>                     else act("diff") }
>>             delete files[file]
>>       } }
>> function act(message) {
>>      print ((message == "dup") ? "duplicates:" : "difference:"), dup, 
>> file
>>      if (!n) system("rm -vi \"" dup "\" </dev/tty")
>> }' "${@}"

> Your script does not appear to do anything (?):
> 
> 2023-05-15 01:19:00 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo
> Building files list from: foo
> Comparing files ...
> 
> 2023-05-15 01:19:33 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
>        26      24      82
> 
> 2023-05-15 01:19:35 dpchrist@vf1 /vf1zpool1/dpchrist
> $ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo
> Building files list from: foo
> Comparing files ...
> 
> 2023-05-15 01:19:48 dpchrist@vf1 /vf1zpool1/dpchrist
> $ ls -R1 foo | wc
>        26      24      82


I looks like your script only finds duplicates when the subpath is 
identical (?):

2023-05-15 01:38:20 dpchrist@vf1 /vf1zpool1/dpchrist
$ cp -Ra foo bar

2023-05-15 01:39:18 dpchrist@vf1 /vf1zpool1/dpchrist
$ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
Building files list from: foo bar
Comparing files ...
duplicates: bar/1/2/a foo/1/2/a
duplicates: bar/1/i-j foo/1/i-j
duplicates: bar/1/2/e foo/1/2/e
duplicates: bar/1/a-b foo/1/a-b
duplicates: bar/1/g foo/1/g
duplicates: bar/1/2/i foo/1/2/i
duplicates: bar/q-r foo/q-r
duplicates: bar/m-n foo/m-n
duplicates: bar/1/2/m foo/1/2/m
duplicates: bar/c foo/c
duplicates: bar/e-f foo/e-f
duplicates: bar/1/s foo/1/s
duplicates: bar/k foo/k
duplicates: bar/o foo/o
duplicates: bar/q foo/q
duplicates: bar/1/c-d foo/1/c-d
duplicates: bar/1/2/s-t foo/1/2/s-t
duplicates: bar/1/2/o-p foo/1/2/o-p
duplicates: bar/1/2/k-l foo/1/2/k-l
duplicates: bar/g-h foo/g-h

2023-05-15 01:39:41 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82

2023-05-15 01:39:44 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 bar | wc
       26      24      82

2023-05-15 01:40:10 dpchrist@vf1 /vf1zpool1/dpchrist
$ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh -n foo bar
Building files list from: foo bar
Comparing files ...
duplicates: bar/1/2/a foo/1/2/a
duplicates: bar/1/i-j foo/1/i-j
duplicates: bar/1/2/e foo/1/2/e
duplicates: bar/1/a-b foo/1/a-b
duplicates: bar/1/g foo/1/g
duplicates: bar/1/2/i foo/1/2/i
duplicates: bar/q-r foo/q-r
duplicates: bar/m-n foo/m-n
duplicates: bar/1/2/m foo/1/2/m
duplicates: bar/c foo/c
duplicates: bar/e-f foo/e-f
duplicates: bar/1/s foo/1/s
duplicates: bar/k foo/k
duplicates: bar/o foo/o
duplicates: bar/q foo/q
duplicates: bar/1/c-d foo/1/c-d
duplicates: bar/1/2/s-t foo/1/2/s-t
duplicates: bar/1/2/o-p foo/1/2/o-p
duplicates: bar/1/2/k-l foo/1/2/k-l
duplicates: bar/g-h foo/g-h

2023-05-15 01:40:22 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 foo | wc
       26      24      82

2023-05-15 01:40:29 dpchrist@vf1 /vf1zpool1/dpchrist
$ ls -R1 bar | wc
       26      24      82

2023-05-15 01:40:34 dpchrist@vf1 /vf1zpool1/dpchrist
$ sysadmin.lists_mailfence.com-20230514-1548-find-dupes.sh foo bar
Building files list from: foo bar
Comparing files ...
duplicates: bar/1/2/a foo/1/2/a
remove bar/1/2/a? n
duplicates: bar/1/i-j foo/1/i-j
remove bar/1/i-j? n
duplicates: bar/1/2/e foo/1/2/e
remove bar/1/2/e? n
duplicates: bar/1/a-b foo/1/a-b
remove bar/1/a-b? n
duplicates: bar/1/g foo/1/g
remove bar/1/g? n
duplicates: bar/1/2/i foo/1/2/i
remove bar/1/2/i? n
duplicates: bar/q-r foo/q-r
remove bar/q-r? n
duplicates: bar/m-n foo/m-n
remove bar/m-n? n
duplicates: bar/1/2/m foo/1/2/m
remove bar/1/2/m? n
duplicates: bar/c foo/c
remove bar/c? n
duplicates: bar/e-f foo/e-f
remove bar/e-f? n
duplicates: bar/1/s foo/1/s
remove bar/1/s? n
duplicates: bar/k foo/k
remove bar/k? n
duplicates: bar/o foo/o
remove bar/o? n
duplicates: bar/q foo/q
remove bar/q? n
duplicates: bar/1/c-d foo/1/c-d
remove bar/1/c-d? n
duplicates: bar/1/2/s-t foo/1/2/s-t
remove bar/1/2/s-t? n
duplicates: bar/1/2/o-p foo/1/2/o-p
remove bar/1/2/o-p? n
duplicates: bar/1/2/k-l foo/1/2/k-l
remove bar/1/2/k-l? n
duplicates: bar/g-h foo/g-h
remove bar/g-h? n


David