Re: Tool to compare directories and delete duplicate files from one directory

From: Kaya Saman <kayasaman_at_optiplex-networks.com>
Date: Fri, 05 May 2023 02:30:14 UTC
On 5/5/23 03:08, Paul Procacci wrote:
> There are multiple reasons why it may not work.  My guess is because 
> the potential for characters that could be showing up within the 
> filenames and whatnot.
>
> This can be solved with an interpreted language that's a bit more 
> forgiving.
> Take the following perl script.  It does the same thing as the shell 
> script (almost).  It renames the source file instead of making a copy 
> of it.
>
> run as:  ./test.pl <http://test.pl> /absolute/path/to/master_dir 
> /absolute_path_to_dir_x
>
> ################################################################################### 
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
>
> sub msgDie
> {
>   my ($ret) = shift;
>   my ($msg) = shift // "$0 dir_base dir\n";
>   print $msg;
>   exit($ret);
> }
>
> msgDie(1) unless(scalar @ARGV eq 2);
>
> my $base = $ARGV[0];
> my $dir  = $ARGV[1];
>
> msgDie(1, "base directory doesn't exist\n") unless -d $base;
> msgDie(1, "source directory doesn't exist\n") unless -d $dir;
>
> opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
> while(readdir $dh)
> {
>   next if($_ eq '.' || $_ eq '..');
>   if( ! -f "$base/$_" ){
>     rename("$dir/$_", "$base/$_");
>     next;
>   }
>
>   my ($ref) = (stat("$base/$_"))[7];
>   my ($src) = (stat("$dir/$_"))[7];
>   unlink("$dir/$_") if($ref == $src);
> }
> ###################################################################################
>
> ~Paul
>
>

This didn't seem to work :-(


What exactly happened is this:


I created a set of test directories in /tmp


So, I have /tmp/test1 and /tmp/test2


to mimic the structure of the directories I intend to run this thing I 
did this:


create a subdir called: dupdir in /tmp/test1 and /tmp/test2


/tmp/test2/dupdir contains these files: dup and dup1


/tmp/test1/dupdir contains a modified 'dup' file but copied dup1 file.


However*, now things get interesting as dup from test1 contains 
"1234567" and dup from test2 contains "111" <- this is to simulate the 
file size difference.


I then ran: ./test.pl /tmp/test1 /tmp/test2


The expected behavior is that I should retain the file 'dup' in test1 
while 'dup1' should be removed.


In my actual file system I have many of these subdirs, so a fair test 
would probably be something like creating:

/tmp/test1/dupdir1

/tmp/test2/dupdir1

/tmp/test1/dupdir2

/tmp/test2/dupdir2


then putting the file dup into dupdir1 and dup1 into dupdir2


I guess my issue is complex?? If I only I had used the 
--remove-source-files option during my initial rsync then I wouldn't 
have had to worry about any of this since I used the --ignore-existing 
option so that would have done the trick initially, but I decided to 
play safe instead and now ended up with a slight headache on my hands.