Re: Tool to compare directories and delete duplicate files from one directory

From: Paul Procacci <pprocacci_at_gmail.com>
Date: Fri, 05 May 2023 03:36:19 UTC
On Thu, May 4, 2023 at 11:20 PM Kaya Saman <kayasaman@optiplex-networks.com>
wrote:

>
> On 5/5/23 04:01, Paul Procacci wrote:
>
> On Thu, May 4, 2023 at 10:30 PM Kaya Saman <
> kayasaman@optiplex-networks.com> wrote:
>
>>
>> On 5/5/23 03:08, Paul Procacci wrote:
>>
>> There are multiple reasons why it may not work.  My guess is because the
>> potential for characters that could be showing up within the filenames and
>> whatnot.
>>
>> This can be solved with an interpreted language that's a bit more
>> forgiving.
>> Take the following perl script.  It does the same thing as the shell
>> script (almost).  It renames the source file instead of making a copy of it.
>>
>> run as:  ./test.pl /absolute/path/to/master_dir /absolute_path_to_dir_x
>>
>> ###################################################################################
>>
>> #!/usr/bin/env perl
>>
>> use strict;
>> use warnings;
>>
>> sub msgDie
>> {
>>   my ($ret) = shift;
>>   my ($msg) = shift // "$0 dir_base dir\n";
>>   print $msg;
>>   exit($ret);
>> }
>>
>> msgDie(1) unless(scalar @ARGV eq 2);
>>
>> my $base = $ARGV[0];
>> my $dir  = $ARGV[1];
>>
>> msgDie(1, "base directory doesn't exist\n") unless -d $base;
>> msgDie(1, "source directory doesn't exist\n") unless -d $dir;
>>
>> opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
>> while(readdir $dh)
>> {
>>   next if($_ eq '.' || $_ eq '..');
>>   if( ! -f "$base/$_" ){
>>     rename("$dir/$_", "$base/$_");
>>     next;
>>   }
>>
>>   my ($ref) = (stat("$base/$_"))[7];
>>   my ($src) = (stat("$dir/$_"))[7];
>>   unlink("$dir/$_") if($ref == $src);
>> }
>>
>> ###################################################################################
>>
>> ~Paul
>>
>>
>>
>> This didn't seem to work :-(
>>
>>
>> What exactly happened is this:
>>
>>
>> I created a set of test directories in /tmp
>>
>>
>> So, I have /tmp/test1 and /tmp/test2
>>
>>
>> to mimic the structure of the directories I intend to run this thing I
>> did this:
>>
>>
>> create a subdir called: dupdir in /tmp/test1 and /tmp/test2
>>
>>
>> /tmp/test2/dupdir contains these files: dup and dup1
>>
>>
>> /tmp/test1/dupdir contains a modified 'dup' file but copied dup1 file.
>>
>>
>> However*, now things get interesting as dup from test1 contains "1234567"
>> and dup from test2 contains "111" <- this is to simulate the file size
>> difference.
>>
>>
>>
>>
>>
>>
> Worked for me!  Regardless.  Use rsync then.
>
> rsync --ignore-existing --remove-source-files  /src /dest
>
> This would at the very least move non-existent files from the source over to the dest AND remove those source files AFTER the transfer happens.
>
> You'll be 1/2 way there doing that.  What you'll be left with are file that exist in BOTH src AND DEST.
>
>
> ~Paul
>
>
> Paul, I think we've got wires crossed....
>
>
> I *have* already performed the rsync. Apologies if I wasn't clear!
>
>
> The problem I am faced with is that the destination directory is already
> populated with the information from 3 source directories.
>
>
> I need to remove the sync'ed files in the source directories and leave
> files that match in name but are of different sizes.
>
>
> The problem is I can't use rsync again for this as there aren't any
> options to simply compare files based on size. I can't use the --existing
> option as the files exist in both directories....
>
>
> This is the dilemma I am facing:
>
>
> ls -l /merged_dir/folder/
>
> 234904506 - file 'a'
>
>
> ls -l /source_dir/folder/
>
> 1080918146 - file 'a'
>
>
> so in this case file 'a' is in both directories with the same name but
> different size. I need to keep both versions. However, *if* they were the
> same size then remove the file in the source_dir.....
>
>
> That's all.. I don't need to transfer anything or copy anything at all...
> just compare and remove files of same name and size.
>
>
> Hopefully I am explaining better and things are more clear? Again I
> apologize for the confusion  :-(
>

You're at least partially right that I was confused because comparing by
name and by size makes no sense to me.  A single byte changed in one yields
the same name and the same size but are different!  ;)
Is the below output what you're expecting to happen:

% mkdir a b
% echo 1111 > a/test.txt
% echo 1111 > b/test.txt
%./test.pl a b
% ls -l a b
a:
total 5
-rw-r--r--  1 pprocacci  pprocacci  5 May  5 03:26 test.txt

b:
total 0

----------

The below perl script is what was ran above.  1) Find a file from directory
"b".  2)  Go to the top of the loop if the file doesn't exist in directory
"a".  3) Go to the top of the loop if the file sizes do not match  4)
unlink the file if conditions 2 and 3 fall through.

#################################################
#!/usr/bin/env perl

use strict;
use warnings;

sub msgDie
{
  my ($ret) = shift;
  my ($msg) = shift // "$0 dir_base dir\n";
  print $msg;
  exit($ret);
}

msgDie(1) unless(scalar @ARGV eq 2);

my $base = $ARGV[0];
my $dir  = $ARGV[1];

msgDie(1, "base directory doesn't exist\n") unless -d $base;
msgDie(1, "source directory doesn't exist\n") unless -d $dir;

opendir(my $dh, $dir) or msgDie("Unable to open directory: $dir\n");
while(readdir $dh)
{
  next if($_ eq '.' || $_ eq '..');
  next if(! -f "$base/$_");

  my ($ref) = (stat("$base/$_"))[7];
  my ($src) = (stat("$dir/$_"))[7];
  unlink("$dir/$_") if($ref == $src);
}
#################################################

-- 
__________________

:(){ :|:& };: