Re: Tool to compare directories and delete duplicate files from one directory

From: Kaya Saman <kayasaman_at_optiplex-networks.com>
Date: Thu, 04 May 2023 23:53:14 UTC
On 5/4/23 23:32, Paul Procacci wrote:
>
>
> On Thu, May 4, 2023 at 5:47 PM Kaya Saman 
> <kayasaman@optiplex-networks.com> wrote:
>
>
>     On 5/4/23 17:29, Paul Procacci wrote:
>>
>>
>>     On Thu, May 4, 2023 at 11:53 AM Kaya Saman
>>     <kayasaman@optiplex-networks.com> wrote:
>>
>>         Hi,
>>
>>
>>         I'm wondering if anyone knows of a tool like diff or so that
>>         can also
>>         delete files based on name and size from either left/right or
>>         source/destination directory?
>>
>>
>>         Basically what I have done is performed an rsync without
>>         using the
>>         --remove-source-files option onto a newly bought and created
>>         disk pool
>>         (yes zpool) that i am trying to consolidate my data - as it's
>>         currently
>>         spread out over multiple pools with the same folder name.
>>
>>
>>         The issue I am facing mainly is that I perform another rsync
>>         and use the
>>         --remove-source-files option, rsync will delete files based
>>         on name
>>         while there are some files that have the same name but not
>>         same size and
>>         I would like to retain these files.
>>
>>
>>         Right now I have looked at many different options in both
>>         rsync and
>>         other tools but found nothing suitable. I even tested using a
>>         few test
>>         dirs and files that I put into /tmp and whatever I tried, the
>>         files of
>>         different size either got transferred or deleted.
>>
>>
>>         How would be a good way to approach this problem?
>>
>>
>>         Even if I create some kind of shell script and use diff, I
>>         think it will
>>         only compare names and not file sizes.
>>
>>
>>         I'm really lost here....
>>
>>
>>         Regards,
>>
>>
>>         Kaya
>>
>>
>>
>>
>>     It sounds like you want fdupes.  It's in the ports tree.
>>
>>     ~Paul
>>
>>     -- 
>>     __________________
>>
>>     :(){ :|:& };:
>
>
>
>     I tried fdupes and installed it a while back. For me it felt like
>     it only works on a single directory.
>
>
>     My dir structure is that I have"
>
>
>     /dir <- main directory where everything has now been rsync'ed to
>
>     /dir_1 <- old directory with partial content
>
>     /dir_2 <- more partial content
>
>     /dir_3 <- more partial content
>
>
>     The key thing here is that I need to compare:
>
>
>     /dir_(x) with /dir
>
>
>     if the files are different sizes in /dir_(x) then leave them,
>     otherwise delete if both name and file size are the same.
>
>
> Then a tiny shell script does the job assuming your files don't have 
> any spaces and no weird characters exist:
>
> #!/bin/sh
>
> for i in b c d;
> do
>   ls $i/ | while read file;
>   do
>     [ ! -f a/$file ] && cp $i/$file a/$file && continue
>
>     ref=`stat -f '%z' a/$file`
>     src=`stat -f '%z' %i/$file`
>     [ $ref -eq $src ] && rm -f $i/file
>
>   done
> done
>
> Change paths accordingly and backup your stuff. ;)
>
> ~Paul
>
> -- 
> __________________
>
> :(){ :|:& };:


Thanks Paul,


I should be able to work with this. There are actually spaces and weird 
characters in the file names so I assume doing something like "file" 
should allow for that?


I don't think I need the line after the 'do' statement do I? From what I 
understand it copies the file from directory i to directory a? As I 
explained initially, the files have already been rsync'ed so I just need 
to compare and delete accordingly.

When I performed the rsync it took around a week to complete per run, 
currently zfs list shows around 12TB usage for my /dir but that's with 
compression enabled, of the merged directory.


A quick Google shows that I can use something like this:

|search_dir=/the/path/to/base/dir for entry in "$search_dir"/* do echo 
"$entry" done|


To list the files in the directory though this might be Bash and not Csh


Otherwise clunkily (my scripting style is pretty rubbish and non 
efficient), I could do something like (it probably won't work!):


#!/bin/sh


#fb = file base

#fm - file merge - file that has already been merged using rsync unless 
size was different


dir_base=/dir
for fb in "$dir_base"/*
do
   echo "$fs"
done


dir_merge=/dir_1
for fm in "$dir_merge"/*
do
   echo "$fm"
done


   do

     ref=`stat -f '%z' $dir_base/$fb`
     src=`stat -f '%z' %i$dir_merge/$fm`
     [ $ref -eq $src ] && rm -f $dir_merge/$fm

   done



Regards,


Kaya