Re: Tool to compare directories and delete duplicate files from one directory

From: Paul Procacci <pprocacci_at_gmail.com>
Date: Fri, 05 May 2023 00:13:02 UTC
On Thu, May 4, 2023 at 7:53 PM Kaya Saman <kayasaman@optiplex-networks.com>
wrote:

>
> On 5/4/23 23:32, Paul Procacci wrote:
>
>
>
> On Thu, May 4, 2023 at 5:47 PM Kaya Saman <kayasaman@optiplex-networks.com>
> wrote:
>
>>
>> On 5/4/23 17:29, Paul Procacci wrote:
>>
>>
>>
>> On Thu, May 4, 2023 at 11:53 AM Kaya Saman <
>> kayasaman@optiplex-networks.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> I'm wondering if anyone knows of a tool like diff or so that can also
>>> delete files based on name and size from either left/right or
>>> source/destination directory?
>>>
>>>
>>> Basically what I have done is performed an rsync without using the
>>> --remove-source-files option onto a newly bought and created disk pool
>>> (yes zpool) that i am trying to consolidate my data - as it's currently
>>> spread out over multiple pools with the same folder name.
>>>
>>>
>>> The issue I am facing mainly is that I perform another rsync and use the
>>> --remove-source-files option, rsync will delete files based on name
>>> while there are some files that have the same name but not same size and
>>> I would like to retain these files.
>>>
>>>
>>> Right now I have looked at many different options in both rsync and
>>> other tools but found nothing suitable. I even tested using a few test
>>> dirs and files that I put into /tmp and whatever I tried, the files of
>>> different size either got transferred or deleted.
>>>
>>>
>>> How would be a good way to approach this problem?
>>>
>>>
>>> Even if I create some kind of shell script and use diff, I think it will
>>> only compare names and not file sizes.
>>>
>>>
>>> I'm really lost here....
>>>
>>>
>>> Regards,
>>>
>>>
>>> Kaya
>>>
>>>
>>>
>>>
>> It sounds like you want fdupes.  It's in the ports tree.
>>
>> ~Paul
>>
>> --
>> __________________
>>
>> :(){ :|:& };:
>>
>>
>>
>> I tried fdupes and installed it a while back. For me it felt like it only
>> works on a single directory.
>>
>>
>> My dir structure is that I have"
>>
>>
>> /dir <- main directory where everything has now been rsync'ed to
>>
>> /dir_1 <- old directory with partial content
>>
>> /dir_2 <- more partial content
>>
>> /dir_3 <- more partial content
>>
>>
>> The key thing here is that I need to compare:
>>
>>
>> /dir_(x) with /dir
>>
>>
>> if the files are different sizes in /dir_(x) then leave them, otherwise
>> delete if both name and file size are the same.
>>
>
> Then a tiny shell script does the job assuming your files don't have any
> spaces and no weird characters exist:
>
> #!/bin/sh
>
> for i in b c d;
> do
>   ls $i/ | while read file;
>   do
>     [ ! -f a/$file ] && cp $i/$file a/$file && continue
>
>     ref=`stat -f '%z' a/$file`
>     src=`stat -f '%z' %i/$file`
>     [ $ref -eq $src ] && rm -f $i/file
>
>   done
> done
>
> Change paths accordingly and backup your stuff. ;)
>
> ~Paul
>
> --
> __________________
>
> :(){ :|:& };:
>
>
> Thanks Paul,
>
>
> I should be able to work with this. There are actually spaces and weird
> characters in the file names so I assume doing something like "file" should
> allow for that?
>
>
> I don't think I need the line after the 'do' statement do I? From what I
> understand it copies the file from directory i to directory a? As I
> explained initially, the files have already been rsync'ed so I just need to
> compare and delete accordingly.
>
> When I performed the rsync it took around a week to complete per run,
> currently zfs list shows around 12TB usage for my /dir but that's with
> compression enabled, of the merged directory.
>
>
> A quick Google shows that I can use something like this:
>
> search_dir=/the/path/to/base/dirfor entry in "$search_dir"/*do
>   echo "$entry"done
>
>
> To list the files in the directory though this might be Bash and not Csh
>
>
> Otherwise clunkily (my scripting style is pretty rubbish and non
> efficient), I could do something like (it probably won't work!):
>
>
> #!/bin/sh
>
>
> #fb = file base
>
> #fm - file merge - file that has already been merged using rsync unless
> size was different
>
>
> dir_base=/dir
> for fb in "$dir_base"/*
> do
>   echo "$fs"
> done
>
>
> dir_merge=/dir_1
> for fm in "$dir_merge"/*
> do
>   echo "$fm"
> done
>
>
>   do
>
>     ref=`stat -f '%z' $dir_base/$fb`
>     src=`stat -f '%z' %i$dir_merge/$fm`
>     [ $ref -eq $src ] && rm -f $dir_merge/$fm
>
>   done
>
>
>
> Regards,
>
>
> Kaya
>

What I provided is exactly what you needed as it loops through all the
directories.  You just have to provide the list of source directories on
that first for loop.
You can alter it, removing the first for loop, but then you'll need to run
it for each directory you'd want to apply the checks to.

Enclosing the variables in quotes may or may not help.  A quote is a valid
character in a filename and therefore may not work as expected.
If you're reasonably sure your filenames do not contain quotes then you
have a better chance of it working.

Worst comes to worst, you'll need to: find /path -print0 | xargs -0 -n 1
<args> to overcome weird characters in filenames.

In either case, adding quotes at this point knowing you have at least
spaces and some special characters, is probably the correct course of
action.

As an aside, I don't use this syntax:    for entry in "$search_dir"/*
You're certainly free to do so, but I personally avoid globs when possible.
Maybe not so much in scripts like this but on the command line, those globs
can expand to a size that exceeds allowable sizes to command line arguments.

Revised script adding comments:
-----------------------------------------------------
#!/bin/sh

#
# dir_1, dir_2, and dir_3 are the directories I want to search through.
for i in dir_1 dir_2 dir_3;
do
  # Retrieve the filenames within each of those directories
  ls $i/ | while read file;
  do
     If the file doesn't exist in the base dir, copy it and continue with
the top of the loop.
    [ ! -f dir_base/$file ] && cp $i/$file dir_base/ && continue

    #
    # Getting to this point means the file eixsts in both locations.
    #

    # Get the file size as it is in the dir_base
    ref=`stat -f '%z' dir_base/$file`

    # Get the file size as it is in $i
    src=`stat -f '%z' $i/$file`

    # If the sizes are the same, remove the file from the source directory
    [ $ref -eq $src ] && rm -f $i/file

  done
done



-- 
__________________

:(){ :|:& };: