Re: Tool to compare directories and delete duplicate files from one directory

From: David Christensen <dpchrist_at_holgerdanske.com>
Date: Thu, 04 May 2023 21:21:23 UTC
On 5/4/23 08:53, Kaya Saman wrote:
> Hi,
> 
> 
> I'm wondering if anyone knows of a tool like diff or so that can also 
> delete files based on name and size from either left/right or 
> source/destination directory?
> 
> 
> Basically what I have done is performed an rsync without using the 
> --remove-source-files option onto a newly bought and created disk pool 
> (yes zpool) that i am trying to consolidate my data - as it's currently 
> spread out over multiple pools with the same folder name.
> 
> 
> The issue I am facing mainly is that I perform another rsync and use the 
> --remove-source-files option, rsync will delete files based on name 
> while there are some files that have the same name but not same size and 
> I would like to retain these files.
> 
> 
> Right now I have looked at many different options in both rsync and 
> other tools but found nothing suitable. I even tested using a few test 
> dirs and files that I put into /tmp and whatever I tried, the files of 
> different size either got transferred or deleted.
> 
> 
> How would be a good way to approach this problem?
> 
> 
> Even if I create some kind of shell script and use diff, I think it will 
> only compare names and not file sizes.
> 
> 
> I'm really lost here....
> 
> 
> Regards,
> 
> 
> Kaya


Mounting the source file system and destination file system on the same 
host will simplify matters.  sshfs(1) works, but is not fast.  Samba is 
fast.  I have never used NFS, but it should be fast.


While I know of several programs that can do copying and have 
destination file name collision detection (and/or destination content 
collision detection), AIUI their collision resolution is limited to 
cancel or overwrite (perhaps conditionally, such as newer source mtime; 
e.g. cp(1) --update).


I would approach the problem by writing a program or script that does 
the copy and collision detection, plus has the collision resolution I 
want.  Such as, compare the source and destination contents.  If the 
contents are the same, do not copy.  If the contents differ, copy to a 
destination file name that is a unique variant of the source file name. 
The challenge then becomes finding a unique destination file name. 
Inserting an encoded (e.g. hexadecimal, base32, base64) secure hash 
(e.g. SHA1, SHA256) of the file contents into the destination file name 
should make it very unlikely that two source files with the same name, 
but differing contents, would have colliding variant names.  In 
addition, it would be good to include a --directory=DIR option (similar 
to tar(1)).


David