cksum entire dir??

Thu Sep 13 20:15:57 UTC 2012

Here's a simple, system-independent way to find duplicate files.  All you
need is something to generate a digest you trust (MD5, SHA1, whatever) plus
normal Unix stuff: awk, expand, grep, join, sort, and uniq.

Generate the signatures:

  me% cd ~/bin
  me% find . -type f -print0 | xargs -0 md5 -r | sort > /tmp/sig1

  me% cat /tmp/sig1
  0287839688bd660676582266685b05bd ./mkrcs
  0b97494883c76da546e3603d1b65e7b2 ./pwgen
  ddbed53e795724e4a6683e7b0987284c ./authlog
  ddbed53e795724e4a6683e7b0987284c ./cmdlog
  fdff1fd84d47f76dbd4954c607d66714 ./dbrun
  ff5e24efec5cf1e17cf32c58e9c4b317 ./tr0

Find duplicate signatures:

  me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^  *1 "
        2 ddbed53e795724e4a6683e7b0987284c

  me% awk '{print $1}' /tmp/sig1 | uniq -c | expand | grep -v "^  *1 " |
      awk '{print $2}' > /tmp/sig2

Associate the duplicates with files:

  me% join /tmp/sig[12]
  ddbed53e795724e4a6683e7b0987284c ./authlog
  ddbed53e795724e4a6683e7b0987284c ./cmdlog

If your filenames contain whitespace, you can URL-encode them, play some
games with awk, or use perl.

-- 
Karl Vogel                      I don't speak for the USAF or my company

This is really a lovely horse, I once rode her mother.
                                       --Ted Walsh, Horse Racing Commentator