

Ask HN: How to find duplicated directory subtrees - harperlee

Due to sloppy backups and synchronization (with crashplan, dropbox, and manual copies going back some years), I have a large collection of files that has been duplicated, partially merged, and updated locally. It&#x27;s quite a mess. Do you know if there&#x27;s any tool that would help to identify subtrees of the whole set that are identical, similar-and-strictly-newer, or similar? How would you approach the task of trimming it down to one clean version?
======
geoff-codes
Identical is pretty easy. Maybe something like this.

    
    
        #!/bin/sh
        here=$PWD
        dir=.
        depth=
        rm -f /tmp/list
        
        [ $# -eq 1 ] &&
          [ -e "$1" ] && dir=$1 || depth=$1 
        
        [ $# -eq 2 ] && for i in "$@"; do
          [ -e "$i" ] && dir=$i || depth=$i
        done
        
        for each in $(find -L $dir -type d -depth $depth 2>/dev/null); do
          [ -x "$each" ] && cd "$each" &&
          sha=$(tar c . 2>/dev/null | tar xmvO 2>&1 | xz | shasum | sed 's| .*||') 
          [ $(find . 2>/dev/null | wc -l) -gt 1 ] && echo $sha $each >> /tmp/list
          cd "$here"
        done
        
        for sha in $(cat /tmp/list | sed 's| .*||' | sort | uniq); do
          [ $(grep $sha /tmp/list | wc -l) -gt 1 ]  &&
            echo Identical directories: && grep $sha /tmp/list | sed 's|.* ||' && echo
        done
    

Similar and "similar-and-strictly-newer" both are much trickier as you have to
invent a rubric for what "similar" mean, and `diff -qr` isn't going to tell
you if, say, the files are mostly the same, but have been moved into a
subdirector. So I'd probably use git, traversing the file tree by moving the
.git dir around and adding each candidate directory as a different branch, and
doing a `git gc` each time to try to keep the size of the index manageable.
Then doing a `git diff [--word-diff] [--stat] --find-copies-harder` between
branches will pick up files that have been moved around, etc. You could
literally do this for every directory and subdirectory, but if you can narrow
it down to, say, directories with the same baseman, it would be substantially
easier.

On the other hand, I would say I suffer from this same ailment, I just mostly
don't bother sorting it out. I just use something like
[http://cpansearch.perl.org/src/ANDK/Perl-Repository-
APC-2.00...](http://cpansearch.perl.org/src/ANDK/Perl-Repository-
APC-2.002001/eg/trimtrees.pl) to make hard links between identical files,
keeping the size of the monstrosity in check.

------
hugopeixoto
I have exactly the same issue. Tons of backup archives nested inside each
other. I have been reducing the mess manually, as I have not found a tool that
would help. My approach is to use "diff -qr" between directories I know
contain the same project and manually merge them / pick the best one. Merging
is definitely the worst part of the process, as it cannot be automatically
done. I thought about building a tool that builds some sort of sha tree
through the directories but on the first try it generated many matches due to
empty folders and stuff like that.

------
mtmail
When I sorted my photos, which are also copies of copies, especially after
moving laptops several times I ended up with creating a list of md5sums,
sorting by count to find duplicate folders.

I'd be interested in something more reusable as well.

