
Verifying copies - snaky
http://jrs-s.net/2016/06/29/verifying-copies/
======
chungy
tar doesn't sort entries on archive creation time. I've tried to use it to
compare trees in the past, to notice this caveat. It's a detail most people
don't seem to realize about their file system, but files aren't (usually, on
most FSes) sorted in any particular order. By creation time order, MAYBE (no
guarantee, often not), but as directories get modified and files are deleted
an added, this gets clobbered anyhow. Use "ls -f" to see the raw, unsorted
list of a directory's contents. Since it is faster for tar to just use this,
than to pre-sort entries, that's the order they'll get added in.

What you want is a tar that is deterministically identical regardless of the
underlying file system's order. Sadly, no such option exists in GNU tar, but
that doesn't quite mean the end of the quest. Behold:

find bin -print0 | sort -Vz | tar -cf bin.tar --no-recursion --null -T -

The first two shouldn't be hard to parse. Find files, separated by NULL
characters (so any whitespace or other special characters won't be an issue),
pass that onto "sort" that uses a version sort (identical rules to dpkg
version sorting, and will disregard the locale sort) and interpret entries as
being between NULLs (as well as emit them, likewise). The tar command should
be pretty self-explanatory, but the "-T -" tells it to read stdin for a file
list, with --null specifying that the input list is separated by NULLs, and
--no-recursion prevents it from recursing into directories, instead just
adding an entry in the tar for directories -- the find|sort will take care of
the directory members already, sorted. Optionally add --numeric-owner to tar,
if you want to compare trees on different computers, but results might vary
(rsync and tar normally send/store both the id and name, and if the "name"
exists on the local machine, uses that instead of id).

~~~
keeperofdakeys
You can probably replace that --no-recursion option on tar, with '\\! -type d'
on find.

~~~
chungy
If you don't care about comparing directory metadata, sure. Otherwise, it's
still useful.

------
liw
Since I develop backup software, this is a problem I needed to solve (e.g.,
for verifying that restore works). So I wrote a tool:
[http://liw.fi/summain/](http://liw.fi/summain/)

It produces output that is meant to be usefully diffable.

~~~
voltagex_
That's a neat looking application - bonus points for a good manpage, too.

That said, how does it handle firstpath/somefile.bin and
secondpath/somefile.bin being identical? This breaks your "diffable" output
because the paths are different.

~~~
liw
You run it in such a way that the paths are identical. For example, by using
the -r option.

$ mkdir foo

$ echo foo > foo/bar

$ cp -a foo foo2

$ summain -r foo > foo.summain

$ summain -r foo2 > foo2.summain

$ diff foo*.summain

------
keeperofdakeys
The easiest way to verify file checksums is to use "rsync -c". Rsync will
usually skip checksum verification of files if both the modification time and
size match on the source and destination, the "-c" option tells it to always
compute a checksum.

Sometimes I also use the following "one" liner, if it's local.

    
    
        diff <(sort <(cd /path/to/source; \
                      find . -type f -print0 | \
                      xargs -0 sha1sum)) \
             <(sort <(cd /path/to/destination; \
                      find . -type f -print0 | \
                      xargs -0 sha1sum))
    

Another nice rsync tip is that if you aren't using -a, you should at least use
-t (preserve modification time). This will make a second rsync faster as it
can skip files if the modification time and size match.

------
mdadm
Is it just me, or is anyone else getting a 403 when trying to access this
page?

~~~
guitarbill
ditto (i'm using chrome). weirdly, after looking at the version on
archive.org, it started working, although that's probably coincidence.

------
wtbob
> For example, if we rsync -a /source /target, we trust that the contents of
> /target will exactly match the contents of /source

You might trust that, but you'd be wrong:

    
    
        $ cd /tmp
        $ mkdir foo
        $ cd foo
        $ mkdir bar
        $ mkdir baz
        $ touch bar/quux
        $ rsync -a ./bar ./baz
        $ ls baz
        bar
        $ ls -R baz
        baz:
        bar
        
        baz/bar:
        quux
    

You see, without a final '/' rsync will put the source _into_ the target,
rather than synchronising the source and the target. Also, if you really want
a sync as opposed to just a full copy, you probably want to add --delete
(which will delete files in the target which don't exist in the source.

So you probably want rsync -a --delete source/ target/.

~~~
zeveb
Incidentally, given the directory structure above, here's a good way to
calculate sha256 checksums over the files. It shells out rather than use the
IRONCLAD package to calculate the checksums.

    
    
        (let ((src #P"/tmp/foo/bar/")
              (dst #P"/tmp/foo/baz/"))
          (flet ((sha256sum (path)
                   "Return the SHA256 checksum of PATH as a string (which is
        good enough for our purposes here)."
                   (first (split-sequence:split-sequence
                           #\Space
                           (uiop:run-program `("sha256sum" ,(namestring path))
                                             :output :string)))))
            (loop for src-path in (directory (uiop:merge-pathnames* "**/*.*" src))
               for dst-path = (uiop:merge-pathnames* (uiop:subpathp src-path src) dst)
               for file = (uiop:truename* dst-path)
               if file
               when (uiop:file-pathname-p file)
               do (unless (string= (sha256sum src-path) (sha256sum dst-path))
                    (warn "~a does not match ~a" dst-path src-path))
               end
               else
               do (warn "~a not found" dst-path))))

------
ktta
cached:
[http://web.archive.org/web/20160722100828/http://jrs-s.net/2...](http://web.archive.org/web/20160722100828/http://jrs-s.net/2016/06/29/verifying-
copies/)

------
ashitlerferad
Just use diffoscope:

[http://diffoscope.org/](http://diffoscope.org/)

------
rxm
Nice read on how to check your rsyncs.

------
based2
[https://youtrack.jetbrains.com/issue/IDEA-158003#tab=Similar...](https://youtrack.jetbrains.com/issue/IDEA-158003#tab=Similar%20Issues)

------
hairymonkeybone
None of the methods described in the article is correct. md5sum does not
compare files, it compares checksums. One mostly-correct solution is the 'd'
(diff) option to GNU tar.

~~~
pjc50
> md5sum does not compare files, it compares checksums.

.. which is, almost all of the time, good enough and requires much less
bandwidth between the copies. These days it might be better to pick SHA256.

