

Tracking Changes in Directories with Python - tsileo
http://thomassileo.com/blog/2013/12/12/tracking-changes-in-directories-with-python/

======
yummyfajitas
I can't tell exactly what the goal is, but inotify might also be a simpler
solution.

(Specifically, if the goal is to monitor changes as they happen and the
service can be assumed to be continually running.)

~~~
tsileo
The service is not continually running, I use this method to make incremental
backups with archives stored on AWS Glacier and meta-data stored on S3 (the
index is stored on S3, and I can't access files on Glacier to compute deltas).

~~~
cmsd2
I've been thinking about the best way to do this, and I too didn't want to
rely on having a backup script running all the time.

Windows seems to have a low-level api for querying the filesystem/vfs for
changes since you last looked: [http://msdn.microsoft.com/en-
us/library/windows/desktop/aa36...](http://msdn.microsoft.com/en-
us/library/windows/desktop/aa363798\(v=vs.85\).aspx)

And BTRFS has some ability to do this with find-new:
[http://www.tummy.com/blogs/2010/11/01/fun-with-btrfs-what-
fi...](http://www.tummy.com/blogs/2010/11/01/fun-with-btrfs-what-files-have-
changed/)

It's nice that btrfs has these new interesting features, also see
send/receive, but it's not available in the vfs, and I suppose never will be.

~~~
xradionut
The Windows solution would seem to only work if you have admin access to the
specific server or the rights to access the journal.

------
johtso
The watchdog library is great for this. It comes with an API and a command
line tool:
[https://github.com/gorakhargosh/watchdog](https://github.com/gorakhargosh/watchdog)

Compatibility:

    
    
      Linux 2.6 (inotify)
      Mac OS X (FSEvents, kqueue)
      FreeBSD/BSD (kqueue)
      Windows (ReadDirectoryChangesW with I/O completion ports; ReadDirectoryChangesW worker threads)
      OS-independent (polling the disk for directory snapshots and comparing them periodically; slow and not recommended)

------
netnichols
Since he asks for feedback at the end...

Using sha256 just to compute changes is probably overkill. Using md5 instead
is almost certainly adequate and will be a good deal faster.

~~~
pudquick
> good deal faster

You could always try it and see :)

Example test: openssl speed md5 sha256

As for why it was chosen, I think it's because there are known examples of MD5
hash collisions (though the likelihood of it on a filesystem is remote) and
likely SHA-1 was skipped because it's considered 'likely' a collision could be
created (though so far only with weakened versions of SHA-1).

But - all this to say: The chances of having two files with the same MD5 hash
that are _identical_ in size is vanishingly small. As such, for the known MD5
collision mechanisms, the different file size would be enough evidence
something has changed.

... Why he didn't include file size in the metadata check, I can't tell you.
Timestamps can be faked - but generating a hash collision with a file of equal
size is a Hard problem.

~~~
tsileo
Thanks for the feedback!

I haven't thought about the filesize/hash to reduce collision, but I chose to
stick with the last-modified time in the article, because it can takes hours
computing hashes for a big directory tree.

Tools like rsync relies on last-modified time by default, and since I want to
use this to track my own files, I won't fake it, so I think it's not a big
deal?

~~~
netnichols
It's not just that it could be faked, it could be an accident that you modify
a file but the file modification date is not changed. For example, say you
edit a photo, but later you run a script that sets the file's modification
date to the EXIF data in the photo.

So I guess the point is that also including the file size will be one more
(fast) data point to help ensure 'accurate' change tracking, without adding
the overhead of computing content hashes.

~~~
tsileo
I think I will add the file size to the index, since it's really cheap.

Thanks!

