

Ask HN: Freely available test-data for a binary diff corpus? - thristian

As a hobby project, I'm working on a binary diff tool[1] that might someday compete with xdelta[2] or even (if I'm lucky) bsdiff[3]. Since binary diffing, like file-compression, ought to work well on a variety of inputs, I'm looking for some "representative" examples of source/target pairs I could add to my test suite and distribute for others to replicate my findings.<p>File-compression hobbyists have the Calgary Corpus[4] and the Canterbury Corpus[5] among others, but I can't find an equivalent binary-diff or delta-encoding corpus. So far I've mostly been working with fan-translations of video-games, which is good exercise for my code, but makes it legally impossible to share my test data.<p>Any suggestions?<p>[1]: https://gitorious.org/python-blip
[2]: http://code.google.com/p/xdelta/
[3]: http://www.daemonology.net/bsdiff/
[4]: http://en.wikipedia.org/wiki/Calgary_Corpus
[5]: http://corpus.canterbury.ac.nz/purpose.html
======
0x0
How about using the binary installers for various versions of open source
software? Run diffs on things like firefox-setup-3.5.0.exe to firefox-
setup-3.5.1.exe and such.

~~~
thristian
The trouble with binary installers is that they tend to be heavily compressed
to begin with, which obscures whatever similarities a diff tool might find -
in much the same way that compressing a file prevents other compression tools
from doing a very good job on it.

I'll add binary tarballs of some open-source software to my suite, but it'd be
nice if I had some files that _weren't_ x86 executables, too. :)

