

Normalized Compression Distance (Wikipedia) - tylermauthe
http://en.wikipedia.org/wiki/Normalized_compression_distance

======
bediger4000
The NCD that Wikipedia has you calculate is not the only one.

"Common Pitfalls Using the Normalized Compression Distance: What to Watch Out
For In A Compressor", by Manuel Cebrian, Manuel Alfonsecsa and Alfonso Ortega
([http://www.ims.cuhk.edu.hk/~cis/2005.4/01.pdf](http://www.ims.cuhk.edu.hk/~cis/2005.4/01.pdf))
has another way to calculate it:

NCD(x, y) = max(C(xy) - C(x), C(yx) - C(y))/max(C(x), C(y))

Where x and y are strings, xy is x concatenated with y, and C(x) is the
compressed length of string x.

This version of NCD is 0.0 for NCD(x,x) for any x, while the Wikipedia version
isn't necessarily so. Also this NCD(x,y) = NCD(y,x) which is really handy.

For an interesting use, see:
[http://clair.si.umich.edu/si767/papers/Week12/msa/Bennett.pd...](http://clair.si.umich.edu/si767/papers/Week12/msa/Bennett.pdf),
"Chain Letters and Evolutionary Histories". Pretty darn good.

------
tylermauthe
This clever technique has all sorts of applications. Say you have 3 strings,
NCD allows you to determine which two of the three are most closely related.

I have used this professionally for: \- Detecting plagiarism in submitted
works \- Detecting languages of strings to catch translation issues

