

Machine Learning Teaches Me How to Write Better AI - moconnor
http://yieldthought.com/post/95722882055/machine-learning-teaches-me-how-to-write-better-ai

======
moconnor
Some more details:

* I used the complearn package from [http://complearn.org/](http://complearn.org/) \- it hasn't been updated in a while and needed a few minor changes to compile on Ubuntu 14; drop me a line if you'd like the patch.

* Similarity clustering with NCD works best when you can visually inspect the items and get _some_ level of understanding as to what they are doing. I recently clustered all of our web logs on a monthly basis and found that, while e.g. summer months were clustered together (around releases and conferences typically) there were a couple of unexpected winter months in there too. The logs were too large to inspect for differences by hand - in this case it is more useful to cluster e.g. individual requests, or summaries tables of each month.

* Wesnoth is a great example of an open source game that continues to grow and develop and the team are friendly and welcoming. I can highly recommend contributing to it!

~~~
__Joker
_Similarity clustering with NCD works best when you can visually inspect the
items.._

Doesn't this apply to all the clustering algorithms ? And this is what we want
to avoid for clustering, because it is not easy to visualize in higher
dimensions.

~~~
moconnor
It's a particular problem with NCD because you don't always know what the
compressor is measuring. At least with the L2 distance on a vector you know
that those points were close to each other for a well-defined definition of
'close'. If it is surprising that they are close, you might want to
investigate whether your feature selection really makes sense for these items.

With NCD the compressor selects its own features in a largely opaque way. This
makes it fun to use but difficult to debug!

------
mswen
Thanks for the description of your workflow and the reflective thinking about
your journey to get there. I was just looking at both R and Julia for
similarity distances earlier today and thinking about which might work better
in an automated analytic application. Your post convinced me to go take a look
at complearn.org

What is your sense for how it compares with kNN or hierarchical clustering at
the clustering level or with various statistical distance measures like
Euclidean, Manhattan etc. at the raw distances level?

~~~
moconnor
NCD as a distance measure can be used for kNN and in this case actually was
used for hierarchical clustering in the form of the unrooted binary tree.

As for how NCD compares to euclidian, manhattan distances and so on - it's
interesting. One of NCD's strengths is that you can apply it even if your data
is not readily representable as a uniformly-long vector of numbers, whereas
most other distance measures require this and it's not always convenient to
represent input data in that form.

For example, in this application games may be of widely-varying lengths. That
doesn't matter when computing the NCD, but I'd have had to pad out the
"missing" values from shorter runs to the lengths of longer ones, or truncated
long ones, or resized them all with some smoothing.

None of those seemed like particularly good options for this application.

Also, with time series (which this essentially is), most compressors are good
at recognizing patterns repeated in items. That means that 0, 0, 4, 7, 9, 4, 0
is very close to 0, 4, 7, 9, 4, 0, 0 under NCD and is comparatively distant
under euclidian, manhattan and cosine distances as the elements of each vector
are assumed to be independent.

In short, your mileage may vary and will depend on the application. I like NCD
because it's super easy to throw it at almost any problem and get a quick
understanding of how the data is structured.

The algorithms in complearn are slow. I'm experimenting with much faster ones
that scale better at the moment.

