

Measure distance in n dimensions with Pythagoras - herdrick
http://betterexplained.com/articles/measure-any-distance-with-the-pythagorean-theorem/
Very simple, but there are some good examples here about how you might do this to quantify similarity in users based on their expressed preferences.  Simple techniques are often best.
======
pixcavator
You can't apply this stuff mindlessly. Consider this suggestion from the
article: "Differences between people: (Height, Weight, Age)". If you use the
formula, you end up with inch^2 + pound^2 + year^2. Mixing units is bad
enough. Now imagine how the results will change if you switch to the metric
system.

~~~
bluishgreen
I think the missing element is this: you need to establish that the domain
that you are trying to measure is a metric space before trying to measure
distance using this formula.

<http://en.wikipedia.org/wiki/Metric_space>

edit: read the second section on how to establish something is a metric space.
(in the wiki article)

~~~
pixcavator
Exactly! It is a big mistake to assume that you can express relations between
the entities you study with a single number. I've never heard however of non-
metric topological spaces used in applications.

------
jgrahamc
The use of the color distance measure turns out to be important in spam
filtering because of the 'Camouflage' trick used by some spammers where
similar, but not identical colors are used to mask chunks of 'good' text
inserted to try to fool spam filters.

From <http://www.jgc.org/tsc/>

Camouflage (GWI!Camouflage!HTML) What: Like Invisible Ink, but instead of
using identical colors (e.g. white on white) use very similar colors. Date
added: June 2, 2003 Example from the wild: The colors 1133333, 123939, and
423939 are chosen to be very similar without being the same)

<table bgcolor="#113333"><tr><td><font color="#123939">those rearing
lands</font><br> <table><tr><td><br><font color="yellow" size=5><b>Plasticine
sex-cartoons.</b></font><br> <font color="#423939">eel harness
highest</font><br> <font color="white" size=3>Absolutely new category of adu1t
sites. </td></tr></table> <font color="#123939">nobody jets
held<br>Northumbria- diamond sleep</font></td></tr></table>

------
ivankirigin
Cool.

But finding the distance between two N-D points isn't really hard at all.

What is hard is finding the distance between a point and a set of points. Or a
set and a set.

Doing exhaustive search is wasteful.

If you're interested, look in to K-D trees as a real solution. Best-bin-first
modified K-D trees are the basis of the SIFT object instance recognition
feature matching algorithm. Break an image into a set of ND features. Matching
a geometrically consistent subset of those features to a previously seen
object works extremely well.

The algorithm is general. Change and add features to make it more powerful or
faster. But the idea of using a set of ND linear features to describe an
object should last for years.

------
tel
Not to snipe at anyone, but skills like this seem pretty essential in solving
problems in general. Little tricks from linear algebra like magnitude (what
this post concerns) and projections show up in interesting and useful places.

Additionally, while they /are/ harder to access, reading and understanding
various proofs in Math can be an even more beautiful and enlightening
experience than just seeing the practical result.

Maybe it's the difference between realizing that setf can be using on all the
generalized variables and actually reading the source and seeing why. Math is
full of clever hacks.

------
ced
Pythagoras theorem only holds in Euclidean geometry (or so, Wikipedia says).
The computer world has no notion of space, so there's no a priori reason for
choosing Pythagoras over other norms%. Has anyone here experimented with
alternatives?

%See for example: <http://en.wikipedia.org/wiki/Distance>

~~~
pixcavator
I've used the 1-norm, aka "Manhattan", distance (with weights of course) for
visual image search. I think the 1-norm is always a better choice than the
Euclidean distance unless there is some clear geometric context.

~~~
machine
The 1-norm tends to also be less sensitive to outliers, and in machine
learning, 1-norm regularization leads to sparse solutions. The real reason
2-norm is popular is that it is easy to minimize (differentiable).

------
ed
Very timely post. I was in bed last night trying to visualize how you might go
about measuring distance in 4-dimensional space...

While the method presented is really simple, it's tricky to think of it as
"intuitive" when n>3 without looking at a proof.

------
herdrick
Very simple, but there are some good examples here about how you might do this
to quantify similarity in users based on their expressed preferences. Simple
techniques are often best.

