

Data Mining: Finding Similar Items and Users - bad_user
http://bionicspirit.com/blog/2012/01/16/cosine-similarity-euclidean-distance.html

======
zeratul
First, a distinction whether we are dealing with sparse data (nearly-binary
data) or dense data (nearly-continuous) should be made. Second, a distinction
whether we are calculating similarity for an unsupervised or a supervised
problem should be made. Then, present most common similarity measures and
variable selection algorithms for each of four type of problems.

This is a vast area of research. Diving head on might result in serious
injury. For example, R statistical package _simba_
(<http://cran.r-project.org/web/packages/simba/index.html>) lists 56 different
similarity/dissimilarity measures just for binary data.

------
kaddar
This is a bit of an oversimplification of data mining to the point where I am
not sure it is useful. Most interesting data exists in only a subset of a
large feature set, where most items are irrelevant to the similarity metric.
Take movies for example, if you tried to find similar movies using all
features, key grip names and minor actors would unrealistically mess up your
similarity score. This relates to the "curse of dimensionality".

Many data mining approaches first use a feature selection or feature
extraction approach. That is, an approach which finds the relevant feature
subsets, or discovers the underlying features of the data set.

Inverse Image search and the solution to the Netflix prize both used feature
extraction approaches.

~~~
bad_user
It is an introduction with which you can solve many simple use-cases.
Obviously it won't get you to win the Netflix prize :)

------
arnoldoMuller
You can solve complex things with k-nearest neighbor as long as you use an
appropriate distance function. This is the beauty, a distance function
abstracts the complexity. I tackled a tricky biology problem by applying a
cascade of similarity filters. Check the presentation I gave at the
Clojure/conj 2011: <http://prezi.com/zaaoq6pjrl2z/clojure-conj-final/>

My startup provides a very fast similarity engine (in a DB of 100 million
objects I can find similar objects in under 20 millisec. with one CPU) in case
you worry about scalability. URL: <http://simmachines.com>

~~~
dantheman
Arnoldo is their a video of your talk online?

~~~
arnoldoMuller
Hi Danny:

Unfortunately not yet, they will be released but I am not sure when.

~~~
dantheman
Do you use twitter or have a blog?

~~~
arnoldoMuller
My twitter is: @amuller :) I will e-mail you when the videos are released.

------
boolean
This was useful. In my case I have a million articles that I want to group
related ones together. Similar to Google News. I'm guessing I can use one of
the algorithms (Cosine Similarity) to calculate the similarity of every two
article and group the close numbers together. Any recommendations on how I
should go about it? I'm trying to find Python libraries that can make this
easier.

~~~
gtani
Not entirely clear what you're asking, if it's cluster by topic matter, or
pick out specific named/physical entities or maybe sentiment analysis.

Two good first steps to look into, depending on your needs, are Bayesian
classifiers and SVD (reduction of high dimensionality, the application to text
processing was patented as Latent Semantic Indexing/Analysis, LSI or LSA, by
IBM, I don't knnow if that's lapsed).

------
chrisacky
Thanks for your contribution.

I am somewhat like many people who are unfamiliar with data mining for this
type of matching.

For example, I want to provide a "similar items" for Vacation Rentals, where
the "dimensions" or attributes, could be "location", "bedrooms", "price", etc.
It's hard to quantitfy anything to show something which might be more relevant
to someone else based on the previous properties that they have currently been
viewing.

Instead I have just taken to approach of creating a bounding box based on the
Geo coordinates, and then offer up similar properties within their search
price range. But I would really love to eventually implement something like
your original article. (Suggestions welcome).

~~~
bad_user
Amazon is taking an indirect approach. They are not necessarily comparing
items directly to offer suggestions, although they probably take categories
into consideration and that's because they have a good stream of traffic,
ratings and purchases to rely on (having more, better data gives better
results than smart algorithms).

Their suggestions are like: customers that viewed this item also viewed;
customers that viewed this item ended up buying; customers that bought this
product, also bought these other products.

That last metric in particular is interesting, because it tells you for a
product what are the complementary products that customers may be interested
in. So you don't actually have to measure somehow the physical properties of
the objects getting sold to discover relationships.

In your case I don't have knowledge about the problem domain to give advice,
but "customers that viewed this deal also viewed ..." is always a great
addition. Also add ratings and follow-up on people with emails to rate on
their vacation, after coming back from the trip. I don't know how well it will
work - there's no general solution, you try something and if it doesn't work,
try something else.

~~~
chrisacky
Thanks for the tip bad_user. I hadn't actually thought about trying to figure
out "customers that viewed this item also viewed".

I could very easily create something that takes every visit, MapReduce it, and
then track the entropy between potential matches to provide the "best" match
based on user visits of that property also.

To take the example, it would be really great to also know that people from
Germany aren't interested in the slightest in our Italian properties based on
trends of their national behaviour.

------
jacabado
relevant: "A collection of command-line tools for researchers in machine
learning, data mining, and related fields." <http://waffles.sourceforge.net/>

------
swiil
I think the overall community needs this sort of tutorial to really garner
value out of the vast sums of data that we as a community collect. Everyone
needs a place to start.

~~~
gtani
The "Collective Intelligence" books, by Alag and by Marmanis / Babenko, are
well done (source code is java). Along with NLP texts by Jurafsky/Martin and
Manning / Schütze, and Norvig/Russell AI text, and a large number good texts
on data mining(1st one i bought by Witten /Frank, has been recently updated
and has lots weka examples), should give you a good base. The database
collection and cleaning (spidering, scraping/info extraction, database
dedup/record linkage) is the part that's not as ewll documented.

<http://www.manning.com/alag/>

<http://www.manning.com/marmanis/>

------
herval
On a related note, I just started the "Programming Collective Intelligence"
book and I'm in process of writing it's code on Ruby. In case anyone wants to
contribute: <https://github.com/herval/ruby_intelligence>

------
joshu
This is a nice intro but they don't get into how to do this for a very large
set of data.

