

Similarity search based on kNN search of Delicious tags. - photon_off
http://www.moreofit.com/#AutoSuggest

======
photon_off
I posted this when I first launched at the end of July; and I figured it's
relevant to share it again to show just how useful Delicious data is.
Moreofit.com has an index of about 250,000 websites and their tag weights,
courtesy of Delicious.com. They're normalized and stored in a manner that
allows for a somewhat efficient kNN search to be applied for a given set of
tags. The results: For a given URL, you can see the "closest" URLs in 50,000+
dimensional tag space. If you go into "custom tag search" you can very the
weights of the tags and explore the tag space that way, also.

More precisely: each website has a vector of approximately 50,000 dimensions,
49,990 of which are 0, but 10 of which have a value. The "closest" URLs in
this space are those which have the smallest distance from the searched URL.
I've thought of doing singular value decomposition, but frankly never got
around to it, because the results are really quite good. This is just the tip
of the iceburg -- I'm sure delicious has well over 10,000,000 URLs.

Delicious.com has an extremely rich set of data: A very large, and arguably
the most relevant, portion of the web has been described by hand#, by real
people, with no incentive to cheat or skew the results. The amount of man-
hours spent tagging and organizing the web is really astounding, and I'm glad
I managed to make something useful out of it.

I really hope Yahoo comes to their senses and realizes how valuable this data
is. Just combining URL (or domain) popularity into search results offers so
much more value. One could create a killer search engine with this data.

#: A lot of the end results of tag weights has to do with how the URL was
initially tagged, as a lot of people opt to use "auto-tagging" which just
copies the most popular tags. Not only are the end results of tagging quite
interesting, but I suspect researching how a URLs popularity and description
has ebbed and flowed over time would be awesome.

