Similarity search based on kNN search of Delicious tags.

photon_off · on Dec 17, 2010

I posted this when I first launched at the end of July; and I figured it's relevant to share it again to show just how useful Delicious data is. Moreofit.com has an index of about 250,000 websites and their tag weights, courtesy of Delicious.com. They're normalized and stored in a manner that allows for a somewhat efficient kNN search to be applied for a given set of tags. The results: For a given URL, you can see the "closest" URLs in 50,000+ dimensional tag space. If you go into "custom tag search" you can very the weights of the tags and explore the tag space that way, also.

More precisely: each website has a vector of approximately 50,000 dimensions, 49,990 of which are 0, but 10 of which have a value. The "closest" URLs in this space are those which have the smallest distance from the searched URL. I've thought of doing singular value decomposition, but frankly never got around to it, because the results are really quite good. This is just the tip of the iceburg -- I'm sure delicious has well over 10,000,000 URLs.

Delicious.com has an extremely rich set of data: A very large, and arguably the most relevant, portion of the web has been described by hand#, by real people, with no incentive to cheat or skew the results. The amount of man-hours spent tagging and organizing the web is really astounding, and I'm glad I managed to make something useful out of it.

I really hope Yahoo comes to their senses and realizes how valuable this data is. Just combining URL (or domain) popularity into search results offers so much more value. One could create a killer search engine with this data.

#: A lot of the end results of tag weights has to do with how the URL was initially tagged, as a lot of people opt to use "auto-tagging" which just copies the most popular tags. Not only are the end results of tagging quite interesting, but I suspect researching how a URLs popularity and description has ebbed and flowed over time would be awesome.