I cooked up a prototype of this project about a year and a half ago. While the results were encouraging, I lost motivation and it just collected dust for all that time. In the past 3 weeks I've revived the project and given new life to it.
The project started when I wondered: "What if delicious tag signatures were vectors, would their nearest neighbors in n-dimensional space be a good match for similarity?" The answer took awhile to figure out, but it was a yes. For me at least, it wasn't an easy problem to solve; finding the nearest neighbors with each vector having 10 dimensions, with total dimensional space about 50,000. Figuring out a way to solve this in a timely manner was a joy.
At the time this project was first started and the first prototype completed, there were absolutely zero sites, other than google's sub-par "similar" link, that offered this. Well, there was "similicious" but its index was tiny, and results weren't too impressive. Now I count 5 full fledged services that do a pretty good job. Some are ranked quit highly and probably make a decent buck. I really think I missed a decent opportunity by not pushing through the last 20% of the project, and losing motivation entirely.
But, like they say, better late than never.
Questions/comments are welcomed/anxiously awaited.
Just a random, half-baked airport musing, but what if you had a related service that filtered out links which were similar and likely to be competitors. Someone might want to restrict what sites their crowdsourced content might link to?
I've considered the uses of moreofit for competitor-like stuff like this. I could make a service which (when installed on your site) scans users' histories for the top 500 competitor websites, aggregate those results, and let you know which competitors websites your visitors have most often been to. Also, if they've been to X competitor, theres a Y% chance they've been to Z competitor.
I haphazardly concluded that the value seems to be in the service itself, not the "finding competitors" aspect. You likely know who your competitors are, or can find out with a 1 time search on a similarity site such as mine.
That doesn't mean the idea has no legs. I imagine it's something people would pay for, and probably already are. It's just not my cup of tea at the moment to be censoring or sniffing people's histories.
The recent posts on how easy it is to launch a start up these days screamed "glut" to me. One solution is to target a niche that wouldn't be well enough served by a universal solution.
Anyway, what I'm getting at is that monitoring competitor popularity would be a valuable service (including an early warning system for detecting emerging competitors.) I think many of us do this every so often manually; and/or have google alerts set up for it. But a service that just does this would broaden the appeal - especially if you could somehow show that your estimates are more accurate... I have no idea of how you could do this... I guess you'd need a genuinely accurate measure, and then compare yours and manual searches with it. But if you could demonstrate that credibly and compellingly, it'd be gold.
I don't think you need to censor or sniff to do this.
If I understand you correctly, the service would work as follows: You enter a URL you want to monitor. My service would show you trends of similar websites to watch out for. It would base it on some sort of secret sauce that measures similarity and popularity of competitors. Correct?
That's an interesting idea. What types of current websites/businesses do you think could benefit from this right now? Could you give me examples? This could help me see if it's possible to monitor similar websites and develop trends.
Measuring a website's popularity is pretty straightforward -- you make some sort of formula based on how often that URL is discussed (from backtype), linked to (various sources), bookmarked, and ranked (alexa, quantcast, et al). It's a matter of mashing up that data with the similarity data (over time) to determine the trends of similar sites.
I didn't start the thread, but here's a tweak of this idea:
"You enter a URL you want to monitor."
That's what I was thinking of doing, and I'd love to collaborate. But, the URL sniffing hack is being batched up in the next generation of browsers, though. When it gets patched, that would essentially make the service useless... unless you can think of some other value offering. It's a pity, because I came up with a beautiful way of checking thousands per second with no locking up or freezing or any sort of negative feedback on the UI.
That's right - though you enter your own URL. Extra idea: it might also show you complements (products/services that fit in nicely with yours; these are usually pretty obvious in established industries - razors+shaving cream; milk+eggs; tyres +brakepads - but not so obvious in emerging industries. You might not have even thought of the category of a certain complement, let alone the name of it, nor specific brands that do it. It's valuable, because you might consider targeting people who buy the complements. It's a kind of market-research.
(Unhelpfully) I think every website could benefit from it. Probably emergent and fast-moving markets would benefit most (because they don't yet know their competitors/complements + it's all changing very fast anyway). A very specific start would be... y-combinator offerings (that have just launched - not the established onces); for fun, you could also do it for emerging languages (like Mirah). These aren't ideal biz markets, but they would gain attention here on HN, and might validate the basic idea (to see if it's possible.) For example, if you were able to find a competitor to a YC company that they didn't know about, it would be pretty compelling evidence (and will surely happen, since startups rarely look very closely for competitors; they are too busy doing. Which is a good thing. Knowing about competitors too early can be artificially discouraging. So this is probably most helpful for companies that already have enough traction that they wouldn't be discouraged. Competitors are good for giving you ideas; for demonstrating market existence (good for raising funds...); for suggesting markets you hadn't thought of. Complements are always good.
Perhaps also, companies that are always launching new products might appreciate it: eg Proctor and Gamble (though in their market, there are only a few competitors and they are well-known...). It might help in industries that are often disrupted - ie high-tech industries. Examples might be the established competitors of YC companies. If you could demonstrate legitimacy, big enterprise might pay a lot (perhaps equivalent to how much it would cost to hire someone to do the same work manually). You can be an arms dealer, and sell to both sides. :-) That's the nature of competition.
Amazingly, matching can be done with a single MySQL query that will take anywhere between .05 - 2.0 seconds (on average .25 if I do a whole batch of URLs) -- it depends on how popular the tags of the URL are. Those numbers were too large for my liking, so I cooked up a Java program that loads the relevant db tables into memory and does the matching itself. The program uses about 80MB (though it only 'really' uses 35MB -- damn Java heap) of memory and takes .008s to compute 500 matches for a given URL -- which I think is suitably fast. However, its 8ms because I've optimized it quite well -- a brute force match would take URL_INDEX_SIZE^2 * 2 * log2(NUMBER_OF_TAGS_PER_URL) multiplications.
The main bottleneck is inserting those 500 rows into a separate table, which is currently taking .1 seconds. I'm sure there's a better way to do this... I'll have to look into it if it becomes a scaling pain.
Interesting. Sounds like you're effectively doing a sparse matrix-vector multiplication for each request?
If you want to buy yourself a bit more time performance-wise, might be worth looking into using a numerical computing library for sparse linear algebra which has optimised routines for this (e.g. http://math.nist.gov/spblas/ at a first glance). Looks like there's been some recent work on parallelising this via the GPU too: http://graphics.cs.uiuc.edu/~wnbell/
Some kind of dimensionality reduction is probably one of the next steps scalability-wise though, and should hopefully improve the results too if you tune it right. Latent Semantic Analysis the term to google if you're not already aware of it. Testing the success of your predictive model might be tricky though in the absence of user-specific tagging data.
I experimented with this stuff a while back for music recommendations - got as far as doing some simple SVD-based dimensionality reduction but the results weren't as good as I'd hoped and it got put on the backburner. So, respect for polishing this one up for launch :)
Good suggestion for dimensionality reduction. Most ready made packages you will find under the name PCA though (principal component analysis). It's almost always good to reduce dimensionality first, depending on the component number you choose, you can remove a lot of the noise from the data.
I'm curious why it didn't work for music recommendation for you. In the netflix challenge for movie recommendation, pretty much everyone ended up using SVD-based methods and variations thereof.
> I'm curious why it didn't work for music recommendation for you. In the netflix challenge for movie recommendation, pretty much everyone ended up using SVD-based methods and variations thereof.
I expect the algorithm did reasonably well given the data available. Probably a mix of reasons why I was disappointed though:
* My expectations were too high, I didn't realise quite how hard of a problem good music recommendations are
* Dataset wasn't large enough
* Dataset was based on deliberately-stated opinions (ratings) rather than actual observed behaviour (listening data)
* I didn't find a way to use the timing data associated with the ratings
* Figuring how best to normalise the dataset prior to attempting SVD was tricky and I'm not convinced I found the best way
I admit I was also sort of naively hoping that the 'features' identified by SVD would have at least a vaguely-human-recognisable theme to them. The first 2 or 3 of them did but from there on in it all looked pretty random.
Also, like a lot of recommenders, it gave the impression of having based its recommendations on some kind of generic, averaged-out 'middle point' of your overall usage data.
Really I'd rather it clustered my usage data, then recommended new things in and around each of the clusters.
I have some ideas for what I'd do differently the next time around - that being one - but also ideas about how better to tie algorithmic recommendation tools in to human interaction.
Yeah, there is always the danger that the data is just too little or too noisy. But the technical issues you mention (normalization, subtracting bias, using temporal data) all came up in the netflix movie recommendation competition as well so you can always look at how people handled it there.
Really amazing input, thank you. Unfortunately, this stuff is over my head, in two ways.
First, I haven't touched C++ in five years and don't want to spend weeks figuring out how to get my data from MySQL loaded into those data structures, how to create sockets so my PHP scripts can connect to it, handling errors, etc, etc. It's probably not as hard as I'm imagining it to be, and I'd love doing it, but I personally can't afford getting sucked into those details (no matter how appealing they are).
Second, I'm not familiar enough with the math of this library to know how to use it. I have about 200,000 vectors with 60,000 dimensions, each vector being extremely sparse. Given one vector, I'd like to compute the dot product of it with all others, then sort by value. I have no clue how to do this with that library.
It sounds like you have a lot of experience with this type of stuff. How much do you think my results could be improved (specifically)? Which results do you think are insufficient, and why?
Thanks for your input, I'm excited to nerd out on this type of stuff... even if it is over my head at times.
I wrote something similar a few years ago (finding photos with the most-similar tag vectors among a set of ~500,000). The core lookup was done by a few hundred lines of C code returning answers in ~0.1s, IIRC. This used a standard info-retrieval method, not a linear-algebra library -- all vectors pre-loaded into RAM in a sparse representation, then for each query scan them all, keeping a min-heap of the best vectors so far. (This could be sped up by indexing the vectors and skipping the ones that with no tags in common with the query, but I didn't need to bother.) Email if you'd like to discuss my experience with this, though it looks like you already get decently fast answers.
You can do queries besides "sites most like site X", you know -- "What sites are like a cross of X and Y?"
Is there any way to do this once and for all for a link? As you're crawling the database you look up the link and do this once and for all for it. So that the next time a user asks a query if it's over there you can directly return that, if it isn't then you can use more resources to crunch it for them.
Is this possible? Also, is there any way I can help you to make this? I don't expect any monetary benefit. I just want to learn.
Another suggestion: on the result page make the main link for each result go to the "moreofit" page of the given link, rather than the actual external websites. (just like you do for the top pages). You want to keep people on your site :) and it's probably better for pagerank also.
Have instead a small arrow icon for the actual external links (that would be useful for the top pages also, there is no external link there at the moment).
Well, the more internal links and the less external links you have, the better for pagerank. PR just formalizes the random surfer model: what is the probability that someone who just randomly clicks links is on your page after n clicks.
Nice idea, well done. Probably you could use PCA to reduce the dimensionality of the data before doing the nn search. This seems a bit like the netflix movie recommendations data so you could probably use ideas from there. Sorry, probably you already do :)
Oh, I got it now, you use the tags attached to the link.
First I thought that you use the individual bookmarking of each user which would be a HUGE binary, sparse matrix, however, that could really be reduced with PCA and it would give a
"people with similar tastes to those who bookmarked your site, also bookmarked ... " effect, possibly an even richer source of information.
Your service is great anyhow, got to love a site that says that my site is similar to paulgraham.com and joelonsoftware.com
I use the tags attached to a URL, as well as their relative magnitudes. In essence, each page has a 10-vector in R^60,000, and moreofit's job is to find kNN.
If you click "auto-suggest" moreofit will check your browser history for the top 2500 or so URLs, to give you a jumping off point. I'm not currently logging this data, because I don't think it's nice to, but I think it'd provide a gold mine of information to mull over.
OK, I got it. Well, each URL is still a (60,000)-dimensional vector, just that 59990 entries happen to be zero and the 10 that are non-zero have certain values which show how many times that tag was chosen. Problem is that not the same ten are nonzero for each URL. Did I get it right?
So what I was suggesting was to perform some off-the-shelf projection method on this data, for example principal components analysis (PCA) which will project your data from 60,000 into say 20-dimensions (as you choose) and then for each URL you will have just 20 numbers which are directly comparable across URLs. You can then more easily do k-nn on this data.
The advantage is that:
1) you reduce noise and the data will be easier to handle
2) the obtained new "components" can be interpreted as more subtle descriptors of the URLs than the original tags.
For example one component could become the "techiness" of the URL, the other the "personalness" the other the "comediness", etc. All this comes out automatically from the data and you can interpret it by looking at what mixture each component will have of the original features (the tags)..
This is just a suggestion, I mean, the algorithm seems to be great as it is, but still, the technique seems a bit crude: if one page is labeled "tech" the other "technology", I think it will miss the connection. However, PCA could find out that these features are related. etc. etc. etc.
> OK, I got it. Well, each URL is still a (60,000)-dimensional vector, just that 59990 entries happen to be zero and the 10 that are non-zero have certain values which show how many times that tag was chosen. Problem is that not the same ten are nonzero for each URL. Did I get it right?
Interesting suggestions with the PCA. One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common. If I were to condense the 60,000 dimensions to something like 20, I'd have to do far more calculations to determine the top matches, as it would be tougher to distinguish vectors that are clearly not a match. I'd have to switch algorithms to something more suited for "more data points in less dimensions," which is something I'd love to do but cannot afford to spend time on right now.
If this type of analysis is as flexible as you say it is, then its possible I could get some gains from condensing the dimensions down to something like 1000, or 500, which would take care of superficial similar tags like "tech" and "technology", while still maintaining the diversity that is there. Looking at the most common tags, I can say with pretty good certainty that many of them will not be correlated enough to condense them together without losing some inherent "match" capability -- but then again I know nothing about PCA.
I'm not at all familiar with PCA. I'm going to go research it as soon as I finish writing this and other responses. If you could, though, perhaps you could explain briefly how it works? Your explanations resonate well with me. As I understand it, it's a method for projecting high amounts of dimensions into lower amounts, probably by figuring out correlations between each dimension then "combining" them in some sort of way. Something akin to creating the desired number of projected dimensions such that their orthogonality is as high as possible. Am I on the right track?
> One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common.
I agree, and depending on the data, it might totally be worth sticking with this for simplicity. One has to think if it is good or not to dismiss these pairs which have zero dimensions in common. Say if one url has "people, tech, jokes", the other has "ppl, technology, fun", maybe they should not be missed. On the other hand, maybe because of the tag suggestions, as you said earlier, the tags might be quite homogeneous.
Your intuition about PCA is very correct. The way it works is that it splits your matrix into a product of two matrices. Say you have a large matrix with n rows (urls) and 60000 columns (tags). Let's call this X.
PCA will give you X = A*S (approximately)
Matrix A will have n rows and say 20 columns, matrix S will have 20 rows and 60000 columns (notice that there are much less entries in total, this is a compression of the data)
Matrix A will tell you how the original urls's are represented in the new features, matrix S will tell you how the new features are represented in the old features (a linear mixture of the old features a.k.a the tags). Depending on how you do this factorization into two matrices, you can have different methods but PCA is optimal in certain important ways.
Lot of nice intros if you search for "pca tutorial" and there are ready implementations in most languages. The basic ideas are simple but then the rabbithole goes deeper and deeper - just like this thread, so I'll stop here :)
A lot of the results look great, like ikea.com, for example. However, trying tougher queries yields no results. Php.net has no results (message still says 500 results found), and the message explains results were filtered for the tag 'php'; clicking "click here to undo this" gives a 404 error. Also, number10.gov.uk, the equivalent of whitehouse.gov for Britain gives a message the site is not popular enough for results. The overlay for "about these results" doesn't have a 'close' prompt, so users who may not be technical enough to try clicking outside the message area would be stuck on that screen. Overall, it looks like it has nice potential. Is there a way to see beyond the top 10 results? The first 10 results for Mixergy.com look great, but I'd like to see more results! :)
I just fixed that broken link. The automatic hiding of a tag is meant primarily to hide tons of duplicates that would happen if you searched for "amazon.com" for example. I guess I should check to see if it hides too much and if so, leave it in.
Paging is not implemented yet... and curiously enough you're the first to request it. Surprised it took so long!
Which "tougher queries" did you try? If you get a response that "the page was not popular enough" then there's not much I can do. I'm relying on delicious to have tags for the URL, and if none are present, then I'm SOL.
Thanks for the feedback. I'll have pagination done shortly.
I tried searching for "Mixergy" and only got one response. Then I saw a message that said, "did you mean Mixergy." I clicked it even though I didn't know what the difference was between what I typed and the response.
I think the message could make it clearer that you're asking about a domain. Like this:
"did you mean Mixergy.com"
Some duplicates can be removed by selecting the "domain only" option, since pages are often indexed separately "foo.com" and "foo.com/home/". I agree that this should be automatically completed -- it's a difficult task that I'm not keen on doing yet.
Medium-sized niche sites, like dpreview.com, work especially well. I'm glad you noticed. If you'd like to explore a little bit more, you can try  and fussing about with the tags. Maybe change one to be "satire" and see what happens. The possibilities are endless, as in essence you are exploring a 60,000 dimensional universe of links. (By the way, the "tag search" option is available by searching for a site, then clicking "search by tag signature" just below the first hilited result.)
I was working on something like this a little while ago. I also saw all the competition (sites similar to a site that shows similar sites :)
I wasn't smart enough to do what you did, I was thinking of somehow building the directory by hand, or by turkers, etc
What finally killed my ambitions was sears.com. I reasoned that a similar-site search would only make the big time if the search was perfect, but what is a perfect match for sears.com? truevalue.com? perfumes.com? clothes.com? or another department store like kmart? You descend into category madness.
All those competitors of yours are also latecomers, I remember in the 90's netscape's browser had a pulldown of similar sites built in.
I think its very possible you'll make a few bucks off this if you can get some buzz, but I'm not sure if these similar site searches will ever be a killer app because the problem is so huge.
Oh, I also discovered that if you put the domain you are matching in the URL, like you are doing now, Adsense does a great job at targeting ads.
One more thing, you forgot the bookmarklet! I want to be on a site, click the moreofit bookmarklet, and do a search. But of course you're now in subdomain hell, if they are on secure.shoes.com, do you search secure.shoes.com or or shoes.com?
Hi Kevin! Glad somebody else has a shared interest in this problem.
Sears.com is a very difficult problem to solve. The "category madness" you bring up certainly is a very good point, and I rely on the wisdom of crowds to mitigate this issue. After a site has been tagged thousands of times, I'm assuming that the unique description of the site is pretty good. However, there is a slight "snowball effect" present with my dataset (delicious): When a URL is first tagged, those tags are suggested to all future people who bookmark it, causing the initial set of tags to have more bias.
At any rate, back to sears.com. The current matches now are all stores, but they're not at specific as you suggest they could be. Moreofit offers a "tag filter" feature, and a tag signature matching feature for this occasion. Of course, it all depends on the user being savvy enough to use these things.
On moreofit, you can specify tags and how important they are for your search.
I agree that it will never be a "killer app," however, I do think it pretty useful and a preliminary search on Google shows a lot of interest in searches for "sites like chatroulette" and "sites like XXX". I'm hoping to serve those customers' needs.
The bookmarklet is coming soon -- and subdomain hell isn't a big issue -- search the subdomain first, if no matches, then the main domain.
The problem with searches for "sites like xx" is that people entering those search strings are likely to be fairly unsophisticated users (who aren't aware of the "similar" button or related: syntax, let alone the possibility of there being dedicated sites for that kind of search). Moreofit.com looks more like a power user tool.
The other problem is that Google's results for the suggested "sites like" queries are pretty helpful - they generally return articles and q&a pages with exactly the lists the end users are looking for. It might require more clicks to get there, but even the non-savvy user should be able to managed it.
I was impressed by the results returned by moreofit though, especially for websites that don't have numerous linkbait articles about "Top 10 websites for ....".
The tag importance sliders are especially neat, though one minor usability issue is that the update button (bottom right) is a long way away from the sliders and the text advising you to click it. The Google display ads returned when you do a URL search are horrendously irrelevant and probably deter users from clicking on the targeted text ads above them.
I'm trying to make it so a search for "sites like bla" would bring me somewhere near the top, with a link to "top ten sites like bla" which I feel would appeal well to these users.
Google's results for these searches are OK. I've noticed it takes a substantial amount of time to scan through the links and figure out which forum posts might contain some good information. When clicking through, sometimes I'd find good suggestions, sometimes not. It's a similar experience if you search for some obscure error code and end up scouring a few pages to find out how somebody else solved it (sometimes in one of those god awful non-threaded 1-comment-per-page bulletin boards -- how do they still exist?).
I'll work on fixing that update button issue.
If there's one thing I noticed, it's that hardly anybody has used the "tag sliders" feature, or any other search options for that matter. They're either content enough, or don't care enough, to delve further into results. From a technology standpoint, I find the tag sliders feature to be really exciting. You can craft your own URL and see if anything matches it.
The horrendous Google Ads are something out of my control -- Google won't allow me to supply the keywords for the page, and if they haven't indexed the page yet, they'll show crappy generic horoscope ads. I need suggestions as to how to get more relevant ads up there. Perhaps an ad service that allows me to supply the keywords. Know of any?
The only think I'd say is to make the "update" button more prominent (eg add another one beneath the "searching for tags" sliders). A few times I wondered why it was not updating the results, and had to hunt around for it. Maybe some people, at the margin in terms of interest, would give up on it because of this.
It would be nice if there was some natural way for the results to update without specifically requesting it - eg. like how the "moreofit" links work. The way google does it in "search tools" for time choice is interesting: it provides typical defaults, that are linked (eg. Past 24 hours,Past week etc). I've always used it by just clicking on one of them, so the issue of "updating" doesn't occur. [reminiscent of 37signals' defaults]. Perhaps the truth is that while someone might theoretically want the results from the past 22 hours, in practice no-one ever does. Thus, the defaults are adequate. For your importance sliders, I guess this could be done with two links "important" vs "secondary" for each slider.
Of course, sometimes you want to assemble a search, and only then execute it (not have it update after every single term is added). For this, the standard practice seems to be to have a "search" button directly underneath however the search is configured (eg this what google search does for "custom range", for time, in web tools). So this would mean adding a "search" button beneath the sliders (not labeled "update", even though that is also an accurate description - I guess, maybeuse "moreofit", if you want to promote the brand at possible expense of usability).
tl;dr add a "search" button beneath the "searching for tags" sliders.
Thanks for the feedback. I think adding another update button below the tag sliders is a good idea.
As per the "more natural way of updating results", I might use Ajax to plop in the new results when any search options are changed.. though this could lead to further complications: Currently you can click on a tag to add it to the "include" and "exclude" tag filters. If you click a tag and suddenly the results you were looking at abruptly disappear and reappear, it might not be so nice.
What do you think of this idea: Any time a change is made to the search options, the "update" button starts glowing (fading in and out or changes colors) so as to notify the user that it wants to be clicked?
Glowing is an interesting idea, but I think just placing the search button beneath would be helpful enough - then it's obvious to the user that they need to hit that, and it's easy to find. That would solve both of the problems that I had, anyway.
"The way google does it in "search tools" for time choice is interesting"
Google actually does it about four different ways. There're defaults for common cases like past day/week/month, but there's also software that tries to guess if a particular date range is interesting, eg:
Then there's "Latest", which actually does update in real time, if it has new results. And then there's "Custom date range", which lets you pick with a calendar widget and then requires that you hit Search again, much like MoreOfIt does.
Finally, somebody noticed! Hardly anybody is using "auto suggest" ... which is good -- I'm glad people have URLs in mind to search. Yes, I'm using the browser history hack that's been around for ages. It's going to be patched up by all major browsers in probably under a year, so don't worry. I think the latest safari already has it covered.
DDG appears to be using similarsites.com, a competitor that cropped up while I let this project collect dust. Their results are admittedly "good enough," but not as good as moreofit, at least in my biased opinion :)
Google's "similar" link is akin to my results with "popularity" sort, only worse. That is, Google seems to do this: "What other URLs have some keywords in common with this one?" And of those, it sorts by "which is most popular?". The results, well, they're not at all impressive which is why I made this in the first place.
However, Google's index is much greater. Currently you can't search for non-semi-popular URLs on moreofit, but you can on Google. So they have me beaten there.
Where would you like me to say it? It will probably go in an "about" page, but I don't think there's any benefit to the possible confusion of "what is the delicious API?" by putting it on the front page.