The project started when I wondered: "What if delicious tag signatures were vectors, would their nearest neighbors in n-dimensional space be a good match for similarity?" The answer took awhile to figure out, but it was a yes. For me at least, it wasn't an easy problem to solve; finding the nearest neighbors with each vector having 10 dimensions, with total dimensional space about 50,000. Figuring out a way to solve this in a timely manner was a joy.
At the time this project was first started and the first prototype completed, there were absolutely zero sites, other than google's sub-par "similar" link, that offered this. Well, there was "similicious" but its index was tiny, and results weren't too impressive. Now I count 5 full fledged services that do a pretty good job. Some are ranked quit highly and probably make a decent buck. I really think I missed a decent opportunity by not pushing through the last 20% of the project, and losing motivation entirely.
But, like they say, better late than never.
Questions/comments are welcomed/anxiously awaited.
If I might ask, how did the original URL-URL recommended work? Was it simply a query to find other URLs with the most matched tags, or was tag order also taken into account?
So I never looked at the global tags - user1's use of tag x was distinct from user2's.
Man, I miss that dataset.
I haphazardly concluded that the value seems to be in the service itself, not the "finding competitors" aspect. You likely know who your competitors are, or can find out with a 1 time search on a similarity site such as mine.
That doesn't mean the idea has no legs. I imagine it's something people would pay for, and probably already are. It's just not my cup of tea at the moment to be censoring or sniffing people's histories.
Anyway, what I'm getting at is that monitoring competitor popularity would be a valuable service (including an early warning system for detecting emerging competitors.) I think many of us do this every so often manually; and/or have google alerts set up for it. But a service that just does this would broaden the appeal - especially if you could somehow show that your estimates are more accurate... I have no idea of how you could do this... I guess you'd need a genuinely accurate measure, and then compare yours and manual searches with it. But if you could demonstrate that credibly and compellingly, it'd be gold.
I don't think you need to censor or sniff to do this.
That's an interesting idea. What types of current websites/businesses do you think could benefit from this right now? Could you give me examples? This could help me see if it's possible to monitor similar websites and develop trends.
Measuring a website's popularity is pretty straightforward -- you make some sort of formula based on how often that URL is discussed (from backtype), linked to (various sources), bookmarked, and ranked (alexa, quantcast, et al). It's a matter of mashing up that data with the similarity data (over time) to determine the trends of similar sites.
"You enter a URL you want to monitor."
(Unhelpfully) I think every website could benefit from it. Probably emergent and fast-moving markets would benefit most (because they don't yet know their competitors/complements + it's all changing very fast anyway). A very specific start would be... y-combinator offerings (that have just launched - not the established onces); for fun, you could also do it for emerging languages (like Mirah). These aren't ideal biz markets, but they would gain attention here on HN, and might validate the basic idea (to see if it's possible.) For example, if you were able to find a competitor to a YC company that they didn't know about, it would be pretty compelling evidence (and will surely happen, since startups rarely look very closely for competitors; they are too busy doing. Which is a good thing. Knowing about competitors too early can be artificially discouraging. So this is probably most helpful for companies that already have enough traction that they wouldn't be discouraged. Competitors are good for giving you ideas; for demonstrating market existence (good for raising funds...); for suggesting markets you hadn't thought of. Complements are always good.
Perhaps also, companies that are always launching new products might appreciate it: eg Proctor and Gamble (though in their market, there are only a few competitors and they are well-known...). It might help in industries that are often disrupted - ie high-tech industries. Examples might be the established competitors of YC companies. If you could demonstrate legitimacy, big enterprise might pay a lot (perhaps equivalent to how much it would cost to hire someone to do the same work manually). You can be an arms dealer, and sell to both sides. :-) That's the nature of competition.
EDIT http://www.mirah.org hasn't got results yet. JRuby's competitors are:
Groovy and http://rjb.rubyforge.org... There's a japanese (I think) news site. That might be helpful, if you can detect similarities across languages...
Interesting stuff, by the way.
The main bottleneck is inserting those 500 rows into a separate table, which is currently taking .1 seconds. I'm sure there's a better way to do this... I'll have to look into it if it becomes a scaling pain.
If you want to buy yourself a bit more time performance-wise, might be worth looking into using a numerical computing library for sparse linear algebra which has optimised routines for this (e.g. http://math.nist.gov/spblas/ at a first glance). Looks like there's been some recent work on parallelising this via the GPU too: http://graphics.cs.uiuc.edu/~wnbell/
Some kind of dimensionality reduction is probably one of the next steps scalability-wise though, and should hopefully improve the results too if you tune it right. Latent Semantic Analysis the term to google if you're not already aware of it. Testing the success of your predictive model might be tricky though in the absence of user-specific tagging data.
I experimented with this stuff a while back for music recommendations - got as far as doing some simple SVD-based dimensionality reduction but the results weren't as good as I'd hoped and it got put on the backburner. So, respect for polishing this one up for launch :)
I'm curious why it didn't work for music recommendation for you. In the netflix challenge for movie recommendation, pretty much everyone ended up using SVD-based methods and variations thereof.
If anyone is interested, this is the most accessible write-up of the idea I know: http://sifter.org/~simon/journal/20061211.html
I expect the algorithm did reasonably well given the data available. Probably a mix of reasons why I was disappointed though:
* My expectations were too high, I didn't realise quite how hard of a problem good music recommendations are
* Dataset wasn't large enough
* Dataset was based on deliberately-stated opinions (ratings) rather than actual observed behaviour (listening data)
* I didn't find a way to use the timing data associated with the ratings
* Figuring how best to normalise the dataset prior to attempting SVD was tricky and I'm not convinced I found the best way
I admit I was also sort of naively hoping that the 'features' identified by SVD would have at least a vaguely-human-recognisable theme to them. The first 2 or 3 of them did but from there on in it all looked pretty random.
Also, like a lot of recommenders, it gave the impression of having based its recommendations on some kind of generic, averaged-out 'middle point' of your overall usage data.
Really I'd rather it clustered my usage data, then recommended new things in and around each of the clusters.
I have some ideas for what I'd do differently the next time around - that being one - but also ideas about how better to tie algorithmic recommendation tools in to human interaction.
Some useful information can be found in the report of the winners:
There's a lot of fancy stuff, which would be overkill in a real system, but lot of practical info also.
Also, there was an article of the winners "collaborative filtering with temporal dynamics" which might be useful and I think it is freely available.
First, I haven't touched C++ in five years and don't want to spend weeks figuring out how to get my data from MySQL loaded into those data structures, how to create sockets so my PHP scripts can connect to it, handling errors, etc, etc. It's probably not as hard as I'm imagining it to be, and I'd love doing it, but I personally can't afford getting sucked into those details (no matter how appealing they are).
Second, I'm not familiar enough with the math of this library to know how to use it. I have about 200,000 vectors with 60,000 dimensions, each vector being extremely sparse. Given one vector, I'd like to compute the dot product of it with all others, then sort by value. I have no clue how to do this with that library.
It sounds like you have a lot of experience with this type of stuff. How much do you think my results could be improved (specifically)? Which results do you think are insufficient, and why?
Thanks for your input, I'm excited to nerd out on this type of stuff... even if it is over my head at times.
You can do queries besides "sites most like site X", you know -- "What sites are like a cross of X and Y?"
Is this possible? Also, is there any way I can help you to make this? I don't expect any monetary benefit. I just want to learn.
Have instead a small arrow icon for the actual external links (that would be useful for the top pages also, there is no external link there at the moment).
I'd like to keep the results rather straightforward. What I will do is link them all to a redirect script so I can log which results are most popular and possible create a smarter results algorithm.
How does including external links affect pagerank?
Your service is great anyhow, got to love a site that says that my site is similar to paulgraham.com and joelonsoftware.com
If you click "auto-suggest" moreofit will check your browser history for the top 2500 or so URLs, to give you a jumping off point. I'm not currently logging this data, because I don't think it's nice to, but I think it'd provide a gold mine of information to mull over.
So what I was suggesting was to perform some off-the-shelf projection method on this data, for example principal components analysis (PCA) which will project your data from 60,000 into say 20-dimensions (as you choose) and then for each URL you will have just 20 numbers which are directly comparable across URLs. You can then more easily do k-nn on this data.
The advantage is that:
1) you reduce noise and the data will be easier to handle
2) the obtained new "components" can be interpreted as more subtle descriptors of the URLs than the original tags.
For example one component could become the "techiness" of the URL, the other the "personalness" the other the "comediness", etc. All this comes out automatically from the data and you can interpret it by looking at what mixture each component will have of the original features (the tags)..
This is just a suggestion, I mean, the algorithm seems to be great as it is, but still, the technique seems a bit crude: if one page is labeled "tech" the other "technology", I think it will miss the connection. However, PCA could find out that these features are related. etc. etc. etc.
Interesting suggestions with the PCA. One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common. If I were to condense the 60,000 dimensions to something like 20, I'd have to do far more calculations to determine the top matches, as it would be tougher to distinguish vectors that are clearly not a match. I'd have to switch algorithms to something more suited for "more data points in less dimensions," which is something I'd love to do but cannot afford to spend time on right now.
If this type of analysis is as flexible as you say it is, then its possible I could get some gains from condensing the dimensions down to something like 1000, or 500, which would take care of superficial similar tags like "tech" and "technology", while still maintaining the diversity that is there. Looking at the most common tags, I can say with pretty good certainty that many of them will not be correlated enough to condense them together without losing some inherent "match" capability -- but then again I know nothing about PCA.
I'm not at all familiar with PCA. I'm going to go research it as soon as I finish writing this and other responses. If you could, though, perhaps you could explain briefly how it works? Your explanations resonate well with me. As I understand it, it's a method for projecting high amounts of dimensions into lower amounts, probably by figuring out correlations between each dimension then "combining" them in some sort of way. Something akin to creating the desired number of projected dimensions such that their orthogonality is as high as possible. Am I on the right track?
I agree, and depending on the data, it might totally be worth sticking with this for simplicity. One has to think if it is good or not to dismiss these pairs which have zero dimensions in common. Say if one url has "people, tech, jokes", the other has "ppl, technology, fun", maybe they should not be missed. On the other hand, maybe because of the tag suggestions, as you said earlier, the tags might be quite homogeneous.
Your intuition about PCA is very correct. The way it works is that it splits your matrix into a product of two matrices. Say you have a large matrix with n rows (urls) and 60000 columns (tags). Let's call this X.
PCA will give you X = A*S (approximately)
Matrix A will have n rows and say 20 columns, matrix S will have 20 rows and 60000 columns (notice that there are much less entries in total, this is a compression of the data)
Matrix A will tell you how the original urls's are represented in the new features, matrix S will tell you how the new features are represented in the old features (a linear mixture of the old features a.k.a the tags). Depending on how you do this factorization into two matrices, you can have different methods but PCA is optimal in certain important ways.
Lot of nice intros if you search for "pca tutorial" and there are ready implementations in most languages. The basic ideas are simple but then the rabbithole goes deeper and deeper - just like this thread, so I'll stop here :)
I just fixed that broken link. The automatic hiding of a tag is meant primarily to hide tons of duplicates that would happen if you searched for "amazon.com" for example. I guess I should check to see if it hides too much and if so, leave it in.
Paging is not implemented yet... and curiously enough you're the first to request it. Surprised it took so long!
Which "tougher queries" did you try? If you get a response that "the page was not popular enough" then there's not much I can do. I'm relying on delicious to have tags for the URL, and if none are present, then I'm SOL.
Thanks for the feedback. I'll have pagination done shortly.
I tried searching for "Mixergy" and only got one response. Then I saw a message that said, "did you mean Mixergy." I clicked it even though I didn't know what the difference was between what I typed and the response.
I think the message could make it clearer that you're asking about a domain. Like this:
"did you mean Mixergy.com"
And I love the site. I bookmarked it.
It also works for international websites (the results for my favorite German-English dictionary [+]) but you could maybe work on eliminating duplicates.
Some duplicates can be removed by selecting the "domain only" option, since pages are often indexed separately "foo.com" and "foo.com/home/". I agree that this should be automatically completed -- it's a difficult task that I'm not keen on doing yet.
Medium-sized niche sites, like dpreview.com, work especially well. I'm glad you noticed. If you'd like to explore a little bit more, you can try  and fussing about with the tags. Maybe change one to be "satire" and see what happens. The possibilities are endless, as in essence you are exploring a 60,000 dimensional universe of links. (By the way, the "tag search" option is available by searching for a site, then clicking "search by tag signature" just below the first hilited result.)
Again, thanks for the feedback and I'm glad you found the results satisfactory.
I wasn't smart enough to do what you did, I was thinking of somehow building the directory by hand, or by turkers, etc
What finally killed my ambitions was sears.com. I reasoned that a similar-site search would only make the big time if the search was perfect, but what is a perfect match for sears.com? truevalue.com? perfumes.com? clothes.com? or another department store like kmart? You descend into category madness.
All those competitors of yours are also latecomers, I remember in the 90's netscape's browser had a pulldown of similar sites built in.
I think its very possible you'll make a few bucks off this if you can get some buzz, but I'm not sure if these similar site searches will ever be a killer app because the problem is so huge.
Oh, I also discovered that if you put the domain you are matching in the URL, like you are doing now, Adsense does a great job at targeting ads.
One more thing, you forgot the bookmarklet! I want to be on a site, click the moreofit bookmarklet, and do a search. But of course you're now in subdomain hell, if they are on secure.shoes.com, do you search secure.shoes.com or or shoes.com?
Sears.com is a very difficult problem to solve. The "category madness" you bring up certainly is a very good point, and I rely on the wisdom of crowds to mitigate this issue. After a site has been tagged thousands of times, I'm assuming that the unique description of the site is pretty good. However, there is a slight "snowball effect" present with my dataset (delicious): When a URL is first tagged, those tags are suggested to all future people who bookmark it, causing the initial set of tags to have more bias.
At any rate, back to sears.com. The current matches now are all stores, but they're not at specific as you suggest they could be. Moreofit offers a "tag filter" feature, and a tag signature matching feature for this occasion. Of course, it all depends on the user being savvy enough to use these things.
On moreofit, you can specify tags and how important they are for your search.
Try these results to see what I mean: http://bit.ly/cVXgED
I agree that it will never be a "killer app," however, I do think it pretty useful and a preliminary search on Google shows a lot of interest in searches for "sites like chatroulette" and "sites like XXX". I'm hoping to serve those customers' needs.
The bookmarklet is coming soon -- and subdomain hell isn't a big issue -- search the subdomain first, if no matches, then the main domain.
The other problem is that Google's results for the suggested "sites like" queries are pretty helpful - they generally return articles and q&a pages with exactly the lists the end users are looking for. It might require more clicks to get there, but even the non-savvy user should be able to managed it.
I was impressed by the results returned by moreofit though, especially for websites that don't have numerous linkbait articles about "Top 10 websites for ....".
The tag importance sliders are especially neat, though one minor usability issue is that the update button (bottom right) is a long way away from the sliders and the text advising you to click it. The Google display ads returned when you do a URL search are horrendously irrelevant and probably deter users from clicking on the targeted text ads above them.
Google's results for these searches are OK. I've noticed it takes a substantial amount of time to scan through the links and figure out which forum posts might contain some good information. When clicking through, sometimes I'd find good suggestions, sometimes not. It's a similar experience if you search for some obscure error code and end up scouring a few pages to find out how somebody else solved it (sometimes in one of those god awful non-threaded 1-comment-per-page bulletin boards -- how do they still exist?).
I'll work on fixing that update button issue.
If there's one thing I noticed, it's that hardly anybody has used the "tag sliders" feature, or any other search options for that matter. They're either content enough, or don't care enough, to delve further into results. From a technology standpoint, I find the tag sliders feature to be really exciting. You can craft your own URL and see if anything matches it.
The horrendous Google Ads are something out of my control -- Google won't allow me to supply the keywords for the page, and if they haven't indexed the page yet, they'll show crappy generic horoscope ads. I need suggestions as to how to get more relevant ads up there. Perhaps an ad service that allows me to supply the keywords. Know of any?
And, of course, thanks for the feedback.
You may want to play up the "best" angle more than the "first" one.
I'm glad you think moreofit's results "suck less". :)
The only think I'd say is to make the "update" button more prominent (eg add another one beneath the "searching for tags" sliders). A few times I wondered why it was not updating the results, and had to hunt around for it. Maybe some people, at the margin in terms of interest, would give up on it because of this.
It would be nice if there was some natural way for the results to update without specifically requesting it - eg. like how the "moreofit" links work. The way google does it in "search tools" for time choice is interesting: it provides typical defaults, that are linked (eg. Past 24 hours,Past week etc). I've always used it by just clicking on one of them, so the issue of "updating" doesn't occur. [reminiscent of 37signals' defaults]. Perhaps the truth is that while someone might theoretically want the results from the past 22 hours, in practice no-one ever does. Thus, the defaults are adequate. For your importance sliders, I guess this could be done with two links "important" vs "secondary" for each slider.
Of course, sometimes you want to assemble a search, and only then execute it (not have it update after every single term is added). For this, the standard practice seems to be to have a "search" button directly underneath however the search is configured (eg this what google search does for "custom range", for time, in web tools). So this would mean adding a "search" button beneath the sliders (not labeled "update", even though that is also an accurate description - I guess, maybeuse "moreofit", if you want to promote the brand at possible expense of usability).
tl;dr add a "search" button beneath the "searching for tags" sliders.
As per the "more natural way of updating results", I might use Ajax to plop in the new results when any search options are changed.. though this could lead to further complications: Currently you can click on a tag to add it to the "include" and "exclude" tag filters. If you click a tag and suddenly the results you were looking at abruptly disappear and reappear, it might not be so nice.
What do you think of this idea: Any time a change is made to the search options, the "update" button starts glowing (fading in and out or changes colors) so as to notify the user that it wants to be clicked?
Google actually does it about four different ways. There're defaults for common cases like past day/week/month, but there's also software that tries to guess if a particular date range is interesting, eg:
Then there's "Latest", which actually does update in real time, if it has new results. And then there's "Custom date range", which lets you pick with a calendar widget and then requires that you hit Search again, much like MoreOfIt does.
It worked better on reddit.com though.
Here's results for "reader.google.com", except omitting any results that have been tagged with "google":
However, Google's index is much greater. Currently you can't search for non-semi-popular URLs on moreofit, but you can on Google. So they have me beaten there.
What is DDG?
At any rate, if you search for "reddit" it should suggest "were you looking for sites like reddit.com?"
Where would you like me to say it? It will probably go in an "about" page, but I don't think there's any benefit to the possible confusion of "what is the delicious API?" by putting it on the front page.