

My website similarity search engine (feedback please) - photon_off
http://www.moreofit.com

======
photon_off
I cooked up a prototype of this project about a year and a half ago. While the
results were encouraging, I lost motivation and it just collected dust for all
that time. In the past 3 weeks I've revived the project and given new life to
it.

The project started when I wondered: "What if delicious tag signatures were
vectors, would their nearest neighbors in n-dimensional space be a good match
for similarity?" The answer took awhile to figure out, but it was a yes. For
me at least, it wasn't an easy problem to solve; finding the nearest neighbors
with each vector having 10 dimensions, with total dimensional space about
50,000. Figuring out a way to solve this in a timely manner was a joy.

At the time this project was first started and the first prototype completed,
there were absolutely zero sites, other than google's sub-par "similar" link,
that offered this. Well, there was "similicious" but its index was tiny, and
results weren't too impressive. Now I count 5 full fledged services that do a
pretty good job. Some are ranked quit highly and probably make a decent buck.
I really think I missed a decent opportunity by not pushing through the last
20% of the project, and losing motivation entirely.

But, like they say, better late than never.

Questions/comments are welcomed/anxiously awaited.

~~~
joshu
curiously, delicious itself used to have a URL-URL recommender, but it didn't
really scale and I had to turn it off.

~~~
photon_off
First off -- thanks for delicious. Not sure if it's apparent, but moreofit
uses its rich data.

If I might ask, how did the original URL-URL recommended work? Was it simply a
query to find other URLs with the most matched tags, or was tag order also
taken into account?

~~~
joshu
It looked at the other URLs the users bookmarked UN the same way, normalized
partially for popularity.

So I never looked at the global tags - user1's use of tag x was distinct from
user2's.

Man, I miss that dataset.

------
jeromec
A lot of the results look great, like ikea.com, for example. However, trying
tougher queries yields no results. Php.net has no results (message still says
500 results found), and the message explains results were filtered for the tag
'php'; clicking "click here to undo this" gives a 404 error. Also,
number10.gov.uk, the equivalent of whitehouse.gov for Britain gives a message
the site is not popular enough for results. The overlay for "about these
results" doesn't have a 'close' prompt, so users who may not be technical
enough to try clicking outside the message area would be stuck on that screen.
Overall, it looks like it has nice potential. Is there a way to see beyond the
top 10 results? The first 10 results for Mixergy.com look great, but I'd like
to see more results! :)

~~~
photon_off
Great feedback.

I just fixed that broken link. The automatic hiding of a tag is meant
primarily to hide tons of duplicates that would happen if you searched for
"amazon.com" for example. I guess I should check to see if it hides _too much_
and if so, leave it in.

Paging is not implemented yet... and curiously enough you're the first to
request it. Surprised it took so long!

Which "tougher queries" did you try? If you get a response that "the page was
not popular enough" then there's not much I can do. I'm relying on delicious
to have tags for the URL, and if none are present, then I'm SOL.

Thanks for the feedback. I'll have pagination done shortly.

------
ugh
Nice work! I tried amongst others “dpreview.com” and the results I got were
very exhaustive. Pretty much the list I would have made had you asked me for
websites that review digital cameras.

It also works for international websites (the results for my favorite German-
English dictionary [+]) but you could maybe work on eliminating duplicates.

[+] [http://www.moreofit.com/similar-
to/dict.leo.org/Top_10_Sites...](http://www.moreofit.com/similar-
to/dict.leo.org/Top_10_Sites_Like_Leo_Dict/)

~~~
photon_off
Thanks for the feedback.

Some duplicates can be removed by selecting the "domain only" option, since
pages are often indexed separately "foo.com" and "foo.com/home/". I agree that
this should be automatically completed -- it's a difficult task that I'm not
keen on doing yet.

Medium-sized niche sites, like dpreview.com, work especially well. I'm glad
you noticed. If you'd like to explore a little bit more, you can try [1] and
fussing about with the tags. Maybe change one to be "satire" and see what
happens. The possibilities are endless, as in essence you are exploring a
60,000 dimensional universe of links. (By the way, the "tag search" option is
available by searching for a site, then clicking "search by tag signature"
just below the first hilited result.)

[1] <http://bit.ly/bcNumG>

Again, thanks for the feedback and I'm glad you found the results
satisfactory.

------
nostrademons
Google's had this since about 2000, but its results suck more:

<http://www.google.com/search?q=related:news.ycombinator.com>

You may want to play up the "best" angle more than the "first" one.

~~~
photon_off
Good point, though I suppose by "first" I meant "first whose primary purpose
is to provide similar sites".

I'm glad you think moreofit's results "suck less". :)

~~~
cj
Yeah, definitely don't play the first angle. The first think I thought of when
I read that was Google's similiar link that I use pretty often.

------
KevinMS
I was working on something like this a little while ago. I also saw all the
competition (sites similar to a site that shows similar sites :)

I wasn't smart enough to do what you did, I was thinking of somehow building
the directory by hand, or by turkers, etc

What finally killed my ambitions was sears.com. I reasoned that a similar-site
search would only make the big time if the search was perfect, but what is a
perfect match for sears.com? truevalue.com? perfumes.com? clothes.com? or
another department store like kmart? You descend into category madness.

All those competitors of yours are also latecomers, I remember in the 90's
netscape's browser had a pulldown of similar sites built in.

I think its very possible you'll make a few bucks off this if you can get some
buzz, but I'm not sure if these similar site searches will ever be a killer
app because the problem is so huge.

Oh, I also discovered that if you put the domain you are matching in the URL,
like you are doing now, Adsense does a great job at targeting ads.

One more thing, you forgot the bookmarklet! I want to be on a site, click the
moreofit bookmarklet, and do a search. But of course you're now in subdomain
hell, if they are on secure.shoes.com, do you search secure.shoes.com or or
shoes.com?

~~~
photon_off
Hi Kevin! Glad somebody else has a shared interest in this problem.

Sears.com is a very difficult problem to solve. The "category madness" you
bring up certainly is a very good point, and I rely on the wisdom of crowds to
mitigate this issue. After a site has been tagged thousands of times, I'm
assuming that the unique description of the site is pretty good. However,
there is a slight "snowball effect" present with my dataset (delicious): When
a URL is first tagged, those tags are suggested to all future people who
bookmark it, causing the initial set of tags to have more bias.

At any rate, back to sears.com. The current matches now are all stores, but
they're not at specific as you suggest they could be. Moreofit offers a "tag
filter" feature, and a tag signature matching feature for this occasion. Of
course, it all depends on the user being savvy enough to use these things.

On moreofit, you can specify tags and how important they are for your search.

Try these results to see what I mean: <http://bit.ly/cVXgED>

I agree that it will never be a "killer app," however, I do think it pretty
useful and a preliminary search on Google shows a lot of interest in searches
for "sites like chatroulette" and "sites like XXX". I'm hoping to serve those
customers' needs.

The bookmarklet is coming soon -- and subdomain hell isn't a big issue --
search the subdomain first, if no matches, then the main domain.

~~~
notahacker
The problem with searches for "sites like xx" is that people entering those
search strings are likely to be fairly unsophisticated users (who aren't aware
of the "similar" button or related: syntax, let alone the possibility of there
being dedicated sites for that kind of search). Moreofit.com looks more like a
power user tool.

The other problem is that Google's results for the suggested "sites like"
queries are pretty helpful - they generally return articles and q&a pages with
exactly the lists the end users are looking for. It might require more clicks
to get there, but even the non-savvy user should be able to managed it.

I was impressed by the results returned by moreofit though, especially for
websites that don't have numerous linkbait articles about "Top 10 websites for
....".

The tag importance sliders are especially neat, though one minor usability
issue is that the update button (bottom right) is a long way away from the
sliders and the text advising you to click it. The Google display ads returned
when you do a URL search are horrendously irrelevant and probably deter users
from clicking on the targeted text ads above them.

~~~
photon_off
I'm trying to make it so a search for "sites like bla" would bring me
somewhere near the top, with a link to "top ten sites like bla" which I feel
would appeal well to these users.

Google's results for these searches are OK. I've noticed it takes a
substantial amount of time to scan through the links and figure out which
forum posts might contain some good information. When clicking through,
sometimes I'd find good suggestions, sometimes not. It's a similar experience
if you search for some obscure error code and end up scouring a few pages to
find out how somebody else solved it (sometimes in one of those god awful non-
threaded 1-comment-per-page bulletin boards -- how do they still exist?).

I'll work on fixing that update button issue.

If there's one thing I noticed, it's that hardly anybody has used the "tag
sliders" feature, or any other search options for that matter. They're either
content enough, or don't care enough, to delve further into results. From a
technology standpoint, I find the tag sliders feature to be really exciting.
You can craft your own URL and see if anything matches it.

The horrendous Google Ads are something out of my control -- Google won't
allow me to supply the keywords for the page, and if they haven't indexed the
page yet, they'll show crappy generic horoscope ads. I need suggestions as to
how to get more relevant ads up there. Perhaps an ad service that allows _me_
to supply the keywords. Know of any?

And, of course, thanks for the feedback.

------
kingkilr
Honestly, I clicked it, entered by blog, and expected a clusterfuck. Couldn't
have been more wrong, the results are probably as close as anything I'd
expect: [http://www.moreofit.com/similar-
to/alexgaynor.net/Top_10_Sit...](http://www.moreofit.com/similar-
to/alexgaynor.net/Top_10_Sites_Like_Alexgaynor/)

~~~
photon_off
Hey, thanks.

------
10ren
excellent work!

The only think I'd say is to make the "update" button more prominent (eg add
another one beneath the "searching for tags" sliders). A few times I wondered
why it was not updating the results, and had to hunt around for it. Maybe some
people, at the margin in terms of interest, would give up on it because of
this.

It would be nice if there was some natural way for the results to update
without specifically requesting it - eg. like how the "moreofit" links work.
The way google does it in "search tools" for time choice is interesting: it
provides typical defaults, that are linked (eg. Past 24 hours,Past week etc).
I've _always_ used it by just clicking on one of them, so the issue of
"updating" doesn't occur. [reminiscent of 37signals' defaults]. Perhaps the
truth is that while someone might theoretically want the results from the past
22 hours, in practice no-one ever does. Thus, the defaults are adequate. For
your importance sliders, I guess this could be done with two links "important"
vs "secondary" for each slider.

Of course, sometimes you want to assemble a search, and only _then_ execute it
(not have it update after every single term is added). For this, the standard
practice seems to be to have a "search" button directly underneath however the
search is configured (eg this what google search does for "custom range", for
time, in web tools). So this would mean adding a "search" button beneath the
sliders (not labeled "update", even though that is also an accurate
description - I guess, maybeuse "moreofit", if you want to promote the brand
at possible expense of usability).

tl;dr add a "search" button beneath the "searching for tags" sliders.

~~~
photon_off
Thanks for the feedback. I think adding another update button below the tag
sliders is a good idea.

As per the "more natural way of updating results", I might use Ajax to plop in
the new results when any search options are changed.. though this could lead
to further complications: Currently you can click on a tag to add it to the
"include" and "exclude" tag filters. If you click a tag and suddenly the
results you were looking at abruptly disappear and reappear, it might not be
so nice.

What do you think of this idea: Any time a change is made to the search
options, the "update" button starts glowing (fading in and out or changes
colors) so as to notify the user that it wants to be clicked?

~~~
10ren
Glowing is an interesting idea, but I think just placing the search button
beneath would be helpful enough - then it's obvious to the user that they need
to hit that, and it's easy to find. That would solve both of the problems that
I had, anyway.

------
jluxenberg
Got a little creeped out by your "auto suggest" feature. I'm guessing you're
using the CSS technique described here
<http://www.merchantos.com/makebeta/tools/spyjax/> ?

~~~
photon_off
Finally, somebody noticed! Hardly anybody is using "auto suggest" ... which is
good -- I'm glad people have URLs in mind to search. Yes, I'm using the
browser history hack that's been around for ages. It's going to be patched up
by all major browsers in probably under a year, so don't worry. I think the
latest safari already has it covered.

------
yarone
Wow. After some brief testing, I am also very impressed with the results. Well
done.

------
kanak
I tried it on google reader (<http://reader.google.com>) and most of the top
matches are google reader itself.

It worked better on reddit.com though.

~~~
photon_off
That's a prime example of when to filter by tag. (You can do this by clicking
on any tag)

Here's results for "reader.google.com", except omitting any results that have
been tagged with "google":

<http://bit.ly/dnsgfk>

------
xtacy
What's interesting is that I can use the website to find websites similar to
itself.. or, wait a minute, can I? :)

~~~
photon_off
I suppose it is _a little_ ironic that moreofit.com isn't itself indexed. It's
not popular enough :P

------
fuzzythinker
How are the results compared to google or DDG's in your test samples?

~~~
photon_off
Google's "similar" link is akin to my results with "popularity" sort, only
worse. That is, Google seems to do this: "What other URLs have some keywords
in common with this one?" And of those, it sorts by "which is most popular?".
The results, well, they're not at all impressive which is why I made this in
the first place.

However, Google's index is much greater. Currently you can't search for non-
semi-popular URLs on moreofit, but you can on Google. So they have me beaten
there.

What is DDG?

~~~
TimMontague
DDG: <http://duckduckgo.com>

------
jamram82
reddit and reddit.com shows two different search results. I thought reddit
will refer to reddit.com

~~~
photon_off
The query "reddit" is assumed to mean you are searching for sites that have
been tagged as "reddit", which is different from "reddit.com" which is assumed
to be a URL.

At any rate, if you search for "reddit" it should suggest "were you looking
for sites like reddit.com?"

------
iterationx
reddit search returns reddit, but the other results look good

------
adrianwaj
delicious API? Why not just say so?

~~~
photon_off
Delicious API.

Where would you like me to say it? It will probably go in an "about" page, but
I don't think there's any benefit to the possible confusion of "what is the
delicious API?" by putting it on the front page.

~~~
adrianwaj
Just a very small "Powered by Delicious" at the bottom of the page would give
some added credibility and clarification.

