
Quora Blocks Startup Search Engines - bjonathan
http://www.readwriteweb.com/hack/2011/01/quora-blocks-startup-search-en.php
======
joblessjunkie
When our startup was just getting going, search spiders were an unexpected
problem for us.

Our project was a bit odd, in that we had millions of documents available to
browse. We wanted them all to be available to search engines, but it was an
operational headache to keep that much data moving.

Our normal daily traffic might have been 10,000 users with perhaps 50K page
requests. We built our first small web server to handle this load.

We were caught blindsided by the Googlebot, which from day one demanded
hundreds of thousands of pages per day. Googlebot offers no option to slow it
down, and if you try to slow it down on purpose your search ranking will
suffer.

So 90% of our early operational burden was devoted to keeping Googlebot happy.
This was an unexpected burden, but it was worth it to be in the Google index,
because they sent a lot of traffic our way and kept our fledging startup
alive.

But it was definitely _not_ worth it to spend equivalent operational expense
to be in marginal search engines that don't refer much traffic. To a site
operator, they don't offer any value in exchange for what they demand.

It was very interesting to see the relative volume of requests that we got
from Googlebot versus the other crawlers. While Googlebot found us on day one,
it took over 3 months before we appeared in Microsoft's index, and even then,
they had perhaps only 10% of the coverage of our site when compared to Google.

~~~
ig1
You can slow down Googlebot via the Google Webmaster Tools

~~~
joblessjunkie
At the time, I recall that our choices were between "too fast" or "way, way
too fast".

------
niyazpk
Did anyone actually go through the robots.txt file?
<http://www.quora.com/robots.txt>

The first two lines are:

    
    
        # If you operate a search engine and would like to crawl Quora, please
        # email info@quora.com. Thanks.
    

Looks sensible to me.

~~~
JoachimSchipper
Yeah, because mailing every two-cent site to get permission to crawl is a
sensible way to build a search engine.

Quora should just get over themselves.

~~~
pclark
Ironically I believe that this is to protect the users privacy on the Quora
site. They have an option to hide your username from your answers, so if
someone were to Google your name your Quora answers wouldn't appear.

~~~
jcr
If one wants to be an idiot or an asshole on the Internet, then one is free to
do so. If one wants to attach their real name, address or some other
identifiable bit of information to such outbursts, then once again, one is
free to do so.

The same goes for life off of the Internet, which of course could be recorded
or reported, and subsequently placed on the Internet.

The real problem is people mistakenly believe they can secretly operate with
impunity in an age of increasing transparency.

~~~
pclark
You've successfully identified a fundamental problem with society, and it is
by no means limited to Quora.

I think the search engine privacy feature is a fantastic innovation from Quora
to help absolve _potential_ problems in the future. (eg: you google my name
and see my Quora answers on Sex)

~~~
jcr
The supposed "potential problems" are fictitious.

If you wrote public answers on questions of sex, why should they be hidden?
--You made your answers public _intentionally_ and also provided the answers
_intentionally_ so others would hopefully learn from them. Additionally, you
did both while knowing your name is attached to them. In other words, you
_want_ to be known both for and by your contributions.

If you did not wish to be known for and by your public contributions, then the
only answer is to either not make public contributions, or go to rather
extreme lengths to only make public contributions anonymously. The latter
fraught with plenty of caveats to maintain anonymity correctly, so the only
real answer is to only make public contributions you would want attributed to
you.

Back in the 80's when Chico State was the number 1 party school in the US,
recruiters would show up in droves looking to get new employees. The unstated
reason was very simple; if someone could graduate from Chico, it pretty much
proved the person could not only get things done but also have a really fun
time doing it.

Who would you rather work with? --A fiercely academic but socially limited
graduate, or the guy who can get stuff done and also do keg stands?

If I searched on your name and found your public opinions, I'd think far more
of you for having the stones to make your own contributions than I would be
concerned about your predilection for midget porn.

Of course, the trouble with trying to be non-judgmental and unbiased is
assuming others will do the same.

------
mmaunder
Running a scraping site pretending to be a search engine is an oldy, but a
goody.

[http://www.google.com/#sclient=psy&hl=en&q=site:duck...](http://www.google.com/#sclient=psy&hl=en&q=site:duckduckgo.com)

Way to play the openness card Gabriel.

------
citricsquid
"...Google, Bing, Blekko and other big players access..."

Wait, since when was Blekko big? Also, why has every article about search
engines recently ended up talking about DDG? :p

~~~
redthrowaway
>why has every article about search engines recently ended up talking about
DDG?

Because they're the new darling, and tech types like them. It's always nice to
see someone come in as a substantial underdog and make a real effort to
succeed. Also, their results seem to be at least as good as Google's which is
pretty impressive.

~~~
nostrademons
"Also, their results seem to be at least as good as Google's which is pretty
impressive."

They're Bing with some twiddles on top. DDG uses Yahoo! BOSS, which is Yahoo's
API to its search service, which is now powered by Bing. Your DDG results all
originally come from Bing's index, they just get filtered and possibly re-
ranked by some custom code Gabriel's written.

~~~
epi0Bauqu
Actually, that's not true and it really isn't hard to verify so I'm not sure
why you keep saying that.

~~~
coderdude
Where are you getting your results?

~~~
epi0Bauqu
From the FAQ:

How do you get your results?

 _From over 30 sources, including DuckDuckBot (our own crawler), crowd-sourced
sites, Yahoo! BOSS, embed.ly, WolframAlpha, EntireWeb, Bing & Blekko._

Just do some searches and compare to Bing, etc., e.g.
<http://duckduckgo.com/?q=hacker+news> vs
<http://www.bing.com/search?q=hacker%20news> vs
[http://www.google.com/search?hl=en&q=hacker%20news](http://www.google.com/search?hl=en&q=hacker%20news)

~~~
coderdude
Alright. I think the spirit of what our ancestor poster was saying is that
it's mostly just powered by an API (by which most of the "work" is being
done). In a previous discussion I asked if you were going to do full-scale Web
crawls, and you said:

>We never stopped crawling; we just scaled it back for specific purposes,
mainly now for spam detection and zero-click info

So if you've scaled back the crawling and are mostly using it just for spam
detection and zero-click info (forgive my ignorance, but I'm not sure what
zero-click info is) then does that mean that you are getting the majority of
your result data from outside APIs like BOSS?

Edit: And does that mean that you store those pages in your database and use
your own ranking algorithm, or are you using their ranked results for a query
and then rearranging them on the fly?

~~~
epi0Bauqu
No, the spirit of the above comment was condescending, dismissive and only
mentioned one source: _They're Bing with some twiddles on top._ (I believe he
is a Google employee.)

About 50% of queries show information from my index. I've concentrated on the
fat head of the search space, where I get a lot of volume (1 and 2 word
queries). For long-tail stuff, yes, the majority of results come from external
APIs, but even there it is inaccurate to say they are all derived from a
direct call to one API. It really varies by query. Some will look like Bing
for sure, but others will look very different. I'm not going to disclose
everything I do of course.

~~~
coderdude
>No, the spirit of the above comment was condescending, dismissive

Well, I didn't even think of it that way. Naturally you're probably looking at
these types of criticisms under a magnifying glass since it's your project.

>About 50% of queries show information from my index. I've concentrated on the
fat head of the search space, where I get a lot of volume (1 and 2 word
queries).

See now that's actually good to know! From many people's point of view
(including mine), DDG _is_ Bing with some twiddles on top. Maybe out of
ignorance, but what can you expect? If you aren't already enamored by the idea
of DDG then you probably don't care enough to find out what you're really
doing on the back-end, which is what really matters most to me in these
discussions. (Not the results since unless I'm told otherwise I just assume
it's Bing or Yahoo.)

Edit: It's quite humorous to be down-voted by the fan boys for expressing an
honest and IMO helpful opinion to the owner of DDG. If all he gets is blind
love he won't know how to get the rest of us on board. Believe it or not, the
down vote button isn't the disagree button. If you have something constructive
to add, say it.

~~~
redthrowaway
See, and you were doing so well until the edit. I was about to upvote your
rather interesting reply, but reddit has left an unforgivably bad taste in my
mouth for people who say "not sure why the downvotes" or "I'll probably get
downvoted, but..."

I'm not going to downvote you, but next time just say what you have to say. If
people like it and upvote it, great. if not, it's a meaningless metric on a
fairly small social news site. There are far more important things to worry
about.

~~~
coderdude
I understand your point, but I'm not worried about the actual points being
added or subtracted from my karma. People here tend to continue up-voting
something that's been up-voted or continue down-voting something that's been
down-voted. I'm attempting to curb any further lack of real consideration for
my comments due to the blind nature of many of the users here regarding voting
etiquette. If you've seen me around here before you'll know that I often post
comments that attempt to uphold what I consider to be the HN standard. Which
is why I ended it with "if you have something constructive to add, say it." I
realize some users may find this grating, but perhaps those users aren't the
target audience for the remark to begin with. Thanks for the heads up though.

To add to that, if I were trolling or being rude or vicious I would not have
edited to bring up the down-voting. I hate to see this site slip into the bad
habit of down-voting legitimate and helpful points due to simple disagreement
with the commenter. With so many new users coming to HN daily it's important
to occasionally remind people that this isn't Reddit, it's not Digg, and it's
not Slashdot.

------
Samuel_Michon
From the article:

 _"[Quora's] robots.txt file explicitly grants Google, Bing, Blekko and other
big players access"_

Blekko is a big player? It launched two months ago and it's still in public
beta. According to Compete.com [1], it attracts only 120,000 visitors a month.

I'd say that if Blekko is a big player, DuckDuckGo is one too.

[1] <http://siteanalytics.compete.com/blekko.com/?metric=uv>

~~~
rudiger
Blekko has around 800 servers, each with 64 GB RAM and 8 TB of storage across
SATA drives.

[http://www.readwriteweb.com/hack/2010/12/the-secrets-
behind-...](http://www.readwriteweb.com/hack/2010/12/the-secrets-behind-
blekkos-search-technology.php)

Has DuckDuckGo published the details of their data-center anywhere?

Also, Blekko may have launched recently, but it was founded in mid-2007.

------
bobf
I agree wholeheartedly. For comparison, I currently see ~95% of search engine
referral traffic from Google, 2.8% from Yahoo, 1.4% from Bing, and <1% from
all other search engines combined. It makes no sense to let X, Y, Z random
search engines index content like Google, when they send nowhere near the
traffic. As others have mentioned, blocking by robots.txt often doesn't work
as some bots don't obey it. I currently use an extensive rewrite rule to block
those bots' User-Agents, and block specific IPs as a last resort.

~~~
pilif
Problem is, that by disallowing those smaller players, you are preventing them
from getting bigger because they would need to be able to index sites to
actually become good enough to attract visitors.

Also, when disallowing a spider you can be sure that you won't get ANY
visitors from them, as you are not in their index to start with.

I mean: if I disallow any bot but Googlebot, I will have 100% Google referrals
by design.

I can understand that in an emergency situation you might want to block some
bots, but blocking all of them because they are currently small feels a) a bit
unfair due to not giving them the chance to grow and b) shortsighted due to
never knowing if they might be getting big enough or the visitors they deliver
might be better ad-clickers

------
terryjsmith
Speaking from experience, this is why robots.txt files don't work; they've
become anti-competitive in nature. If you actually build code to obey them you
get locked out of actual sites, while the real scrapers you don't want don't
listen to robots.txt anyways.

If a site really doesn't want you there, they can block you at the IP/User-
Agent level, which is what Quora will end up doing.

------
alphadog
What's the penalty (if any) for ignoring robots.txt and crawling anyway? Any
citations of it being legally enforced via TOS or ASP?

~~~
moultano
robots.txt in the past has been the legal defense against claims of copyright
infringement.

~~~
uxp
Curious if you have any citations for this remark. I have always been under
the impression that robots.txt is entirely optional, both to implement and to
respect.

~~~
moultano
IANAL and looking for other sources, but it's discussed along with many other
factors here: <http://www.benedict.com/Digital/Internet/Field/Field.aspx>

Essentially, the existence of robots.txt and meta noindex has made courts more
comfortable ruling that the index of search engines constitutes fair use.

~~~
uxp
But this still doesn't specifically answer the question, "What is the penalty
for ignoring 'User-agent:* Disallow: /' in robots.txt, or meta tag no-index",
only that without a robots.txt or no-index tag, crawlers can't be found liable
for accessing information that is freely available to them.

Interesting link, however. I was unaware of that case.

------
shirtless_coder
I don't blame them for doing this. They are trying to protect their assets
(questions and answers). However, this is not a good move for two reasons:

1\. Honoring robots.txt is optional, so this doesn't actually offer any form
of asset protection.

2\. They are losing serious amounts of exposure that they would reap if they'd
lift this restriction so that smaller engines can find them.

------
wslh
Is robots.txt against net neutrality? net neutrality for bots.

~~~
metageek
I think the answer is no, because it's on an endpoint. Net neutrality is about
making sure that middlemen don't abuse their position.

------
jay_kyburz
Question: Are smaller search companies allowed to build off of the results of
other search providers, or is this blocked by the TOS.

~~~
alanh
Google in particular absolutely does not allow automated search queries.

~~~
dasil003
Why does <http://lmgtfy.com/> still exist then?

~~~
unshift
lmgtfy just issues a redirect, it doesn't perform a query and return results
of its own

------
dedward
Did anyone look at the robots.txt?

It looks to me like it doesn't block anything - it does have some specific
settings tailored for various big-boy search engines, with a catch-all rule at
the end for those not otherwise defined.

It also says right there at the top "# If you operate a search engine and
would like to crawl Quora, please # email info@quora.com. Thanks. "

So they're trying to get the right data to the right search engines the right
way - makes sense to me.

User-agent: * Allow: /$ Allow: /about$ Allow: /about/ Allow: /jobs$ Allow:
/challenges$ Allow: /press$ Allow: /login$ Allow: /login/ Allow: /signup$

~~~
jarek
The important part is Disallow: / right after the allows.

------
dedward
Search engines can be a real PITA for a site. They can take a site down. When
it's my site - I don't have to let you, or anyone else, index it. We can do it
nicely through robots.txt, or cold-war style through firewalls and script and
tarpits and who knows what else. So - assuming Quora took the time to put what
they did in their robots.txt, in a world where many sites still don't bother
at all - one can assume they are paying close attention to the business value
they are extracting from search engine driven traffic.

------
joe_the_user
This is indeed a disturbing, oligopolistic trend.

Craiglist policy of specifically disallowing only classified search engines
should look kinder in retrospect.

It is reasonable to block sites which aim to merely re-frame your content
within a "search" parameter. But that kind of thing can be dealt with terms of
service.

The problem is that it takes time to actively filter against offenders.

------
arihant
If they are blocking to save resources, then it might be worth noting that
startup search engines will not bomb their servers the way Google and Bing
can. Are there any published numbers as to how the crawling load varies from
startups to Googlebot?

------
jmount
Just about everybody effective blocks (or throttles to near zero) all but the
top 3 or so crawlers. For a popular commerce site a significant fraction (like
1/3rd) of the infrastructure capacity is just responding to crawler traffic.

------
rwhitman
Aggressive spiders are a pain in the ass. I'm with them on this.

~~~
mtogo
Search engine spiders, too.

