

Hacker News bans Google and all other search engines - jsrfded
http://news.ycombinator.com/robots.txt
Looks like news.ycombinator.com is rejecting all search engine robots with "User-Agent: * Disallow: /".  This is unfortuante, I often do site: searches to find old threads here.  esp since news.ycombinator.com doesn't have its own site search, this is the only way to find old threads.<p>pg, what's up?
======
pg
Don't worry, it doesn't mean anything. The software for ranking applications
runs on the same server, and it is horribly inefficient (something 4 people
use every 6 months doesn't tend to get optimized much). This weekend all of us
were reading applications at the same time, and the system was getting so slow
that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers
is much more expensive for us than traffic from human users, because it
interacts badly with lazy item loading.) We only finished reading applications
an hour before I had to leave for SXSW, so I forgot to set robots.txt back to
the normal one, but I just did now.

~~~
lisper
Dude, why not just get yourself another server?

~~~
sbhat7
"something 4 people use every 6 months doesn't tend to get optimized much"

~~~
_delirium
Yeah, but the usual case is still to throw stuff like that on a spare "random
odds and ends go here" server, not a production one whose
performance/responsiveness you care about.

~~~
kmt
It likely has a dependency on HN: looking up users, karma, etc., which is just
simpler to do on the same machine.

~~~
markbao
Ranking applications by HN karma? I hope not.

~~~
davidu
They've been clear that they do look at your HN posts and comments, etc.

It may or may not be used for ranking in specific, but it's definitely used
and they've always been upfront about that when asked.

~~~
idoh
Also, they use hacker news to communicate to applicants. I applied a while
back, and one day while browsing I got a message through HN, something like
"pg has a question for you" in the header bar. Clicking through is the
question, and place to put the answer.

Basically, there's a whole messaging system in hacker news that most people
don't see.

------
eel
What does this mean for <http://www.searchyc.com>? Will it stop too?
(Afterall, it must be a bot too, and a robots.txt like this implies that no
bots /should/ scrape the site.) Or will it be allowed?

~~~
qeorge
That was my immediate question too. In addition to searchyc, we lose every
other HN mash-up present and future.

Maybe this will finally bring about an HN API.

------
g0atbutt
That's pretty disappointing. I'd love to hear an official explanation of why
this choice was made.

~~~
cookiecaper
I quite agree. I use Google to search HN. I just used it the other day,
because I didn't remember this one guy's startup or username, but I knew it
involved family websites. I was able to find it after a bit of search-fu. I am
sad to see that HN won't be indexed by Google anymore. :(

~~~
dantheman
One way to search news.ycombinator is to visit searchyc.com

~~~
JeffJenkins
Unless they're obeying robots.txt, which they probably should be

------
sahaj
i just googled this:
[http://www.google.com/search?q=site:news.ycombinator.com+Hac...](http://www.google.com/search?q=site:news.ycombinator.com+Hacker+News+bans+Google+and+all+other+search+engines)

and found this: <http://news.ycombinator.com/item?id=165279>

one of the top replies:

 _My vote is to constrain growth as much as possible, at least that which
comes from stupid sources. Smart hackers will find this site just fine without
Yahoo or MSN, probably even Google. As "evil" as blocking sites and crawlers
may sound, I think these types of measures will be necessary to preserve the
quality of content here. Whatever actions further that objective have my
vote._

~~~
borism
ugh

perhaps smart hackers shouldn't block everything out like this without any
discretion whatsoever?

------
petercooper
Will this break Google Reader and other RSS readers from legitimately using
the RSS feed here? After all, Google Reader uses a bot to read the feed and
allows us to search it from within their app.. much like their search engine
does with regular pages :-) (That is, "web robots" aren't just spiders.)

Also things like <http://hacker-newspaper.gilesb.com/> and
<http://hnsort.com/> become less legitimate due to this. If the reason is to
reduce the SEO benefits of getting a link on HN, just "nofollow" everything
instead..

(Update: Googling on this topic brought up a page of my own where a Google
Reader engineer explained how Google Reader deals with robots.txt -
[http://www.petercooper.co.uk/google-reader-ignores-
robottxt-...](http://www.petercooper.co.uk/google-reader-ignores-robottxt-
rules-51.html) \- though their definition of Web robot is far from universal)

------
proee
This makes me sad. I often do site:news.ycombinator.com google searches for
key topics.

~~~
Mc_Big_G
I do exactly the same thing at least once a day. I can always remember just a
snippet of the post or subject I want to find and that type of query always
gets me what I want. I just used it today to find the YC company who is doing
the journalism stuff so I could send the link to my nephew.

------
pierrefar
Does that also kill Search YC?

What about the "official" HNSearch
(<http://www.webmynd.com/html/hackernews.html> )?

~~~
blasdel
The 'official' HNSearch is completely useless for anyone who would ever want
to search HN. It's a Firefox Extension (!) that adds a 400px sidebar to your
google search results pages showing site: searches along with an extra adsense
ad. It's less useless than the 5 IE toolbars your annoying relatives live
with, but just as obnoxious.

I just installed it again for a laugh, and it actually redirects to
searchyc.com results when you click on its "Hacker News" header — even though
pg staunchly refuses to ever link to searchyc. I made fun of him for it at his
party before Startup School and he tried to laugh it off with "they're a YC
company!" as if that weren't obvious on the face of it.

~~~
tptacek
What's pg's problem with SearchYC? It's incredibly useful.

~~~
jacquesm
That it competes with webmynd ?

It is a clear conflict of interest, HN should simply get the best search
available, so the 'funded' webmynd guys should get off their butts and fix it
or HN should support searchyc.

~~~
tptacek
I thought Webmynd was a browser plugin. I've never even _considered_ using it
for HN search.

~~~
jacquesm
It is. Check the link at the bottom of every HN page:

<http://www.webmynd.com/html/hackernews.html>

------
chaosmachine
If this is an anti-spam measure, I expect it will be about as effective as "no
follow" was. There's still plenty of good reasons for spammers to submit crap.
RSS and Twitter syndication of links, for example.

On the other hand, if the goal is to push HN back into semi-obscurity by
making it harder to find, it might work.

------
tkiley
Is this permanent?

HN is a good source of google juice for interesting new startups, and it would
be a shame to see that go away...

~~~
qeorge
That's probably the main issue. HN's authority and DoFollow'd links make it a
prime target for linkspam.

There's a lot of reasons I'm bummed about this, but I have to say: startups
missing out on free PageRank is not one of them.

~~~
ericb
The links are nofollow'd now.

edit: Oh wait, seems that only comment links are nofollow'd.

~~~
weaksauce
Actually some of the links are nofollowed until it hits a karma threshold. Not
sure of the value but it seems that a story with >50 votes is not currently
nofollowed.

~~~
new_account
It seems that a lot is going on - in secret. Example: I just discovered that
if I reload the page my votes for comments get rolled back.

~~~
blasdel
It wasn't rolled back: it was never counted.

There's a disenfranchisement heuristic that causes your votes to be recorded
(can't vote again) but not counted. It usually persists for about a day. I
think it's supposed to kill voting rings but its behavior is pretty odd.

The number in your browser gets incremented or decremented client-side
independently of reality — it's doing a blind GET request by pre-fetching an
'image', not making an XHR POST and getting the new karma value in the
response. Like a lot of things in Arc, it's just poor HTTP.

------
Matt_Cutts
Does anyone know why HN decided to disallow search engines? It's certainly up
to the website owner to decide, but there are good ways to (say) reduce the
load on the web server without blocking search engines entirely.

~~~
seiji
The HN code keeps everything in memory (lazily loaded from disk). If every
search engine and me-too crawler out there hits every page, you constantly
have the entire HN corpus in memory. I assume the memory overload is what
leads to the site being slow and unreliable. I don't know if the arc code has
any LRU scheme or if everything sticks around until the server falls over.

HN just needs a super simple caching proxy in front of it to reduce load on
the app server. Alternatively, just generate static pages for all topics older
than 5 days.

Will sysadmin for YC dinner invites.

------
simonw
I noticed that the other day while playing around with the YQL console - YQL
obeys robots.txt, so the Hacker News data table doesn't work any more.

------
tptacek
I'm interested in the justification for this, but I'm happy about it. I'm
actually uncomfortable with how high Hacker News comments score on Google.

~~~
jamesbressi
Please enlighten me as to why you are "uncomfortable with how high Hacker News
comments score on Google." I mean it sincerely, I'd like to know because I
have no clue why that is an issue?

~~~
tptacek
Talking to people in a relatively small forum like Hacker News: totally
comfortable.

Talking with the entire world by having my comments associated in ways I don't
expect with keywords: less comfortable.

~~~
jamesbressi
Thank you and that makes total sense. Point and case?:

I googled an executive of a company that was looking to do business with me. I
came across some google groups discussions that he was part of from a couple
years ago... and let's just say the material would be a dream for someone with
a vendetta against the person.

Thanks for the perspective!

------
redsymbol
Everyone RELAX. There's probably a good explanation for this, maybe it was
even an (easily correctable) accident. Give Paul a chance to respond before
raising the pitchforks, okay?

------
chaosmachine
It appears to have been changed, just seconds ago. Now it reads:

    
    
      User-Agent: * 
      Disallow: /x?
      Disallow: /vote?
      Disallow: /reply?
      Disallow: /submitted?
      Disallow: /threads?

~~~
Matt_Cutts
That seems like a nice middle ground--it should save server load, but search
engines can still get to useful content.

------
WalkingDead
I guess that might have to do with load these search engines generate.
Previosly in a thread we checked that comments posted in HN appear in google
search after a minute of posting. That should create a good amount of load on
HN. Multiply it for all the search engines in the wild, and PG probably have
decided blocking those woun't do any harm. My guesss. Could be wrong, or it
might be temporary or a mistake.

------
mikecane
What is the reason for this?

~~~
zitterbewegung
Probably to stop linkspam?

~~~
jsrfded
It's unfortunate because I often use site: to find old threads on ycombinator.
especially since news.ycombinator.com doesn't have its own search facility,
external search engines are the old way to comb through old posts here.

~~~
johnfn
<http://www.searchyc.com/> still exists, but it's up in the air if they will
continue.

~~~
chengmi
Not unless pg replies to this with "Please stop".

------
CoreDumpling
Well, at least my profile on HN will no longer be the first result for a
Google search for my username (it can now revert to some blog that's not
mine). But I really wish HN were searchable; I try to find insight on the
comment threads here all the time.

------
ars
The robots page is returning the mime-type text/html instead of text/plain

------
brianr
Whoa there, everything important is still crawlable and indexable. Here's what
robots.txt says right now:

    
    
      User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?
    

This just disallows those pages... not the home page, and not the /item?
action (note the url of this page).

~~~
mcav
It used to just disallow all. PG must be changing it as we speak.

------
nirmal
Looks like the Readable Feeds stuff is still working. Not sure how this will
effect it in the future.

<http://andrewtrusty.appspot.com/readability/>

------
prs
Truth be told: On various occasions the site: operator on Google came in handy
for me to dig out some nugget of information from the archives of HN.

I can't await to hear the reason behind that decision.

------
tsally
So is SearchYC done, or do they use a scraper not a crawler?

~~~
grinich
What's the difference?

~~~
tsally
In my mind legitimate crawlers are large engineered programs that respect
robots.txt, whereas a scraper might just be a Perl script some dude wrote to
check a website for updates periodically. Technically I suppose they all
should probably respect robots.txt, but in my experience something you think
of as a scraper rarely does.

------
sev
I'd like to understand why this decision has been made, as well as why the
explanation has been delayed (or will not be given at all) directly from the
source.

------
euroclydon
Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe
I'll have to search HN with it.

------
AdamN
I think this is still on the front page because people want it to stay this
way.

------
rmason
I think that it's brilliant. Short term it gets publicity and drives people to
the group who are curious.

Long term PG signs a deal with Bing to be the exclusive search engine for
Hacker News that pays for the servers and bandwidth.

Down mod me if you will but it's simply brilliant.

------
andrewcooke
surprised no-one has suggested that this is to improve performance. this place
can be very slow sometimes, and blocking robots can reduce load.

~~~
keltex
That's not a legitimate reason to ban google. You can always (using google's
webmaster tools) throttle their indexing. Or you can use search-engine
specific lines in robots.txt (Bing anyone?) to ban misbehaving bots.

The only reason to do this is if you don't want your site indexed by google.
Which I really can't think of a legitimate reason to do so.

~~~
andrewcooke
true, but it's a busy time for yc. maybe this was just a stop-gap measure?
when did it start?

------
jchrisa
not strictly true (read the code), but of all the sites I know, HN is best
poised to implement their own search and drop Google. go for it!

------
ddsmooth
That notwithstanding, if you go to Google.com, type hacker news into the
search box and click "I'm Feeling Lucky" you will find yourself at a familiar
web page. ;-)

~~~
jsrfded
Google indexes robots banned sites based on links. But notice that the snippet
is gone. Google also won't be able to index HN posts and comments anymore.

------
karlzt
I just searched in google and is not blocked

------
_pius
Very lame.

------
jgavris
_very_ interesting

~~~
jgavris
the way i see it, they're going to have to put up a loginwall to stop
crawlers...and then we'll just crawl back in anyway...

------
python123
This is horrible. Now what's a talentless, middle-aged career software
engineer with hopeless dreams of entrepreneurship going to do with his time?

------
hackermom
i really enjoyed that the news item linked to the robots.txt file :) it's
nerdy, it's short, it's clear and concise, all in one :)

------
quinto42
This is fantastic! No more spam links.

------
fnazeeri
Bye bye HN...it was good to know you.

