Hacker News new | comments | show | ask | jobs | submit login
Hacker News bans Google and all other search engines (ycombinator.com)
318 points by jsrfded on Mar 15, 2010 | hide | past | web | favorite | 120 comments

Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.

That makes sense. Thanks for the explanation pg!

Dude, why not just get yourself another server?

"something 4 people use every 6 months doesn't tend to get optimized much"

Yeah, but the usual case is still to throw stuff like that on a spare "random odds and ends go here" server, not a production one whose performance/responsiveness you care about.

It likely has a dependency on HN: looking up users, karma, etc., which is just simpler to do on the same machine.

Ranking applications by HN karma? I hope not.

They've been clear that they do look at your HN posts and comments, etc.

It may or may not be used for ranking in specific, but it's definitely used and they've always been upfront about that when asked.

Also, they use hacker news to communicate to applicants. I applied a while back, and one day while browsing I got a message through HN, something like "pg has a question for you" in the header bar. Clicking through is the question, and place to put the answer.

Basically, there's a whole messaging system in hacker news that most people don't see.

Yeah, this is known :) but an algo that incorporates HN karma wouldn't be the best thing :P

I was told once that they don't rank on karma, but if they recognize your name from discussions around here it helps your case.

I'd be really worried about some algorithm somewhere deciding to crawl less frequently even after the temporary ban has been lifted.

Robot bans can render your archive.org records inaccessable, zap all your former code 200 urls out of the major search indices, and yes, kill your inclusion in the list of topical sites.

Much better to set a temporary Crawl-Delay directive. Otherwise you're not just telling the engines to pause crawling you, you're telling them "take all of my pages out of your index."

What are the details of crawl-delay? If its supposed to be "number of seconds between hits", for some sites, 1 hit per second may as well be "not crawled at all"..

Right, but his point is that at least this way you won't get delisted :)

Why is this still on the front page of HN? This seems like such a non-story.

Have you considered using last-modified/etag headers to control the crawlers? If you detect a crawler, and it requests a page that it has visited before, you could just return a 304 without doing anything else.

On a related note the threads page seems to load very slowly now. I guess because it would place heavy load on the DB grabbing all your comments from different threads.

What does this mean for http://www.searchyc.com? Will it stop too? (Afterall, it must be a bot too, and a robots.txt like this implies that no bots /should/ scrape the site.) Or will it be allowed?

That was my immediate question too. In addition to searchyc, we lose every other HN mash-up present and future.

Maybe this will finally bring about an HN API.

robots.txt is really only like a note on the front door of your unlocked house that says "please only look at these items in these rooms." or "please do not look in my house" If someone wants to write a program to index the site they can still do so. Google, MSN, Bing, yahoo, et. al. will respect the robots.txt file though.

Or searchyc could simply ignore robots.txt

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.

I quite agree. I use Google to search HN. I just used it the other day, because I didn't remember this one guy's startup or username, but I knew it involved family websites. I was able to find it after a bit of search-fu. I am sad to see that HN won't be indexed by Google anymore. :(

Yes, the way that references vanish from hn, I really need Google to find things a couple of days old. I have tried and found other search methods not worthwhile.


One way to search news.ycombinator is to visit searchyc.com

Unless they're obeying robots.txt, which they probably should be

I tried search google go with and without quotes and the results were the same. Also, very few of the results actually pretained to google go. I would still rather have google...

Hmm, does search-fu qualify as something you can put on a resume?

I wonder if part of the reason is that the Arc web server doesn't send HTTP status codes properly. Every URL path, valid or not, returns a 200, even if the content is only "Unknown."

For example, check the headers sent back for:


    HTTP/1.1 200 OK
    Content-Type: text/html; charset=utf-8
    Connection: close
If a search engine sees that, it will periodically revisit it forever.

The correct solution is to fix the Arc server to return correct HTTP headers, not block everything using robots.txt.

That may be the correct solution, but I'm sure the current option was chosen more because it was the quick solution.

Maybe they're blocking everything while they're in the process of fixing it.

Until someone creates an HTML link to such a document, how does this problem manifest itself?

The site itself created zillions of them itself with its random, expiring fnids. Although the scheme has changed now, cralwers would still hit those obsolete URLs.


This sucks for one major reason. Sometimes I want to dig up a link that was on HN, and I don't know where else to find it. So I search google with site:ycombinator.com. If they block Google, then that won't work.

At the very least, add a search engine to HN itself so I can search for old links.

http://searchyc.com/ (assuming it's not blocked also)

Agreed 100%

Why produce a great site and then not let your users access its valuable content?

Things scroll by the front page so fast it's not really fair not to provide a search feature.

I don't care if they let external search engines (Google et al) index it, but they should at least provide /some/ kind of internal search feature themselves.

IMO providing a search feature is probably at least as effective, if not more so, than the "noprocrast" flag.

After all, if I know I can search past items at any point in the future, I don't feel a pressure to load up the front page several times a day.

Whereas if the fast-scrolling front page is the only way to access items, it encourages that sort of rat-hitting-the-bar behavior I learned about in college psych classes ;)

HN is very, very slow. Removing crawlers likely helps with that.

Removing a leg or two could also help one towards a weight loss goal.

But then you lose a lot of options for exercise. Walking and running are definitely out.

I would bet good money that it's nothing that a good dedicated server with, say, 32GiB ram and 2 or 4 sas spindles couldn't fix.

You can set gbot's crawler rate with google webmaster tools if that's the issue.

The robots.txt file's content-type is text/html, not text/plain. It won't work. I'm still able to access the contents of this site from Yahoo Pipes. Don't worry about it.

i just googled this: http://www.google.com/search?q=site:news.ycombinator.com+Hac...

and found this: http://news.ycombinator.com/item?id=165279

one of the top replies:

My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google. As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.


perhaps smart hackers shouldn't block everything out like this without any discretion whatsoever?

Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)

Also things like http://hacker-newspaper.gilesb.com/ and http://hnsort.com/ become less legitimate due to this. If the reason is to reduce the SEO benefits of getting a link on HN, just "nofollow" everything instead..

(Update: Googling on this topic brought up a page of my own where a Google Reader engineer explained how Google Reader deals with robots.txt - http://www.petercooper.co.uk/google-reader-ignores-robottxt-... - though their definition of Web robot is far from universal)

This makes me sad. I often do site:news.ycombinator.com google searches for key topics.

I do exactly the same thing at least once a day. I can always remember just a snippet of the post or subject I want to find and that type of query always gets me what I want. I just used it today to find the YC company who is doing the journalism stuff so I could send the link to my nephew.

That's my standard method as well to search HN.

Does that also kill Search YC?

What about the "official" HNSearch (http://www.webmynd.com/html/hackernews.html )?

The 'official' HNSearch is completely useless for anyone who would ever want to search HN. It's a Firefox Extension (!) that adds a 400px sidebar to your google search results pages showing site: searches along with an extra adsense ad. It's less useless than the 5 IE toolbars your annoying relatives live with, but just as obnoxious.

I just installed it again for a laugh, and it actually redirects to searchyc.com results when you click on its "Hacker News" header — even though pg staunchly refuses to ever link to searchyc. I made fun of him for it at his party before Startup School and he tried to laugh it off with "they're a YC company!" as if that weren't obvious on the face of it.

I don't use either. I'm interested for two reasons:

1. I use Google site: searches a LOT. I can remember reading about something I need now a long time ago on HN and I go find it.

2. One item on my "if I have time" projects list is to do "something" "fun" with HN submissions. Now I'm officially barred, regardless of what I would have thought up.

What's pg's problem with SearchYC? It's incredibly useful.

That it competes with webmynd ?

It is a clear conflict of interest, HN should simply get the best search available, so the 'funded' webmynd guys should get off their butts and fix it or HN should support searchyc.

I thought Webmynd was a browser plugin. I've never even considered using it for HN search.

It is. Check the link at the bottom of every HN page:


He invited alaskamiller to that private party too — I was kicking myself later for not putting a little more fuel on that fire by pulling him into that conversation (not that I didn't piss off pg enough).

If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.

On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.

Is this permanent?

HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...

That's probably the main issue. HN's authority and DoFollow'd links make it a prime target for linkspam.

There's a lot of reasons I'm bummed about this, but I have to say: startups missing out on free PageRank is not one of them.

The way it's done now is that all new submissions are nofollowed until they hit a certain karma threshold. A while back when I looked at this, the threhold was about 3-5 up votes.

This is actually a very neat and simple way to exercise editorial control to remove the nofollow: if enough people in the community like it, there is something useful about it and it is vetted.

The links are nofollow'd now.

edit: Oh wait, seems that only comment links are nofollow'd.

Actually some of the links are nofollowed until it hits a karma threshold. Not sure of the value but it seems that a story with >50 votes is not currently nofollowed.

It seems that a lot is going on - in secret. Example: I just discovered that if I reload the page my votes for comments get rolled back.

It wasn't rolled back: it was never counted.

There's a disenfranchisement heuristic that causes your votes to be recorded (can't vote again) but not counted. It usually persists for about a day. I think it's supposed to kill voting rings but its behavior is pretty odd.

The number in your browser gets incremented or decremented client-side independently of reality — it's doing a blind GET request by pre-fetching an 'image', not making an XHR POST and getting the new karma value in the response. Like a lot of things in Arc, it's just poor HTTP.

Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.

The HN code keeps everything in memory (lazily loaded from disk). If every search engine and me-too crawler out there hits every page, you constantly have the entire HN corpus in memory. I assume the memory overload is what leads to the site being slow and unreliable. I don't know if the arc code has any LRU scheme or if everything sticks around until the server falls over.

HN just needs a super simple caching proxy in front of it to reduce load on the app server. Alternatively, just generate static pages for all topics older than 5 days.

Will sysadmin for YC dinner invites.

I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.

I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.

Please enlighten me as to why you are "uncomfortable with how high Hacker News comments score on Google." I mean it sincerely, I'd like to know because I have no clue why that is an issue?

Talking to people in a relatively small forum like Hacker News: totally comfortable.

Talking with the entire world by having my comments associated in ways I don't expect with keywords: less comfortable.

Thank you and that makes total sense. Point and case?:

I googled an executive of a company that was looking to do business with me. I came across some google groups discussions that he was part of from a couple years ago... and let's just say the material would be a dream for someone with a vendetta against the person.

Thanks for the perspective!

start posting your comments in rot13? ;)

Some Brainfuck to implement this suggestion:


Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?

It appears to have been changed, just seconds ago. Now it reads:

  User-Agent: * 
  Disallow: /x?
  Disallow: /vote?
  Disallow: /reply?
  Disallow: /submitted?
  Disallow: /threads?

That seems like a nice middle ground--it should save server load, but search engines can still get to useful content.

I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.

What is the reason for this?

My bet is that it is for technical reasons. Whenever you see a reply link (among other things) the server has to persist a continuation for it. Getting all of those pages crawled must generate a massive number of continuations to persist.

I know that this is one of the reasons that vote-links look the way they do.

If that's not the reason, then my #2 guess is that getting the whole site crawled is moving too many comments into the cache. If I remember correctly, whenever an uncached comment gets accessed, it gets cached (basically adding it to a hash), and thus added to the in-memory ram. That gradually raises the memory requirements for running the app, until pg does a restart, which resets the cache as well.

And the regular restarts invalidate every single continuation-based URL

Probably to stop linkspam?

It's unfortunate because I often use site: to find old threads on ycombinator. especially since news.ycombinator.com doesn't have its own search facility, external search engines are the old way to comb through old posts here.

http://www.searchyc.com/ still exists, but it's up in the air if they will continue.

Not unless pg replies to this with "Please stop".

That doesn't make sense. Why not just add rel=nofollow to all links?

OK, thanks.

maybe to see what firms ignore robot.txt files?

Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.

The robots page is returning the mime-type text/html instead of text/plain

Whoa there, everything important is still crawlable and indexable. Here's what robots.txt says right now:

  User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?
This just disallows those pages... not the home page, and not the /item? action (note the url of this page).

It used to just disallow all. PG must be changing it as we speak.

looks like it's been updated.

Looks like the Readable Feeds stuff is still working. Not sure how this will effect it in the future.


Truth be told: On various occasions the site: operator on Google came in handy for me to dig out some nugget of information from the archives of HN.

I can't await to hear the reason behind that decision.

So is SearchYC done, or do they use a scraper not a crawler?

What's the difference?

In my mind legitimate crawlers are large engineered programs that respect robots.txt, whereas a scraper might just be a Perl script some dude wrote to check a website for updates periodically. Technically I suppose they all should probably respect robots.txt, but in my experience something you think of as a scraper rarely does.

Well, what if I just saved every HTML page I visited. I'm clearly not a bot.

I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.

Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.

I think this is still on the front page because people want it to stay this way.

I think that it's brilliant. Short term it gets publicity and drives people to the group who are curious.

Long term PG signs a deal with Bing to be the exclusive search engine for Hacker News that pays for the servers and bandwidth.

Down mod me if you will but it's simply brilliant.

surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.

That's not a legitimate reason to ban google. You can always (using google's webmaster tools) throttle their indexing. Or you can use search-engine specific lines in robots.txt (Bing anyone?) to ban misbehaving bots.

The only reason to do this is if you don't want your site indexed by google. Which I really can't think of a legitimate reason to do so.

... And you can speed the site up. There's no reason HN has to be so slow.

true, but it's a busy time for yc. maybe this was just a stop-gap measure? when did it start?

not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!

That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)

Google indexes robots banned sites based on links. But notice that the snippet is gone. Google also won't be able to index HN posts and comments anymore.

I just searched in google and is not blocked

Very lame.

very interesting

the way i see it, they're going to have to put up a loginwall to stop crawlers...and then we'll just crawl back in anyway...

This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?

i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)

This is fantastic! No more spam links.

Bye bye HN...it was good to know you.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact