Hacker News new | comments | show | ask | jobs | submit login

I think I know what the problem is; we're detecting HN as a dead page. It's unclear whether this happened on the HN side or on Google's side, but I'm pinging the right people to ask whether we can get this fixed pretty quickly.

Added: Looks like HN has been blocking Googlebot, so our automated systems started to think that HN was dead. I dropped an email to PG to ask what he'd like us to do.

I sent you an email about this.

(A couple weeks ago I banned all Google crawler IPs except one. Crawlers are disproportionately bad for HN's performance because HN is optimized to serve recent stuff, which is usually in memory.)

Hi Paul,

A site can be crawled from any number of Googlebot IP addresses, and so blocking all except one doesn't help in throttling crawling.

If you verify the site in Webmaster Tools, we have a tool you can use to set a slower crawl rate for Googlebot, regardless of which specific IP address ends up crawling the site.

Let me know if you need more help.

Edit Detailed instructions to set a custom crawl rate:

1. Verify the site in Webmaster Tools.

2. On the site's dashboard, the left hand side menu has an entry called Site Settings. Expand that and choose the Settings submenu.

3. The page there has a crawl rate setting (last one). It defaults to " Let Google determine my crawl rate (recommended)". Select "Set custom crawl rate" instead.

4. That opens up a form and choose his desired crawl rate in crawls per second.

If there is a specific problem with Googlebot, you can reach the team as follows:

1. To the right hand side of the Crawl Rate setting is a link called "Learn More". Click that to open a yellow box. 2. In the box is a link called Report a problem with Googlebot which will take you to form you can fill out with full details.



I would like to set that crawl rate but do not see why I must register at Google to do so. Why can't Google support the Crawl-Delay directive in robots.txt for this?

My plane is about to take off, but very briefly: people sometimes shoot themselves in the foot and get it way, way wrong. Like "crawl a single page from my website every five years" wrong.

Crawl-Delay is (in my opinion) not the best measure. We tend to talk about "hostload," which is the inverse: the number of simultaneous connections that are allowed.

Another great way to shoot yourself in the foot (and getting it way way wrong) is to block all Googlebot IPs except for one.

Instead of completely disregarding Crawl-Delay, why not support it up to a maximum value that is deemed sensible? This would prevent people from completely shooting themselves in the foot, and it would surely be better than completely disregarding it.

I would think that the number of people who (a) know how to create a valid robot.txt file, (b) have some idea of how to use the "crawl-delay" directive and (c) write a "shoot-themselves-in-the-foot" worthy error is vanishingly small.

As opposed to --for illustrative purposes-- the vanishingly small number of people who know how to block IP addresses and manage to get their site to disappear from Google's listings?


A few years ago, I did pretty much the same thing myself. Thankfully the late summer was our slow season and the site recovered pretty quickly from my bone-headed move, but the split second after I realized what I've done was bone-chilling.

I think just about everyone has thought at some point that they understood how something worked, only to have had things go pear-shaped on them.

The lesson: people are not fully knowledgeable about everything, even the smart and talented ones.

Perhaps I'm making an error assuming that a website as influential as Hacker News has a "real live Webmaster" to do things like write robot.txt files.

I alluded to some of the ways that I've seen people shoot themselves in the foot in a blog post a few years ago: http://www.mattcutts.com/blog/the-web-is-a-fuzz-test-patch-y...

"You would not believe the sort of weird, random, ill-formed stuff that some people put up on the web: everything from tables nested to infinity and beyond, to web documents with a filetype of exe, to executables returned as text documents. In a 1996 paper titled "An Investigation of Documents from the World Wide Web," Inktomi Eric Brewer and colleagues discovered that over 40% of web pages had at least one syntax error".

We can often figure out the intent of the site owner, but mistakes do happen.

The number of webpages with HTML that's just plain wrong (and renders fine!) is staggering. I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

If you're writing HTML, you should be validating it: http://validator.w3.org/

Google.com has 39 errors and 2 warnings. Among other things, they don't close their body or html tags.

Is there any real downside to having syntax errors?

The downside is maintainability. If your website follows the rules, you can be pretty confident that any weird behaviour you see is a problem with the browser (which is additional context you can use when googling for a solution). If your website requires browsers to quietly patch it into a working state, you have no guarantees that they'll all do it the same way and you'll probably spend a bunch of time working around the differing behaviour.

Obviously, that's not a problem if you already know exactly how different browsers will treat your code, or you're using parsing errors so elemental that they must be patched up identically for the page to work. For example, on the Google homepage, they don't escape ampersands that appear in URLs (like href="http://example.com/?foo=bar&baz=qux — the & should be &). That's a syntax error, but one that maybe 80% of the web commits, so any browser that couldn't handle it wouldn't be very useful.

Particularly before HTML5.

In Google's case, neglecting to close the tags is intentional, for performance. See http://code.google.com/intl/fr-FR/speed/articles/optimizing-...

It's interesting that you apparently actually checked a validator to get the error count, and yet the two things you cite as errors are not errors, have never been errors, and are not listed in the errors returned by the validator. Both opening and closing tags for the html, body, and head elements are optional, in all versions of HTML that I am aware of (outside of XHTML, which has never been seriously used on the open web as it isn't supported by IE pre-9). There is a tag reported unclosed by the validator, but that's the center tag.

Anyhow, one downside to having syntax errors might be that parsers which aren't as clever as those in web browsers, and which haven't caught up with the HTML5 parser standard, might choke on your page. This means that crawlers and other software that might try to extract semantic information (like microformat/microdata parsers) might not be able to parse your page. Google probably doesn't need to worry about this too much; there's no real benefit they get from having anyone crawl or extract information from their home page, and there is significant benefit from reducing the number of bytes as much as possible while still remaining compatible with all common web browsers.

I really wish that HTML5 would stop calling many of these problems "errors." They are really more like warnings in any other compiler. There is well-defined, sensible behavior for them specified in the standard. There is no real guesswork being made on the part of the parser, in which the user's intentions are unclear and the parser just needs to make an arbitrary choice and keep going (except for the unclosed center tag, because unclosed tags for anything but the few valid ones can indicate that someone made a mistake in authoring). Many of the "errors" are stylistic warnings, saying that you should use CSS instead of the older presentational attributes, but all of the presentational attributes are still defined and still will be indefinitely, as no one can remove support for them without breaking the web.

For the Google homepage, every byte counts. I'm not surprised.

The introduction of the menu bar and other javascripty-ness demonstrates this is less the case than it used to be.

Not really, but there are many benefits to keeping your HTML clean. Google seems to just use whatever works, in able to support all kinds of ancient browsers.

It looks like Google Front page developers simply don't care about HTML compliance.

There is no reason to allow most of these errors other than coding sloppiness.


I often wonder what the web would be like if web browsers threw an error upon encountering a syntax error rather than making a best effort to render.

The web would have died in stillbirth and it would never have grown to where it is now.

"Be generous in what you accept" (part of Postel's Law) is a cornerstone of what made the internet great.

XHTML had a "die upon failure" mode, and it has died, why do you think XHTML was abandoned and lots of people are using HTML5 now.

"...everything from tables nested to infinity..."

The irony of that statement on hacker news is pretty amazing. Have you looked at how the threads are rendered on this page. It is tables all the way down.

Considering it's Google and we're talking about almost the whole population of the Earth, the vanishingly small percentage of the entire population of the planet would still be at least hundreds of thousands of people.

I do understand the thought but I think it is not a good gesture to do. You could always cap crawl-delay at a reasonable maximum and additionally allow people to fix mistakes through the webmaster tools (eg if they told your bots to stay away for a long time but in the meantime want to revert that).

Maybe instead that hostload could be parsed from robots.txt? It sure seems like the better mechanic to tweak for load issues (while traffic/bandwidth issues are still unresolved).

Matt, how does a sitemap fit into this? If I'm not mistaken, you can suggest some refresh rate there, too. Do you take that into account?

Could google vary the crawling rate on each site and see what effects that has on response times, and develop an algorithm to adjust crawl speed so as not to affect site performance too much? If google starts crawling a site and notices sequential crawl requests are answered in .5s w/ .1s stddev and it starts crawling with 10 parallel connections and the answers are 2s w/ 1s stddev, clearly that's a problem because user experience for real people will be impacted. Maybe google could automatically email webmaster@ and notify them of performance issues it sees when crawling.

Another thing that might help google is for them to announce and support some meta tag that would allow site owners (or web app devs) to declare how likely a page is to change in the future. Google could store that with the page metadata and when crawling a site for updates, particularly when rate limited via webmaster tools, it could first crawl those pages most likely to have changed. Forum/discussion sites could add the meta tags to older threads (particularly once they're no longer open for comments) announcing to google that those thread pages are unlikely to change in the future. For sites with lots of old threads (or lots of pages generated from data stored in a DB and not all of which can be cached), that sort of feature would help the site during google crawls and would help google keep more recent pages up to date without crawling entire sites.

> declare how likely a page is to change in the future

I believe you can do that using a sitemap.xml

Have you considered putting a caching reverse proxy in front of the arc app to keep the backend from having to render all of the old pages?

It seems like the only dynamic element of old articles is the "$x days ago" bit and that'd be pretty easy to turn into something static by instead just putting in timestamps in the actual HTML and using Javascript to transform them into how many hours / days ago they were. Then the crawlers would just be pulling out cached, pre-rendered HTML.

There's an example of doing such with nginx here:


With that you'd just have to send out the HTTP header from the arc app saying that current articles expire immediately, and old ones don't.

I believe Rtm has already set one up.

The conspicuous lack of a "Server:" header inclines me to believe that that's probably not the case (most web servers set one indicating the server software and version). Here are the headers that HN sends out from an old post (20 days ago):

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=utf-8
  Cache-Control: private
  Connection: close
  Cache-Control: max-age=0

My favorite part of HN's headers: the lines are separated by naked LFs instead of CRLF, in violation of the HTTP spec

This is common violation that everyone accepts. It's definitely done by 'bad' clients - not sure how often servers send bare LF.

(I used to telnet to port 80 for testing, and type GET / HTTP/1.0 <enter> <enter>, and that should be LF on Linux & Mac)

You don't have a problem with one of the most trafficked sites for programming/web startup-related news implementing HTTP incorrectly?

Do you ignore whether your HTML is valid just because the browser rendered it correctly?


I've got real work to do. Making a validator happy is fake work.

By ensuring that your pages are valid, you make it ever so much more likely that you will not have to scramble around wasting time at a most inopportune time when the new version of a browser comes out which handles your non-standards compliant tag soup differently than the current version of the browser.

So, do you want to pay the price upfront when you can plan for it or afterwards when the fix must be done immediately because customers are complaining?

In my experience, I've had to scramble to fix browser compatibility issues every bit as often with 100% standards-compliant code, as I have had to with some incredibly laughably bad HTML.

I'd much rather pay the exact price later, than an inflated price now.

Some of us actually care about interoperability, maintainability and writing good code in general as opposed to just cowboying stuff together as quickly as possible

I used to preach the same thing.

Then, working at a startup taught me that it's not black and white. Several quotes come to mind, but Voltaire's is my favorite:

"The perfect is the enemy of the good."

Having worked at startups with both cowboys with a "get 'er done" attitude and people who actually care about software craftsmanship, I'll take the latter any day.

Cowboys may get things "done" quickly, but that doesn't help when things are subtly broken, have interoperability problems, or are nearly impossible to extend without breaking.

Why don't you bother doing your real work right the first time? As long as there's a well defined spec, you might as well follow it instead of being creative and original when it comes to implementing standards.

You don't know that everyone accepts it. Even if they did, it doesn't make it right.

I fixed submitted a patch for this in the pecl_http PHP library:


We use varnish for caching and check the useragent for requests.

If the cache has a copy of an article that is a few hours old it will just give that version to Googlebot while if it thinks a human is requesting the page then it will go to the backend and fetch the latest version.


+1 for varnish. It's stupidly[1] fast and there shouldn't be much trickery required to deflect most of HN's traffic (e.g. ~10 sec expiry for "live" pages, infinite expiry for archived pages).

[1] 15k reqs/sec on a moderate box

Gotcha--thanks, Paul. I'm about to get on a plane, but we'll get this figured out where we're not sending as much hostload toward HN.

If only we could all get our Google woes fixed in such a manner.

seriously. i can't get a hospital to show up in google maps... no human for me to talk to. HN is number 4 instead of 1 google page one, they're right on it.

This is why I don't understand the anti-promotion crowd who think promoting oneself and building an audience is a bad thing. Having the implicit threat of an audience you can address is a major lever to getting decent service nowadays.

I'm certainly not against promoting myself where I think some attention is merited, but ultimately that kind of thing is close to a zero sum game in that the amount of 'famous' people is fairly limited.

Which hospital? Seems like this thread has at least one pair of google-eyeballs looking at it.

i'd rather not drop the name here. but it's a verified listing and i've used the "report a problem" 3 times now. If they're not checking that... :(

Bad PR always seems to get instant action - it's more damage control than anything else.

If you have an audience, you have PR power.

You can fix this yourself.

In the lower-right corner of Google Maps there is a tiny link that says, "Edit in Google Map Maker". Click this link and you can edit Google Maps. Your edits get sent to Google and they'll approve/deny it in typically a few days.

it's a verified listing (do you know hard it is to convince the IT dept to take their automated phone system offline so i could verify a google maps listing? It was nuts and no google didn't offer the postcard method) so i don't see why i have to enter the same info again, but i did.

the listing only shows up if you type the exact name of the hospital into the search bar, which is useless.

Does that mean at Google you can manually set the system to treat different sites differently?

I don't mean the ranking but other aspects - like you guys blacklisted some domains which produce low quality content in wholesale. (I don't know if the algorithm was tweaked to detect and filter such sources or if it was a manual thing.)

Does that mean at Google you can manually set the system to treat different sites differently?

Webmaster Tools has a crawl rate slider which operate on a site-by-site basis, and that's existed for quite a while now.

If you're asking if they can manually boost a site's ranking, hopefully that isn't what's being suggested.

Could you use some sort of sitemap or other way to provide the data to Google that isn't so damaging to site performance? Or in Google Webmaster tools turn down the rate of crawling?

Just realized that this could be a problem for lots of sites, and I'm curious as to what the best solution is, since not everyone has Matt Cutts reading their site and helping out.

We do have a self-service tool in webmaster tools for people who prefer to be crawled slower.

Do you support/respect the "crawl-delay" directive?


Nope. As mentioned above, apparently Google thinks people will "shoot themselves in the foot" with the crawl-delay directive, while they won't with Google's special interface (which requires registering and logging in).

I can't imagine that they are just guessing abut this. I'm sure someone tried implementing it and was horrified at the actual results before they gave up on it.

What more can you expect from the World's Largest Adware Company?

Seriously, one thing about Google is that they seem to really like ensuring people are logged on, preferably at all times. Fortunately recent changes to Google Apps (promoting apps user accounts to full Google accounts) has made this more complex on my side and probably degraded the level of actionable info they can get out of it.

reddit had the same problem. We set up a separate server just for the google crawler with it's own copy of the database, so that the queries for old pages didn't slow down everyone else.

Seeing threads like this remind me that HN is still a pretty tight-knit community of real people doing real things.

It's good to see this stuff sometimes. Thanks, Matt!


And then reading some of the other threads on this topic is a bit...something.

Guys, can you calm the conspiracy theory nonsense a bit? Please?

If you're not on this site very much, you might not realize that Matt pops into almost every thread where google is doing something strange regardless of who they're doing it to, and tries to help figure out what is happening. This isn't HN getting some sort of preferential treatment, this is just the effect of having a userbase full of hackers.

You'd see the same type of thing on /. years ago if you frequented it enough.

This is nothing new. This is what a good community looks like. Everybody relax.

Honestly if you read the things that Matt and Pierre have said, they just looked at "freshness" (I believe that is what it is called), and inferred that PG had blocked their crawlers.

This is all stuff you can get from within google webmaster tools (which isn't some secret whoooo insider google thing. It's something they offer to everybody, and it's just like analytics.)

OH! Wait! I mean (hold on, let me spin up my google conspiracy theory generator): thehackernews.com has more ads on it so google is intentionally tweaking their algo to serve that page at a higher point than the real HN because of ads!


C'mon, guys, look at their user pages. They're both just active users of the site trying to help out.

"This isn't HN getting some sort of preferential treatment, this is just the effect of having a userbase full of hackers"

Of course it's preferential treatment. And if you scan the last month or two of Matt's comments they are general in nature and not specific as in:

"I think I know what the problem is; we're detecting HN as a dead page. It's unclear whether this happened on the HN side or on Google's side, but I'm pinging the right people to ask whether we can get this fixed pretty quickly."

You don't think "pinging the right people" and "get this fixed pretty quickly" is preferential treatment?

He's done the same thing for nearly everyone that's asked about something Google-related here. So yes, everyone on HN gets special treatment.

Answering people's individual questions doesn't scale to the entire Internet, so Google really has no choice but to address problems on a case-by-case basis. In this case, Matt reads HN and personally wants to solve the problem. That's the only way Google could possibly work, so that's how they do it.

The preferential treatment is kind of annoying TBH. I've seen sites disappear from Google's listing for months as a result of simple issues such as a change of domain name (with all the required redirects in place).

Yet here, a website owner is purposely blocking the crawler and they jump with solutions to try to fix the problem. Sigh.

All these years, you thought you were working with machines, and finally it turns out coding is a "people business". Did they change, or did you? Is quality less important than networking now because social networks make quality the equivalent of good networking? Or because of some corruption in the pure bootstrap capitalism of code=web=money=power? I don't know the answers to these questions, but I am concerned about being told to accept the output of raw logic and algorithms as holy writ, only being told a bit later that everything's negotiable if you have a personal relationship.

It really is but I was personally kind of glad not to see HN have that high of a ranking on Google for the term.

I remember when I first started visiting HN I saw all these smart people and the tight community and I was amazed that something that felt so close-knit and exclusive yet was still open could still exist these days.

I was a lurker for a long time before I actually signed up and participated because I honestly felt like I swasnt entitled to be part of "the group" and I should somehow earn my wings. Then in late 2010 I signed up but didn't submit for a bit and didn't join discussions. I still felt like I didn't have enough to offer. I now feel like I've somehow earned the right to be part of this community though in hindsight I'm quite embarrassed of my first few submissions.

So this story does have a point that I'm about to get to. I first heard of HN through an article in GQ and then forgot the link. I couldn't find the site again after searching Google for "Hacker News" as easily as I thought. This frustrated me slightly back then but now I think it's a good thing.

As the size of a community gets larger the quality of comments and submissions usually decreases. Letting people join HN freely and openly is a great thing but I fear that if it became a huge sensation then we'd be inundated by garbage submissions and comments way more frequently. I know about the post on how newbies often say HN is becoming Reddit and all that so I do try to remember that.

So the point is that not everyone respects communities like this and are thoughtful about joining and how they choose to interact on communities like HN the same way I was and I feel like maybe it's okay if Google isn't giving us the best ranking for certain terms. I mean, HN is easy to find still, just not that easy to stumble over.

Advantages of having a deep Google guy as a user :)

The question is, 'Was Matt Cutts ever THE Googleguy?'

   ( http://www.webmasterworld.com/profilev4.cgi?action=view&member=GoogleGuy )
I always wondered who it really was years back.

Yes, I'm GoogleGuy. Steven Levy documented that in his recent "In The Plex" book, so there's no harm in verifying it now.

Matt, when you have a free minute, can you take a look at this?


Sure, some indexing folks are discussing this now.

I appreciate it. I set up a 301 from dl.php?id=x to /myproduct/ for each of the ids last night as a last-ditch effort.

btw, turns out the redirect I set up two days had broken a couple of links, but not the ones I care about. Fixed now, though.

Has Google considered an indicator next to results for this type of case? A single red exclamation mark with further details when the mouse is over it, for example.

I'm aware of webmaster tools, but it seems not all webmasters are.

We do include some information in the search results for cases where our users could be harmed, e.g. if we detect malware or if we think that your site may have been hacked. We've been exploring other ways to alert site owners that there's something important affecting their site.

But people do choose to remove their sites, so we can't always tell between a mistake vs. someone who genuinely prefers not to be in Google's index.

Does it perhaps return something to the tune of "Unknown or expired link"? :-P

Nope, it looks like we haven't been able to fetch even the robots.txt file for a few days, so our system is trying to play it safe by assuming that HN is unreachable/dead.


what about the massive amount of duplicate content indexed via this domain: hackerne.ws ?

can you improve my site also Matt and fix the issue , i had the same problem some time ago - some wordpress plugin blocked googlebot unintentinally and i lost the ranking,

i am sure here are many folks here as well who had similar problems, why not help us all out ))

I never thought that Google Search would be so laughable.

Excellent contribution... Ugh, this is why I don't mind if HN is listed on Google or not.

So which part is laughable? The part where they're able to crawl the vast expanses of the web and return relevant results for the majority of their users? Or is it the part where they came out of nowhere to dominate search because they did it better than the rest?

Come on now, you can't be all things to all people. Google is far from perfect but for a lot of us it's much closer to perfect than the competition and they're constantly trying to improve it. Why don't you go ask Matt Cutts to fix whichever parts of it you think are laughable to your liking? He's been hanging around here and he doesn't seem shy about answering people's questions and concerns. I do doubt he'd give the time of day to a one sentence remark that adds nothing of value whatsoever to the larger discussion or any of it's offshoots.

No, the part where instead of getting better, Google search keeps getting worse. The verbatim mode is finally there but much of the time when it mucks with my search terms it's wrong. The UI is also cluttered. They are doing a lot of things that do nothing to improve the quality of the search, so when I want to learn something, I often skip Google and go straight to Stack Overflow or Wikipedia. I did jump the gun on this issue, though...turns out some googlebot blocking was involved.

Apparently the page in the top spot now hasn't been updated since: 2011.07.08

It now only consists of ads, a twitter feed, and a "Abba-da-dabba-da-dabba-dabba Dat’s all folks!" line.

So what if Googlebot thought HN was dead? Why would it opt to show a "more dead" page in place of it?

I think you're fibbing.

The other page isn't dead, they can access it.

yes - it looks more like google had been changing their algorithm recently - in favor of older sites , versus the content -

and it didnt turn out quite well ...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact