Hacker News new | comments | show | ask | jobs | submit login
Tell HN: The front page of Hacker News has been deindexed from Google
66 points by Roedou 1545 days ago | hide | past | web | 38 comments | favorite
You can confirm this by searching for 'hacker news' in Google; the #1 ranking URL is /newest, rather than the front page. This isn't term specific - the site doesn't appear for other terms that it usually ranks well for, such as "news.ycombinator.com" or "hn".

I've checked the usual technical reasons (html head canonical/robots meta tag, http headers, robots.txt issues) but I don't see anything untoward.

I'll keep looking into it, but I'm posting this here in case the admins/mods have made any changes recently that could have had an effect. There's a possibility that the URL has been removed by Google for some particular reason, though I can't think of many pages that deserve it less than HN.

I'll update this thread if I see anything, but hopefully someone else will post an answer before I figure it out....

It's not that PG has a grudge against Google (or vice versa) or anything like that. I believe that search engine bots crawl Hacker News hard enough that PG blocks most crawling by bots. In the case of Google, he does allow us to crawl from some IP addresses, but it's true that Google isn't able to crawl/index every page on Hacker News.

Here's a link where I answered the same question about three weeks ago: https://news.ycombinator.com/item?id=5837004 , so this isn't a new issue. In fact, PG has been blocking various bots since 2011 or so; https://news.ycombinator.com/item?id=3277661 is one of the original discussions about this.

And to show this isn't a Google-specific issue, note that Bing's #1 result for the search [hacker news] is a completely different site, thehackernews.com: http://www.bing.com/search?q=hacker+news

In general, I think PG's priority is to have a useful, interesting site for hackers. That takes precedence and is the reason why I believe PG blocks most bots: so that crawling doesn't overload the site.

Thanks for that Matt; I didn't see that recent post or your comment, so sorry for dragging you back here to repeat yourself.

Looks like I'm going to have to stop relying on searching 'hn' when using a different computer, and start typing in the full URL. First world problems are such a burden.

No worries at all. I don't think the HN thread from three weeks ago made it to the front page (I happened to see it while browsing on /newest). I figured someone would notice and ask about this, so I'm happy to have the chance to explain.

Hey Matt,

I'm sorry to reach out to you directly on a public forum like this, but my company's website encountered a major negative SEO attack last month and we were hit with a manual penalty by Google today. I thought you might be interested to hear about what happened, and I of course I would like to resolve it as I do my best to always keep my company's SEO efforts within Google's guidelines. Please reach out via email to me at mbrody@myclean.com if we can help each other fix this! Thanks again for everything you do to help make the web a better place, and in advance I understand if you're too busy to respond.

Best regards,

Mike B.

Don't apologize to just Matt, you're pseudo-apology and better-sent-as-an-email question pissed me off. Why would you take up three extra lines with a BS platitude and a signature? Please keep a personal request for assistance to better-suited channels.

Just downvote and move along, why the hostility.

Doesn't Googlebot respect Crawl-Delay in robots.txt? PG has set it to be 30 secs - https://news.ycombinator.com/robots.txt - which IMHO should not cause any load issues given the overall traffic profile of HN.

Googlebot doesn't respect the crawl-delay setting in robots.txt. https://groups.google.com/forum/#!topic/Google_Webmaster_Hel...

As I understand it, the best way to lower the crawl rate is to log into Google Webmaster Tools and manually lower your crawl rate. The crawl rate delays expire every 90 days, so I set a calendar reminder to renew the crawl delay every 3 months. https://support.google.com/webmasters/answer/48620?hl=en

Mmm, seems kind of like a feature. In fact, maybe PG should robots.txt google entirely. It seems like HN has been getting mentions in other media with increasing frequency. If you can't find the site just because google doesn't doesn't list it, then I have to wonder what you are actually doing here. This wouldn't be the first way that HN sets a bar for new users either; the "Create Account" form is already hidden under "submit".

HNSearch works great for HN specific searches anyway.

This has happened before, and usually has a non-pitchforky reasoning (e.g. PG pulled it temporarily because of network/server issue). I'm sure it will be back soon, and we will have a rather reasonable answer as to why. There are way to many google employees, that frequent and enjoy HN, for it to be banned for some arbitrary reason

And of course, the network has specific functions for censorship as required by child protection laws. "Just a network error" really doesn't guarantee that the network wasn't doing something nefarious itself.

If you are using DuckDuckGo, you can use the !hn bang to send your query to hnsearch.com

This is trivial to do in any modern browser without DDG.

I found this old thread, where pg had blocked most of the Google bots, and it caused Google to think the site was down:


Could be a similar issue? I'll take a look.

pg also commented that he doesn't want traffic from Google anyway: https://news.ycombinator.com/item?id=5808990

In which case, he should add: <meta name="googlebot" content="noindex"> to the html head of every page.

(I have to say, that's a smart way of avoiding any Eternal Septembering, but it'd be a shame. I often use Google to find old HN threads that I vaguely remember from months or years ago.)

you may want to consider using hnsearch.

(Google's search is often better for this purpose as it has features like synonym/typo fixing and as it indexes entire pages lets you match keywords across entire threads: hnsearch is myopic on individual posts.)


Even sites with good search functions are often still way outclassed by google search with "site:..."—and most sites don't have good search functions...

Matt Cutts browses this site. Maybe he knows the reason why?

The .org site hasn't been delisted, so it's obviously not based on content:


This is most likely the same reason digg's frontpage was deindexed. There's no "content" per se, it's just links. Someone will notice, add an exception, and all is well.

Unlike Digg, HN has a substantial amount of content in the comments pages though, which are heavily indexed.

Edit - All the comment pages are still indexed just fine. It's /only/ the front-page. Which, imo, doesn't really matter anyway.

That wasn't the reason for Digg's issue at all: Google had tried to manually deindex some pages from the site, but made a mistake and pulled the whole domain. They reincluded it shortly after.

The comments are the real content of this site.

Sounds like overaggressive spam detection.

This sounds like the case. Google is getting aggressive with its Panda updates, and as a previous commenter noted, the HN homepage is just links. Since that triggers Panda, it's a good bet that Google went a little overboard (not unprecedented).

> the HN homepage is just links. Since that triggers Panda

To be more specific, Panda is triggered by low quality/duplicate content. 'Penguin' is triggered by spammy/bad backlinks.

I'm not saying you're wrong (a page of links would look pretty low quality to google's algo), I just wanted to add on for clarity's sake.

Yes, I see where I was unclear. It's not the links themselves, but the lack of original, robust content.

The Panda update definitely pushed lazy/low-quality pages down the SERPs, but doesn't tend to deindex pages.

Also: while the page doesn't have any of its own unique content, it presumably still has high engagement and low bounce rate.

That's a good point. Hadn't thought the issue through that far.

Please don't post on HN to ask or tell us something (e.g. to ask us questions about Y Combinator, or to ask or complain about moderation). If you want to say something to us, please send it to info@ycombinator.com.


I too had noticed this. It's unfortunate because searching via Google with site:news.ycombinator.com in the query is much better than HN's own search when you have a good idea what you're looking for (spearfishing search vs BFS)

This isnt the first time this has happened and I suspect that it wont be the last.

The pagerank has fallen from a 6 to a 3 as well.

Google is evil. Screw them. I refuse to use Google or their services. Make the switch. They deindex a lot of sites they don't agree with. Not saying that is the case here but they've been known to do it.

Link backing up your claims?

Switch to what?

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact