Hacker News new | comments | ask | show | jobs | submit login
Ask HN: What do you do when your website seems to be penalised by Google?
41 points by mmavnn on Jan 12, 2015 | hide | past | web | favorite | 20 comments
I have a personal blog (mostly dev related); it's been going for a while. On a couple of specialist subjects (F# type providers being the main example) some of the posts are reasonably popular and linked to by many other people. Although it's a small site, on these subjects it tends to show up in the first page on Bing, Duck Duck Go, etc for searches like "Type provider tutorial" and right at the top if you use a specific phrase (like the title of my most popular post, "Type Providers from the Ground Up".

Google hates it. Basically, however specific the query, my blog never turns up unless you actually put the base url into your query. Ironically, plenty of spam sites' copies of the posts appear quite high in the search results.

What do you do in these types of situations? I've done no SEO beyond writing content, so I'm pretty sure I've used no "black hat" techniques. I've no ads, no duplicate content. Google webmaster tools claims the site is not blacklisted and that there is nothing wrong with it.

It feels wrong and possibly pointless to start again several years down the line with a new url just because Google doesn't seem to like the current one; but on the other hand, the lack of organic search results will always be a limit on the readership. For a personal blog this is irritating and disappointing - if I was freelance or this was my company blog, it would be a real and immediate financial hit.

Thoughts or advice for people facing this situation?

This is a hard thing to debug from outside the Googleplex, but you are currently serving a canonical tag:

<link rel="canonical" href="http://blog.mavnn.co.uk/type-providers-from-the-ground-up">

for a URL which cannot possibly return an HTTP 200. (It 301s to a URL with a / on the end.)

This combination could cause Google to conclude that you have no page which requires inclusion in their main index.

Interesting; that's autogenerated by Octopress (or Jakyll). I'd never noticed it's missing the '/'.

Known Octopress issue, apparently: http://hackingoff.com/blog/octopress-default-seo-flaws/

Thanks, patio11

Is it irony or something else that a system of supposed vast computing power and learning (and certain real world power via its distribution of search riches) is broken by a tiny thing like this fix for a missing slash ?

URLs are really, really, really, really hard to get right on a large scale. For a side project I've written my own crawler/indexer and I try to do deduplication where possible, and the reality is that:

can serve entirely different content from

depending on the server (and application) configuration.

Pretty much the only way to 100% reliably deduplicate URLs is to look at their content, and somehow magically compare content that can change from page load to page load -- which is a whole other problem.

Exactly. It's so difficult to get URLs "right", and that's quite non-obvious until you do something like writing a crawler.

Another example is whether foo.com/bar is the same as foo.com/BAR. Usually yes, but it's entirely possible that they will serve different content.

Also, which URL parameters should be disregarded, and which should be considered important? A crawler must do quite a bit of nontrivial page introspection in order to figure out the answer to that all on its own.

Often pages that are essentially the same will be a bit different. Timestamps and time-sensitive data (eg. listings on a marketplace) will trip you up, here.

I wouldn't say the crawler is broken at all. It's picky, as it should be. An URL that ends with a / be an entirely different web page than an URL that doesn't end with a /.

Also, you might ask why Google won't ignore the canonical URL if it's an invalid URL.. well, that's what you get with the canonical URL - you're explicitly telling Google this is the "real" url of the web page. You can't have it both ways, and then complain Google is ignoring your canonical tag.

Well, if you give a canonical tag out, you're really taking on the responsibility of resolving different url's, so you should make sure you do it right.

Regarding the spam sites, in your RSS feed, you are publishing your full articles. More than likely, the scraper sites are pulling directly from these feeds, publishing quickly and getting Googlebot to see the content before it hits your site (thus receiving attribution). I would suggest:

1) Summaries only in RSS feeds. 2) Throttle the RSS feed back by several hours so that your latest article is not listed immediately. 3) Upon publishing, immediately link to the article via all of your social media outlets. 4) When internally linking within articles, use full URL paths and not relative. (If the spam sites are directly pulling your content and not cleaning up, you may be able to get a link back to your site from the scraped content.)

When publishing, timing is everything. Just my $0.02 based on my own experiences dealing with spam sites.

On a side note, even though we are in the age of HTML5, I would still suggest sticking with one H1 tag per page, if possible.

This sucks. I'm not saying it's not the answer, but the fact that you have to castrate your feed because spam sites can actually get "SEO credit" for your content just sucks. I always loved RSS feeds that published the full text, because I could read whole articles without having to click through.

Semantic web could fix this a little by making it easier to scrape with the <article> tag, but publishing content is exactly what RSS was meant to do.

I wish Google would (if even possible) find a better way to fix this. In the same way that there's an actual argument against single page apps because "they can't be indexed" or "SEO, man." Discoverability shouldn't be holding back progress (in an ideal world, I know). Rather, indexing should adapt to new technology so that we can make a better web that's still discoverable by users.

I agree, but with sites that do not have much authority (aka PageRank), it's difficult to determine attribution when scraped content is coming online just as quickly as an original post. Googlebot will generally hit a site several times a day, but if it's hitting the spammer's site first or if the spammer site has more authority, it's a long uphill climb to get things turned around.

I should also point out that this is just one thing to consider amongst the other points already made by others.

I must admit, I'm not desperate enough for new readers to hamstring my own publishing to my existing readers - so summary RSS feed is out for me.

I'd consider delaying RSS publication but that's actually very awkward as it's an Octopress website (i.e. created and pushed as a set of static pages, including things like the rss feed).

Early last year Google asked people to report such problems. You probably won't see any direct impact from doing so, but there's no harm in trying.

The form is still live at least: http://searchengineland.com/google-scraper-tool-185532

First, make sure your sitemap.xml is exhaustive. Then, check the number of indexed pages in Google Webmaster Tool (after a couple of days if you had to update you sitemap). If few pages are indexed, go through this checklist at https://ligatures.net/content/expertise/site-not-indexing-ch... to fix possible issues. If your pages are still not displayed in search results, then you are likely another victim of a well-known chicken-and-egg problem for content based sites: you need links for ranking and you need ranking to attract links. Yet, most niche are saturated and you are likely crushed by competition. The only efficient way out is to obtain dofollow backlinks from sites/blogs which are: i) Not under your control (i.e., a forum profile link is under your control...) ii) Editorially reviewed iii) Have relevant topics to yours iv) Which are already trusted by Google v) Which are popular Other links won't make much of a difference.

I have the same situation, without manual actions in webmaster tools my website was wiped from search since 5th Dec 2014.

It appears if you search domain.com and if you search site:domain.com but if you search just "domain" it doesn't appears and the website has been more than 10 years well indexed.

I'm very worried because I can't contact google to know what happens because as I said I don't have manual penalties to reconsider and I'm losing my own users that search for the domain.

Have you registered your site with Google webmaster tools (https://www.google.com/webmasters/tools/)?

That would be a good first step for seeing if Google are having specific trouble with anything on your site

I had a similar problem and it turned out that google webmaster tools has an interesting 'feature'.

When you register a domain, say for example: http://www.cnn.com

Does not necessarily register if you have forced your server to use http://cnn.com (without the www. before the domain name).

If you have set up your web server like that, make sure you add the non www. domain to webmaster tools as well. For some strange reason, it seems some subdomains alongside certain settings in the webmaster tools area, will tell google that the two sites are totally different.

I have yes, no mention of any issues from the tools (it even claims to like the sitemap).

Presumably all of the spam copies are actually harming your rankings as well, as Google will see those as duplicate content.

(Note that I'm guessing here, I have no particular authority in the area)

hi, first things first

do not use the word "penalty" - you want to show up for a certain query in google, you think you should show up for it, you don't show up for it.

that is the issue, nothing else.

you formulated a hypothesis: you think your site is "penalised by Google"

ok, go to google webmaster tools and verify

  - http://blog.mavnn.co.uk/
  - http://mavnn.co.uk/
  - http://www.mavnn.co.uk/
check the "Site Messages" navigation point of all these domain variations, if you have a penalty, then there will be a message. pro tip: only ever talk about "penality" if you get a message that says you have a "penality". (everything else is just SEO b#llshit talk)

my guess: there won't be such a message.

ok, the second quess is the wrong canonical, you already fixed that one. but: if you point a canonical to an HTTP 301 redirect, and the redirect points back to the original URL google will basically ignore the canonical. the canonical could have been the issue, but as https://www.google.com/search?q=site%3Ablog.mavnn.co.uk%2Fty... has been indexed (without ending slash) i doubt it.

ok, let's look at anything that might be unusual about your site

i.e.: your start page http://www.mavnn.co.uk/

basically it consists out of a "Hello World" and a link to a broken URL and a link to a piece of duplicated text.

"Hello World" is a typical "this server was just set up, nothing to see here" message.

your start page is not indexed (see: site:http://www.mavnn.co.uk/ )

that is strange. let's formulate a hypothesis.

your startpage communicated a basic "this server has just been set up, nothing to see here" message. google has a) no interrest into indexing such websites b) the webmasters are pretty pissed, if their newly set up servers are indexed in this way, as newly set up servers are usually not very secure, yet

additional google sees common subdomains i.e. blog.example.com as part of the main site and not as independent webproperties (yeah, they figured that one out quite some time ago).

hypothesis: you communicate via your startsite that your website is not yet - probably - set up and that is why it does not send you traffic.

my bet is, that this is the case. why? because you startpage is the one thing that is definitely not ... like other websites our there.

fix it, two possibilities: http://www.mavnn.co.uk/ -> HTTP 301 -> http://blog.mavnn.co.uk/

or you set up a proper startpage, some text what this is, some links to your other ressources.

after you have done one of this, do a fetch-as-googlebot (via google webmaster tools) and click the "submit to index" button.

wait two days.

if not, test another hypothesis or post in the google webmaster forum, actually google guys dig these kind of errors.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact