

Ask HN: What do you do when your website seems to be penalised by Google? - mmavnn

I have a personal blog (mostly dev related); it&#x27;s been going for a while. On a couple of specialist subjects (F# type providers being the main example) some of the posts are reasonably popular and linked to by many other people. Although it&#x27;s a small site, on these subjects it tends to show up in the first page on Bing, Duck Duck Go, etc for searches like &quot;Type provider tutorial&quot; and right at the top if you use a specific phrase (like the title of my most popular post, &quot;Type Providers from the Ground Up&quot;.<p>Google hates it. Basically, however specific the query, my blog never turns up unless you actually put the base url into your query. Ironically, plenty of spam sites&#x27; copies of the posts appear quite high in the search results.<p>What do you do in these types of situations? I&#x27;ve done no SEO beyond writing content, so I&#x27;m pretty sure I&#x27;ve used no &quot;black hat&quot; techniques. I&#x27;ve no ads, no duplicate content. Google webmaster tools claims the site is not blacklisted and that there is nothing wrong with it.<p>It feels wrong and possibly pointless to start again several years down the line with a new url just because Google doesn&#x27;t seem to like the current one; but on the other hand, the lack of organic search results will always be a limit on the readership. For a personal blog this is irritating and disappointing - if I was freelance or this was my company blog, it would be a real and immediate financial hit.<p>Thoughts or advice for people facing this situation?
======
patio11
This is a hard thing to debug from outside the Googleplex, but you are
currently serving a canonical tag:

<link rel="canonical" href="[http://blog.mavnn.co.uk/type-providers-from-the-
ground-up">](http://blog.mavnn.co.uk/type-providers-from-the-ground-up">)

for a URL which cannot possibly return an HTTP 200. (It 301s to a URL with a /
on the end.)

This combination _could_ cause Google to conclude that you have no page which
requires inclusion in their main index.

~~~
ovi256
Is it irony or something else that a system of supposed vast computing power
and learning (and certain real world power via its distribution of search
riches) is broken by a tiny thing like this fix for a missing slash ?

~~~
thaumaturgy
URLs are really, really, really, _really_ hard to get right on a large scale.
For a side project I've written my own crawler/indexer and I try to do
deduplication where possible, and the reality is that:

    
    
        domain.com/this-page-here
    

can serve entirely different content from

    
    
        domain.com/this-page-here/
    

depending on the server (and application) configuration.

Pretty much the only way to 100% reliably deduplicate URLs is to look at their
content, and somehow magically compare content that can change from page load
to page load -- which is a whole other problem.

~~~
jfoster
Exactly. It's so difficult to get URLs "right", and that's quite non-obvious
until you do something like writing a crawler.

Another example is whether foo.com/bar is the same as foo.com/BAR. Usually
yes, but it's entirely possible that they will serve different content.

Also, which URL parameters should be disregarded, and which should be
considered important? A crawler must do quite a bit of nontrivial page
introspection in order to figure out the answer to that all on its own.

Often pages that are essentially the same will be a bit different. Timestamps
and time-sensitive data (eg. listings on a marketplace) will trip you up,
here.

------
mtbcoder
Regarding the spam sites, in your RSS feed, you are publishing your full
articles. More than likely, the scraper sites are pulling directly from these
feeds, publishing quickly and getting Googlebot to see the content before it
hits your site (thus receiving attribution). I would suggest:

1) Summaries only in RSS feeds. 2) Throttle the RSS feed back by several hours
so that your latest article is not listed immediately. 3) Upon publishing,
immediately link to the article via all of your social media outlets. 4) When
internally linking within articles, use full URL paths and not relative. (If
the spam sites are directly pulling your content and not cleaning up, you may
be able to get a link back to your site from the scraped content.)

When publishing, timing is everything. Just my $0.02 based on my own
experiences dealing with spam sites.

On a side note, even though we are in the age of HTML5, I would still suggest
sticking with one H1 tag per page, if possible.

~~~
andrewstuart2
This sucks. I'm not saying it's not the answer, but the fact that you have to
castrate your feed because spam sites can actually get "SEO credit" for your
content just sucks. I always loved RSS feeds that published the full text,
because I could read whole articles without having to click through.

Semantic web could fix this _a little_ by making it easier to scrape with the
<article> tag, but publishing content is exactly what RSS was meant to do.

I wish Google would (if even possible) find a better way to fix this. In the
same way that there's an actual argument _against_ single page apps because
"they can't be indexed" or "SEO, man." Discoverability shouldn't be holding
back progress (in an ideal world, I know). Rather, indexing should adapt to
new technology so that we can make a better web that's still discoverable by
users.

~~~
mtbcoder
I agree, but with sites that do not have much authority (aka PageRank), it's
difficult to determine attribution when scraped content is coming online just
as quickly as an original post. Googlebot will generally hit a site several
times a day, but if it's hitting the spammer's site first or if the spammer
site has more authority, it's a long uphill climb to get things turned around.

I should also point out that this is just one thing to consider amongst the
other points already made by others.

------
dredge
Early last year Google asked people to report such problems. You probably
won't see any direct impact from doing so, but there's no harm in trying.

The form is still live at least: [http://searchengineland.com/google-scraper-
tool-185532](http://searchengineland.com/google-scraper-tool-185532)

------
JVerstry
First, make sure your sitemap.xml is exhaustive. Then, check the number of
indexed pages in Google Webmaster Tool (after a couple of days if you had to
update you sitemap). If few pages are indexed, go through this checklist at
[https://ligatures.net/content/expertise/site-not-indexing-
ch...](https://ligatures.net/content/expertise/site-not-indexing-
checklist.html) to fix possible issues. If your pages are still not displayed
in search results, then you are likely another victim of a well-known chicken-
and-egg problem for content based sites: you need links for ranking and you
need ranking to attract links. Yet, most niche are saturated and you are
likely crushed by competition. The only efficient way out is to obtain
dofollow backlinks from sites/blogs which are: i) Not under your control
(i.e., a forum profile link is under your control...) ii) Editorially reviewed
iii) Have relevant topics to yours iv) Which are already trusted by Google v)
Which are popular Other links won't make much of a difference.

------
borrame
I have the same situation, without manual actions in webmaster tools my
website was wiped from search since 5th Dec 2014.

It appears if you search domain.com and if you search site:domain.com but if
you search just "domain" it doesn't appears and the website has been more than
10 years well indexed.

I'm very worried because I can't contact google to know what happens because
as I said I don't have manual penalties to reconsider and I'm losing my own
users that search for the domain.

------
rfergie
Have you registered your site with Google webmaster tools
([https://www.google.com/webmasters/tools/](https://www.google.com/webmasters/tools/))?

That would be a good first step for seeing if Google are having specific
trouble with anything on your site

~~~
Sealy
I had a similar problem and it turned out that google webmaster tools has an
interesting 'feature'.

When you register a domain, say for example:
[http://www.cnn.com](http://www.cnn.com)

Does not necessarily register if you have forced your server to use
[http://cnn.com](http://cnn.com) (without the www. before the domain name).

If you have set up your web server like that, make sure you add the non www.
domain to webmaster tools as well. For some strange reason, it seems some
subdomains alongside certain settings in the webmaster tools area, will tell
google that the two sites are totally different.

------
kevinbowman
Presumably all of the spam copies are actually harming your rankings as well,
as Google will see those as duplicate content.

(Note that I'm guessing here, I have no particular authority in the area)

------
franze
hi, first things first

do not use the word "penalty" \- you want to show up for a certain query in
google, you think you should show up for it, you don't show up for it.

that is the issue, nothing else.

you formulated a hypothesis: you think your site is "penalised by Google"

ok, go to google webmaster tools and verify

    
    
      - http://blog.mavnn.co.uk/
      - http://mavnn.co.uk/
      - http://www.mavnn.co.uk/
    

check the "Site Messages" navigation point of all these domain variations, if
you have a penalty, then there will be a message. pro tip: only ever talk
about "penality" if you get a message that says you have a "penality".
(everything else is just SEO b#llshit talk)

my guess: there won't be such a message.

ok, the second quess is the wrong canonical, you already fixed that one. but:
if you point a canonical to an HTTP 301 redirect, and the redirect points back
to the original URL google will basically ignore the canonical. the canonical
could have been the issue, but as
[https://www.google.com/search?q=site%3Ablog.mavnn.co.uk%2Fty...](https://www.google.com/search?q=site%3Ablog.mavnn.co.uk%2Ftype-
providers-from-the-ground-up&pws=0&hl=en) has been indexed (without ending
slash) i doubt it.

ok, let's look at anything that might be unusual about your site

i.e.: your start page [http://www.mavnn.co.uk/](http://www.mavnn.co.uk/)

basically it consists out of a "Hello World" and a link to a broken URL and a
link to a piece of duplicated text.

"Hello World" is a typical "this server was just set up, nothing to see here"
message.

your start page is not indexed (see:
site:[http://www.mavnn.co.uk/](http://www.mavnn.co.uk/) )

that is strange. let's formulate a hypothesis.

your startpage communicated a basic "this server has just been set up, nothing
to see here" message. google has a) no interrest into indexing such websites
b) the webmasters are pretty pissed, if their newly set up servers are indexed
in this way, as newly set up servers are usually not very secure, yet

additional google sees common subdomains i.e. blog.example.com as part of the
main site and not as independent webproperties (yeah, they figured that one
out quite some time ago).

hypothesis: you communicate via your startsite that your website is not yet -
probably - set up and that is why it does not send you traffic.

my bet is, that this is the case. why? because you startpage is the one thing
that is definitely not ... like other websites our there.

fix it, two possibilities: [http://www.mavnn.co.uk/](http://www.mavnn.co.uk/)
-> HTTP 301 -> [http://blog.mavnn.co.uk/](http://blog.mavnn.co.uk/)

or you set up a proper startpage, some text what this is, some links to your
other ressources.

after you have done one of this, do a fetch-as-googlebot (via google webmaster
tools) and click the "submit to index" button.

wait two days.

if not, test another hypothesis or post in the google webmaster forum,
actually google guys dig these kind of errors.

