Hacker News new | past | comments | ask | show | jobs | submit login
Digg banned from Google? (martinmacdonald.net)
96 points by trevin on March 20, 2013 | hide | past | favorite | 73 comments



This has nothing to do with Reader. We were tackling a spammer and inadvertently took action on the root page of digg.com.

Here's the official statement from Google: "We're sorry about the inconvenience this morning to people trying to search for Digg. In the process of removing a spammy submitted link on Digg.com, we inadvertently applied the webspam action to the whole site. We're correcting this, and the fix should be deployed shortly."

From talking to the relevant engineer, I think digg.com should be fully back in our results within 15 minutes or so. After that, we'll be looking into what protections or process improvements would make this less likely to happen in the future.

Added: I believe Digg is fully back now.


If this would happen to a less popular site, what chances does a site-owner have of getting attention to this problem, and getting it fixed?


None.

Even if you know people within Google, there's so much fear of allegations of impropriety that employees are too afraid to even ask the appropriate team if there's a possible mistake that they should look at.


Hey relix, it took an unfortunate chain of corner cases for this to happen, and for this situation it was actually more likely for the corner cases to hit a larger site rather than a less popular site.

In general, when a member of the webspam team directly applies a manual webspam action against a site, we also drop a note to the site owner at http://google.com/webmasters/ . That helps the site owner tell whether something is going on with manual spam vs. just algorithmic ranking. Then any site can do a reconsideration request at the same place or post in our webmaster forum at https://productforums.google.com/forum/#!forum/webmasters .

People like to scrutinize Google, so I've noticed that writing a "Google unfairly penalized me" blog post typically makes its way to us pretty often.


Hi Matt,

That doesn't match my experience. Could you explain the penalty against onlineslangdictionary.com?

Showing citations of slang use[1] caused what appears to be an algorithmic penalty. The correlation between showing citations and the presence of a penalty is apparent:

http://onlineslangdictionary.com/static/images/panda/overvie...

Missing from those 3 charts is the one showing that citations were once again removed over 120 days ago, yet the penalty remains. It would appear that the algorithmic penalty was turned into a manual penalty.

I've followed all procedures including those listed in your comment, without resolution.

[1] By citations of slang use, I mean short (1-3 sentence) attributed excerpts of published works, shown within the appropriate definitions, as evidence of the correctness of those definitions. All citations were gathered and posted by hand.


Hi Walter, the only manual webspam action I see regarding onlineslangdictionary.com is from several years ago (are you familiar with a company called Web Build Pages or someone named Jim Boykin?), but that no longer applies here.

You're affected by a couple algorithms in our general web ranking. The first is our page layout algorithm. See http://googlewebmastercentral.blogspot.com/2012/01/page-layo... or http://searchengineland.com/google-may-penalize-ad-heavy-pag... for more context on that. In particular, comparing a page like http://onlineslangdictionary.com/meaning-definition-of/compy to a page like http://www.urbandictionary.com/define.php?term=Pepperazzi... , your site has much more prominent ads above the fold compared to Urban Dictionary.

Your site is also affected by our Panda algorithm. Here's a blog post we wrote to give guidance to sites that are affected by Panda: http://googlewebmastercentral.blogspot.com/2011/05/more-guid...


Hi, This is Jim Boykin. I have no records of ever doing anything for onlineslangdictionary.com ...but I guess it "no longer applies here"...but just strange that you'd associate me with this website...hum I wonder who else I'm associated with by google whom I have had nothing to do with...


P.S. One other quick thing. I saw you sending me tweets, but the tweets looked fairly repetitive, and you hadn't chosen a Twitter avatar. I get a lot of tweets from bots, and this looked fairly close to bot-like to me: https://twitter.com/mattcutts/status/315232934040846337/phot... That (plus the fact that the site had no current manual webspam actions, plus the fact that I wasn't sure what you meant by citations) meant that I didn't reply. Hope that helps.


Makes sense. All the tweets were by hand. I tried to tweet every weekday but missed some, then eventually gave up.

I just didn't know what to do upon getting no feedback from you guys after posting to the Google Webmaster forums, filing reconsideration requests, contacting friends at Google, posting to and commenting on Reddit about it, commenting on HN about it, posting to Facebook, blogging, and tweeting about it, and putting a yellow box at the top of all pages on the site mentioning the penalty and linking to a page with the details.

Thanks again for talking with me about this. (I'd still like to hear about Web Build Pages / Jim Boykin and the rest - https://news.ycombinator.com/item?id=5444996 ...)


After trying for 1 year and 11 months to get anyone from Google to talk to me about the penalty, I can't tell you how ecstatic I am that you responded. Thank you! I would have responded sooner but I wanted to deploy some changes to the site, and also I was floored by some of the things in your comment and didn't know quite how to respond.

I'm very happy to read that there's no current manual action against onlineslangdictionary.com.

Hi Walter, the only manual webspam action I see regarding onlineslangdictionary.com is from several years ago...

Oh? I've never received a notice about having a penalty or about a penalty being removed. When was that penalty in place?

(are you familiar with a company called Web Build Pages or someone named Jim Boykin?)

Nope. I hadn't heard of either until I read your comment. Why do you ask? Did he/they cause the manual penalty against my site and then cause the manual penalty to be removed? How?

You're affected by a couple algorithms in our general web ranking. The first is our page layout algorithm. See <snip>. In particular, comparing a page like http://onlineslangdictionary.com/meaning-definition-of/compy to a page like http://www.urbandictionary.com/define.php?term=Pepperazzi..., your site has much more prominent ads above the fold compared to Urban Dictionary.

Interesting! I had read that post on the Webmaster Central Blog before, but never even considered that the layout algorithm was penalizing my site, for a few reasons. 1. The upper leaderboard ad + side wide skyscraper ad combination is so commonly used everywhere on the web. 2. I removed the leaderboard ad from the entire site from 11 May 2011 through 31 August 2011 and found it had no effect on the site's ranking. (It also had no effect on user behavior, such as bounce rate or time-on-site.) 3. My site isn't one of the sites "that go much further to load the top of the page with ads to an excessive degree or that make it hard to find the actual original content on the page."

I have removed all advertising from onlineslangdictionary.com, and also removed the yellow box at the top informing visitors why they no longer have access to the citations of slang use. (Better safe than sorry, I guess.) The page layout penalty should no longer be a problem.

(Since ads are no longer on my site, for reference, here are screenshots of those two URLs you linked, one from my site and one from urbandictionary.com:

    OSD: http://onlineslangdictionary.com/static/images/layout-algorithm/2013-03-24-osd.com-atf-def-of-compy.png
    UD: http://onlineslangdictionary.com/static/images/layout-algorithm/2013-03-24-ud.com-atf-def-of-pepperazzi.png
Interestingly, with my screen size, Urban Dictionary has more pixels above the fold dedicated to ads.)

Your site is also affected by our Panda algorithm. Here's a blog post we wrote to give guidance to sites that are affected by Panda: http://googlewebmastercentral.blogspot.com/2011/05/more-guid...

I've read that article in the past, and gave it a re-read. I understand Panda is about penalizing low-quality sites.

High-quality dictionaries have citations of use from published sources. Citations prove the definitions are correct, provide real-world illustrations of proper usage, are just plain interesting, etc. Penalizing a dictionary for showing citations is like penalizing Wikipedia for having lots of numbered sentence fragments at the bottom of their articles. That's how they prove that their claims are factual.

onlineslangdictionary.com had around 5,000 citations of slang use, collected and added by hand. The presence/absence of citations on the site is the only thing I've found to correlate with the presence/absence of a penalty (http://onlineslangdictionary.com/static/images/panda/overvie...).

Due to Panda, they were removed for non-authenticated users (including Googlebot) most recently starting 16 November 2012. They have been unavailable to authenticated users starting 8 March 2013. Because of a coding mistake on my part, they were visible for between 3 and 4 hours on 12 March 2013. (Basically: I accidentally inverted the logic of the 'if' statement that checks whether citations need to be removed (the answer should always be "yes") causing the code to not remove citations.) I fixed the bug as soon as I noticed it, and filed an updated reconsideration request.

The citations are gone. All content on the site is 100% original. It's got the only real, free slang thesaurus on the web. There are other unique features. I don't know what Panda would be penalizing the site for.

I started The Online Slang Dictionary in 1996, and have been working on it full-time for the past 6 years. My goal is to create the "Wiktionary of Slang" - not a flash-in-the-pan made-for-AdSense site. I was delighted with the site's ranking Between The Penalties: from 8 days after I first removed the citations until 3 days after I put them back on the site (13 November 2011 until 9 October 2012.)

It would be awesome to have the chance to once again compete on a level playing field with other slang websites. I'd love to have the time to implement the new features I've been dying to add, rather than spending time (over a year now) trying to guess why Google is penalizing the site and fixing those guesses - since my data shows that site growth is impossible with the penalties in place.


Thanks Matt, that's good to know.


You bet. I thought https://news.ycombinator.com/item?id=5422855 was a pretty good example of that. I'd never heard of Amy Wilentz, but her blog post made it our way via several channels and made the front page of HN today. Admittedly it was a bad fact to get wrong, but I think it still demonstrates that blogging about Google doing something suboptimally can get a fair amount of attention.


A similar issue (removed from search results, but remained in index) happened to my site due to a DNS issue last year, and I had to perform some magic steps+ in webmaster tools and we re-appeared in search within a day or so.

+ no magic, I just don't remember what exactly I had to do.


The Star Trek TOS "The Ultimate Computer": Captain James T. Kirk: "And how long will it be before all of us simply get in the way?"


So was it the well known missing robots.txt killing a site problem (as described in the original article) or a manual action gone astray?

When you looking at protections you need to ask what happens of this is a mom and pop site and not a well known sv company with high level Google contacts?


I can't speak for Google or Matt Cutts. But it sounds like he said it was a Google bug.


Hey matt,If its a problem with one link alone then just take action on that one page alone, then it could be considered as action otherwise it can be taken as overaction. On the name of quality one should not take the whole site down for a day,This action of yours is completely wrong..


Why not take the 'nuclear launch' method? Every time someone's about to de-list [large number] of pages, a second human has to confirm it.

(This is presuming humans were involved at all)


Is this a stunt to put Digg.com on the map again?


I would imagine it was more like:

Apply actions to (tick one) :

[] Individual Link [] Individual Page [] Individual Keyword [] Whole Domain

[SUBMIT]

And the manual reviewer was drunk on Google Juice :)


I see so many flaws with this process and small business owners that may not be able to get the attention that Digg just has.


@Matt How can 1 bad link on a site take down a whole site? Is this normal?


Doing a site:digg.com/news/ search on Bing shows a lot of pages like these:

http://digg.com/news/gaming/ing_bank_i_ilanlar

and even more duplicate tag and rss pages for "site:digg.com/tag/" and "site:digg.com rss".

These /news/ pages 302 redirect to many different sites (some are bound to contain spam or be of lower quality).

302 redirects for these links is bad practice. Some link shorteners (ab)use 302 Found (instead of 301 Moved Permanently) to hoard content that doesn't belong to them. The content for these links can't be found on digg.com, so they too use the wrong redirect and associate themselves with all pages they link to.

Besides that: Digg.com acts like a single page webapp for most of its content. There are no discussion pages or detail pages for the stories. The content that does appear is near duplicate to other content on the web, especially with popular stories, where many blogs just copy the title and the first intro paragraph.


I saw a similar pattern in the links that remain, it's sad to see the link authority of a site get plundered, and it seems like something inside Google's indexer realized that was going on and deleted it.

As a search engine its something you have to do if you want to consider site authority in your ranking model.


It's funny, but I've been pretty careful to add the nocache,noindex headers for many of the pages on the site I work on where a given page is no longer active... also, I would think that digg would maybe consider a rel=nofollow for any links that aren't "popular" yet. That may be the best way to handle the link spam.


The article is down, but here is the text from Google's cache:

    --snip-- 
Something interesting has just come across one of my networks (hat tip to datadial), just a few days after Digg have announced that they are building a replacement for the much loved Google Reader, they have (coincidentally?) disappeared from the primary google index.

[image unavailable]

Is it an SEO penalty for links? That seems to be the number one reason that brands are getting booted from google’s index these days… Some conspiracy theorists will no doubt be proclaiming its something to do with their announcement to build a replica of the now defunct Google reader, but personally I really cant see that having any effect. Could there?

Doing a site: search for Digg certainly demonstrates that they are no longer in the index:

[image unavailable]

Its likely (only if it is link based however) that it would be down to what individuals who submit content do after the fact – ie. sending spammy links at their posts to try and build the pagerank, and create “authority” which they then pass back to their own sites. Digg has long been listed in every “linkwheel” sellers handbook, and if that is the reason then what does it mean for every community site on the internet?

Will we have to manually aprove all new links soon at this rate? Come on Google – WTF – let the internet know what you’re doing please.

    --end snip--
Without the images, maybe I am missing some context. But it seems hyperbolic, considering digg is not serving a robots.txt anymore [1]. It is probably just a blunder on Digg's part.

[1]: At the time of my comment, digg was serving an xhtml document with status 404 at /robots.txt. Now it appears to be a valid robots file.

PS: I am enjoying the irony of having a copy of the article, even though the site is down, because of Google's cache. Need to pontificate about how Google is potentially evil but can't keep your server running? Don't worry, people can read it via Google's cache.


I just tried the digg.com/robots.txt, and for it me it says:

User-agent: * Disallow:

Which explains everything without conspiracy theories


A "Disallow:" (without anything after it) is saying allow all robots access to the entire server. (If you need something beyond my word on the topic, there's a thorough explanation at http://www.robotstxt.org/robotstxt.html)


Yes, this seems to explain it.

http://www.robotstxt.org/robotstxt.html


No, it doesn't. Your linked example contains:

  User-agent: *
  Disallow: /
While the (new) digg.com/robots.txt contains:

  User-agent: *
  Disallow:
Those are very different. The former essentially disallows bots from crawling the entire site, while the latter disallows nothing - effectively allowing everything. The syntax is unusual, granted, for historical reasons.


Why don't people just post things via a Coral Cache link when it's not a major site (that should obviously be able to handle the traffic)?


Because it isn't proper etiquette, it is their content and they should get the views for it.

Some people may not be able to handle it and caches are always nice but pageviews are also a nice thing when you run a website even if they don't generate any revenue.


Ok. True enough. That said, having a Coral Cache link auto-generated for each article[1] has been a requested feature of HN for a while. E.g.

  1. Article Title (example.com) [Coral Cache Link]
That way it can be cached prior to going down.

[1] At least, each article that hits the front page.


It's possible (though not necessarily likely) that they finally got banned for their toolbar, which is basically designed to scam Google.


Are they still using it? the current Digg is a whole new site rewritten from the ground up from betaworks I'm not seeing the toolbar at all.


Hmm it appears they killed it in 2010, though I still see some browser plugins available.


I'm not familiar with their toolbar. How does it scam Google?


It makes it look like you're still on Digg even when you click the a link to go to another site. So essentially it makes it look like people are spending a lot more time on Digg than they actually are.


Google "digggate".


What year are you posting from?


I'd wait for an official Digg statement, because Digg may have requested to be removed from google search results (the net result would be the same)


From my experience, when Google de-indexes a site, they also suppress any PageRank the Google Toolbar would have shown for it; that doesn't appear to be the case here, digg.com is still PR8.

Having toolbar PageRank[1] and yet no cached page[2] is not something I've seen before.

[1] http://toolbarqueries.google.com/tbr?features=Rank&sourc...

[2] https://www.google.com/search?q=cache:digg.com


Your experience may be limited, because this is very common. PageRank and indexed status operate independently of each other; it's not uncommon to see a site that was deindexed still maintain PR for quite some time (and often, indefinitely). However, if your site is deindexed, PR means "nothing" because your links no longer provide juice.

Valid PR while being deindexed is one of the (many) tricks that Google has added in the last couple years to try to reduce the usefulness of getting PR for blackhat purposes


Maybe they nuked themselves:

http://digg.com/robots.txt


I'm pretty sure getting 404 from robots.txt will not get you nuked from Google :)


It's cloaked so only google can see it - try changing your user agent to match googlebot's.


Using the googlebot User-Agent string gives me:

  User-agent: *
  Disallow:
This robots.txt should allow all bots to search the entire website. However, I think Google also penalizes websites that serve different content to googlebot than to non-bot user agents.


This is the same as what I see, using a standard-issue Firefox UA string.


they updated it


That's weird because http://blog.digg.com/robots.txt does not cloak. Why cloak one and leave the other?

  Sitemap: http://blog.digg.com/sitemap-pages.xml
  Sitemap: http://blog.digg.com/sitemap1.xml

  User-agent: *
  Disallow: /private
  Disallow: /random
  Disallow: /day
  Crawl-delay: 1


You do not need a robots.txt or a sitemap XML to be listed.


  curl -I http://digg.com/robots.txt
  HTTP/1.1 404 Not Found
  Content-length: 1443
  Content-Type: text/html
  Date: Wed, 20 Mar 2013 16:27:13 GMT
  Server: nginx/1.1.19
  Connection: keep-alive
an HTTP 404 robots.txt should not be an issue - but maybe there was a robots.txt with something else in there instead, and now it's gone.

but even so, if there was an all blocking robots.txt site:digg.com would still show the URLs of digg.com, as crawling is optional for indexing. (maybe they used the non-documented Noindex: / directive, but i doubt it)

if you want to achieve such a clean removal, in most cases you must request a complete removal via google webmaster tools.

so yeah, somewhere, someone might have screwed up, most likely on diggs side, maybe on googles side, maybe a combination.

UPDATe: now it's

    curl -i http://digg.com/robots.txt
    HTTP/1.1 200 OK
    Cache-Control: public
    Content-Type: text/plain
    Date: Wed, 20 Mar 2013 17:05:29 GMT
    Etag: "c47ccf1a49c24cc5842430aa75c72ef491292412"
    Last-Modified: Wed, 20 Mar 2013 16:51:48 GMT
    Server: TornadoServer/2.2
    Content-Length: 24
    Connection: keep-alive
    
    User-agent: * 
    Disallow:
which is the best practice robots.txt (allow all, or rather: hey, i have a valid robots.txt file but the second statement is not a valid robots.txt disallow directive, so this means allow all as there is no other disallow directive (note: i once coded https://npmjs.org/package/robotstxt which tried to reimplement googles spec of the robots.txt https://developers.google.com/webmasters/control-crawl-index... so i have some experience in reading robots.txt files ))

still, it wasn't a robots.txt issue

hmmm, just a hypothesis: maybe just maybe someone thought it was a good idea to remove www.digg.com via google webmaster tools from google (as their main domain is digg.com and not www.digg.com and they definitely tried had some www.digg.com URL indexed, as even bing some some of these URLs indexed http://www.bing.com/search?q=site%3Awww.digg.com&go=&... ) but google is (maybe) set to treat www.digg.com and digg.com as the same (via the settings panel in google webmaster tools), removing www.digg.com then could result into removing digg.com as well (something similar happened to a client of mine years ago), so could be the issue, could not be the issue, we would need more data (access to GWT) to verify this.


It's up for me with:

  User-agent: *
  Disallow:


I noticed that too, they are also missing a sitemap.xml


As another data point: I just clicked that link in Chrome, and I get the robots.txt with * Disallow like others are saying. Weird that some people are getting a 404.


It seems to be fixed now:

    User-agent: *
    Disallow:


getting 404 on that file.


Google admitted the mistake: “We’re sorry about the inconvenience this morning to people trying to search for Digg. In the process of removing a spammy link on Digg.com, we inadvertently applied the webspam action to the whole site. We’re correcting this, and the fix should be deployed shortly.”

http://thenextweb.com/google/2013/03/20/google-seems-to-have...


Google has rules concerning SEO. A warning for the rest of us not to try to cheat our way to a better ranking.


Their robots.txt file clearly asks to not be crawled at all.

  User-agent: *
  Disallow:
http://digg.com/robots.txt


Incorrect. To configure your robots.txt to not be crawled at all use:

  User-agent: *
  Disallow: /
Allow indexing of everything with:

  User-agent: *
  Disallow:
It seems they are at this very moment struggling to change things.


You are totally right. Disregard everything I said.


Disallow nothing -> allow everything.


Betteridge's law of headlines strikes again.


http://digg.com/robots.txt <-- 404 Not Found

I know a robots.txt 404 shouldn't de-list you, but I would have expected digg to have one?

Maybe they requested de-listing and removed their robots.txt? Who knows!


There is one there now:

  User-agent: *
  Disallow:


Yup - It's back now.. But it 100% didn't exist when I posted earlier :)


http://digg.com/robots.txt looks fine too me


[deleted]


They are in Yahoo

EDIT: I fucking hate HN's 'Delete' function.


What I love most about this situation is the multiple theories, when in essence Google puts it very simply: WE F*ED UP. Sorry!


If you assume conspiracy then you're most likely ignoring the simple fact of life in the technology world. Someone made a screw up.


It doesn't matter if they were banned - they didn't care about SEO in the first place.

All those old URLs they had for years? All disappeared as they wanted to start fresh.


And nothing of value was lost.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: