This has nothing to do with Reader. We were tackling a spammer and inadvertently took action on the root page of digg.com.
Here's the official statement from Google: "We're sorry about the inconvenience this morning to people trying to search for Digg. In the process of removing a spammy submitted link on Digg.com, we inadvertently applied the webspam action to the whole site. We're correcting this, and the fix should be deployed shortly."
From talking to the relevant engineer, I think digg.com should be fully back in our results within 15 minutes or so. After that, we'll be looking into what protections or process improvements would make this less likely to happen in the future.
Even if you know people within Google, there's so much fear of allegations of impropriety that employees are too afraid to even ask the appropriate team if there's a possible mistake that they should look at.
Hey relix, it took an unfortunate chain of corner cases for this to happen, and for this situation it was actually more likely for the corner cases to hit a larger site rather than a less popular site.
In general, when a member of the webspam team directly applies a manual webspam action against a site, we also drop a note to the site owner at http://google.com/webmasters/ . That helps the site owner tell whether something is going on with manual spam vs. just algorithmic ranking. Then any site can do a reconsideration request at the same place or post in our webmaster forum at https://productforums.google.com/forum/#!forum/webmasters .
People like to scrutinize Google, so I've noticed that writing a "Google unfairly penalized me" blog post typically makes its way to us pretty often.
That doesn't match my experience. Could you explain the penalty against onlineslangdictionary.com?
Showing citations of slang use[1] caused what appears to be an algorithmic penalty. The correlation between showing citations and the presence of a penalty is apparent:
Missing from those 3 charts is the one showing that citations were once again removed over 120 days ago, yet the penalty remains. It would appear that the algorithmic penalty was turned into a manual penalty.
I've followed all procedures including those listed in your comment, without resolution.
[1] By citations of slang use, I mean short (1-3 sentence) attributed excerpts of published works, shown within the appropriate definitions, as evidence of the correctness of those definitions. All citations were gathered and posted by hand.
Hi Walter, the only manual webspam action I see regarding onlineslangdictionary.com is from several years ago (are you familiar with a company called Web Build Pages or someone named Jim Boykin?), but that no longer applies here.
Hi, This is Jim Boykin. I have no records of ever doing anything for onlineslangdictionary.com ...but I guess it "no longer applies here"...but just strange that you'd associate me with this website...hum I wonder who else I'm associated with by google whom I have had nothing to do with...
P.S. One other quick thing. I saw you sending me tweets, but the tweets looked fairly repetitive, and you hadn't chosen a Twitter avatar. I get a lot of tweets from bots, and this looked fairly close to bot-like to me: https://twitter.com/mattcutts/status/315232934040846337/phot... That (plus the fact that the site had no current manual webspam actions, plus the fact that I wasn't sure what you meant by citations) meant that I didn't reply. Hope that helps.
Makes sense. All the tweets were by hand. I tried to tweet every weekday but missed some, then eventually gave up.
I just didn't know what to do upon getting no feedback from you guys after posting to the Google Webmaster forums, filing reconsideration requests, contacting friends at Google, posting to and commenting on Reddit about it, commenting on HN about it, posting to Facebook, blogging, and tweeting about it, and putting a yellow box at the top of all pages on the site mentioning the penalty and linking to a page with the details.
After trying for 1 year and 11 months to get anyone from Google to talk to me about the penalty, I can't tell you how ecstatic I am that you responded. Thank you! I would have responded sooner but I wanted to deploy some changes to the site, and also I was floored by some of the things in your comment and didn't know quite how to respond.
I'm very happy to read that there's no current manual action against onlineslangdictionary.com.
Hi Walter, the only manual webspam action I see regarding onlineslangdictionary.com is from several years ago...
Oh? I've never received a notice about having a penalty or about a penalty being removed. When was that penalty in place?
(are you familiar with a company called Web Build Pages or someone named Jim Boykin?)
Nope. I hadn't heard of either until I read your comment. Why do you ask? Did he/they cause the manual penalty against my site and then cause the manual penalty to be removed? How?
Interesting! I had read that post on the Webmaster Central Blog before, but never even considered that the layout algorithm was penalizing my site, for a few reasons. 1. The upper leaderboard ad + side wide skyscraper ad combination is so commonly used everywhere on the web. 2. I removed the leaderboard ad from the entire site from 11 May 2011 through 31 August 2011 and found it had no effect on the site's ranking. (It also had no effect on user behavior, such as bounce rate or time-on-site.) 3. My site isn't one of the sites "that go much further to load the top of the page with ads to an excessive degree or that make it hard to find the actual original content on the page."
I have removed all advertising from onlineslangdictionary.com, and also removed the yellow box at the top informing visitors why they no longer have access to the citations of slang use. (Better safe than sorry, I guess.) The page layout penalty should no longer be a problem.
(Since ads are no longer on my site, for reference, here are screenshots of those two URLs you linked, one from my site and one from urbandictionary.com:
I've read that article in the past, and gave it a re-read. I understand Panda is about penalizing low-quality sites.
High-quality dictionaries have citations of use from published sources. Citations prove the definitions are correct, provide real-world illustrations of proper usage, are just plain interesting, etc. Penalizing a dictionary for showing citations is like penalizing Wikipedia for having lots of numbered sentence fragments at the bottom of their articles. That's how they prove that their claims are factual.
onlineslangdictionary.com had around 5,000 citations of slang use, collected and added by hand. The presence/absence of citations on the site is the only thing I've found to correlate with the presence/absence of a penalty (http://onlineslangdictionary.com/static/images/panda/overvie...).
Due to Panda, they were removed for non-authenticated users (including Googlebot) most recently starting 16 November 2012. They have been unavailable to authenticated users starting 8 March 2013. Because of a coding mistake on my part, they were visible for between 3 and 4 hours on 12 March 2013. (Basically: I accidentally inverted the logic of the 'if' statement that checks whether citations need to be removed (the answer should always be "yes") causing the code to not remove citations.) I fixed the bug as soon as I noticed it, and filed an updated reconsideration request.
The citations are gone. All content on the site is 100% original. It's got the only real, free slang thesaurus on the web. There are other unique features. I don't know what Panda would be penalizing the site for.
I started The Online Slang Dictionary in 1996, and have been working on it full-time for the past 6 years. My goal is to create the "Wiktionary of Slang" - not a flash-in-the-pan made-for-AdSense site. I was delighted with the site's ranking Between The Penalties: from 8 days after I first removed the citations until 3 days after I put them back on the site (13 November 2011 until 9 October 2012.)
It would be awesome to have the chance to once again compete on a level playing field with other slang websites. I'd love to have the time to implement the new features I've been dying to add, rather than spending time (over a year now) trying to guess why Google is penalizing the site and fixing those guesses - since my data shows that site growth is impossible with the penalties in place.
You bet. I thought https://news.ycombinator.com/item?id=5422855 was a pretty good example of that. I'd never heard of Amy Wilentz, but her blog post made it our way via several channels and made the front page of HN today. Admittedly it was a bad fact to get wrong, but I think it still demonstrates that blogging about Google doing something suboptimally can get a fair amount of attention.
A similar issue (removed from search results, but remained in index) happened to my site due to a DNS issue last year, and I had to perform some magic steps+ in webmaster tools and we re-appeared in search within a day or so.
+ no magic, I just don't remember what exactly I had to do.
So was it the well known missing robots.txt killing a site problem (as described in the original article) or a manual action gone astray?
When you looking at protections you need to ask what happens of this is a mom and pop site and not a well known sv company with high level Google contacts?
Hey matt,If its a problem with one link alone then just take action on that one page alone, then it could be considered as action otherwise it can be taken as overaction. On the name of quality one should not take the whole site down for a day,This action of yours is completely wrong..
and even more duplicate tag and rss pages for "site:digg.com/tag/" and "site:digg.com rss".
These /news/ pages 302 redirect to many different sites (some are bound to contain spam or be of lower quality).
302 redirects for these links is bad practice. Some link shorteners (ab)use 302 Found (instead of 301 Moved Permanently) to hoard content that doesn't belong to them. The content for these links can't be found on digg.com, so they too use the wrong redirect and associate themselves with all pages they link to.
Besides that: Digg.com acts like a single page webapp for most of its content. There are no discussion pages or detail pages for the stories. The content that does appear is near duplicate to other content on the web, especially with popular stories, where many blogs just copy the title and the first intro paragraph.
I saw a similar pattern in the links that remain, it's sad to see the link authority of a site get plundered, and it seems like something inside Google's indexer realized that was going on and deleted it.
As a search engine its something you have to do if you want to consider site authority in your ranking model.
It's funny, but I've been pretty careful to add the nocache,noindex headers for many of the pages on the site I work on where a given page is no longer active... also, I would think that digg would maybe consider a rel=nofollow for any links that aren't "popular" yet. That may be the best way to handle the link spam.
The article is down, but here is the text from Google's cache:
--snip--
Something interesting has just come across one of my networks (hat tip to datadial), just a few days after Digg have announced that they are building a replacement for the much loved Google Reader, they have (coincidentally?) disappeared from the primary google index.
[image unavailable]
Is it an SEO penalty for links? That seems to be the number one reason that brands are getting booted from google’s index these days… Some conspiracy theorists will no doubt be proclaiming its something to do with their announcement to build a replica of the now defunct Google reader, but personally I really cant see that having any effect. Could there?
Doing a site: search for Digg certainly demonstrates that they are no longer in the index:
[image unavailable]
Its likely (only if it is link based however) that it would be down to what individuals who submit content do after the fact – ie. sending spammy links at their posts to try and build the pagerank, and create “authority” which they then pass back to their own sites. Digg has long been listed in every “linkwheel” sellers handbook, and if that is the reason then what does it mean for every community site on the internet?
Will we have to manually aprove all new links soon at this rate? Come on Google – WTF – let the internet know what you’re doing please.
--end snip--
Without the images, maybe I am missing some context. But it seems hyperbolic, considering digg is not serving a robots.txt anymore [1]. It is probably just a blunder on Digg's part.
[1]: At the time of my comment, digg was serving an xhtml document with status 404 at /robots.txt. Now it appears to be a valid robots file.
PS: I am enjoying the irony of having a copy of the article, even though the site is down, because of Google's cache. Need to pontificate about how Google is potentially evil but can't keep your server running? Don't worry, people can read it via Google's cache.
A "Disallow:" (without anything after it) is saying allow all robots access to the entire server. (If you need something beyond my word on the topic, there's a thorough explanation at http://www.robotstxt.org/robotstxt.html)
Those are very different. The former essentially disallows bots from crawling the entire site, while the latter disallows nothing - effectively allowing everything. The syntax is unusual, granted, for historical reasons.
Because it isn't proper etiquette, it is their content and they should get the views for it.
Some people may not be able to handle it and caches are always nice but pageviews are also a nice thing when you run a website even if they don't generate any revenue.
It makes it look like you're still on Digg even when you click the a link to go to another site. So essentially it makes it look like people are spending a lot more time on Digg than they actually are.
From my experience, when Google de-indexes a site, they also suppress any PageRank the Google Toolbar would have shown for it; that doesn't appear to be the case here, digg.com is still PR8.
Having toolbar PageRank[1] and yet no cached page[2] is not something I've seen before.
Your experience may be limited, because this is very common. PageRank and indexed status operate independently of each other; it's not uncommon to see a site that was deindexed still maintain PR for quite some time (and often, indefinitely). However, if your site is deindexed, PR means "nothing" because your links no longer provide juice.
Valid PR while being deindexed is one of the (many) tricks that Google has added in the last couple years to try to reduce the usefulness of getting PR for blackhat purposes
This robots.txt should allow all bots to search the entire website. However, I think Google also penalizes websites that serve different content to googlebot than to non-bot user agents.
curl -I http://digg.com/robots.txt
HTTP/1.1 404 Not Found
Content-length: 1443
Content-Type: text/html
Date: Wed, 20 Mar 2013 16:27:13 GMT
Server: nginx/1.1.19
Connection: keep-alive
an HTTP 404 robots.txt should not be an issue - but maybe there was a robots.txt with something else in there instead, and now it's gone.
but even so, if there was an all blocking robots.txt site:digg.com would still show the URLs of digg.com, as crawling is optional for indexing. (maybe they used the non-documented Noindex: / directive, but i doubt it)
if you want to achieve such a clean removal, in most cases you must request a complete removal via google webmaster tools.
so yeah, somewhere, someone might have screwed up, most likely on diggs side, maybe on googles side, maybe a combination.
UPDATe: now it's
curl -i http://digg.com/robots.txt
HTTP/1.1 200 OK
Cache-Control: public
Content-Type: text/plain
Date: Wed, 20 Mar 2013 17:05:29 GMT
Etag: "c47ccf1a49c24cc5842430aa75c72ef491292412"
Last-Modified: Wed, 20 Mar 2013 16:51:48 GMT
Server: TornadoServer/2.2
Content-Length: 24
Connection: keep-alive
User-agent: *
Disallow:
which is the best practice robots.txt (allow all, or rather: hey, i have a valid robots.txt file but the second statement is not a valid robots.txt disallow directive, so this means allow all as there is no other disallow directive (note: i once coded https://npmjs.org/package/robotstxt which tried to reimplement googles spec of the robots.txt https://developers.google.com/webmasters/control-crawl-index... so i have some experience in reading robots.txt files ))
still, it wasn't a robots.txt issue
hmmm, just a hypothesis: maybe just maybe someone thought it was a good idea to remove www.digg.com via google webmaster tools from google (as their main domain is digg.com and not www.digg.com and they definitely tried had some www.digg.com URL indexed, as even bing some some of these URLs indexed http://www.bing.com/search?q=site%3Awww.digg.com&go=&... ) but google is (maybe) set to treat www.digg.com and digg.com as the same (via the settings panel in google webmaster tools), removing www.digg.com then could result into removing digg.com as well (something similar happened to a client of mine years ago), so could be the issue, could not be the issue, we would need more data (access to GWT) to verify this.
As another data point: I just clicked that link in Chrome, and I get the robots.txt with * Disallow like others are saying. Weird that some people are getting a 404.
Google admitted the mistake:
“We’re sorry about the inconvenience this morning to people trying to search for Digg. In the process of removing a spammy link on Digg.com, we inadvertently applied the webspam action to the whole site. We’re correcting this, and the fix should be deployed shortly.”
Here's the official statement from Google: "We're sorry about the inconvenience this morning to people trying to search for Digg. In the process of removing a spammy submitted link on Digg.com, we inadvertently applied the webspam action to the whole site. We're correcting this, and the fix should be deployed shortly."
From talking to the relevant engineer, I think digg.com should be fully back in our results within 15 minutes or so. After that, we'll be looking into what protections or process improvements would make this less likely to happen in the future.
Added: I believe Digg is fully back now.