Hacker News new | past | comments | ask | show | jobs | submit login
Google algorithm change launched
688 points by Matt_Cutts on Jan 28, 2011 | hide | past | favorite | 183 comments
Earlier this week Google launched an algorithmic change that will tend to rank scraper sites or sites with less original content lower. The net effect is that searchers are more likely to see the sites that wrote the original content. An example would be that stackoverflow.com will tend to rank higher than sites that just reuse stackoverflow.com's content. Note that the algorithmic change isn't specific to stackoverflow.com though.

I know a few people here on HN had mentioned specific queries like [pass json body to spring mvc] or [aws s3 emr pig], and those look better to me now. I know that the people here all have their favorite programming-related query, so I wanted to ask if anyone notices a search where a site like efreedom ranks higher than SO now? Most of the searches I tried looked like they were returning SO at the appropriate times/slots now.




Will this change effect sites like filestube.com and freshwap.net? FilesTube ranks for majority of the long tail keywords, even those not related to downloads/torrents/rapidshare.

I see filestube's auto-generated search listing pages ranking on Google all the time. Pages like: http://www.filestube.com/m/matt+cutts http://www.filestube.com/g/google+scraper

Same goes for freshwap: http://www.freshwap.net/387/dl/Google+Matt+Cutts

These sites will give out an auto-generated page for every keyword you enter into it. Apparently, Google loves to index them... there are 126 million pages of files tube indexed in Google. I thought indexing search listing pages of other search engine was against Google's policies.


There's some really annoying torrent sites like this. I mean, sites that pretend they have search results for whatever torrent you're searching for. Those show up on a google a lot and they're useless.


This is another class of problems we're working on. Expect some changes here in the next few months.


Your account page doesn't say anything. I assume you work for Google?


Yep, in fact I wrote the change we're talking about in this thread. :)


Hah. Good to know. I guess I'm your 'enemy' then since if there was a label for me it'd probably be blackhatter. Though my 'spam' tends to be of much greater quality than what most of the stuff BHW sort of people produce (I have some original content written and excluding my bottom of the barrel sites, the others have their automated parts like scraping edited/checked by a hired person).


Though my 'spam' tends to be of much greater quality than what most of the stuff BHW sort of people produce

You threw a vitamin pill into a bucket of mud?


die die die.


It is acceptable to rank these websites for keywords related to torrents and downloads. But, those auto-generated pages has been ranking even for keywords not related to torrents/downloads.


Matt, I just went through my search history because i remembered a very specific instance of seeing this. Here's the query.

http://www.google.com/search?q=nstoolbar+bottom+bar&ie=u...

You'll notice that efreedom.com shows on the first page with content taken directly from stackoverflow. While stackoverflow does show in the results, the exact page that efreedom copies does not. Anyway, I'm glad you guys are taking this seriously.

For reference here is what I see right now - http://dl.dropbox.com/u/1437645/googlesearchresult.png


It looks like we've got SO above efreedom for that query, but it's always nice to find a url that we didn't have that we'd like to be indexed. That lets us check whether we can improve our crawling/indexing. Thanks for the example!


The _relevant_ SO result http://stackoverflow.com/questions/1097115/bottom-bar-in-nsw... shows up in the collapsed results below the 2nd SO result!!


Yes SO is above efreedom in this instance, but the SO results are actually worse then the efreedom result based on the query.


This is a situation I've seen many times in the past. Often the right site is on top, but it's showing the wrong result, meanwhile the scraper site surfaces the correct one.

It seems like just showing a few more results from the "real" site would solve the problem.


This is why I love Hacker News.


If you tell me that sites who chronically scrape content elsewhere get a "pull-down weight" attached to the whole of their content, you've made my day.


That wouldn't be nice. There is a big difference between copying content from Stack Overflow, to running a Planet news aggregator on a community site for your open source project. The first is an ad scam, the second is valuable service to the members of the community.


For [nstoolbar bottom bar] I see (from an incognito window):

#2) stackoverflow.com/questions/3977343/nstoolbar-on-bottom-of-window

#3) stackoverflow.com/questions/tagged/nstoolbar

#9) efreedom.com/Question/1-3977343/NSToolbar-Bottom-Window

The #9 result is copied from stackoverflow.com/questions/3977343/nstoolbar-on-bottom-of-window, which is the same as the #2 result.

This actually is really hard to do as well as Google has given that the stackoverflow page doesn't have the word "bar" on it anywhere, while the efreedom page does.

Neither Bing nor DDG seem to have either the StackOverflow page nor the Efreedom page. I would give an uninformed guess that they probably can't match the StackOverflow page because it doesn't contain "bar" and are probably doing some sort of hand demotion/removal for efreedom. As a result, you get no pointers to the information you want instead of two.


For [nstoolbar bottom bar], searched from my logged-in window, a permalink to this comment is the top result.

Um...


Clearly by observing the search results, we have changed them.


Ironically, now this post ranks at number 1 for the query. World works in a strange way it seems.


did you try with &pws=0 (personalized search disable) option with the query? That solved the problem for me. I do not see efreedom on top for this query.


I'm seeing two SO URLs (5 and 6) before efreedom (8th) when I click that link.


The two SO urls that rank above the efreedom url are for different questions. The identical SO link (http://stackoverflow.com/questions/3977343/nstoolbar-on-bott...) doesn't seem to show up in the results at all.

Two SO links on the first page above any efreedom urls is a positive thing, but even so, if content is apparently considered good enough to make it to the first page, it should come from the original source rather than a scraper.


Yes, but 2 things.

1. The SO results shown don't address the query I submitted as well as the efreedom one does.

2. Why doesn't the stackoverflow article that is being duped show in the results?


So HN is in the first spot for this query now. And it redirects to this question. Make of that what you will.


Make that first two spots.


It's strange how these results vary from person to person. When I click that link, I see an eFreedom link (with SO content) at #6 and a different SO link at #7.


Not exactly a scraper site, but if you do a search for "learn to hack" the top result is just a list of SEO keywords:

http://www.learn-to-hack.com/

Several of the other results are rather dubious as well. The reason I bring it up is because the Squidoo lens that comes up is something I made, and while certainly not perfect it's still a much better than many of the SEO spam sites and fake eBooks that rank above it. (And plus the ad revenue is going to charity rather than some shady organized crime ring.)

Anyway sorry if it's a faux pas to complain about my own stuff, but I feel like it's a legitimate problem with the way Google works.


>Anyway sorry if it's a faux pas to complain about my own stuff, but I feel like it's a legitimate problem with the way Google works.

Definitely not a faux pas. Thanks for the example. My biggest annoyance in threads like these is people who write essays about their site losing traffic but then aren't willing to provide an url for people to check out.


FYI:

Don't flag this because there's no link or "citation". Matt Cutts is the web spam guy at Google.



Yup, I actually posted over here at HN a little bit before I did a post on my personal blog.


I'd like to mention that this is really an impressive way of working with the community Matt. Companies say all the time that they value their customer's opinions, but rarely do you see a grievance that's posted on a social news site being a.) initially responded to by the guy who's responsible for it, and b.) amended by that person and his team with a request for review. It's heightened my faith that maybe Google can actually keep that small-company-feel no matter how large they get. I really hope you guys continue in this fashion, and thanks for the search fix.


HN has been a really high signal/noise site in discussing these issues, so it only seemed fair to give folks a heads-up here and see what other issues people were seeing. But thank you. :)


Yea, I have recently posted a comment starting with "Any Google employees here" knowing that there is more than one employee that posts here.


Is this change restricted to programming-related queries only?

I noticed today that a search for "mubarrek london" returns a page of results where every result on the first page, besides the top one, is spam from www.88searchengines.com, www.30searchengines.com, www.70searchengines.com, etc.

I know this might not be related to topic of scraper sites directly, but not sure how else one can easily report these types of things.


I can't seem to replicate this:

http://i.imgur.com/kDwPd.png

Are you sure there isn't something else going on?


Perhaps it's related to location, it's reproducible with incognito search (to make sure no personalized settings are being applied): http://i.imgur.com/Yd7FN.png


I do see these problematic results:

http://www.tristanperry.com/pics/SpammyResult.jpg

Searching from the UK, if that changes anything.


http://cd34.com/mubarek+london.png

I get similar results as he does, with personalization, and, in Incognito mode.


Tried the query and was able to replicate this https://skitch.com/techsutra/rm3gb/mubarrek-london-google-se...


Google doesn't link to search engines in its results, that's their policy, so I would think those sites will be removed from their index at some point.


I would like to point out one interesting thing I noticed today. I was looking for "gcc optimization flags for xeon".

Following query is from google.com and contains no efreedom on the front page. http://www.google.com/search?hl=en&q=gcc+optimization+fl...

Now, the same query from google.ie (ireland site) contains 2 efreedom on the top page!

http://www.google.ie/search?hl=en&q=gcc+optimization+fla...

Why this strange search behavior to a query which has no relevance to user's location?

P.S. I was logged on to my google account while searching, not sure if that has any effect whatsoever.


The high-order bit for me is that Stack Overflow is #1 for both pages. We can do localization based on TLD: the query [bank] will return different results on google.com vs. google.ie, for example. Without doing a deep dig yet, we might have thought that open-mag.com was more helpful for European searchers, for example. It looks like that site is based in Portugal.


Kudos for keeping locational variety. Is there a flag/keyword we can enter into the search to ignore location?


Such a flag would be reallllllly handy. I constantly switch between google's .com/.nl sites based on what language i'm searching in and what I'm looking for.


Agreed


http://google.com/#q=kudos&hl=fi&gl=fi

hl refers to the UI language and gl to localized results.


Some people may recommend http://www.google.com/ncr, but have you tried actually asking for English when you request pages on Google? It sounds like you don't have your Accept-Language header set to English. If an Accept-Language: en header doesn't fix the problem, you should complain on HN (with a new thread) about it instead of using /ncr. Google really should support that.


UPDATE: I am even more surprised to see that when I logged out of my account and searched again on google.ie, one of the efreedom disappeared!!

So, does this mean that Google search now takes into account my past search result clicks (or rather mis-clicks)? Or ranks contents in someway that proves efreedom is somehow more relevant to me?


If you're using personalized search, we do look at past clicks to change your ranking. So if you clicked on efreedom in the past, that could affect your ranking. You can add "&pws=0" (or use incognito mode in Chrome without logging in) to turn off personalized search and see whether that's the factor for you.


If I'm clicking around to crappy sites to figure out why they rank highly I don't really want them to get even higher in my personal search ranking!

Is deleting them from search history enough to correct this effect?


Wow, you learn a lot around here if you're paying attention! I often wondered whether or not my results were being personalized. Good to know.


Adding &pws=0 turned the personalized search off and hence got rid of the efreedom.

Also, I cleared my web history and searched again. The results are same now signed in or signed out. Problem solved! :)

Thanks!


Glad to hear it. &pws=0 is handy to diagnose whether personalized search is causing something to rank higher for you.


It's fairly trivial to add a custom search engine to chrome or ff that automatically appends


In Chrome, do this by adding an engine with the following URL, and make it default if you want:

  {google:baseURL}search?{google:RLZ}{google:acceptedSuggestion}{google:originalQueryForSuggestion}sourceid=chrome&ie={inputEncoding}&pws=0&q=%s


Google has been doing this for some time now.


The two links show the same results for me now (logged in). Perhaps the index/algorithm changes just took a while to be deployed to all servers.


Glad to see that Matt went out of his way to help us geeks and the StackOverFlow community!

In the future is there going to be some way for webmasters to do something like rel="canonical" across domains so if I want to syndicate a piece of content across two properties I own I can indicate which one is the original source? My understanding is that rel="canonical" is only meant to be used between pages on the same root domain today but I could be mistaken.


Syndicating content across your own domains is a common practice and will not be effected by this.


rel="canonical does work across domains.



Good to know, thanks! I wasn't sure if that was the case or not.

Is Google the only engine that supports it currently or is it standardized?


This is awesome! Now if only I could completely remove certain sites from my search results ala the Google Wiki stuff. I'd love to drop swik, and expertsexchange and a few other annoying sites. Its possible this algorithm change will make these sites less annoying to me though.


Bonus trick. Experts Exchange actually often has some good stuff on it. The pages are designed to make you think you have to buy a subscription to see the answers, but if you simply scroll down beyond the ads for their service the answers (which aren't scraped from elsewhere) are visible near the bottom of the page. Yes annoying, I wish they didn't work this way, but there are times that I've found experts exchange to be the most useful result.


This only seems to be the case when coming from Google, if you refresh the page or paste the link to a co-worker the results are gone.

As for the long list of crap between the question and the answer, use AdBlock to remove that various DIV's.


That may be the case sometimes, but as far as I'm concerned they are spam. I would like the option to blacklist sites from my results. Wouldn't it be great if we had choice.


> which aren't scraped from elsewhere

They used to be scraped from usenet. They may well still be, but I don't have any current evidence. Are they still doing that?


Second that. WTB ability to customize my search results by blacklisting certain sites. Big change I know, but would be a hugely useful service Google could provide.


I created a site that uses Google CSE to do just this. The link is: http://blacklist-search.appspot.com/


http://userscripts.org/scripts/show/33156 never mind I just went out and found a script for it...


why not just use the - function?

Create something like:

http://www.google.com/search?q=google+-google.com&ie=utf...

I suspect you could hack together a quick greasemonkey script to automatically inject your negative site list when the URL is generated


Because Chrome still doesn't sync your "search engine" settings. So maintaining a massive customized search engine across the multiple computers I use would be a pain. If this was a native feature it would work so long as I logged in to Google. Of course, if they fix chrome I'll be happy too.


This only works for a small number of blacklisted domains.


Here's a query that shows efreedom above SO. I did site:efreedom.com and picked few titles until I found one.

query: Mailengine with .NET API

The right page that should show up is

http://stackoverflow.com/questions/1720900/mailengine-with-n...

If you google the link below you'll see that the page is indexed.

http://stackoverflow.com/questions/1720900/mailengine-with-n...

However, the page doesn't show up when you use the query I mentioned.


Matt, this is great news.

How about sites that rank well with no content, just navigation? Here's an example:

http://www.collegegrad.com/entryleveljob/entrylevelaccountin...

It's generally a high quality site, but that page has absolutely no relevance to the query except for a title tag and some internal anchor text. The search terms aren't even on the page.

If I remember correctly, it used to rank #1 for "accounting entry level jobs," and now it's down to #8. My question is why is it even ranking at all? It's not even low quality content. It's no content.


I notice that you run a competing site to collegegrad.com - namely onedayonejob.com. Obviously you are likely to have paid attention to some of the things your competitor has been doing. But are you sure you're not just using this as an opportunity to stick it to them? If so, that would strike me as a distasteful use of this forum, especially since Matt has been very gracious to give this opportunity to the HN community.


Even if they were direct competitors with the exact same market and product (Which they don't appear to be), that could mean that he's got a much better idea of how his competitors may be doing spammy things then we do. Since these are algorithmic changes not site specific changes any resulting fix would get applied to his own site as well.

The motives here don't offend me.


I completely understand why you'd question my motives. I thought really hard about whether I should post this example or not. I tried to find an alternative example that would demonstrate the same problem, but this is the only one that I could come up with. I'm sure if I spent significantly more time looking I could have found something similar, but this example is very apparent to me because it's in SERPs that I watch closely.

I'm not trying to "out" this site. I don't think they've done anything wrong or manipulative—they just have some pages with no content that are ranking well. I think that this type of issue should be on Matt's radar. Most of the site is of extremely high quality, and it deserves the high rankings that it gets.

It's a page that is solely navigational structure, yet it ranks very well for a very specific keyword because of title tags and anchor text. We've already seen that Google has a problem (that they're working towards fixing) with low content from low and medium quality sites. What are they doing about low quality content (or non-content) from high quality sites?


Hey matt, thanks for the update I wanted to know how would it work in this case:

http://www.google.com/#q=major+online+dating+sites+koopa&...

We submitted an article to ezinearticles from our blog.koopa.com. Just wondering how our blog is not remotely listed but ezine which we posted are article through our blog has? How does this work when our authors submit to article directories?

On that same note I notice a bunch of other sites just below which have copies ezines article to the tee and are ranking higher then the our original blog that posted it.

Is this related to the current algorithm change or just that our blog may not be indexed yet? Thanks for the update once again. I love the fact that you and the google team are constantly updating and changing your algo to give value to rightful content owners.


What is the url of your blog?



Yes - I've been doing a lot of YII related searches the past week, and I've noticed that a lot of times a site called 'devcomments.com' pops up somewhere in the first 5 results. Usually with a bogus page that does contain the keywords you were looking for, but not the actual discussion/forum thread. It appears they have copied their content directly from the official site, yiiframework.com.

Example: http://www.google.com/search?q=Any+yii+way+to+get+the+previo...

In that particular instance it is result number 3 (and the original is number 1), but on more than one occasion it was the top result and it's never what you are looking for.


While I applaud such an improvement, I still despair over Google's inability to handle context. For instance the search phrase 'tex decorative rules' is totally miss managed. First it overrides the search and changes tex to text. Humorously enough, when you counter override back to tex, the search is even worse. We won't even speak of what happens when you change tex to latex--- porno doth ever rise to the top I guess. While we may be in a new millennium, some things remain far behind. And yes I know that things are being worked on from the those who were academic and are now working at Google, but still...


tex is a hard one. And [latex decorative rules] is even trickier--most people wouldn't expect technical documentation if they just saw that as a random query. That's more of an issue to pass on to the synonyms team and not in the scope of this change, but I'll pass this one on to the right folks


Would it be feasible to introduce a search customization setting: "lower the estimated probability that I mistyped"? I assume google uses something like two cut-offs: A < B, where if P(mistype)>A, it suggests but doesn't search for the correction, while if P(mistype)> B, it displays auto-corrected results. Simply an user option to nudge P(mistype) lower would probably suffice, or an option to increase B.

A major usability problem for me with auto-corrected results is the two lines with links: "Showing results for <link>" and "Search instead for <link>". Interpreting those two lines takes me out of the flow of analyzing search results. I haven't yet been conditioned to click on the second link. I never want to click on the first link, because google is already showing me those results.

I frequently search for unusual acronyms, terms, variables, made-up words, and gibberish that shows up in logs, among other things that google likes to "correct" for my benefit. I realize those use cases are not typical, but current auto-correction behavior can be extremely aggravating in those cases.

Auto-suggest a correction all you want, but auto-correcting and suggesting the original is going too far I think, unless there are zero hits, or some equivalently strong algorithmic determination is made that I couldn't possibly want the query the way I typed it.


Personally, I think this would be a good idea. You could learn how tech/Google-savvy different users are and adjust the search results accordingly. Someone who has done a site: search is probably more likely to know their way around Google, for example.


Nice to hear about this change Matt. "Fuckin' efreedom..." is an oft heard missive around our office, godspeed in ranking them and their ilk down.


Since Matt is responding here, I figured this is worth a shot, no harm in asking. Matt, would love a response from you if you get a chance, since the Webmaster Tools appeals process gives no insight whatsoever to our situation.

Following on from one of the comments here, namely the idea that "value is in the eye of the beholder", I'd like to raise our own plight. I run a number of aggregator sites - the largest and oldest of them being celebrifi.com, which was a PageRank 5 until Google de-indexed us in December (along with some of our other sites, but interestingly not all).

A little background - the purpose of the sites is to aggregate, organize, rank and add context to what's happening in the news, with each site focusing a specific vertical. Think Techmeme, but with more context.

I'll be the first to admit, there is no original content, but I strongly believe that we "add value" by figuring out what exactly is going on in any given story or blog post.

We add value to publishers, by always linking to the original source (indeed, many publishers directly request that we add their feeds to the sources we track), we respect copyright by only displaying a short snippet of the original text and only displaying thumbnail images and we add value to users by giving them easy access to a lot more content on the same topic/story, all in the same place.

Google's Quality Guidelines clearly state that duplicate content is penalized, and that is totally fine with us, but is it right to totally de-index a site for duplicate content? I wouldn't even want to rank above the original source for any given piece of content, as I respect the hard work that writers and publishers put into creating quality content, but aggregators who add value have a role to play in the content ecosystem. Digg, for example, uses the "wisdom of the crowd" to aggregate and rank content - hence adds value. Topix takes a local approach to aggregating content, and uses comments to rank content - hence adds value. We take a verticalized approach to aggregating and ranking content, and hence I believe that we add value.

As mentioned above, we got de-indexed in December, and despite going through the appeals process, fixing a few things on our end to do with sitemaps, and clearing out some older "low quality" sources that we were tracking, we received no clarity into what our crime was.

Matt, I'd like to raise this issue with you - both as it relates to us, but also as a general industry question - are all aggregators going to be de-indexed? And if not, which aggregators are and which aren't? What is the criteria, and who decides? If its algorithmic, then I am very curious to know what on our sites triggered the de-indexing? And even more curious to know why some of our sites got de-indexed, and some didn't.

I have great respect for Google's efforts to clean up spam and low-quality content - and would always expect to see original content ranking higher than aggregated content. But to completely de-index an established aggregator site and strip it of its PageRank seems very draconian.

I would love to hear your/Google's position, and look forward to some more clarity on both our situation, and the future of news/content aggregators.

Respectfully yours, Niko


Respectfully - Sites like yours ruin the internet. You admit it yourself: you produce no content of your own. I do NOT want to see results from pages of that ilk when I google for things. Google made the right choice to de-index it.


Google (in the context of search) produce no content of their own, are they ruining the internet?


The goal from the user's perspective is to get to the content they want as quickly as possible. A search engine helps in that, as presumably you don't know where the content you want is if you're visiting a search engine. A search engine that links to an aggregator site doesn't - the search engine should just send you to the original content directly.

Presumably, aggregator sites by themselves also help in content discovery. I find a lot of content through Hacker News. But they should do so by being good enough to be a destination in themselves. An aggregator that needs to be found by search engine isn't doing users any favors.


I understand what you're saying, but doesn't Google News do the exact same thing?

Google News provides snippets of content, and helps people discover the news providing direct links to the original source of news. For Google to deindex a site like celebrifi and while running a competing product (Google News) smells a bit of monopolistic behavior. It's suspect it's unintentional, but Google is going to have to walk a very very fine line as you start deindexing certain sites.


When News results show up on the search result page, they link directly to the story, they don't link to the Google News landing page or category where you'd then have to click on a link.

Google News itself falls into the second category - an aggregator that stands on its own, as a destination. Much like Hacker News. I personally don't use it, but the people that do go because they find it lets them discover a bunch of content that they otherwise wouldn't know about.


I know that you're saying that Google News is a dedicated site, and separate from the search results. But, Google links to it's own aggregation service from its results on a regular basis, and at the very top of most every results page in Google there is a link to Google News version of the search.

Google News does stand alone as an aggregator, but you have to admit that it is promoted heavily by Google search. If GOOG keeps doing stuff like that, I suspect there are going to be a lot more companies that start to take umbrage, and start challenging this behavior in court claiming that it's anti-competitive behavior.


No, because Google search doesn't intentionally pollute other sites with aggregated content. I don't see Google search pages in Bing's results, outranking original content. I only see Google results if I go to google.com and specifically request them. That's the difference.


I go to Google when I want to find something. I click on a link when I think I have found what I'm looking for. When I click on that link, I want to go straight to my destination, not get lost in an maze of sites that do nothing but link to each other and dilute the content.

I go to Google because it provides value to me. Aggregator / republishing sites trick you into visiting them with the lure of what looks like actual, original content. This makes the internet less useful because you end up reading the same content over and over.


It depends what you think the goal is. I expect Google's goal with searching is to get you to the source data. What is the point in going to an aggregator for a query? Isn't that what google is in a nutshell? I'm not advancing a position, just genuinely posing the question. What role should aggregated results play in a search query if any at all?


The point of using an aggregator is that they supply MORE information surrounding your topic that the original source might. The original article or video might be great, but has no surrounding contextually relevant information that might also be useful or helpful. I personally love using sites that aggregate information for me about my hobbies, whether it be games or sports. I find more about what is going on because that site has done the "value added" work of finding and organizing it for me. Saves me time, because there are millions of sites out there that might have great original content, but I will never find them for many reasons. Not the least of which their SEO might stink. So thank you to aggregators whose SEO chops are good!


I have often found myself landing on an aggregator for a particularly query that has been a helpful resource for related problems.

Of course original content has priority, the original poster explicitly agreed with that and there are some site who do produce no value at all, but saying every site who purely uses other sites content are all ruining the internet is obviously very wrong, considering this is google we are talking about.


From looking at some Google Places I would say they are the king of scrapping content and definitely help contribute to the ruining of the Internet.

Example would be my wife's company, which is a large non-profit science center that helps educate kids and adults about science and has a wonderful website that they pay a lot of money to maintain and support. However, Google doesn't see any problem with the fact that they've scrapped my wife's company website to get their contact data, hours, and a few other points that they then put on their own page (Google Places - whatever that is), along with their own Google ads, as well as link to my wife's company competitors all for the purpose of keeping visitors on their site so that they can get the ad dollars while adding no additional value of their own.

To me this is the biggest scam in the world and makes them look like giant lying crooks on the web because they steal content from others, while banning competitors that do the same thing and all the while telling people that they doing it for the good of the web.

Yeah right?!? Google is doing it for the good of your pocket books. Here's a thought, why doesn't Google start paying sites, or splitting profits with the sites that they scrap data from and then make money off of. They are stealing visitors, traffic, and ad dollars while providing no additional content or benefit to the reader. I believe Google is getting really close to anti-trust issues and should be looked at very closely by the US Gov. They broke Ma Bell up in the 80's because they got too big and I see something like this coming down the line for Google.

People are starting to get concerned that their hands are dappling into many areas that cross-support each other, putting a lot of control and power into one company's hands. Google is not the Internet, they make no original content of their own, they regularly police the web and try to tell us what is relevant or not, similar to a dictator in a closed society, and they make money off all of this so of course it's in their best interest to keep it going and make us think we need them to show us the web. Truth be told, I can find anything I'm looking for on any engine, and often I find good and bad results on all of them so to me Google is just another search engine and if they keep going down their slippery slope they will wind up like Alta Vista.

Here's a suggestion. Drop your bunk PR and link structured algo and develop a new tool that doesn't support spamming the web with paid links. Google created this monster when they put so much emphasis on links and people all over the web buy and sell links to push their rankings up. This model is so old and out-dated that it doesn't make sense. Now Google is putting band-aids all over their old bunk algo trying to keep their out-dated ways going instead of putting in the time, dollars, and effort on making a new model that will actually produced good nature results instead of manipulated results that have to be regularly policed and manipulated to keep it going. The fact that Google has to regular update their algo shows that they know there are problems with it, but they just keep sticking more patches and band-aids on it instead of building a new model that will not support the spam techniques on the web.

All I can say is CONGRATS on putting another band-aid on the OLD ALGO. Maybe this one will only effect a small number of legit sites and, heck, those are casualties of the war on spam, right?

Good luck to all the legit sites that are trying to jump through Google's hoops! Just remember, if you don't make it through today you will be gone tomorrow.


Like I said, and others have said as well, value is in the eye of the beholder - if you follow your own logic, then every single aggregator, including the likes of HuffPo (something like 60-70% of their content is little "curated snippets" from around the web), are ruining the internet. There is very little "truly original" content out there, virtually every single blog rehashes the content of others. Is Techmeme of no value? Is Digg of no value? Is Topix of no value?

Also, we do actually produce some original content, we have a small staff of writers who create "featured posts" every day - sold ourselves a little short there.


Those services add value to the content - comments, votes, opinion, and/or journalistic research.

You guys have it, but it's buried under a pile of spam. Lose the spam and you'll recover your integrity.


The original content you do have just seems to be short rewritten versions of other online sources and any comments they have appear to just be random Tweets that use one of the same keywords.


I'm curious, what is your opinion of TechMeme?


Techmeme, like Google, is a great and valuable resource.

However, the debate is whether or not these aggregators sites should show up in search results instead of the original source.

I use Techmeme all day long, but don't expect it to ever show up in a search result when I am looking for a specific topic. I want to the original site. I can always go to Techmeme directly to get the "value-added" context on my own.

Aggregator sites are great destinations. They shouldn't be in search results (at least, not above the original sites).


I disagree. Sites that aggregate information make it easier to peruse what you are specifically interested in without poring through an RSS feed or visiting site after site. And if only a portion of the text is presented, with a link to the original post, you can always visit the originating site to get the full article.


> Sites like yours ruin the internet. You admit it yourself: you produce no content of your own

are you talking to the parent post or matt?

/sarcasm


Though sarcastic, I think this succinctly summarizes the issue.

Aggregators, it's time to wake up. Google is not your friend, Google is your competitor.

We can agree that Bing and Google are competitors, right? And we can agree that Bing labels themselves a "Decision Engine" and not a "Search Engine" right?

The line between a Search->Decision->Topical Aggregation Engine is so blurry I'm not sure it exists anymore. Google's stated mission is to "organize the world's information." - they are going to add social and they are going to use your preferences to figure out which stories/pages are interesting to you.


As an aggregation site, your SEO shouldn't expect to outrank the content you're sourcing - that's unethical. At best, your SEO should focus on the service you provide.

Looking at your site, it's easy to see why Google bumped you down:

http://technifi.com/

http://technifi.com/news/Egypt-Leaves-the-Internet-3841691.h...

It looks like your service is nothing like Techmeme, adding ad-related pages prior to accessing significant content. And the context-aware parts seems to lack any form of editorial choice, such as the Google-image Vodafone photos. In short, you have no insight, let alone opinion, and thus add nothing of value to the information.

Now aware of your site, I don't find any benefit and would ignore it or add it to my block list in my custom Google search.


Let me address these points one by one:

1. We don't expect our SEO to outrank the original source, I was very clear about that in my original post. SEO is a tool to be used among many other tools to ensure that content is properly "classified", nothing more than that. When a search engine visits, you want that search engine to immediately know what any given page is about, and we do that very well.

2. The Vodafone image concern I don't understand - we identified Vodafone as one of the main entities in the story (Vodafone shut off service to Egypt), and that is why the image is showing up there. If you visit the Renesys blog (the original source) you will see that Vodafone is mentioned there. The Vodafone image is not an ad, it is there to identify what the story is mainly about - context.

3. Regarding adding insight, opinion or value - we believe that the role of algorithmic news aggregation is not to have an opinion, but instead to uncover news that you may not otherwise have been aware of, give you context in the form of links to the main entities in that story, or other stories on the same topic. We believe we do that very accurately, and are always working on making it better and smarter.

4. Regarding the comparison to Techmeme - everyone has their favorite aggregator, and I am a big fan of Techmeme as well. I would, however, point out that Techmeme also displays ads in order to make a living, and they also have "content pages" that you can find through Google search. Indeed, virtually all aggregators run ads against the aggregated content they display on their sites, and all aggregators have "content pages" above and beyond a homepage.

The issue I am trying to uncover is not whether you like our sites or not, but rather what exactly we have done wrong, when compared with other aggregators, that caused us to not just have our pages bumped down in ranking, but to be totally de-indexed and have our PageRank stripped away (from a respectable 5 on Celebrifi). That, and to understand better what the future of content or news aggregation might be - should we expect all aggregators to become de-indexed? Are there guidelines that should be followed, or changes that should be made, to not fall afoul of Google? Are all aggregators on a level playing field, or will the lesser-known ones be shut off while the chosen few with established brand names survive?

These are valid concerns, not just for us but for many others in this space, and we are more than prepared to put in whatever changes might be needed to get back into Google's good books.


Radley is spot on. Instead of trying to counter his points you need to read them over and over. If I want to find about Charlie Sheens cocaine and pussy habit I will go to TMZ. You're 'magic algorithms' are NOT what decides what's hot and what's not. It's money through ads on duplicate content. http://informifi.com/company.php?page=about

The big aggregators will win and rise to the top. The knock-offs will be de-indexed. There's only so much room for news and gossip sites that don't add value, don't take it personally just move on.


Perhaps you're looking at it the wrong way. You haven't been de-indexed, Google simply found other sites that were more relevant.


It's pretty clear to me that if I were to Google for something, I'd want to see the original source of the content over a link to a page on an aggregator site.

On the other hand, if I wanted to see an aggregator site like Huffingtonpost or Digg, then I could Google for a site like that, or even Google for a keyword about a discussion that happens on the aggregator site. That seems legitimate to me, but showing up when I'm googling for something related to the original source is obviously not.


It's because when I'm searching for something I want a link to the original source. I don't want a link to your page, really. The results page would become polluted if it contained links to all agregators such as yours.

It's unfair to be in the results page with somebody else's content when the people searching for that content are actually targeting the original source, not yours.


I have to second Niko's position here. Our main business model is "adding value through aggregation". We provide users in many markets aggregated content from a variety of publicly available sources on specific topics. While much of the content on our sites is technically "duplicate", so is much of the content on the NY Times. They syndicate content from the AP, as do many news sites. We do the same, and always take care to credit the authors and provide valuable links back to the original content source. Does this algorithmic change affect sites like ours?


So, which SEO forum was a link to this thread posted to?


Can you elaborate on your use of the word "algorithmic change?" I'm not sure you're using it in the sense that I'm used to and I'm interested in your assertion that simple aggregation adds value.

That said, I'd be careful of analogizing aggregators to wire services. The AP actually employs their own reporters.


I see you lost a half a million visitors in traffic after the de-ranking… Ouch!

IMO.You are kind of hiding the original source and do make it hard to find. Yes you show the source and your call to action is "Read More" but your site does not give the appearance of a news aggregator.


Sounds good, thanks for the update Matt. I must ask though: I wonder whether you've seen the "January 26 2011 Traffic Change - Back to 'Zombie Traffic' " discussion over at Webmaster World? A number of webmasters there (who own websites with fully unique content) are reporting that they've seen lower quality content sites and/or content scrapers rank above their established sites with good content, starting from around the 26th Jan.

There's no specific queries/websites talked about there (I might be wrong but I think Webmaster World has some rules preventing specific discussion of websites/queries), although I thought I'd flag it up since some webmasters have noticed adverse affects from a Jan 26th algo change; and it sounds like this might be the cause.

Anywhoo, that being said: it's great to see Google continuing to be on the ball and responding to the recent feedback from various blogs and other sources (e.g. here).


I love Webmaster World, but one frustrating aspect of webmaster forums is that specific sites and queries are rarely given. It can be very hard to assess what's really going on since you don't know the specific site that's being discussed.


searchers are more likely to see the sites that wrote the original content

This is great news, but I have to wonder: How do you figure out which site wrote the original content?

I'm wondering based on what happened to Tarsnap last week, where hackzq8search.appspot.com outranked tarsnap.com on a search for "tarsnap".



Will this new change impact the issue where scrapers that take videos and video descriptions from youtube and turn them into 'blog posts' show up higher than the youtube page that contains the actual original content? I worked for nearly a year producing a few hundred videos only to find that spam sites (usually running adsense ads) were showing up ahead of the video on youtube.com and my own site. Our site's domain name was even plastered all over the description and it still didn't matter. The spam sites still showed up way ahead of us in the search results.


This is good news, but I feel like it is only treating a symptom not the actual disease.

If the algorithm properly detected site relevance, importance and viewer satisfaction, those copycat sites should never have ranked higher in the first place. In a way this is admitting that it is impossible to stop the gaming of search engine optimization, and that the only way to deal with it is to "win" in some special cases.

That being said I provide no real solution, this is a huge problem with millions behind each side.

Although this is good news it is also gloom news.


How would an algorithm detect viewer satisfaction?


How long user stays on page (hint would be how soon they make next search/click), whether or not they continue looking for something under a similar search query, having actual buttons users can click to rate, etc.


Spam gets upvoted to the top of reddit on a regular basis. Most people are pretty apathetic to legitimacy most of the time. All we ever have is proxies for quality, the hope is that we have enough of them to cover the space.


I imagine theres not much you can be specific about when talking about google's algorithm, but can you at least disclose if the identification of "original" content providers is "determined" (by some automatic process) or is it "specified" (manual intervention)?

Stack overflow is an obvious benefactor of this new change, I'm just wondering if smaller content providers might benefit as well?


There's nothing manual about this. Everybody who authors their own content should be helped. :)


Fantastic though I'd rather see google block sites like efreedom all together.


I think that would be a dangerous precedent to set. Whose opinion should decide what sites add value to content that may not be completely original? I'm not saying efreedom is not "being evil", but I could foresee someone somewhere using another's original content and making it more accessible. For example someone could easily improve the accessibility of the content from experts-exchange.com.


Google already makes that decision about lots of search results, usage on their services etc.


The fact that efreedom results are showing up in the results is irritating by itself. We all know efreedom is spam, and so does Google.

Now I am always on alert when clicking on a link in the results to avoid the spam pages like efreedom, expertexchange and whatnot. Why not just remove them from search results?

They are causing enough bad publicity to Google.


"scraper sites or sites with less original content"

To clarify, this change measures unique content site-wide, not just for a particular web page/url?

I may be too focused on semantics, but it would be important if you are trying to maximize visibility for a single page in Google's search results and your other pages have a significant amount of duplicate content.

For instance, one page on your site is about a WordPress plugin you've created (it's totally unique), but on most of your other pages, you've copied and organized relevant sections of the WordPress Codex so that your users can easily find the documentation needed to customize the plugin. Is your unique webpage about the plugin safe? I was always under the impression that search rankings were determined on a page-by-page basis rather than site-wide.

P.S. Sorry if this was already asked and I missed it.


I don't suppose you'll tell us how you know whether a site's content is original or not.

What if you have a blog with lots of quotes from other sites; will that hurt your rankings because Google sees "unoriginal" content?

Is there some ratio of original to unoriginal content that must be met to keep from being flagged as a scraper?


http://www.google.com/search?sourceid=chrome&ie=UTF-8...

for me, the 10th result is:

www.questionhub.com/StackOverflow/3910933 - Cached

I think it is safe to say they are scraping, and, while there are some SO answers, that particular answer does correctly interpret what my intention was.

In fact, the SO page scraped shows up as the 12th entry.


Matt would you comment on the thought process of Google on curation. There are sites that are generally strongly disliked.

Examples: Demand Media in general. Or say w3schoolS.com for JavaScript/CSS/HTML.

I can understand that there would likely be a deluge of legal action against Google if this was done in a heavy handed way.

Or is it a principled thing where everyone should be treated equal? If so isn't that a lot of algorithm ideology. In practice hard rules without human judgment leads to ridiculous edge cases or bypassing the intent of the rules.

I always assumed you curated in a way via having teams that specialized in creating topic specific search techniques which then all get combined together via topic detection or some other meta algorithm. This would be a good balance between manual curation and having it be a scalable approach.


I ran across this list just now on sites that may be using SO content: http://meta.stackoverflow.com/questions/24611/is-it-legal-to...


Bravo!


I'll be curious to collect programming-related queries where we're not returning Stack Overflow or some other site that we should. Computer science and programming queries are easy for engineers to assess and say "Ah, here's something we need to do better on."


What's the best way to get these to you (after this article has dropped off the front page)?


I'll still circle back to this page for quite a while, or you can tweet them to me. If it's long feedback, you can blog it and tweet a link to the blog post.


This comment made me chuckle. I agree all this should be discussed out in the open, but blogging and then tweeting links to the blog post....what happened to email!? :)


I get way too much email already, but the main reason is that email is a poor use of my cycles for doing support. In the 10 minutes that I would use to reply to one person, I could make a webmaster video that 1000+ people would benefit from, or 1/6th of a blog post that could answer other peoples' question for a year or more. Even Twitter is 1 to many instead of 1 to 1. I do get and reply to a ton of email, but when possible I try to communicate in ways that will help multiple people at once.


This is the best legitimate advertisement for twitter that I've ever read.


As the guy in charge of fighting webspam, maybe this will help you find or improve upon some algorithmic means of detecting when lots of unrelated wikis start to get exploited to feed up advertisements.

http://www.bay12forums.com/smf/index.php?topic=73045.0

Apparently, for a short time (until the admins fixed it), a wiki for a game I play was hijacked to spam Google. Actually, said wiki (magmawiki.com) is under attack right now from spam user accounts. I don't run the place, but I thought I'd pass the information along because it might give you something to examine and because I have to imagine that anything that helps Google identify spam-hijacked wikis would be helpful to everyone.


will there be any penalization for sites that use wikipedia content. Specifically a site that would NOT show up in the same search results as a wikipedia page but a commerce site that uses some unaltered wiki content in the descriptors of the products?

I've seen many sites use this method of adding text to an image heavy site and have watched one particular site that uses this method of adding text, drop dramatically out of the search results since October 28th and again around Dec 28th.

Would you advise to discontinue the use of wikipedia content even though the targeted keywords differ from those of the actual wikipedia page?


Anyone else notice that even Matt's site ranks below other sites who have copied the content after this change? Here's a search for the 3rd paragraph of content in the original blog post:

http://goo.gl/vVs8A

Some sites, including this one: http://www.boonebank.com/brc/SBR_template.cfm?Document=headl...

... which links back to the original post (is that not supposed to acknowladge the original source of the article?) and quotes Matt ranks higher in my SERP results


If a site ranks high in a SR, isn't it because it must have good backlinks?

Shouldn't that be more important than whether it has original content or not?

Maybe an aggregator displays the content in a more useful way than the originator so it gets linked to more than the originator.

If the content creator doesn't like his content copied he can take it up with the copier. It's not Googles job to get involved in that.

Google's job is to give the searcher a list of the sites that matches his keywords in an unbiased way. They should do that mostly on what the internet thinks is the best site, not what Google thinks is the best site.


At what point does Google cross over from "ranked by algorithm" to "ranked by algorithm selection as editorialized by Googlers and bloggers?"

Publicly discussing algorithms changes like this seems like a potential PR problem.


The whole history of Google is trying to find algorithms that encode our philosophy and mental model of what we think users want. We've been discussing algorithmic changes with people online since 2001, when GoogleGuy would show up on webmasterworld.com to dispel misconceptions.


This algorithm change could come across as more editorial and less empirical. It's publicized as highly targeted ("slightly over 2% of queries"), responding to small tech discussion (Atwood, SO, HN, quality launch meeting), and seemingly rolled out in a short period of time. The media could easily boil this down to "a small number people felt sites they liked were under-ranked, so Google moved them up a week later."

I mention this because Google often talks publicly about being entirely algorithmic, and elements of this narrative feel human.


2% of queries is huge. If you average that across a population (which of course isn't how it's actually distributed...), a searcher could expect to run across such a query every few days.


One potential solution to spam and personalization of results lies in doing the job on your machine, much like you can get rid of spam in your local mailbox.

I'd recommend 'Seeks', (http://www.seeks-project.info/). It requires you run it on your machine, or use a public node. Though while on your machine, it 'learns' from your navigation and re-ranks the results based on these local data. Additionally I use regexps to remove websites I don't want to hear about, like expertexchange.com.


This is great news, thanks very much.


My current favourite spammy search (Windows process names) seems slightly improved, but still produces mostly unusable results.

Try something like: http://www.google.com/search?q=hidfind.exe

Another similar search area (Windows drivers for various hardware) has improved a lot judging from the few examples I just tried - actual hardware manufacturers are now amongst the top results for a change.

So that's quite a step forward...


Repeating a quick search I was doing a week or so ago to find drivers, I'm not seeing an improvement. Asus actually has a great driver site, that is way better and less hassle than the spam sites, but I've yet to see it show up in search results. I only found it because one of the spam sites was using asus's servers to serve up the drivers.

http://www.google.co.nz/search?sourceid=chrome&ie=UTF-8&...


When I search for "rabbitmq exchange declare" (no quotes) I noticed the following.

Mailing list entries from http://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-... show up from old.nabble.com before the original source.

From a quick look there is significant improvement.


One caveat, though I imagine this has been thought of before, is that mobile versions of sites often have the same content as full-browser versions of sites.

So ideally, perhaps m.google.com would be able to sort through this and not penalize the duplicated-nature of the mobile version.... Anyway, something to think about if you haven't already.


So does this mean that Stack Overflow could remove the first tag from the start of the title and they would still rank above the scrapers? I found it really distracting and often double take on a SO result in google because I've scanned the first word and saw a spammy looking tag first.


I run an letting agency (estate agency) I put the property's on my low ranking website. The content is not scraped, but re-published to several property portals, some big names.

Who'm gets penalised? The higher ranking, more frequently indexed property portals or my local website.


Now that's good news. Thanks!


matt, kudos! this is great news.

a question that I always pondered, what is googles approach on more clever forms of rehashed content that involve photoshopping / cropping an original image, is this something that google looks at?

Example source photo: (http://shoes.n-sb.org/img/thumbs/472153d5fc674fa1c685f3c7814...)

rehashed image: (http://images.sneakernews.com/wp-content/uploads/2011/01/nik...)


I got 99 comments but of which this ain't one :)


thanks matt this is awesome!


[dead]


Google doesn't just remove sites from their index. They will always be there. But if you think a different site should rank higher than these spammy ones, you should mention it and maybe someone can look at it. (I don't work for Google.)


I meant remove them at least from my front page.


For a while there was a Google feature that let you do that, along with up/downvoting results. It got removed though, and I've never read why :/


may be here is a good place to ask - may be someone can answer - what does the term "original content" actually mean , original in what sense ?

if you have a blog about wines, then what do you need to write - in order to be "original" blog about wines - does it mean mostly:

1) express original opinions about wines, 2) original structure of sentence and original wording - but the same opinions that 10 other sites write about, 3) original brands of wine you talk about - original names mentioned, 4) original content in the sense that you have not copied the text 100% from someone else - or what 5) combination of the factors mentioned above 6) something else that makes your content original ..


In order to be considered "original" from the search engine perspective, it needs to be around 30% new content. There are websites like http://www.copyscape.com/ that can tell you how original something appears.

Internet marketers use software to "spin" text and make it appear unique. Sometimes the text is annoying for humans to read, but search engines eat it up. You can take one article and turn it into 50 with the click of a couple of buttons.


I think this is a much simpler content, as in not being a literal copy of another's text.

To be an "original" wine blog, write your own blog entries. Don't copy someone else's.


While you are at it, can you go ahead and de-index this site: http://www.google.com/search?q=site:livestrong.com


Hi Matt, I'm sure your busy and don't want to answer all SEO related questions anyway. I wanted to ask, though, what Google things about the importance of exact anchor text in rankings as remarked upon in http://www.seomoz.org/blog/how-organized-crime-is-taking-con...?


Very happy to see the end of copied Wikipedia sites.


Please don't take my word for it.

Hit http://duckduckgo.com with your programming queries.

No - I am not affiliated and am not getting paid to write this.


Then why the advertisement? This is about Google's search result algorithms.


Because he couldn't find a better place to spam with traffic.


Hi Matt. Why Google don't test the results after changes on algorithm? A simple program that store and compare n SERP results before and after the changes.


We do.


I still see sites with low quality duplicate content but highly stuffed keywords. Here's one such example http://www.google.com/search?hl=en&q=internet+phone+serv... Check the listing for www.internetphoneguide.org Also for http://www.google.com/search?hl=en&q=voip+phone+service see the listing for www.zimbio.com which is not an authority on the topic. Last but not the least http://www.google.com/search?hl=en&q=home+phone+service The site www.freelifelinephone.com jumped to Page 1 over night with the latest update.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: