"As “pure webspam” has decreased over time, attention has shifted instead to “content farms,” which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. "
Looks like 2011 is the year that Google kills the scrapers. Look for an uptick in the sensitivity of the duplicate content penalty.
I personally am always baffled how some of the really large, spammy sites with low quality content are rewarded with premium AdSense accounts and great rankings.
the link given in the first example above got 6 upvotes, and there is little info in the post other than the link.
a basic description is all that most web-users want. if you are doing some deep research, then these sites probably aren't good fits, but when you just want a quick overview (which is what most Google users want), then they can be useful.
I have a sneaking suspicion that you are a webmaster who sometimes finds that your site(s) rank below the sites that you bash.
>> I have a sneaking suspicion that you are a webmaster who sometimes finds that your site(s) rank below the sites that you bash.
Honestly isn't the case :) I'm not a fan of monitoring the SERPs for a bunch of keywords and seeing where I rank. I sometimes do it for a few choice keywords and I've never seen WiseGeek, About, eHow rank above me (or have pages written on the searched-for topics). So that's not the case.
I pointed out BizRate since their site is mainly auto-generated and they have a keyword stuffing section, neither of which seem to provide much/any value to users.
As for WiseGeek, they might have some useful content, but I've seen plently of pages with almost as many words of adverts as text, which clearly isn't desirable. And some of its (no doubt quickly written) content is either unhelpful or in some cases wrong.
"Wireless flash drives communicate with wireless devices using wireless protocols."
You don't say?
"Advanced wireless flash drives stream data to more than one device at a time."
Well, we're the only one out there, so I guess they're all advanced!
"Insert the USB portion of your wireless flash drive into an available USB port on your computer. Your computer should automatically recognize the device. If not, click the "Start" button and then click "My Computer." Double click the flash drive in the removable media section. This opens the drive and displays the files. Double click the executable file (start.exe, for example) to start the installation process."
Uh, no. No software installation is ever required, a point we make abundantly clear on our web site.
Basically, it's all wrong.
I agree with Ryan that these kinds of sites are worse than scrapers because they're just as useless but algorithmically much harder to detect.
They're made to earn the employers revenue, not help out people with the particular queries.
I hope the point is clear -- "to earn revenue" isn't an appropriate test. But your point is well taken. I'd prefer a search to turn up the site made by the guy in the garage who is passionate about the subject rather than the $4/hr content farm content.
I think if we're to continue complaining about spam we have to actually define what we mean by spam. And I have this strange feeling that not everyone will agree...
So you don't want to just say "these are spam."
I recall about.com as a decent site a while ago, but I believe it was acquired a while ago by ... Yahoo!? Go figure.
I think that might be what makes it so complicated.
About.com is owned by the New York Times Company, not Yahoo.
I'm not saying that a clone will never be listed above SO, but it definitely happens less often compared to a several weeks ago.
This happens for more than StackOverflow clones. Mailing lists, Linux man-pages, FAQs, published Linux articles, etc. all have clone pages that are obvious link farms (sometimes they even include ads that attempt to harm my computer) that rank higher than the "official" (or at least less-noisey) pages.
Ideally, I'd like to completely remove domains from result as has been discussed elsewhere on HN. Hopefully this upcoming push for social networking that Google has will reintroduce a better-implemented "SearchWiki" feature...
> I've been tracking how often this happens over the last month.
> it definitely happens less often compared to a several weeks
> I am seeing many, many more clone sites in my search
> results in the last few months
Result #4 at the moment is "AWS Developer Forums: Interactions between S3, EMR and HDFS ..." on http://www.hackzq8search.appspot.com/developer...com/...
What's sublime about this example is that:
1. hackzq8search is clone of AWS's websites amazonwebservices.com, aws.typepad.com, etc
2. hackzq8search is hosted on appspot.com, Google's App Engine domain
3. hackzq8search is over quota, so the site doesn't show any content anyway.
Yet this site was the top search result, beating out the site it was cloning, time and time again on my AWS/EMR-related searches this week.
The one mitigating aspect as that hackzq8search's URL naming scheme is easily decodable -- the hackzq8search URL includes the full URL of the cloned URL, so I can write a Greasemonkey script to extract the proper original URL.
I found a glimmer of optimism in that the site has been slowly fading in SEO-success this week: I complained about https://encrypted.google.com/search?q=aws+s3+security+sox+pc... on Thursday, but on Friday the hackzq8search Search Result was gone from the first search result page.
It's still not hard to slam some AWS-related keywords into Google and get these bogus results, though.
To be clear: the webspam team does reserve the right to take manual action to correct spam problems, and we do. That not only helps Google be responsive, it also improves our algorithms because we get use that data to train better algorithms. With Stack Overflow, I especially wanted to see Google tackle this instance with algorithms first.
Sites that are the victims of content cloning have to be very visible and valuable, so maybe a little manual curating could be relevant.
> the Stack Overflow cloners could just make other websites
Not really? The point is not to tag the clones but to tag the original; everything that is not the original and that has copied content is a clone -- its name, domain or country notwithstanding.
the primary input to search engines comes from web crawlers...the idea of "first" when it comes to duplicated content is already difficult to determine, and (I would guess) it would get much much worse in the inevitable arms race if something like this were implemented.
Because we don't understand what's hard, we think you're not really trying, and then we make up evil reasons to explain that.
I believe if people understood better the difficulties of spam fighting they would be more understanding.
Not necessarily. The rate at which Google refreshes its crawl of a site, and how deep it crawl, depend on how often a site updates and its PageRank numbers. If a scraper site updates more often and has higher PR than the sites it's scraping, Google will be more likely to find the content there than at its source. Identifying the scraper copy as canonical because it was encountered first would be wrong.
But I'd like to say one other thing. Why is Google only doing something about web spam now after people have pointed out how bad things have been getting? Has anybody considered creating a small team to just oversee public perceptions of the search results and try to keep on-top of things in the future?
The efreedom answer at the 5th position is actually the most relevant - the stackoverflow question from which it was copied doesn't even show up on the first page. There is one stackoverflow result on the first page, but it deals with a more complex related issue, not the simple question I was looking for.
Of course, then all the content-copy farms will respond by copying valid content plus word lists - hopefully Google knows how to detect that.
It almost feels like a cache miss when I have to drop down to the official site/documentation, since that typically requires a greater time investment to read through to find the relevant sections.
I guess that's a tribute to how well stackoverflow works, most the time. And also to how lazy I am.
Stackoverflow comes in at number 8 while clones are 6 and 10
The reason Q&A sites are so visible is that people tend to type questions in their search engines, so Q&A sites are a good match to those.
And then another SEO cycle would start. Don't forget that before google came along nobody was trying to 'game the system' with backlinks and other trickery, the fact that that google is successful is what caused people to start gaming google.
Typically you pretend the search engine is a black box, you observe what goes in to it (web pages, links between them and queries) and you try to infer its internal operations based on what comes out (results ranked in order of what the engine considers to be important).
Careful analysis will then reveal to a greater or lesser extent which elements matter most to it and then the gaming will commence. Only by drastically changing the algorithm faster than the gamers can reverse-engineer the inner workings would a search engine be able to keep ahead but there are only so many ways in which you can realistically speaking build a search engine with present technology.
Your ideal, I'm afraid, is not going to be built any time soon, if you have any ideas on how to go about this then I'm all ears.
Moreover, the unique licensing around SO content, along with its mass, presents an interesting edge case for Google. They should of course fix it, but it's not indicative of the average or mode experience.
1. They diminish the value of Google Search as an advertising platform. And Google Search is likely the most valuable virtual estate on the net. I more often click on ads in Google Search than I click on ads on all other sites combined. This is because when I'm on Google Search I'm actually searching for something, so I might click on a relevant ad.
2. They diminish the value of AdWords content network ads. People pay Google to display their ads because they believe they get better return for their money there than on the alternatives (Yahoo and Microsoft). Ads on low quality sites are unlikely to be competitive, so these sites decrease the relative value of AdWords.
That is, high-ranked low quality sites with AdSense are a double threat to the main source of income for Google, and I expect Google to make it their main priority.
Why, then, aren't they more successful? My guess: Because the problem is a lot hard than any armchair designer would believe. Problems tend to be a lot simpler when you are not the one who must solve them.
I disagree on point 2. Users on low quality AdSense sites almost certainly arrived there from a search engine so if I can display my adverts on a users landing page it will be almost as good as if they arrived on my site straight from Google.
1: One of Google's major weaknesses is the concentration of its revenue around a few large websites. Having the most popular advertising network on the web (AdSense) is an important asset for pretty much anything else Google wants to do.
If what you said were true, the greatest threat to the Google Search page would be a strong AdSense market with high quality content which everybody finds using organic search.
2: AdSense does not compete with Content Network because it is long tail. The value for advertisers is in reach and diversity.
E.g., this query -- "travel agent vermont" (which I got from this post complaining about Mahalo spamming the web and Google not enforcing its own qc standards http://smackdown.blogsblogsblogs.com/2010/03/08/mahalo-com-m...) still returns a Mahalo result in the top 10.
For an example (which was submitted as search feedback a month ago), try searching for "XMPP load balancing" and look at the third organic link.
Short version is: they used to, and got busted for, serving answers to the spiders and ads and pitches to the surfers. So now they show the answer at the bottom of a pile of ads and pitches.
But they still suck. Horribly. And are the number one example I hear when people say "I wish Google would let me blacklist domains".
Edit: Even a GM script to remove all the leading spammy divs would do...
On Chrome you have Search Engine Blacklist
On Firefox you can use the filter option of Optimize Google
The indexed content is blurred out and there's a big ad overlaying it, with no close button or other method to display the content, other than the Google cache link.
Clicking the link again shows the additional content at the bottom of the page as you described. So there's some other algorithm at play.
Do you consider that site to be a quality & non-spam source of information?
The question is one of rankings. The only time there is a problem is when a spammy site ranks above a more relevant site for a particular search. If I enter a very specific query that only hits a spammy site, then I should see the spammy site, because it's there. Google is a search index, not a "visitors guide to the internet."
If they're serious about their standards, they would remove Mahalo en masse from their index.
edit: Or, to satisfy lukev, they can keep the index as-is but make sure Mahalo pages never rank high in results.
That's not quite what I've been reading. I believe the more common claim is that Google has a disincentive to algorithmically weed out the kind of drivel that exists for no other reason than to make its publisher money via AdSense. It's about aggregate effects, not failure to clamp down on individual sites. Or, put another way, it's not if certain sites are serving Google ads, it's because that kind of content is usually associated with AdSense.
AdSense is definitely a problem for search quality. It creates the same imperative for the content farm as Google Search has: get the user to click off the page as soon as possible. And the easiest way to do that is to create high-ranking but unsatisfying content with lots of ad links mixed in.
If Google did not operate AdSense, it seems hard to believe the company would not have penalized this sort of behavior ages ago. A love for AdSense is probably the single largest thing spam sites have in common worldwide.
Disagree. Our quality guidelines at http://www.google.com/support/webmasters/bin/answer.py?hl=en... say "Don't create multiple pages, subdomains, or domains with substantially duplicate content." Duplicate content can be content copied within the site itself or copied from other sites.
Stack Overflow is a bit of a weird case, by the way, because their content license allowed anyone to copy their content. If they didn't have that license, we could consider the clones of SO to be scraper sites that clearly violate our guidelines.
Can you speak about the possibility for personal domain blacklists for Google accounts? I know giving users the option to remove sites from their own search results is talked about a lot in these HN threads. Is there any talk internally about implementing something like this?
Likewise, there are 200,000,000+ domain names. Even if every single employee did nothing but webmaster support 24/7, each Google employee would need to do tech support for 10,000 domains apiece. The same argument goes for supporting hundreds of thousands of advertisers.
The problem of user, customer, and advertiser support at web-wide levels is very hard. That's why we've looked for scalable solutions like blogging and videos. I've made 300+ videos that have gotten 3.5M views: http://www.youtube.com/user/GoogleWebmasterHelp for example. There's no way I could talk to that many webmasters personally.
So we haven't found a way to do 1:1 conversation for everyone that has a question about Google. That's not even raising the back-and-forth that some people want to have with Google. See http://www.google.com/support/forum/p/Webmasters/thread?tid=... and http://www.google.com/support/forum/p/Webmasters/thread?fid=... to get a glimpse at the sort of prolonged conversations that people want to have with Google. In short: it's a hard problem.
Example: I get my adsense account or site banner in a non automatic way since there is some problem with the content: so not into an automated way, but because somebody looked at my site.
I should, in that case, have a chance to communicate with Google. This is inherently scalable as everything started with a 1:1 action.
Google needs many happy users to be able to offer an attractive product to its customers. They depend on people using them just as much as they depend on advertisers paying them money. Both sides of the market are important, advertisers and users. Google cannot just ignore one side.
Those who use Google to search don’t pay Google any money and they are, in that sense, no customers. You shouldn’t read much more into that word, though.
(Two sided markets and network effects are fascinating and relevant to so many discussions on HN. Wikipedia has a pretty good primer: http://en.wikipedia.org/wiki/Two-sided_market)
I'm skeptical, because spammy-ad clickthrough rates are already low and trending lower, and I speculate google has great incentive to send people where they want to go lest their competitors get stronger.
It just wouldn't make sense for Google to suddenly abandon that (very successful) strategy and say "let's keep spammy/low-quality sites around and send users there because we make money off the ads." We make more money long-term when our users get great results. That makes them more happy and thus more loyal.
My preference is just to enforce a hard time deadline. If the ads team starts to miss that deadline and revenue decreases, then they're highly motivated to speed their system up. :)
While I might not love eHow's process and tactics, I'll admit that I've benefited from one or two of their easy-to-read articles. (Though I'll often second source that information.)
I think Google's in an interesting pickle to decide whether results should be: fast food or fine dining.
Google could serve up 'fast food' in results (e.g. - farmed content) and it would likely be 'good enough' for MOST people. Plenty of folks will eat fast food even if there is a better alternative down the road.
I'd like to see more fine dining results, but I'm not sure I'm in the majority there. Quality for the early-adopter might look different than quality for the late-majority.
If the algorithm is leveraging user behavior as a signal, doesn't it follow that popular and familiar sites and brands may gain more trust and authority?
Is Google thinking at all about search in this way?
As Matt's post suggests, it could simply be that people's expectations are rising -- search results are getting so good in general (which they are) that we notice the problems more. Or it could be that Google is focused on a narrow definition of "spam" that doesn't cover content farms. It could even be that both sides are "right" -- that overall search quality is rising even as the content farm problem worsens, if Google has been successfully reducing other causes of low search quality.
I'd love to see some hard analysis of this. For instance, pick some a reasonably large set of sample queries, and show what the results looked like five years ago, and what they look like today. Of course, you'd first have to find a set of sample queries and results from five years ago.
These songs and albums are not available legitimately through torrents. What value is there in providing links to pirated content? I understand that Google is not under any legal obligation to remove these results, but as a non-pirate these results are significantly lowering my perception of the quality of Google's search results.
Does that mean that Google manually decrease rankings of spammy sites that their algorithms haven't caught? Does this entail decreasing the rank of the entire domain, the IP? Does blacklisting ever happen?
I ask since Google have previously said they don't wish to manually interfere with search results.
 "The second reason we have a principle against manually adjusting our results is that often a broken query is just a symptom of a potential improvement to be made to our ranking algorithm" - http://googleblog.blogspot.com/2008/07/introduction-to-googl...
Although our first instinct is to look for an algorithmic solution, yes, we can. In the blog post you mentioned, it says
"I should add, however, that there are clear written policies for websites recommended by Google, and we do take action on sites that are in violation of our policies or for a small number of other reasons (e.g. legal requirements, child porn, viruses/malware, etc)."
As the quote mentions, we do reserve the right to take action on sites that violate our quality guidelines. The guidelines are here, by the way: http://www.google.com/support/webmasters/bin/answer.py?hl=en...
Edit: By specifics, I don't necessarily mean implementation details, just anything more informative and plan-of-action than acknowledging the problem.
I am not sure if Posterous offers the ability to add a robots.txt file with which I can tell search engines to exclude it, but would that prevent my primary site ranking from being affected?
though, i do have a question about this ...sometimes i find 1 normal link (do-follow) on a page is approprate for my reader (e.g. the content match, and i approve the site), so in that situation i would like to sell link (and absolutely with a CLEAR "sponsored link" text in plain view but not with “rel="no-follow"”. Afaik google doesnt approve do-follow sold links in anyway. I find this a bit problem because good links from commerial activity could be just as appropriate that benefits my reader.
any advice, or perhaps furture plans, on this?
In terms of adsense, if you really think about it, adsense content on a page should probably be a slightly negative ranking signal (not just not a positive signal). The very best quality pages have no ads. Think of government pages, nonprofits, .edu, quality personal blogs, etc. If no one is making money off a page (no ads) then whatever purpose it has, it is likely to be non-spammy.
We see a virtually unbounded number of problems with our search results, and we're working constantly to fix them. Most of the people I talk to who work on search have the attitude that Google is horribly broken all the time, it's just also measurably the best thing available.
Google as a company, and search quality in particular, does not rest on its laurels. The people who hate Google's search results the most all work at Google. If you think you hate Google's search results as much as we do, you should come work for us. :)
Why not allow the community to sort this out. "Google Custom Search" already exist. Google could extend that to the direction where people could customize the Google search to exclude certain sites from the results (right now it is only possible to specify a list of sites to include in the search).
Blacklists for at least specific "fields of searching" would emerge very quickly. People could select what blacklists to use, if any.
- When power searchers start adding -somedomain.xyz to their searches
- Increase spam reporting by adding some kind of feedback to the spam reporting feature. I think I'd love to get an automated mail saying something like: "The site somespamdommain.xyz that you and others reported x days ago is now handled by our improved algorithms". Submitting spam reports really doesn't feel useful when it seems like nothing ever happens.
- Adding weight to spam reports. You know a lot about us, and I guess you can filter out who are power searchers. This could help stop people from gaming the system into blocking competitors.
They don't want useless content in their serps. If someone goes directly to the domain, that's in no way related to search quality.
In years past, Google's results were measurably less relevant than they are today. In the time between "then" and "now", we're grown more accustomed to high quality, fast, relevant results. I think this makes it seem like small problems in search are bigger than they are.
It would be great if there was a "Google of 2004" to test this side by side, but I don't think that is possible. :)
Google issues an announcement via blog post. TC and others start to pick it up. And the original author of the blog post takes questions and provides technical answers, where allowed, in HN.
That said, I agree with Google, users' expectations have skyrocketed, and it is tough to keep pace with them.
Personally I tend to find what I'm looking for by adding a few more words, but in the case of reviews and tech stuff it doesn't always work and I often have to rewrite my query one or more times to get something valuable.
This is one of my pet peeves. If I search for "product X review", most of the result I get back are of the form "be the first to review product X", which is absolutely not what I want.
If the measurement system is able to detect spam, why is spam not removed in the first place, before it has a chance to show up in the metrics...?
Second biggest issue is poor Wikipedia articles appearing in the top results for almost any reference type query. Many less frequently updated Wikipedia articles are nothing but regurgitated content lifted from other quality sources. What makes it worse is that Wikipedia is using no-follow for their links, so even if these sites are linked in the reference section, they won't get any credit. It's interesting to see so many people complain about low quality content on commercial sites, but they never mention Wikipedia, which is a much bigger offender (I guess this might be because Wikipedia gets its content for free and doesn't have ads and other sites pay for the content and do have ads).
Third, I hope Google doesn't make any changes without checking very carefully that good sites will not be negatively affected. For example, newspapers will often have the exact same articles from the AP, but also original content based on their own reporting. Punishing them for having duplicate content would not work well. There are many similar possible pitfalls.
The long answer is that without Wikipedia results, Google's search quality would be at an all-time low in terms of relevance, freshness and comprehensiveness.
What you are saying != how you are performing.
Time and time again Google has failed. I've already moved on to Bing and Duckduckgo and I would recommend you do too. Unless you like digging through hordes of useless SERPS.
Numerous colleges & universities courses use our book as a required textbook. Consequently, many teachers & professors link to it from course websites to let their students know where to download the book's code examples. Yes, that probably helps the site's Google rank, but that was never one of our goals.
I'm not sure why you thought it was a link farm. Perhaps you should have taken a closer look at the site before using it as a bad example?
Pardon, if this seems nitpicky, but just wanted to share one of my recent experiences where Google failed me(i have another example too but search was personal), while Twitter did not.
I'm just now testing this query, but when I did [fallston md walmart], there is a really comprehensive article at http://www.exploreharford.com/news/3074/work-start-fallston-... that shows up at #5. That article mentions that "The new road will lead into a 147,465-square-foot Walmart, which has been planned with a possible 57,628-square-foot expansion" Then I did the search [walmart supercenter size square feet]. The #1 and #2 results are both pretty good, e.g. the page at http://walmartstores.com/aboutus/7606.aspx says that an average store is 108K square feet, and supercenters average 185K square feet.
As a human, I can deduce that 147K square feet (with 57K square feet of potential expansion) implies that it will probably be a supercenter, but it's practically impossible for a search engine to figure out this sort of multi-step computation right now. My guess is that in a couple months when the store opens up, this information will be more findable on the web--for example, somewhere on Walmart's web site. But for now, this is a really hard search because there's just not much info on the web.
I appreciate you sharing this search. I think it's a pretty fair illustration of how peoples' expectations of Google have grown exponentially over the years. :)
Google of all companies I would have thought would understand and respect the important of giving people the power over their own technology experiences.
I played with adsense a lot in the past, and if you did too you should now how spam sites generate a lot more clicks than sites where the user is actually focused on reading content...
A proposal: http://www.saigonist.com/content/google-spam-content-farm-fi...
They'd be a lot more credible without the corpo-speak junk in the first paragraph.
I'm sure that Google does better than it did in October 2000. They'd be something horribly wrong if that wasn't the case. That's not relevant to the specific concerns people have about what's happening here and now.
Regarding the specific concerns that people have been raising in the last couple months, I think the blog post tried to acknowledge a recent uptick in spam and then describe some ways we plan to tackle the issue.
Personally I think any type of "scheming" in technology will eventually get caught and then all of a sudden there goes your business model.
So I guess it depends. If a 'content farm' produces good, helpful content, then that's great and should be encouraged (even if it is done on a massive scale). But when it comes to a case like WiseGeek where content is spat out en masse, even if the content is crappy and really short, then it becomes a problem.
Until the keyword "buy viagra" isn't littered with forum link and comment spam and parasite pages, Google's algo is still not "fixed"
With a billion searches a day, I won't claim that we'll get every query right. But just because we don't "solve" one query to your satisfaction doesn't mean we're not working really hard on the problem. It's not an easy problem. :)
There's a growing concern/consensus that in loads of non-spammy niches, the only way to get decent rankings is to build links.
I know this is naturally something that would be very hard for Google to tackle (since if 'junk' backlinks = domain penalty, black hatters would simply spam their competitors' sites with backlinks and get their competitors deindexed), but are there active efforts being made so that gray and black hat link building campaigns aren't (in some cases) pretty essential to a site getting good rankings?
The simplest response is that link spam works, has worked for years, and coming into 2011, it appears to be as strong as ever in terms of ranking sites for competitive niches.
It seems to me a $200M company offering to sell a product to you via their website would seem more legit than, say, a site you're redirected to after someone hacked some .edu account. (At this moment, the second result on Google for buy viagra appears is a link to www.engineering.uga.edu, but that redirects to some pharmacy site if you click through from Google (but not if you load directly).)
So yes, I'd say there's at least some way to compare the legitimacy here.
I expected more. It reads like content farm.
To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly. ... We’ve also radically improved our ability to detect hacked sites, which were a major source of spam in 2010. And we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.
And they are saying pretty clearly they think they can do a better job on content-farms:
Nonetheless, we hear the feedback from the web loud and clear: people are asking for even stronger action on content farms and sites that consist primarily of spammy or low-quality content. ... The fact is that we’re not perfect, and combined with users’ skyrocketing expectations of Google, these imperfections get magnified in perception. However, we can and should do better.
When is recently?
We’ve also radically improved our ability to detect hacked sites
Since when; all the complaints I have read are from the last few weeks, and when I looked up a medical complaint this week?
And we’re evaluating multiple changes that should help drive spam levels even lower
Oh, are you now? Of course... Translation 'We are looking into it'. D'uh.
that copy others’ content and sites with low levels of original content
This last part is the only datum I got from the article - they explicitly respond to stack overflow, etc... It is still fluff though.
What I would have preferred:
I made a change on 2011-01-15 and you should see it here, here, and here. 'Here' can be broadly defined.