Hacker News new | comments | show | ask | jobs | submit login
How Organized Spam is Taking Control of Google's Search Results (seomoz.org)
136 points by JoelSutherland 2250 days ago | hide | past | web | 51 comments | favorite



Part of the problem with "spammy" content coming out on top is often the competition from real content is pretty thin.

To take the example of Pandora jewelery, Pandora is a company that controls it's marketing channels with an iron fist. They're very careful to partner with better-than-average jewelers and each retailer has an exclusive territory. So far as I know, there's no legitimate channel for new Pandora products online (everybody who claims to sell them on Ebay seems to have fewer than 10 feedbacks.)

Thus, other than Pandora's official website there's no legitimate e-commerce presence for Pandora online, so there's nothing to compete with the junk. Somebody might randomly write about them, but there's nobody (legitimate) who's got a feedback loop going where revenue supports content creation and marketing efforts -- which will inevitable come out on top against amateur competition.

Demand Media, ExpertsExchange and quite a few junk sites similarly thrive on the lack of good content. I was having trouble changing the ribbon on an old typewriter a few weeks ago, and web searches asking about this particular model turned up junk pages with advice like:

(1) Buy a new typewriter ribbon, (2) Take the old typewriter ribbon out, (3) Put the new typewritter ribbon out

Now, these pages were keyword stuffed with the name of the typewriter, but they didn't even bother to have an affiliate (or other) link to a place where I could buy the goddamn typewriter ribbon, which according to them is 33% of the work!

Once more, the feedback loop doesn't exist to nourish a good answer here, so of course the blight is going to move in.


To reinforce the feedback angle, I think it's important to point out that the weeds starve out the opportunity for genuine content to grow into a valuable audience. Even when there's real content that wants to compete, the risk is the upfront investment required to peek out above all the crap exceeds the profit from serving that niche.


[negative allelopathy]

And, in those cases where the content mills do have a few morsels of useful information, they've usually just pushed other less professionally/cynically optimized sources for the same info down off the first page of results.


you bet


That's why sometimes I think sites that do manual culling like Quora are greatly valuable. Sure when they have sizeable information, peole will try to spam it too but that spam will soon get cleaned up. Just like wikipedia.


I think you've hit the nail on the head.

This complaint seems to come from the recent attitude absolutely everything is on the web. But tightly controlled products inherently don't have much web presence since their producers like control and all the search terms in the article seemed to involve terms pointing to these.

"NFL Jersey" - That sounds like "spam bait" already. Lots of (rather unsophisticated) people want these and the NFL itself controls its authorized products.

If you tried "cheap viagra", your results wouldn't be encouraging.

Sorry if the Interwebs are not meeting your expectations...

Perhaps Google should rate the quality of its own searches... That idea is serious.


Well, the genie is out of the bottle, because Google created that market. If you type a question, Google will find an answer. If nobody is interested in the problem, spammers will give you something.


I run a blog host, so I get to see these spammers at work. Every day, they sign up for several hundred new accounts and post informative articles on how to find NFL Jerseys, Ugg Boots and Tiffany Jewelry, all with plenty of links back to sites like the one in the article.

The scary thing is that it's not automated. There are real people pasting in content and checking to see that it's correct. Fortunately for me, it's all going straight into my bayesian filter's spam corpus and making it easier to detect, but even for my one site it must be costing somebody a lot of money to post it all.

If Google had an API to report this stuff, I'd be happy to forward it along to them on the fly. Seems that there are plenty of User-generated-content sites like mine with a ton of valuable spam data if anybody figured out a way to use it.

If anybody's interested, here's what we're doing to keep the site spam free:

http://expatsoftware.com/articles/2010/03/care-and-feeding-o...


That gives me an idea. What if Google were to provide free anti-spam tools like Akismet that integrate with forums, blogs, and wikis? They could detect the spam patterns and essentially blacklist those spam sites from the search engine. With enough sites using their tools, Google could build a significant dataset of what these sites are trying to do to generate spam links.

Wouldn't detected who's trying to generate link spam be a fairly effective way of removing them from search engine results?


I ran across this a few months myself while searching for bicycle info. Note that this is completely different than the spam content (SO, eFreedom, etc.) issue.

I wanted to learn more about the bianchi infinito and what I found was a number of web stores selling bicycles at impossible prices. I mean oscommerce or zencart or whatever instances of legitimate looking web stores. Then when you look closer they're mostly in Indonesia and in order to complete the purchase you have to bank wire the money.

I think it's spilling over from alibaba and similar sites where 99% of the vendors are scams. Now those vendors are creating whole ecommerce web presences to make their scam sales.

The stores I saw didn't usually make the first page of results. Usually third or fourth. Sometimes second. Anyway I was a bit shocked out how many fake stores there were and how they ranked as highly as many legitimate bike shops.


I think I would be interested in having the option to filter out all search results for shopping sites. There already is Google Product Search for that.


Funny this is being posted the same day as Matt Cutts' HN post which is currently at #1.

Also per exhibit one of the article: The first hit for "nfl jerseys" I get, even with pws=0 is to nflshop.com. The website that nfl.com links you to when you click on the "shop" link.

More bandwagon jumping about google spam being out of control? I like to think so.


Note that Cutts is talking about a different problem: which are sites that 'syndicate' content and end up ranking better than the original site. People have been complaining about this for years (usually third-tier bloggers who don't have much ranking power) but people perceived that this became a crisis in the last few months.

The morals are also different too. Some people might not like eFreedom, but the fact is that StackOverflow is CC-BY-SA. Anybody who wants to repackage StackOverflow content in a different way is free to do that. I do think that StackOverflow should generally outrank eFreedom, but a site like eFreedom can potentially add value a lot of value.

On the other hand, other spam sites are generating original crap content with their own crap content generation system... And if they aren't, they can switch to some other content generation method to get around duplicate content filtering.

(And speaking of which, duplicate content filtering content of some kind is absolutely essential for a workable web search engine... It's not even a matter of spam. Building a search engine for one the largest units of a large Uni, we found that there were many documents that were duplicated all over the place for all sorts of reasons, and that since the on-page factors are the same, these tend to form 'plugs' of search results that displace other results.)


At least they are impartial, today I have seen osdir outrank google on google for a result from the archives of a googlegroups hosted mailing list. But it is annoying.


Not annoying, a good thing. Searching a group from within google groups often doesn't work correctly.


"...but a site like eFreedom can potentially add a lot of value."

Genuine question here...are you talking specifically about eFreedom, and if so exactly what value does it add? When I've inadvertently stumbled in there, the questions and answers are an exact ripoff of SO, and I (and I suspect everyone else) just immediately clicks on the "from StackOverflow" link so all the responses in the original can be read.


Whoever runs eFreedom has addressed this at http://efreedom.com/About/

"Simplification of the user interface. We show only the accepted answer (or highest voted answer if no answer has been accepted yet). We removed the sidebar, comments and vote counts in order to minimize distraction. This gets you to your answer and on to your project quickly."

There are a few more points listed there such as translation and displaying snippets of related questions that seem to show a genuine effort to help answer questions.

I think eFreedom nicely illustrates the problem Matt Cutts and the folks at Google face. The site appears to be playing by Stackoverflow's and Google's rules, possibly doing SEO better than Stackoverflow. If Google also ranks based on page load times, eFreedom might be helped even more, since the site lacks the majority of Stackoverflows features and might load faster. So a programmer, having never heard of or cares about Stackoverflow, interested in only finding a specific answer that includes their Google query, might see nothing wrong with the eFreedom response. Suppose the majority doing Google searches preferred eFreedom based on measured clicks. Should the fact that Stackoverflow was the originator of the content guarantee them a higher page rank? What if Stackoverflow was slow? And stepping back from these two specific sites, how do you deal with that across all sites and their clones?

I avoid eFreedom links because I enjoy participating in Stackoverflow and use the other features, and despite the assurances of the folks at eFreedom the site still seems shady. It seems perfectly reasonable to me that Google would take steps to ensure Stackoverflow ranks higher. But that's just one case out of many. It seems like a tough problem for Google to solve across the Internet.


Thanks for the link, but sorry, I don't get it. The accepted answer on SO is right at the top, and if I thought eFreedom had a better interface we wouldn't be clicking through to SO each time.

To each his own, I suppose, but I'm not impressed by what appears to be their "value" - SEO - and agree the site seems a bit shady.


For one thing, eFreedom.com actually answers your question. This is different from ExpertsExchange (which promises you might get an answer if you fork over $, yeah right) or eHow which only sometimes answers your question, and if it does, does the worst possible job that could possibly be done.

Community sites, at least in their early phases, need to focus on getting people to put content in more than they need to focus on making it easy for people to get it out. Delicious is the classic example: it's a roach motel which makes it very easy to put your bookmarks in, but doesn't provide a useful browsing interface for your and other people's bookmarks (other than having a list of recently hot for various tags.)

Particularly in the semantic age I think there's a lot of room for remixing CC content to improve browsing and discoverability.


>For one thing, eFreedom.com actually answers your question. This is different from ExpertsExchange (which promises you might get an answer if you fork over $, yeah right)

Not sure if I'm taking you too literally, but Experts Exchange does answer your question without having to pay (scroll down).


I don't understand this. Are you saying that efreedom adds original content to that which they scrape? I avoid them like the plague, but my exposure has taught me that they are merely reprinting SO content with crappy formatting. Not much of a value-add in my eyes.


No, I'm not really defending eFreedom. However, I think that sites that are ~like~ eFreedom in some ways to be useful. For instance, large scale text mining could create things that are more than the some of their parts.

For example, I think within 10-20 years at the most we'll have systems that can decompose text into facts and then reassemble it into 'original' text.


Actually there are link farms that are doing exactly that in order to appear to robots to be original text. However, it's just chunks of text "mined" into a mass of subject-focussed sentence fragments. That's the thing, the race to the bottom is: original content is scraped without improvement in order to pay someone else via ads, and original content is generated without regard to coherence in order to pay someone via ads.

Two ways to do this: you have good content either left intact (no value-add) or rearranged or otherwised structurally corrupted in order to appear to be a different/better answer (value-minus), or you have advertisers being led to believe their ads are showing on relevant content, when it's really just a jumble of random words loosely oriented around a concept. "The dog was dog walking. Dog food always is in the grocery store. RALEY's. It dogged him for years..." so on and so forth.

On one hand users are being defrauded and on the other, the advertisers/affiliates. There is no defense for eFreedom, nabble, mail-archive, and their ilk. They are bad people, bad for business and bad for the internet. I sincerely believe this.


Well, I'm not trying to belabor the point and maybe I haven't looked enough on eFreedom, but I don't see any original content there, it's just a scrape. Even the "related links" is scraped.

I still don't get the value in that. Where is the "content in" (which I take to mean content generation rather than duplication) that you mention?

I agree with you about the other sites.


I see NFLShop.com as #1, but some of the rest appear to be spam (in some cases it's hard to tell if they are legitimate or not). I looked for legitimate sellers of NFL jerseys and by and large they have terrible SEO. I don't know how companies haven't gotten wise, but check out FinishLine.com's NFL jerseys landing page:

http://www.finishline.com/store/shop/nfl/nfl-jerseys/_/N-2z7...

They have "nfl-jerseys" in the URL which is about the only redeeming thing. The page title is unrelated, which is what would show in a SERP. I clicked on the top result, a women's Ben Roethlisberger jersey and the page has nearly zero information and the images 404 (!).

http://www.finishline.com/store/product/reebok-womens-pittsb...

Compare that to one of the results that comes up in Google and you can see why. Great titles, URLs, the filters don't require forms.


http://duckduckgo.com/?q=nfl+jerseys

I believe Google's results are less spammy than duckduckgo for this particular query.


Bing has similarly poor results. This isn't a Google problem -- it's endemic to all search engines.


Agreed. The article's title deliberately said Google, as if the other engines are doing a better job.


No, the article's title said Google as if the other engines do not exist / are irrelevant.


"Google needs to greatly lower the value of keyword-rich anchor texts."

Won't this have a lot of adverse effects? And if keywords in anchor text become less valuable, can't spammers compensate by ramping up their existing efforts?

"I would not be surprised to see Google shift even more ranking signal power from anchor-text heavy links to relevant social media “chatter”."

Why would this be harder to game than links?

Spam happens because search is hard. There are probably solutions, but they're not as easy to come by as the ones suggested in tfa. Still, it's good to see this sort of community feedback on search results, especially given how responsive the search team is to this sort of thing. Keep up the good work, guys.


There might be an AI solution to discriminating real social media chatter from fake content.



Yes, they sell copied stuff. Here in China there are huge organizations handling these things. No surprise that in some of the examples the wording sound very Chinese and even some Chinese characters appear. I am pretty certain where those guys are located.


Contraband and counterfeits have long been a source of funding for gangs -- both sides of the northern irish conflict, ETA in the Basque country has a commercial wing, and I'm sure the more traditional mafias are into this sort of thing too.


There's a change slowly rolling out that improves the [nfl jerseys] query substantially. I'll check on the others. Thanks for the examples.

Some really dramatic changes to how we use links are on the way. (Sorry I can't say anything more specific. This is a really sensitive area.)


Warning: contrarian rant ahead.

Something has been bugging me for a while, and it took a few hours after I read this article to figure out what it was.

I love the coining of a new term: "organized spam", and I love calling out things that are wrong, but I wonder if we're not taking this crime metaphor a bit too far.

Look guys, it's a search engine. You type in a search term, it gives you results. There's nothing magic or special about it -- anybody with a smidgen of database training can make one (although nowhere near as Google's, granted)

Although some of these examples involve people ripping other people off, I get the feeling that somehow Google has become such a part of our lives that we feel as if somehow these folks trading links and trying to get attention are acting criminally. That anything that gets in the way of my getting instant information is a crime against humanity. That really bugs me.

It's not. Get over yourself. Sure, large parts of this may be well-funded, but there's nothing necessarily criminal going on. For instance lots of poor people in lots of third-world countries are making money dropping by my blog each day and telling me how awesome I am. It's not expected, but I'm happy they're making a few dollars. I can live with the inconvenience or try to fix it on my end. I don't need to blame them.

I don't like the state of Google search right now either, although I'm still a loyal customer. But what I see in the marketplace is humans reacting logically to their best interests. If you're going to monetize google search so that billions of dollars flows through it, there's going to be some ancillary effects that nobody predicted. Instead of blaming the people, understand that the people are just regular, intelligent folks doing the best they can. Hell, my wife is in a social group with a lady who made several thousand dollars adding advertiser text to her blogs -- until Google delisted her. She saw nothing wrong with it, and still is pretty pissed at Google. From her standpoint Google crapped all over her party.

And yes, Google has every right to delist sites and such. More power to them. I hope they continue to delist and evolve their search engine. I hope they get a handle on this. But I think we should all separate our well-wishes for Google's success from our opinions of our fellow man. I've heard linkspammers and spammers called "subhuman" and all sorts of nasty things. While there are criminals who are trying to rip you off, there's no evidence that there are more criminals on the web that anywhere else. Most of these people are trying to make a living. The fact they might inconvenience you on your way to get an answer to a technical question or find the latest mp3 you have to have is really not that high on their list of priorities -- nor should it be.

Google needs to do a better job. Period. There seems to be this "conversation machine" right now where people post articles showing how bad search is, then folks come out and rant, then Google makes an announcement. Repeat and rinse. It's as if we went down to the local newstand and asked the grocer for a magazine on trucks. He gives us a bunch of magazines on boats, so -- we blame the magazine publishers! It's simply not logical. A little perspective, please. Google is the provider here and those of us who like them should try to help out. But we shouldn't cross the line into thinking that anybody that annoys Google or searcher is somehow evil or criminal. That's crazy. Much better to understand people as rational actors than to demonize anybody who tricks some random American internet company.

</rant>


I'm not going to comment on your entire post, nor am I against what you're saying, but I did want to comment on this bit:

>It's as if we went down to the local newstand and asked the grocer for a magazine on trucks. He gives us a bunch of magazines on boats, so -- we blame the magazine publishers!

This analogy would be more true to the situation at hand if you say that the magazine publishers are using methods that they know will increase their chances of getting boat magazines in front of your eyes when you're seeking truck magazines. Do you think my assertion is off-base?


Yes I do, and here's why.

The magazine publishers are free to configure their magazines and the world around them in any way they wish. The newstand operator is responsible for what goes on inside his stand. If he's serving up junk, do we go blaming the rest of the world for the quality of his service?

Somehow we've taken Google out of the picture as an independent agent, It's as if whatever program they are running is somehow golden, and by outsiders changing the inputs that Google uses so that it doesn't work correctly that somehow the outsiders are at fault. But outisders don't set the inputs -- Google does. Outsiders don't write the ranking algorithm -- Google does. Outsiders don't make money from having ads alongside search results and tracking individual's search behavior -- Google does. Outsiders are free to do whatever they want -- that's the entire reason for picking one search provider over another, the fact that one engine can take the world as it is and do a better job of organizing it than another one can.

If we don't expect Google to be responsible for how they process data -- if we somehow place Google's poor results and put the blame on the world at large, then exactly what of value is Google providing here in our relationship?

Like I said, I'm a fan. I want them to do well. I'm happy to help if I can. But hell if I'm going to let Google off the hook for providing good search results simply because the nature of the internet has changed. Things change. That's what they're supposed to do.

This is like writing a web app that is open to SQL injection attacks and then getting pissed at everybody else when they crash your system. Except there's one big difference: with an SQL injection attack there is an outsider directly interacting with your system, perhaps malevolently. With Google, outsiders don't even enter data in, Google goes and gets it. We've got the shoe on the wrong foot, as my mother used to say.


I understand what you're getting at, but you're way off the mark here:

> The magazine publishers are free to configure their magazines and the world around them in any way they wish. The newstand operator is responsible for what goes on inside his stand. If he's serving up junk, do we go blaming the rest of the world for the quality of his service?

There's a difference between selling junk and selling something that's obviously criminal. If you walked into a store where every magazine had "VIAGRA - 50% OFF. MAIL US YOUR MONEY". Do you honestly mean to tell me that the magazines in question were perfectly ok and it was the magazine vendor who did something wrong?

It's good that you realize that it's humans that are committing crimes, and not subhuman beings. But that doesn't excuse them nor should you.

Yes, Google has some level of responsibility here and they should be held accountable. But they're not the ones actually committing the crime.


You seem to be ignoring the fact that these sites are selling fake merchandise or downright stealing people's money. White-gray-black-hat SEO is not the point, I think you're conflating the nuts and bolts aspect of this article with the general geek indignation about Google's plummeting search results; but the latter is a tempest in a teapot—despite there being a very vocal minority within hacker circles brandishing a fuming vitriolic hatred of spammers of all types, not too many people really believe there's anything criminal about SEO per se.


Google created an ecosystem, spammers take steps that are criminal in this particular ecosystem and work against it's 'citizens'. Maybe their actions are not criminal in general terms, but it's at least scammy (not to say worse) when you look at it considering only this narrow subject (getting information through the internet). Google needs to do a better job, yes, just like lawmakers need to do it IRL, but it doesn't change the fact that IRL also mostly criminals are the ones that force the law system to adjust. Your post makes a lot of sense, but IMO there's nothing wrong in blaming spammers for the current situation.


I know this isn't a perfect solution. But I made this site that uses Google's Custom search to allow you to maintain your own blacklist so that you can filter out sites you don't want displayed. Here's the link: http://blacklist-search.appspot.com/


How is "greatly reducing the value of anchor text" going to improve search? Didn't we all start using google because anchor text was a great ranking signal? It seems the appropriate course of action is to devalue links from sites that either ignorantly or willfully pollute the link-o-sphere.


i agree with people who are complaining. this morning i was looking for a docking station for my cowon mp3 player. that spammy-ass website called techframe kept showing 3 times in the first few results. it was annoying.


I guess google shows different things to different people. My search for NFL Jerseys for example, seems just fine:

http://grab.by/8DYK

what do you think?


Except for the first one, those are the spam jersey sites.


well the results can't all be the first one repeated over and over.

if you search for 'nfl jerseys' you're probably looking to buy a jersey, and at least a few of those (eg football fanatics) do in fact look like legitimate stores.


What if I'm trying to find a sports geek's blog post about the history of NFL jerseys, or something like that? My personal problem with all the search engines is that legitimately interesting/useful amateur content, or even things like mainstream news articles, gets lost in a sea of sites that are trying to sell me stuff when my query could be construed as even remotely commercial. Unfortunately this is a trend I don't see changing, because (a) the hawkers have more expertise and resources than the bloggers when it comes to SEO, and (b) the search engine itself benefits financially by assuming I want to buy things and showing results accordingly (especially if it's Google due to AdSense).


Google could choose not to show bad results if they wanted to especially since they are scams in this case. They already do this for duplicate content.


Searching from the UK, I get the exact same (really spammy) results that the SEOMoz blog post shows.

Must be a localisation thing I guess?


Looking at the specific examples:

[nfl jerseys]

#1) http://www.nflshop.com/category/index.jsp?categoryId=2237409

Visit nfl.com, click "shop", then choose the "jerseys" tab, this is the page you are on. Seems perfectly relevant. The domain does not contain "jerseys" in it, and while the title does - it's the Jersey's category page for the nfl's shopping website, that makes sense. Hardly spam.

#2) http://www.footballfanatics.com/NFL_Jerseys

Visit www.clc.com, the collegiate licensing company, click retailers->collegiate retail outlets, Football Fanatics is one of 13 licensed collegiate retailers. Most major college universities sell their football merchandise through them. It's been around (run whois) since 1997, 14 years! Perhaps it's not ideal for NFL (non-college), but it's definitely Not Spam.

Unfortunately, below this some of the results do start getting ugly - there aren't too many online retailers that can legally sell NFL merchandise. Even Amazon is just a storefront for the NFL Shop (see http://www.amazon.com/NFL-Football-Fans/b?node=374273011). That might make it a good result, but it's essentially duplicated content given the NFL Shop result.

[pandora jewelry]

#1/#2) Pandora.net, totally not spam, this is the type in [amazon], get amazon.com kind of result.

#2.5) Below the second result I see a shopping results box which has only pandora jewelry from authorized retailers.

#3) http://www.pandoramoa.com/ - the Pandora Mall of America stores. Authorized pandora retailer. The domain has been around since 2007 (4 years)

Below this, the rest is getting ugly. Similar to [nfl jerseys], there aren't many online retailers legally able to sell pandora jewelry, so once Google has listed the only 3 good results available, what do want them to do? Try [jewelry] or [necklaces] - queries where there are lots of legit destinations and the top 10 results are all non-spammy.

[thomas sabo]

#1/#2) ThomasSabo.com, just like [pandora jewelry], this is exactly what 99% of the people with this query want.

Same story for the non-existant good results below.

These 3 queries are a very specific type of query where there are only one or two relevant results, but there are lots of sites that "match" the query. I'm not saying the rankings after the first few relevant results are good, but what would you propose to show after those relevant results as an alternative?

Writing an article about a specific class of queries is fine, although the author doesn't really propose a better set of results. The implication made is that this issue applies to a broad set of queries which it doesn't seem to. Ironically, the author's signature line is a link to http://www.tomsgutscheine.de/, whose title translated to english appears to be: "Coupons, Coupon Codes & Coupons (January 2011) - Tom's Coupons".




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: