I had a discussion with tptacek about this one day. See, I don't think Google (the search engine whose opinion's most influence my thoughts -- no offense DDG) sees content farms as a bad thing.
If someone is searching for "how to make a blueberry pie", and they get an article entitled "how to make a blueberry pie", they're happy. Are they actually going to make a blueberry pie? Probably not. Therefore, it doesn't really matter whether they get a good blueberry pie recipe or a bad blueberry pie recipe. As long as they quickly get to a well-designed page that they won't read anyhow because no one reads on the Internet which has a few bullet points they'll skim fast and a blueberry pie picture on it, they're happy. Their blueberry pie voyeurism need is fulfilled.
Content mills make that happen, for huge segments of the population. Let me strip that of euphemism: content mills make this happen for women, the elderly, and the technically disinclined. Absent the content mill, there is insufficient "organically produced" content on the things they care about on the Internet because their participation on the Internet is dramatically less than y'alls participation is and y'all -- speaking in generalities -- do not blog about good blueberry pie recipes.
You can think of content mills as an organism in symbiosis with Google: how to you juice relevance algorithms to identify the sliver of a sliver of a fraction of the Internet which talks about blueberry pies and other things your mom cares about, identify the best tangentially related article, and present it to her every time? Well, you could have your crack teams of geniuses work on it for a few years, even though your favorite tricks like PageRank are likely to function less well because there's less linking data to go around. Or, in the alternative, you could encourage content farming.
It surely has not escape Google's notice that their bottom line revenue increases by about 80% of the top-line revenue of the entire content farming industry, incidentally. Contextual ads are the perfect monetization vehicle for laser-targeted content produced at quality which will be solely viewed in search mode, and Google owns that entire field.
I work in search quality at Google, and while certainly not everyone agrees that it's a problem, a lot of people do.
I could write a lot about this, but the central issue is that it is very very hard to make changes that sacrifice on-topic-ness for good-ness that don't make the results in general worse. We're working on it though, and I suspect we'll never stop.
I think a lot of the promise lies in as you said, identifying the tangentially related article, or as I like to frame it, bringing more queries into the head. We've launched a lot of changes that do exactly this. (But you are right, it is difficult, and fundamentally so. Language is hard.)
virtually every news organization "scrapes" the associated press
If they're not adding real value, like analysis or graphics or commentary or whatnot, why would you want to keep them if they're all just duplicates?
I had a friend work at a startup to solve this problem exact: we read virtually identical articles about the same bit of news on all the news sites. The startup was working on highlighting only the unique bits of each article and recommend the one article that seems to have the most pieces of information. You would read the one and skim to the unique bits of the others, and you would have gotten all angles and facts much more quickly.
We do filter near-duplicates within the same set of results. You'll likely see only one copy of an AP story with a link at the bottom saying something like "Repeat this search with the omitted results included"
Google News doesn't show up in search results the way, e.g., Mahalo might. The only search results I've seen that incorporate Google News are built right into the main results page; I don't click through expecting content and get a Google News page instead.
In fact, I haven't seen any Google-owned scraping or aggregating page in a result that I've clicked through. They are big believers in the theory that you should look at exactly one search results page, not a page that takes you to a page that takes you to (...) the result you actually wanted.
I haven't seen any Google-owned scraping or aggregating page in a result that I've clicked through.
What about Google Health results? Try [Whooping Cough] or similar. Top 'result' is a Google health page whose main column is all content republished from Medline. Right column is essentially 'more results' from News and Scholar.
It's not quite as bad as other paste-together pages of text and more results, but they're creeping in that direction.
It is a problem. Things like MFA spam and Google's recent trend of being less useful as a Search Engine for Programmers (ignoring quotes, ignoring punctuation like ? and !) are enough to make me switch, if I find something better. This article has convinced me to try out DDG for a little while.
Google has done this since launch, I don't think it's a recent trend. I agree though that searching for programming information isn't great. (It's even harder with math.)
I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.
The BigResournce content is scraped from Actionscript.org and surrounded by BigResource's ads, which is pretty standard scraping. Clicking the link to "view original forum thread" redirects to a framed page with more BigResource ads, and the original content in a frame. The frame is handled internally by the site, so I'm doubtful they're even showing an actual link to their scraped content.
My question is this: Is there something that BigResource is doing that exempts them from being classified as spam? As near as I can tell, they add no extra value to the content for the user, and push the legitimate results they scraped further down in results (because they have many, many pages).
Kudos, I was about to submit a longstanding complaint but it looks like it has been fixed. For a long time Google was autocorrecting my searches for gearman as "gearman" -- not just in a "Did you mean?" but actually giving me search results about germans instead of gearman. I'm not sure when it happened, but results are sane and accurate again, even for things like "gearman workers" and "gearman jobs".
Although, I agree with your point. This bit offended me a bit;
>>>content mills make this happen for women, the elderly, and the technically disinclined.<<<
On the other hand, have you ever thought about making an interface that does this at a glance instead of an algorithm? Why should a person go to pages at all? Even better why can't people see how pages related to their queries are interconnected?
For e.g. you enter a few search terms. Google calls a few results up. Now, maybe you can do meta-parsing? It is simply too computationally expensive to parse the entire internet using NLP, but if you can do it at two levels then you should be able to reduce costs and increase efficiency.
After that you show the results in a tree like structure with screen grabs of the page in question and an automatically generated summary. Now, the user can select any one of these pages and you open them up again according her/his query. You keep on generating this tree until you hit a semantic dead end i.e. the pages that are about to be opened have no relation with the original query.
Something like this should be impervious to content farms for a simple reason. The user has the ability to skim over the content presented and can visit what s/he likes. So, the content farms can't bait her/him. Moreover, since several sources are displayed visually side by side and their interconnections and sub references are shown the user should be better able to judge what to visit and what not to visit.
Moreover, you can make it smart and tailor it for users so that links they seldom visit get omitted from subsequent trees.
I've wanted to make something like this for a long time, but I don't have the knowledge to do so. I would love to learn though. I think that the next frontier of search isn't optimizing the algorithms, but to change the way results are presented through more intuitive UIs.
[I am assuming that content farms simply set up pages without verifying the validity of the content]
The content farms hurt most when I'm doing a focused query, e.g. trying to make a purchasing decision or discover a specific fact about a very particular topic.
Many times, when searching for specific products to buy, you get page after page of review aggregation sites filled with affiliate sales links, often with the same reviews, over and over again.
At other times, you get poorly written generic content farms that almost trap the reader with the hope of getting what he wants, until it eventually dawns on him that the site is basically a fluffed out set of wikipedia pages.
I was looking for motorbike gear some months ago, and ended up on this site. The odd thing about it is it looks a bit like a shop of some kind - "parts trade" and all that - and it even says "search through our large inventory of motorcycle parts", but it's not actually a shop, just a huge web of interlinked articles. The only ad at the current time is a static link to digitalroom.com.
There are other indirect mechanisms whereby content-farm-junk -- aka 'spablum' -- benefits Google.
What if spablum doesn't prevent a person from eventually finishing their task, but lengthens their browsing/searching session, requiring a few reformulations? Up to a certain threshold -- when the person gives up entirely or tries a competitor -- such delaying tactics may give Google more AdWords and AdSense impressions and clickthroughs.
Next, as the quality of natural search results drops, the value of the paid-placement slots goes up -- for both advertisers and users. This is one giant hole in the argument that it's effortless to switch to another search engine, if it offers better natural search results. Lots of the value from Google searches now comes from the relevant ads -- only if Google were to let competitors run ads from the same pool would this switching cost be eliminated.
Finally, the quality arms-race is something deep-pocketed, heavily-PhD'd Google can fight indefinitely... but adds an immense barrier to new shoestring operations. Google, given its position, would be perfectly happy if search quality just hovers where it is (or even declines slightly) -- as long as the cost for potential competitors to achieve that same search quality, at scale, grows moreso than for Google.
I can guarantee you that no one in search quality, outside perhaps of the people who work in ui, care one bit about the revenue effects of what we do. Someone in the company certainly does, but not anyone in the chain of command of people who have control over ranking launches.
If the results suck, by all means, rail on Google and tell us we're doing a crappy job, but there's no need to suspect an ulterior motive.
Every so often, some well meaning salesperson will email one of the search quality mailing lists because they think there's a problem with how we are indexing/ranking one of our Adsense or Adwords customer's sites. Every single time, I've seen the person sternly told by a VP to never email those lists again.
I work in search UI and we're explicitly told not to worry about revenue, outside of certain gigantic projects, like, say, redesigning the search page (where we're told to ignore revenue but try not to bankrupt the company). There's a whole other department dedicated to making sure we can continue to pay our salaries; it's not our job to worry about money, only to make users happy.
If the people in charge of rankings have not a care one bit about the revenue effects then simply let the user decide - give me flag to simply take all results out from any site containing adsense. The the non-technically inclined can get the results with the "how to make blueberry pie" results and the more technically inclined can skip the content mills.
Something like so:
how to make blueberry pie -adsense
edit: didn't notice that previous commenter had said the same thing and moultano had replied "you want the new york times filtered out of the results?"
No, I want an advanced flag that filters ALL sites with adsense out when I use it, including the New York Times. If I want results from NYT then I won't use the flag when I search. He didn't exactly answer the question of will we ever get a filter-out-adsense flag
Thanks for commenting, and I do trust that those specifically tasked with search quality would seek it, unreservedly, without ulterior motives.
However, once such revenue-enhancing feedback loops exist, they can lead to self-reinforcing behaviors without any conscious intent. Simply the fact that Google has been so wildly profitable, even with search quality stagnant or declining for many spablum-heavy queries, could mean certain radical avenues of inquiry aren't considered. ("We're obviously doing OK overall, our search share, usership, and profits are up every quarter. Let's not rock the boat by starting a war with these content mills, which would surely include lawsuits and congressional hearings calling attention to our market power.")
Even the choice of overall budget for search quality comes into play. The current level of spending has, with its mixture of both qualified success and failure against spablum, done just fine for Google's margins. But I think Google could spend twice as much on search quality, and still be profitable. So, why doesn't it? Well, I don't blame Google for being profit-maximizing, but that means someone is controlling a relative-effort lever that makes search just so good, and no better, because of a tradeoff in which the marginal profitability of even better search is considered.
Also, what if Google's search quality gets monotonically better given the web it has to work with, but the overall economic impact of Google's ad programs and dominant-focal-point-rankings is simultaneously making the average web content worse? Then, each generation of Google's search tech may be better, ceteris paribus, but the net pollution effect still boosts Google's revenues via the three mechanisms I listed: longer search sessions, greater attention to paid areas, a more difficult environment for competitors.
Even your anecdote is not completely reassuring; for an AdSense/AdWords customer, it's not in Google's self-interest to favor those sites in natural results. By ranking highly in organic results, they might not need to buy as much paid placement! So that VP's stern warning can be explained by either a dedication to search independence, or to simply ad-revenue-maximizing self-interest.
Will I ever get an advanced search operator that will filter all AdSense-carrying results out of my searches? (I'm going to make one on Blekko, if possible.)
>"We're obviously doing OK overall, our search share, usership, and profits are up every quarter. Let's not rock the boat by starting a war with these content mills, which would surely include lawsuits and congressional hearings calling attention to our market power."
You do have a point there. We do get sued a lot, and in general courts have upheld our view that search results are our constitutionally protected speech, but we are a lot more lawsuit shy than we used to be.
>But I think Google could spend twice as much on that search quality, and still be profitable. So, why doesn't it?
I'm not sure it would help. We have a _lot_ of people working on search quality, and even more working on the infrastructure that supports it. More people would certainly let us scratch more itches and peer in more dark corners, but I'm not sure it would produce the kind of dramatic improvements you are looking for.
>Also, what if Google's search quality gets monotonically better given the web it has to work with, but the overall economic impact of Google's ad programs and dominant-focal-point-rankings is simultaneously making the average web content worse?
That's something a lot of people worry about. I hope that having a financial incentive to make content results in more good content, or at least more content that isn't horrible. Even the existence of content-farms is probably a net positive. For all the crap they produce, a lot of it is filling niches in which there simply isn't any content, and the stuff they produce _is_ better than nothing.
>Will I ever get an advanced search operator that will filter all AdSense-carrying results out of my searches?
You really want the New York Times filtered out of your searches?
It's not that I actually think spending twice as much on search quality is the right decision. Just that the fact that Google could, and doesn't, reminds us that some level of management at Google is trading off search quality against other values. (As a competitive for-profit entity, profitability is high among those other values.) Even without conscious intent, if there's a saturation point where greater search quality doesn't help net profitability, or even where slightly worse quality means more profits, the organization will develop certain practices and shared rationalizations which help them converge on that optimal point.
These indirect mechanisms for avoiding quality when it conflicts with profitability may not seem malicious on their surface. In fact, here's what they might sound like instead:
I'm not sure [spending more on search quality] would help.
Even the existence of content-farms is probably a net positive. For all the crap they produce, a lot of it is filling niches in which there simply isn't any content, and the stuff they produce _is_ better than nothing.
And yes, in many searches, I would gladly sacrifice NYTimes results to get rid of EHow and its siblings. (I didn't ask to eliminate AdSense sites from every results-page, just an option to do so when I'm in a MFA-polluted category.)
I think gojomo has a really good idea here. However, there is no reason to bias it against Google's ad network. We need an advanced search option to find non-ad supported sites, that behind the scenes attempts to filter out every known ad network.
The use case wouldn't be an all the time thing, it would be when you are searching and get results that are overly polluted with ad-based content. Having the option to filter to non-ad based content lets you focus on those sites that have a motivation other than page views for providing content. I imagine this would include both sites actually selling products and non-profit sites like Wikipedia and open source.
You only remember the queries that don't work. :) Seriously though, we've measurably seen people expect more from search engines over time, so if your impression has stayed constant then we are doing pretty well. In particular, the average query length has increased by ~ 2 words if I'm not mistaken.
Big changes I can think of in the last few years that we've publicly announced:
* The index is several orders of magnitude larger.
* Most documents that you see in results can now go from crawl to serving in a matter of minutes.
* Ranking for the extreme head is pretty much SEO free.
* Improved ranking for long tail documents with little data available about them.
Those are just the really big launches I can remember off the top of my head.
Most of our day to day work in general falls into two categories.
1. Making our existing systems faster, fresher, higher quality.
2. Searching for brand new sources of data, most of which don't work out that well (like in any other research area.) and occasionally finding the diamond in the rough that dramatically changes search results enough for at least the SEO community to notice.
Heh. That's awesome. As an SEO, I know that we definitely don't see the extreme head as being "SEO free" - but perhaps we have different definitions of SEO. The people ranking there have definitely thought about how to do that (and done it effectively).
They haven't done it with scraped links and content farms, but they have definitely optimised their business to do well in search engines.
Yep. And I know the smart SEOs behind at least 3 of those rankings. There are more big brands, but there are also big brands who are not ranking there - and I would argue that is not coincidence. SEO is evolving, but whatever the algorithm, there are opportunities to take steps to have your business perform better in the search results.
Equally (and while I very much respect the work you guys are doing and trying to do) there are massive head terms where I know how much the top-ranking sites are paying for links. When they stop, they slip. The job is not done. (But you know that).
it would be hard for me to say that the search results from google
have increased in quality at all in the past five years
Well, I know SQUAT about Google's search team. But just as a thought experiment, let's ask whether the 'net has changed in the last five years? It seems to me that whatever "algorithm" was in place back then would be gamed to death by now. Spammers would have all their dreck at the top of every search result. And that's not even counting the criminals who want their drive-by download payloads delivered...
There's a constant war against spam and crime, and while Google has a stupendous amount of money to invest in winning it, the other side has an amazing war chest and a massive army fighting against them.
I have no idea, maybe it only takes an hour a week for one engineer to keep things ticking along. But I can easily imagine that the team has to do a lot of work just to keep results from getting worse.
I think at this point that just keeping up with the people trying to game Google is already an order of magnitude 'improvement' (even if it's in absolute terms not much of an improvement to the end user).
When they were playing around with their SearchWiki system, there was an 'X' next to all results that you could click to make them go away, which I was pretty excited about, until I realized it didn't actually permanently banish that site; just removed it from the current results page.
There are a bunch of greasemonkey scripts to let you blacklist specific sites from Google results, but relatively frequent changes to Google's results page seem to make them always fall behind, so it's hard to find a consistently working one. This is the best-maintained one I can find, and it's actually not currently working: http://userscripts.org/scripts/show/44418
Didn't google some time ago have these up and down arrows, like here on HN with comments, which I never quite learned as to what they were for. That is not quite far off from marking sites as spam and I used the up and down arrows only perhaps once in the entire time.
Search is different from sites like this. Here we know we are going to spend some time reading interesting content, but not quite what we will be reading. When searching you know you needs some information and what you want to do is find it, preferably instantly, and get out of Google immediately. It is easier to click on the next link than mark some site for spam or click the up and down arrow.
The article closes with, "In some sense, Blekko's approach is more democratic--if any content is good enough for your friends, it's probably good enough for you too."
If I'm going to use a social search, I want to see things that:
1) Are liked by at least one of my friends.
2) Are not explicitly disliked by (a meaningful threshold, perhaps as few as 1) of my friends.
And really, I'm not so sure that I trust #1, but #2 could be useful to me. Especially when "friends" gets replaced with "Hacker News," then I'm much more interested.
In other words, I don't trust the sensitivity of a social network to get me what I need; I may have interests that reach beyond those of any of my associates. However, I do, to an extent, trust its specificity for identifying badness, and that's why I might consider a "social"-esque search filtering service.
Actually, Gabe, have you ever considered creating some sort of collaborative filtering tool for duckduck?
Want to thank moultano (from Google) for replying to so many stories.
One technique that I have manually begun employing to stop landing at what I would call value-less sites is looking at the PageRank (Chrome extension) of the host page before considering the individual story.
One reader complained about the hollow/bullshit review sites that match whatever word you are searching for and then make your eyes bleed with sheer number of ads once you get there and inter-related affiliate links -- these sites never have a high page rank and can be safely ignored.
I don't know if DuckDuckGo or Blekko (is that the new one in Alpha?) are going to take this type of data into consideration when ranking search results, but I sure do and it has never failed me.
If you want to see a quick example of bullshit websites -- try and Google for ANYTHING health-or-weightloss-related. Try "HGH review", "Sensa review" or just about anything else that you might be curious about in the health/medical realm and I guarantee you that it will be atleast 3 pages of search results before you find a single article that is not Google-fodder that actually has real content in it.
In the old days you used to be able to tell a "real" website from a "bullshit" one by looking at how pretty or professional the site is... un/fortunately the barrier to a beautiful site is much lower now and running across value-less sites that look as good as professionally developed/run sites is hard to spot instantly.
This is where peeking at the PageRank of the host has helped me quite a bit.
Yes it misses some things, like the case where a useless site produces 1 article that is good, but in general it keeps me sane and stops me from giving up search in general.
It would be nice if there was a Chrome/Firefox extension for Google's search result page that I could click "Submit Complaint" to submit links to Google complaining about the quality of the result or the site itself. I know they would have a lot of noise to work through from something like this, but I would hope over time it would help them be able to spot patterns in these affiliate-linking-ad-smattered-nightmare sites and get rid of them.
Which is probably an unfair comment given Matt Cutts is dealing from inside a massive, listed company, but epi0Bauqu / DDG are certainly addressing a definite user experience problem. Indeed, this is a more tangible benefit for using DDG than the focus on security and privacy, which I don't believe is understood by most users.
This is actually interesting enough that I'm trying DDG as my default search engine for a bit currently. I probably should've already (I intellectually like what they're doing, and check it out occasionally), but Google just works well enough for inertia to keep me there. Getting content-farm results is a common daily annoyance, though, so this is the sort of thing that could make for a noticeable improvement in my Search Happiness even in the short term.
This is a huge problem to address, and could be the thing that changes the dynamics of search. Google has become useless to me for a lot of searches -- many things that attract "MFA" -- because of the overwhelming amount of search spam. The problem has gotten much worse over the last few years because of the content mills presumably. Fortunately Google's still great for sufficiently obscure things that don't lead to transactions (like research papers).
It seems for every query you really want to have good Wikipedia quality level equivalent page summarizing the best knowledge available. These content farms are trying to fill the gap where Wikipedia won't have a corresponding page, however, they do not have enough revenue per page to pay for content that is good enough.
I don't have a problem with content mill sites -- as long as they provide me information quickly in a format I desire.
I think lots of folks want "authoritative". So let's suppose I'm mentally disabled and in search of a recipe for cake. I google "cake" and I got 40,000 sites. Top of the list is the Cake Institute of America. I google airplanes and I'm looking at the history of winged flight. I want to learn to tie my shoes and spend 5 hours on the history of footwear in western culture.
This is just silly. Communications 101 says that the message changes depending on the audience and the medium. Yet those in the search engine business, it seems, want "authoritative" and "best" sources. I could give a shit. I want information custom-made to me -- who I am, how I speak, my culture, my mood, my life.
The "There can be only one" attitude is not helpful. Somewhere, right now, some guy wants to find out how to train speckled-bellied pigeons to dance. And those dang E-how guys probably have a video for it. Back in the day it was painful as heck to find information. Now companies are figuring out how to make each little penny they can on creating content. As long as the content is useful, I think that's awesome.
Having said that, the problem is that the drunken-angry-sailor-web-content is different from the New-England-school-marm-content. That's okay. There's room for growth. Isn't progress a good thing?
But the domain-squatting nonsense, and the sleaze factor some of these companies bring to the table? That's got to go. With lots more tlds I think the domain-name-spamming business has a limited shelf-life, thankfully.
To address your point though, I think what is new about the article is that some search engine is trying to address a problem which annoys a lot of HNers. A problem I might say which differentiated Google from the then MSN and ALtavista etc. That is irrelevant or shallow search results.