If someone is searching for "how to make a blueberry pie", and they get an article entitled "how to make a blueberry pie", they're happy. Are they actually going to make a blueberry pie? Probably not. Therefore, it doesn't really matter whether they get a good blueberry pie recipe or a bad blueberry pie recipe. As long as they quickly get to a well-designed page that they won't read anyhow because no one reads on the Internet which has a few bullet points they'll skim fast and a blueberry pie picture on it, they're happy. Their blueberry pie voyeurism need is fulfilled.
Content mills make that happen, for huge segments of the population. Let me strip that of euphemism: content mills make this happen for women, the elderly, and the technically disinclined. Absent the content mill, there is insufficient "organically produced" content on the things they care about on the Internet because their participation on the Internet is dramatically less than y'alls participation is and y'all -- speaking in generalities -- do not blog about good blueberry pie recipes.
You can think of content mills as an organism in symbiosis with Google: how to you juice relevance algorithms to identify the sliver of a sliver of a fraction of the Internet which talks about blueberry pies and other things your mom cares about, identify the best tangentially related article, and present it to her every time? Well, you could have your crack teams of geniuses work on it for a few years, even though your favorite tricks like PageRank are likely to function less well because there's less linking data to go around. Or, in the alternative, you could encourage content farming.
It surely has not escape Google's notice that their bottom line revenue increases by about 80% of the top-line revenue of the entire content farming industry, incidentally. Contextual ads are the perfect monetization vehicle for laser-targeted content produced at quality which will be solely viewed in search mode, and Google owns that entire field.
I could write a lot about this, but the central issue is that it is very very hard to make changes that sacrifice on-topic-ness for good-ness that don't make the results in general worse. We're working on it though, and I suspect we'll never stop.
I think a lot of the promise lies in as you said, identifying the tangentially related article, or as I like to frame it, bringing more queries into the head. We've launched a lot of changes that do exactly this. (But you are right, it is difficult, and fundamentally so. Language is hard.)
That site has a few thousand crappy reviews scraped from all over the place. It comes up on google all the time and it's something that I automatically ignore.
It then returns a page with your search queries and no information.
What use is a site that simply returns search queries?
Why aren't sites that scrape content blacklisted?
The problem is more difficult than you'd think. For instance, virtually every news organization "scrapes" the associated press, but we wouldn't want to throw out every news organization.
Content-free search result pages are things we do try to remove, even manually if it becomes a big enough problem.
If they're not adding real value, like analysis or graphics or commentary or whatnot, why would you want to keep them if they're all just duplicates?
I had a friend work at a startup to solve this problem exact: we read virtually identical articles about the same bit of news on all the news sites. The startup was working on highlighting only the unique bits of each article and recommend the one article that seems to have the most pieces of information. You would read the one and skim to the unique bits of the others, and you would have gotten all angles and facts much more quickly.
Shame they closed it up.
What like Google News?
It's a lame joke but it shows you how fine the line is between 'scraping' and 'aggregating' content.
In fact, I haven't seen any Google-owned scraping or aggregating page in a result that I've clicked through. They are big believers in the theory that you should look at exactly one search results page, not a page that takes you to a page that takes you to (...) the result you actually wanted.
What about Google Health results? Try [Whooping Cough] or similar. Top 'result' is a Google health page whose main column is all content republished from Medline. Right column is essentially 'more results' from News and Scholar.
It's not quite as bad as other paste-together pages of text and more results, but they're creeping in that direction.
Google has done this since launch, I don't think it's a recent trend. I agree though that searching for programming information isn't great. (It's even harder with math.)
I doubt you'll find MFA spam to be better on DDG than on Google, but please, if you see a query where they are beating us. Send it over. :) I can guarantee you that I'll get a lot of eyes looking at it.
http://www.google.com/search?q=soap+with+flash+as3 brings up the following link from bigresource.com: http://www.bigresource.com/FLASH-SOAP-with-flash-AS3-PvTRLrv...
The BigResournce content is scraped from Actionscript.org and surrounded by BigResource's ads, which is pretty standard scraping. Clicking the link to "view original forum thread" redirects to a framed page with more BigResource ads, and the original content in a frame. The frame is handled internally by the site, so I'm doubtful they're even showing an actual link to their scraped content.
I've seen this specific site debated in this thread: http://www.google.com/support/forum/p/Web+Search/thread?tid=... and I know several users including myself have reported this site as spam (I even went the extra mile and changed Chrome to append "-site:bigresource.com" to queries using the default engine).
My question is this: Is there something that BigResource is doing that exempts them from being classified as spam? As near as I can tell, they add no extra value to the content for the user, and push the legitimate results they scraped further down in results (because they have many, many pages).
I got fed up of searching for CPU instructions and compiler intrinsics in Google. You get mostly pointless forum discussions or MFA kind of results.
Same happens with many other technical searches. It looks like the secret pagerank is gamed from both inside and outside.
Note: DDG's search is Yahoo Boss/Bing.
>>>content mills make this happen for women, the elderly, and the technically disinclined.<<<
On the other hand, have you ever thought about making an interface that does this at a glance instead of an algorithm? Why should a person go to pages at all? Even better why can't people see how pages related to their queries are interconnected?
For e.g. you enter a few search terms. Google calls a few results up. Now, maybe you can do meta-parsing? It is simply too computationally expensive to parse the entire internet using NLP, but if you can do it at two levels then you should be able to reduce costs and increase efficiency.
After that you show the results in a tree like structure with screen grabs of the page in question and an automatically generated summary. Now, the user can select any one of these pages and you open them up again according her/his query. You keep on generating this tree until you hit a semantic dead end i.e. the pages that are about to be opened have no relation with the original query.
Something like this should be impervious to content farms for a simple reason. The user has the ability to skim over the content presented and can visit what s/he likes. So, the content farms can't bait her/him. Moreover, since several sources are displayed visually side by side and their interconnections and sub references are shown the user should be better able to judge what to visit and what not to visit.
Moreover, you can make it smart and tailor it for users so that links they seldom visit get omitted from subsequent trees.
I've wanted to make something like this for a long time, but I don't have the knowledge to do so. I would love to learn though. I think that the next frontier of search isn't optimizing the algorithms, but to change the way results are presented through more intuitive UIs.
[I am assuming that content farms simply set up pages without verifying the validity of the content]
What if spablum doesn't prevent a person from eventually finishing their task, but lengthens their browsing/searching session, requiring a few reformulations? Up to a certain threshold -- when the person gives up entirely or tries a competitor -- such delaying tactics may give Google more AdWords and AdSense impressions and clickthroughs.
Next, as the quality of natural search results drops, the value of the paid-placement slots goes up -- for both advertisers and users. This is one giant hole in the argument that it's effortless to switch to another search engine, if it offers better natural search results. Lots of the value from Google searches now comes from the relevant ads -- only if Google were to let competitors run ads from the same pool would this switching cost be eliminated.
Finally, the quality arms-race is something deep-pocketed, heavily-PhD'd Google can fight indefinitely... but adds an immense barrier to new shoestring operations. Google, given its position, would be perfectly happy if search quality just hovers where it is (or even declines slightly) -- as long as the cost for potential competitors to achieve that same search quality, at scale, grows moreso than for Google.
If the results suck, by all means, rail on Google and tell us we're doing a crappy job, but there's no need to suspect an ulterior motive.
Every so often, some well meaning salesperson will email one of the search quality mailing lists because they think there's a problem with how we are indexing/ranking one of our Adsense or Adwords customer's sites. Every single time, I've seen the person sternly told by a VP to never email those lists again.
However, once such revenue-enhancing feedback loops exist, they can lead to self-reinforcing behaviors without any conscious intent. Simply the fact that Google has been so wildly profitable, even with search quality stagnant or declining for many spablum-heavy queries, could mean certain radical avenues of inquiry aren't considered. ("We're obviously doing OK overall, our search share, usership, and profits are up every quarter. Let's not rock the boat by starting a war with these content mills, which would surely include lawsuits and congressional hearings calling attention to our market power.")
Even the choice of overall budget for search quality comes into play. The current level of spending has, with its mixture of both qualified success and failure against spablum, done just fine for Google's margins. But I think Google could spend twice as much on search quality, and still be profitable. So, why doesn't it? Well, I don't blame Google for being profit-maximizing, but that means someone is controlling a relative-effort lever that makes search just so good, and no better, because of a tradeoff in which the marginal profitability of even better search is considered.
Also, what if Google's search quality gets monotonically better given the web it has to work with, but the overall economic impact of Google's ad programs and dominant-focal-point-rankings is simultaneously making the average web content worse? Then, each generation of Google's search tech may be better, ceteris paribus, but the net pollution effect still boosts Google's revenues via the three mechanisms I listed: longer search sessions, greater attention to paid areas, a more difficult environment for competitors.
Even your anecdote is not completely reassuring; for an AdSense/AdWords customer, it's not in Google's self-interest to favor those sites in natural results. By ranking highly in organic results, they might not need to buy as much paid placement! So that VP's stern warning can be explained by either a dedication to search independence, or to simply ad-revenue-maximizing self-interest.
Will I ever get an advanced search operator that will filter all AdSense-carrying results out of my searches? (I'm going to make one on Blekko, if possible.)
You do have a point there. We do get sued a lot, and in general courts have upheld our view that search results are our constitutionally protected speech, but we are a lot more lawsuit shy than we used to be.
>But I think Google could spend twice as much on that search quality, and still be profitable. So, why doesn't it?
I'm not sure it would help. We have a _lot_ of people working on search quality, and even more working on the infrastructure that supports it. More people would certainly let us scratch more itches and peer in more dark corners, but I'm not sure it would produce the kind of dramatic improvements you are looking for.
>Also, what if Google's search quality gets monotonically better given the web it has to work with, but the overall economic impact of Google's ad programs and dominant-focal-point-rankings is simultaneously making the average web content worse?
That's something a lot of people worry about. I hope that having a financial incentive to make content results in more good content, or at least more content that isn't horrible. Even the existence of content-farms is probably a net positive. For all the crap they produce, a lot of it is filling niches in which there simply isn't any content, and the stuff they produce _is_ better than nothing.
>Will I ever get an advanced search operator that will filter all AdSense-carrying results out of my searches?
You really want the New York Times filtered out of your searches?
These indirect mechanisms for avoiding quality when it conflicts with profitability may not seem malicious on their surface. In fact, here's what they might sound like instead:
I'm not sure [spending more on search quality] would help.
Even the existence of content-farms is probably a net positive. For all the crap they produce, a lot of it is filling niches in which there simply isn't any content, and the stuff they produce _is_ better than nothing.
And yes, in many searches, I would gladly sacrifice NYTimes results to get rid of EHow and its siblings. (I didn't ask to eliminate AdSense sites from every results-page, just an option to do so when I'm in a MFA-polluted category.)
Something like so:
how to make blueberry pie -adsense
edit: didn't notice that previous commenter had said the same thing and moultano had replied "you want the new york times filtered out of the results?"
No, I want an advanced flag that filters ALL sites with adsense out when I use it, including the New York Times. If I want results from NYT then I won't use the flag when I search. He didn't exactly answer the question of will we ever get a filter-out-adsense flag
Big changes I can think of in the last few years that we've publicly announced:
* The index is several orders of magnitude larger.
* Most documents that you see in results can now go from crawl to serving in a matter of minutes.
* Ranking for the extreme head is pretty much SEO free.
* Improved ranking for long tail documents with little data available about them.
Those are just the really big launches I can remember off the top of my head.
Most of our day to day work in general falls into two categories.
1. Making our existing systems faster, fresher, higher quality.
2. Searching for brand new sources of data, most of which don't work out that well (like in any other research area.) and occasionally finding the diamond in the rough that dramatically changes search results enough for at least the SEO community to notice.
Can you explain what you mean by this?
They haven't done it with scraped links and content farms, but they have definitely optimised their business to do well in search engines.
Equally (and while I very much respect the work you guys are doing and trying to do) there are massive head terms where I know how much the top-ranking sites are paying for links. When they stop, they slip. The job is not done. (But you know that).
In general though, I agree with you, and it will get better.
it would be hard for me to say that the search results from google
have increased in quality at all in the past five years
There's a constant war against spam and crime, and while Google has a stupendous amount of money to invest in winning it, the other side has an amazing war chest and a massive army fighting against them.
I have no idea, maybe it only takes an hour a week for one engineer to keep things ticking along. But I can easily imagine that the team has to do a lot of work just to keep results from getting worse.
Many times, when searching for specific products to buy, you get page after page of review aggregation sites filled with affiliate sales links, often with the same reviews, over and over again.
At other times, you get poorly written generic content farms that almost trap the reader with the hope of getting what he wants, until it eventually dawns on him that the site is basically a fluffed out set of wikipedia pages.
Here's one I reported to Google as search spam: http://www.motorcyclepartstrade.com/
I was looking for motorbike gear some months ago, and ended up on this site. The odd thing about it is it looks a bit like a shop of some kind - "parts trade" and all that - and it even says "search through our large inventory of motorcycle parts", but it's not actually a shop, just a huge web of interlinked articles. The only ad at the current time is a static link to digitalroom.com.
I skimmed the immediate replies, and people seem to be taking the "Yes, and" approach, so I'll ask. What the fuck are you talking about, man?
A way to opt-in to using a community managed list of bad domains would be even better.
There are a bunch of greasemonkey scripts to let you blacklist specific sites from Google results, but relatively frequent changes to Google's results page seem to make them always fall behind, so it's hard to find a consistently working one. This is the best-maintained one I can find, and it's actually not currently working: http://userscripts.org/scripts/show/44418
Search is different from sites like this. Here we know we are going to spend some time reading interesting content, but not quite what we will be reading. When searching you know you needs some information and what you want to do is find it, preferably instantly, and get out of Google immediately. It is easier to click on the next link than mark some site for spam or click the up and down arrow.
I think lots of folks want "authoritative". So let's suppose I'm mentally disabled and in search of a recipe for cake. I google "cake" and I got 40,000 sites. Top of the list is the Cake Institute of America. I google airplanes and I'm looking at the history of winged flight. I want to learn to tie my shoes and spend 5 hours on the history of footwear in western culture.
This is just silly. Communications 101 says that the message changes depending on the audience and the medium. Yet those in the search engine business, it seems, want "authoritative" and "best" sources. I could give a shit. I want information custom-made to me -- who I am, how I speak, my culture, my mood, my life.
The "There can be only one" attitude is not helpful. Somewhere, right now, some guy wants to find out how to train speckled-bellied pigeons to dance. And those dang E-how guys probably have a video for it. Back in the day it was painful as heck to find information. Now companies are figuring out how to make each little penny they can on creating content. As long as the content is useful, I think that's awesome.
Having said that, the problem is that the drunken-angry-sailor-web-content is different from the New-England-school-marm-content. That's okay. There's room for growth. Isn't progress a good thing?
But the domain-squatting nonsense, and the sleaze factor some of these companies bring to the table? That's got to go. With lots more tlds I think the domain-name-spamming business has a limited shelf-life, thankfully.
If I'm going to use a social search, I want to see things that:
1) Are liked by at least one of my friends.
2) Are not explicitly disliked by (a meaningful threshold, perhaps as few as 1) of my friends.
And really, I'm not so sure that I trust #1, but #2 could be useful to me. Especially when "friends" gets replaced with "Hacker News," then I'm much more interested.
In other words, I don't trust the sensitivity of a social network to get me what I need; I may have interests that reach beyond those of any of my associates. However, I do, to an extent, trust its specificity for identifying badness, and that's why I might consider a "social"-esque search filtering service.
Actually, Gabe, have you ever considered creating some sort of collaborative filtering tool for duckduck?
Which is probably an unfair comment given Matt Cutts is dealing from inside a massive, listed company, but epi0Bauqu / DDG are certainly addressing a definite user experience problem. Indeed, this is a more tangible benefit for using DDG than the focus on security and privacy, which I don't believe is understood by most users.
One technique that I have manually begun employing to stop landing at what I would call value-less sites is looking at the PageRank (Chrome extension) of the host page before considering the individual story.
One reader complained about the hollow/bullshit review sites that match whatever word you are searching for and then make your eyes bleed with sheer number of ads once you get there and inter-related affiliate links -- these sites never have a high page rank and can be safely ignored.
I don't know if DuckDuckGo or Blekko (is that the new one in Alpha?) are going to take this type of data into consideration when ranking search results, but I sure do and it has never failed me.
If you want to see a quick example of bullshit websites -- try and Google for ANYTHING health-or-weightloss-related. Try "HGH review", "Sensa review" or just about anything else that you might be curious about in the health/medical realm and I guarantee you that it will be atleast 3 pages of search results before you find a single article that is not Google-fodder that actually has real content in it.
In the old days you used to be able to tell a "real" website from a "bullshit" one by looking at how pretty or professional the site is... un/fortunately the barrier to a beautiful site is much lower now and running across value-less sites that look as good as professionally developed/run sites is hard to spot instantly.
This is where peeking at the PageRank of the host has helped me quite a bit.
Yes it misses some things, like the case where a useless site produces 1 article that is good, but in general it keeps me sane and stops me from giving up search in general.
It would be nice if there was a Chrome/Firefox extension for Google's search result page that I could click "Submit Complaint" to submit links to Google complaining about the quality of the result or the site itself. I know they would have a lot of noise to work through from something like this, but I would hope over time it would help them be able to spot patterns in these affiliate-linking-ad-smattered-nightmare sites and get rid of them.
To address your point though, I think what is new about the article is that some search engine is trying to address a problem which annoys a lot of HNers. A problem I might say which differentiated Google from the then MSN and ALtavista etc. That is irrelevant or shallow search results.