
The Search Engine Backlash Against 'Content Mills' - ab9
http://www.technologyreview.com/blog/post.aspx?bid=377&bpid=25532
======
patio11
I had a discussion with tptacek about this one day. See, I don't think Google
(the search engine whose opinion's most influence my thoughts -- no offense
DDG) sees content farms as a bad thing.

If someone is searching for "how to make a blueberry pie", and they get an
article entitled "how to make a blueberry pie", they're happy. Are they
actually going to make a blueberry pie? _Probably not_. Therefore, it doesn't
really matter whether they get a good blueberry pie recipe or a bad blueberry
pie recipe. As long as they quickly get to a well-designed page _that they
won't read anyhow because no one reads on the Internet_ which has a few bullet
points they'll skim fast and a blueberry pie picture on it, they're happy.
Their blueberry pie voyeurism need is fulfilled.

Content mills make that happen, for huge segments of the population. Let me
strip that of euphemism: content mills make this happen for women, the
elderly, and the technically disinclined. Absent the content mill, there is
insufficient "organically produced" content on the things they care about on
the Internet because their participation on the Internet is dramatically less
than y'alls participation is and y'all -- speaking in generalities -- do not
blog about good blueberry pie recipes.

You can think of content mills as an organism in symbiosis with Google: how to
you juice relevance algorithms to identify the sliver of a sliver of a
fraction of the Internet which talks about blueberry pies and other things
your mom cares about, identify the best tangentially related article, and
present it to her every time? Well, you could have your crack teams of
geniuses work on it for a few years, even though your favorite tricks like
PageRank are likely to function less well because there's less linking data to
go around. Or, in the alternative, you could encourage content farming.

It surely has not escape Google's notice that their bottom line revenue
increases by about 80% of the top-line revenue of the entire content farming
industry, incidentally. Contextual ads are the perfect monetization vehicle
for laser-targeted content produced at quality which will be solely viewed in
search mode, and Google _owns_ that entire field.

~~~
moultano
I work in search quality at Google, and while certainly not everyone agrees
that it's a problem, a lot of people do.

I could write a lot about this, but the central issue is that it is very very
hard to make changes that sacrifice on-topic-ness for good-ness that don't
make the results in general worse. We're working on it though, and I suspect
we'll never stop.

I think a lot of the promise lies in as you said, identifying the tangentially
related article, or as I like to frame it, bringing more queries into the
head. We've launched a lot of changes that do exactly this. (But you are
right, it is difficult, and fundamentally so. Language is hard.)

~~~
w00pla
Can't you stop crawling sites that dish up your search terms for you (without
any content)? A good example is eudict.com - search for an obscure word and
this is the first that pops up.

It then returns a page with your search queries and no information.

What use is a site that simply returns search queries?

Why aren't sites that scrape content blacklisted?

~~~
moultano
>Why aren't sites that scrape content blacklisted?

The problem is more difficult than you'd think. For instance, virtually every
news organization "scrapes" the associated press, but we wouldn't want to
throw out every news organization.

Content-free search result pages are things we do try to remove, even manually
if it becomes a big enough problem.

~~~
pierrefar
_virtually every news organization "scrapes" the associated press_

If they're not adding real value, like analysis or graphics or commentary or
whatnot, why would you want to keep them if they're all just duplicates?

I had a friend work at a startup to solve this problem exact: we read
virtually identical articles about the same bit of news on all the news sites.
The startup was working on highlighting only the unique bits of each article
and recommend the one article that seems to have the most pieces of
information. You would read the one and skim to the unique bits of the others,
and you would have gotten all angles and facts much more quickly.

Shame they closed it up.

~~~
moultano
We do filter near-duplicates within the same set of results. You'll likely see
only one copy of an AP story with a link at the bottom saying something like
"Repeat this search with the omitted results included"

------
jhickner
I wish google would just implement a way to add a list of blocked domains to
your search preferences. Then I'd never have to see crap from mahalo,
expertsexchange, or the like ever again.

A way to opt-in to using a community managed list of bad domains would be even
better.

~~~
dotcoma
This is such a good idea - and such a simple idea (spamcontent-block, like
adblock) - that we should have had it 5 years ago already...

~~~
Ardit20
Didn't google some time ago have these up and down arrows, like here on HN
with comments, which I never quite learned as to what they were for. That is
not quite far off from marking sites as spam and I used the up and down arrows
only perhaps once in the entire time.

Search is different from sites like this. Here we know we are going to spend
some time reading interesting content, but not quite what we will be reading.
When searching you know you needs some information and what you want to do is
find it, preferably instantly, and get out of Google immediately. It is easier
to click on the next link than mark some site for spam or click the up and
down arrow.

------
DanielBMarkham
I don't have a problem with content mill sites -- as long as they provide me
information quickly in a format I desire.

I think lots of folks want "authoritative". So let's suppose I'm mentally
disabled and in search of a recipe for cake. I google "cake" and I got 40,000
sites. Top of the list is the Cake Institute of America. I google airplanes
and I'm looking at the history of winged flight. I want to learn to tie my
shoes and spend 5 hours on the history of footwear in western culture.

This is just silly. Communications 101 says that the message changes depending
on the audience and the medium. Yet those in the search engine business, it
seems, want "authoritative" and "best" sources. I could give a shit. I want
information custom-made to me -- who I am, how I speak, my culture, my mood,
my life.

The "There can be only one" attitude is not helpful. Somewhere, right now,
some guy wants to find out how to train speckled-bellied pigeons to dance. And
those dang E-how guys probably have a video for it. Back in the day it was
painful as heck to find information. Now companies are figuring out how to
make each little penny they can on creating content. As long as the content is
useful, I think that's awesome.

Having said that, the problem is that the drunken-angry-sailor-web-content is
different from the New-England-school-marm-content. That's okay. There's room
for growth. Isn't progress a good thing?

But the domain-squatting nonsense, and the sleaze factor some of these
companies bring to the table? That's got to go. With lots more tlds I think
the domain-name-spamming business has a limited shelf-life, thankfully.

------
carbocation
The article closes with, "In some sense, Blekko's approach is more democratic
--if any content is good enough for your friends, it's probably good enough
for you too."

If I'm going to use a social search, I want to see things that:

1) Are liked by at least one of my friends.

2) Are not explicitly disliked by (a meaningful threshold, perhaps as few as
1) of my friends.

And really, I'm not so sure that I trust #1, but #2 could be useful to me.
Especially when "friends" gets replaced with "Hacker News," then I'm much more
interested.

In other words, I don't trust the sensitivity of a social network to get me
what I need; I may have interests that reach beyond those of any of my
associates. However, I do, to an extent, trust its specificity for identifying
badness, and that's why I might consider a "social"-esque search filtering
service.

Actually, Gabe, have you ever considered creating some sort of collaborative
filtering tool for duckduck?

------
epi0Bauqu
If anyone has domains to report, feel free to email me and I'll add them to my
training set. I've recently thought about open sourcing this whole piece.

------
JacobAldridge
"We cuts out what Cutts leaves in."

Which is probably an unfair comment given Matt Cutts is dealing from inside a
massive, listed company, but epi0Bauqu / DDG are certainly addressing a
definite user experience problem. Indeed, this is a more tangible benefit for
using DDG than the focus on security and privacy, which I don't believe is
understood by most users.

~~~
megablast
Wow, I think the statement is just a bit of fun.

~~~
JacobAldridge
I agree (I made the comment - that wasn't a quote from the article), but I
wanted to clarify to ensure it didn't come across as snarky or unhelpful.

------
_delirium
This is actually interesting enough that I'm trying DDG as my default search
engine for a bit currently. I probably should've already (I intellectually
like what they're doing, and check it out occasionally), but Google just works
well enough for inertia to keep me there. Getting content-farm results is a
common daily annoyance, though, so this is the sort of thing that could make
for a noticeable improvement in my Search Happiness even in the short term.

------
rkalla
Want to thank moultano (from Google) for replying to so many stories.

One technique that I have manually begun employing to stop landing at what I
would call value-less sites is looking at the PageRank (Chrome extension) of
the host page before considering the individual story.

One reader complained about the hollow/bullshit review sites that match
whatever word you are searching for and then make your eyes bleed with sheer
number of ads once you get there and inter-related affiliate links -- these
sites never have a high page rank and can be safely ignored.

I don't know if DuckDuckGo or Blekko (is that the new one in Alpha?) are going
to take this type of data into consideration when ranking search results, but
_I_ sure do and it has never failed me.

If you want to see a quick example of bullshit websites -- try and Google for
ANYTHING health-or-weightloss-related. Try "HGH review", "Sensa review" or
just about anything else that you might be curious about in the health/medical
realm and I guarantee you that it will be atleast 3 pages of search results
before you find a single article that is not Google-fodder that actually has
_real_ content in it.

In the old days you used to be able to tell a "real" website from a "bullshit"
one by looking at how pretty or professional the site is... un/fortunately the
barrier to a beautiful site is much lower now and running across value-less
sites that look as good as professionally developed/run sites is hard to spot
instantly.

This is where peeking at the PageRank of the host has helped me quite a bit.

Yes it misses some things, like the case where a useless site produces _1_
article that is good, but in general it keeps me sane and stops me from giving
up search in general.

It would be nice if there was a Chrome/Firefox extension for Google's search
result page that I could click "Submit Complaint" to submit links to Google
complaining about the quality of the result or the site itself. I know they
would have a lot of noise to work through from something like this, but I
would hope over time it would help them be able to spot patterns in these
affiliate-linking-ad-smattered-nightmare sites and get rid of them.

------
jasonmorton
This is a huge problem to address, and could be the thing that changes the
dynamics of search. Google has become useless to me for a lot of searches --
many things that attract "MFA" -- because of the overwhelming amount of search
spam. The problem has gotten much worse over the last few years because of the
content mills presumably. Fortunately Google's still great for sufficiently
obscure things that don't lead to transactions (like research papers).

------
rythie
It seems for every query you really want to have good Wikipedia quality level
equivalent page summarizing the best knowledge available. These content farms
are trying to fill the gap where Wikipedia won't have a corresponding page,
however, they do not have enough revenue per page to pay for content that is
good enough.

------
Kaizyn
Nothing to see here. New up and coming search engine contenders don't list
every site Google and Yahoo! do as a way to differentiate themselves from the
well-entrenched competition.

~~~
Ardit20
We live in capitalism. Communicate or survive.

To address your point though, I think what is new about the article is that
some search engine is trying to address a problem which annoys a lot of HNers.
A problem I might say which differentiated Google from the then MSN and
ALtavista etc. That is irrelevant or shallow search results.

