
Google’s decreasingly useful, spam-filled web search - ihodes
http://www.marco.org/2617546197
======
patio11
I think people, possibly including me, get irked with Demand Media et al more
because they're more successful than we think they _deserve_ to be rather than
because they actually decrease the value of the SERPs. For SERPS where DM
ranks well, the results prior to DM existing generally pretty much sucked.
Maybe that is a Google issue, maybe that is an Internet issue (memo to
Internet: middle aged women exist, please write for them, kthxbye), but for
whatever reason, if you routinely Googled for [how do i make a blueberry pie]
every week for the last ten years I don't think you ever had an awesome search
experience.

DM pages are adequate for much of what they rank for, in much the way that USA
Today is an adequate newspaper, your local state school provides adequate
degrees in history, etc etc. They're adequate in a scalable manner, though,
and they understand Google much better than the average publisher, which means
they get visibility in excess of what some people might expect.

P.S.

Demand Media: <http://www.ehow.com/how_2933_make-blueberry-pie.html>

Virtuous publishers on the Internet:
<http://www.pickyourown.org/blueberrypie.php>

If I wanted to bake a blueberry pie, I'd go for that second page every day of
the week, but it is highly non-obvious to me that it is a _better_ result qua
search engine result than the DM page. I love this example because I think
Google fundamentally doesn't think [how do i make a blueberry pie] is looking
for a blueberry pie recipe. Most searches will _not actually convert_ to pies.
For the 98% of searchers who merely want to satisfy their pie voyeurism need,
the DM content may well be _better_.

~~~
bambax
> _[how do i make a blueberry pie]_

You know very well that this is not how search works. You don't "ask a
question" to the search engine like you would ask your grandmother.

You type in words that you expect to be in the pages you're looking for, and
the search engine lists pages _that actually contain ALL of those words_.

One of the main improvements of Google in the very early days was that it used
the AND operand by default, whereas competing search engines used OR by
default, resulting in an incredible amount of noise.

In essence, searching for "how do i make a blueberry pie" (with quotes) should
return _only spam_ , because only spammy and SEO optimized sites would contain
the phrase as such. A real recipe would maybe contain the phrase "how TO make
a blueberry pie" but not "how do i..."

\- - -

I think your point was that "middle aged women" don't know any of this.

It would be arguable (probably wrong, but still) that people who don't know
this, who didn't make the effort to understand a little how all of this works,
deserve the spam they get.

There is a good way to discriminate between good and bad content, and that is
to _know a little about what you're searching_ in order to search for words
that will be present in good quality content and NOT in spammy pages.

For example, it's reasonnable to expect a good recipe to give instructions in
metric system as well as imperial; if you add "celsius" to the search then the
second (informative) recipe arrives first:

[http://www.google.com/search?q=how+to+make+blueberry+pie+cel...](http://www.google.com/search?q=how+to+make+blueberry+pie+celsius)

~~~
patio11
_I think your point was that "middle aged women" don't know any of this._

No. That is overbroad, untrue, and would be very injurious to my professional
reputation. I said that the Internet is skewed away from producing content
responsive to their needs, which is about as controversial as saying that they
are slightly underrepresented on HN relative to, I don't know, twenty-
something males.

Non-technical users frequently use natural language search. The experience for
natural language search is fairly poor. There are many classes of search which
offer poor experience, but it is the one which leaps to my head first because
I deal with non-technical users every day.

 _people who don't know this, who didn't make the effort to understand a
little how all of this works, deserve the spam they get._

Words cannot express the depth of my distaste for this position. I will accept
that "Google screwed up" or "I screwed up" if one of my users has a suboptimal
Internet experience (which starts at Google because _Google is the Internet_
and ideally ends at my site), but I cannot accept that she is responsible if
she has a poor user experience. We've got the teams of PhDs, the highly paid
SEO consultants, and the lifetime of building an accurate mental model of how
the devil box works. She wants to teach kids to read, not learn magic
incantations. It should -- "should" in the sense of "would be optimal for the
business", "would be optimal for society", and "as a moral imperative for
computing professionals" -- just work for her.

~~~
bambax
(I really don't understand your first/second sentence (what's injurious?)
Also, I'm 40.)

> _I will accept that "Google screwed up" (...) if one of my users has a
> suboptimal Internet experience_

I respect that, and from a business point of view you're very right.

The problem is, how do you improve her experience without screwing up mine?
Why can't I search for pages that actually contain all the words I'm looking
for, as I typed them, and not words "that were present in the page linking to
this page" or words Google think I want although I didn't type them in?

From a "moral" point of view (which you brought up), if she "wants to teach
kids to read" maybe she could start by learning how to spell?

~~~
bambax
I don't know why the above comment is being downvoted, but I'm guessing it's
because it sounds "elitist" (which is apparently a very great crime).

To elaborate, then: I agree with the parent comment that it's Google's job to
make everyone's experience optimal (and not the user's), and it's certainly in
the best interests of Google (or any business) to cater to the needs of as
many of its customers as possible (although in the case of Google, as has been
pointed out many times before, users are in fact the product).

But I would argue that the real elitists are people who think "middle aged
women" shouldn't be expected to actually learn how to use machines.

"Middle aged women" (why single them out?) use machines all the time, whether
at work or at home. They're expected to know how to use a spreadsheet, a word
processor, a food processor. And they do. But somehow this expectation is
lifted for "the Internet". Why?

A search engine is not a person; it's certainly not a mind reader. A search
engine is just a machine.

~~~
ghshephard
The objective is Clarke's third law:

"Any sufficiently advanced technology is indistinguishable from magic."

The winner in search will be the one who strives for that.

If _I_ were presented with "How do I cook a blueberry Pie" - I would
immediately do the following search:

[http://www.foodnetwork.com/search/delegate.do?fnSearchString...](http://www.foodnetwork.com/search/delegate.do?fnSearchString=Blueberry+Pie&fnSearchType=site)

Plus a few other pre-eminent and trusted food-networks (Allrecipes) -
searching for "Blueberry Pie" on each one, I'd then scan the quality of the
comments - looking for insight into other chefs (cross checking their history
to see if they, in turn, can be trusted) who clearly have tried out the
recipes, and have made relevant comments. I would then identify the recipe
that looked most likely to work for me.

I would expect no less from a sufficiently advanced search engine in this, and
all other domains.

~~~
bambax
The Food Network? Really? The home of "semi-homemade"...? ;-)

About Clarke's law, here's an observation by George Bernard Shaw: _"Build a
system that even a fool can use, and only a fool will want to use it."_

Quotations aside, the process you're describing is certainly excellent; it's
probably what Blekko is trying to pull, in a scalable way. It'll be
interesting to watch how it plays out.

~~~
ghshephard
The second iteration, of course, is to engage with every (valid, trusted,
revenue generating, etc...) customer who searched, determine the quality of
the results, and then feed _that_ information back into the algorithms. You
could then bias based on domain experts (world class chef's feedback on
BlueBerry Pies more important than an anonymous user)

It may be the case that (AllRecipes, TheFoodNetwork, Etc...) are NOT the best
place to search for a recipe, and that, indeed, "<http://pickyourown.org> is
knocking it out of the park this week.

There is a lot of room for search to improve - I think the company that beats
google (if it's not google that does so first), will be the one that manages
to start creating the search<\-->Consumer<\-->Search feedback quality looop.

PageRank was just the beginning.

~~~
eru
Google may already collect enough data to do this. They track clicks on the
search results, so they can see whether you liked the results, whether you
went back to a different result after visiting your first, and whether you
modify your search terms for another search because the first one did not work
out.

------
moultano
We're working on it (as always.) There is a big improvement inspired by the
stackoverflow post on its way shortly.

If people want to help out, _the best_ thing to do is to post examples of
specific queries. Those become the "fixed points" around which we can tune
until we get it right. The more example queries the better, and I'll make sure
they get to the right people.

A good way to get example queries is to look through your search history,
which if turned on can be found here: <http://www.google.com/searchhistory>

~~~
kitsune__
What happened to exact search queries:

For instance, if I search for "a-r" I receive results for "ar".

I hate this. It makes it impossible to filter irrelevant results.

Or try this query: "a-c" -"ac"

This will return 0 results.

~~~
cap4life
And this is where the downfall of Google begins. I also hate that some exact
queries are being broad matched without my consent.

------
noibl
In a lot of the comments around this lately, people have been saying that this
is something Google can fix, or needs to fix.

I would suggest that the content farms' success in gaming specifically
Google's algorithm was an inevitability (whatever the current state of the
arms race) and the only thing that will weaken the effectiveness of their
techniques is to expose their business model to a greater range of algorithms.
If you have three or four search engines all working on slightly different
principles, it becomes a lot harder to game them all with the same content,
even if gaming any one of them would be trivial. In other words, competition
in the SE space at the algorithmic level is something we sorely need to see.

In parallel, my suggestion for one new search engine to add to the mix: a
crawler for unsubsidised content. That is, the results consist solely of pages
that don't carry advertising of any kind. This wouldn't exclude ecommerce
sites but would exclude most kinds of affiliate marketing. Subscriber-only
sites could pay to be indexed at a flat rate, though guaranteeing that this
fee wouldn't affect rankings might be tricky. Alternatively a journal-access
style of subscription model could see the SE paying the content site owner
when one of its paying users consumes their information.

~~~
gojomo
_a crawler for unsubsidized content_

Note that this doesn't require an all-new crawl/engine: just for an existing
engine to offer an advanced operator that filters ad-drenched pages from
results. Even just an operator that eliminated AdSense sites would be a big
win for some queries.

~~~
noibl
While technically feasible, this would rule out a large number of high quality
content sites which are currently ad-supported. For example, specialist
community forums often carry ads just to cover costs. You need to allow for a
different funding model for sites like that, such as subscriptions.

------
tumult
Neal Stephenson's novel _Anathem_ has a section that talks about how the
'reticulum' (the internet, in the book's fictional world) was overrun with
false copies of documents with slight changes made to them. 99.99% of all of
the information on the internet was spam.

A huge industry of commercialized systems connected to the internet for the
sole purpose of filling it with spam, and then the corporations would sell
back filters and knowledge of which documents _weren't_ spam to customers.
Eventually, the algorithms used to modify documents developed a malicious
edge, so that the thousands of spam copies of an original document would be
deceptive in ways that would harm people (e.g., in Marco's electrical plug
wiring example, the document would have been modified so that it could get you
killed by telling you to touch the wrong wire or something.)

Inevitably, it spiraled out of control, and a sophisticated system of social
trust and ranking was put in place by IT workers and systems administrators,
which are a caste and race of people in the fictional world.

Good book. Prescient, even.

~~~
jf
It wouldn't be the first prescient thing Neal Stephenson has put in a book.
Whenever I see Google Earth, I think of Snow Crash.

~~~
acon
Since Google Earth was inspired by Snow Crash I'm not sure you can classify it
as prescient in the same way.

------
andrewljohnson
This is much ado about nothing. Google has a few search problems, and they
always have, and they always improve.

Also, if Marco is going to list some problems, how about listing some problem
searches? I search for what he lists, and the top result is fine in most
cases, and debatable in others.

You folks think Google sucks? I don't. It's awesome, and I rely on it more
everyday.

~~~
johnny22
It really depends on what you're searching for. I purchased an handset that
runs android in september and was hoping to find out when it would be upgraded
to a newer version of android. The first 2 pages were almost completely filled
with the same article.

to be fair though, I tried the same search on duckduckgo and bing and got
mostly the same results.

I was just expecting google to be better than that.

~~~
andrewljohnson
Sounds like an obscure search, and if you aren't going to list the search
terms you tried, it's really hard to judge whether it's user error or Google
error. Google is only psychic to a point.

------
cletus
Sites like Demand Media see a gap for particular content and churn out cheap
crap.

Blogs see an idea in the public consciousness and jump on the bandwagon with
derivative posts.

Anyone see the parallel? ActualLy there is a difference: the DM writer got
paid.

Product searches have been screwed for years. I've often wished I could filter
out any price search engines and/or retailers from results. What's worse is
that all these sites have places for reviews (of which there are never any)
but hey the review keyword is there.

But as for this post there's nothing new here. It's a rehash of a bunch of
other posts from the last month.

I can still find what I want with ease on Google. Am I just some kind of
gifted searcher? I seriously doubt it.

It's like these posts are all making slippery slope arguments ("there are two
content farm results in the first page. If this trend continues there will be
7000 content farm resets") rather than complaining sbout the actuality.

The other mistake made here is to assume Google's algorithm is static. This is
false. It's a rapidly moving target.

Like another comment says: such noise (spam) isn't unique to Google so is the
"problem" with Google's index or the Web itself?

If nothing else these posts all make the case that Google's index is
algorithmic. I say this because at different times you'll see conspiracy
theories about Google promoting certain properties over others.

Here's a question: if Google started blacklisting sites, how soon would the
complaints of censorship or favoritism take?

~~~
bad_user
I agree with most of your points, but I would truly value Google providing me
with the capability of blacklisting websites for myself.

Such a thing is already possible with just browser plugins, but I'd like that
blacklist to follow me around and grow algorithmically (based on preferences
of similar users), and hacking around a product's deficiencies is really not
"voting with your wallet".

And in such a case Google couldn't be accused of censorship / favoritism.

------
DanielBMarkham
Wonder how many duplicate topic and mostly duplicate content articles we're
going to see about how Google provides duplicate content and duplicate topic
answers to searches?

My irony meter is pegging.

~~~
ivankirigin
It is a public debate whose nature is similar discourse that focuses in on an
approximate truth. Your comment is especially unfair when directed at such a
consistently high quality blog.

~~~
wonderzombie
Could the OP be referring more specifically to HN? HN is akin to a curated
system, and of course there are plenty of duplicate submissions on HN.

I'm not expressing an opinion one way or the other. Rather, that was my take
on his comment, which amused me.

------
fhars
What really irks me in the last few month is that google increasingly doesn't
actually answer my questions. More often than not, none of the results on the
first page contain all of my search terms, and most of the time it is the most
specific term that it is missing everywhere. Or the big G has replaced that
term with something completely unrelated. I have to prefix every search term
with a + if I want to get a result quality that is even remotely similar to
what use to be the default.

~~~
Matt_Cutts
Example searches? That's the most constructive way to help us improve.

~~~
muuh-gnu
Isn't it telling that you, if youre speaking on behalf of google i.e. working
there on search, dont even know where your weaknesses are dont know where to
even begin? If you have to rely on community input to even spot a direction
for improvement you probably are in deeper trouble than all these articles
suggest. Would Google even be where it is today if they hadn't spotted a
weakness and improved it a decade (or more) ago?

I've personally mostly given up google for basically _any_ kind of product
search, because it reminded me of going through my spam-riddled email inbox
before my provider hat spam filters (or before I switched to gmail). Before
that, I've had to give up the Google Groups Usenet interface because it was so
cluttered with spam, that it wasnt even funny any more how useless it was, and
Google _still_ kept associating its name with it.

If you have to, kick the spammers out manually, to prevent users jumping ship,
until you eventually sort it out algorithmically.

Edit: When you downvote a posting I put time in to write, you could at least
tell me what downvoteworthy I actually did, so I can refrain from doing it in
future, since it is in our common interest to avoid downvoting and being
downvoted. So, can you please elaborate in hindsight?

~~~
Matt_Cutts
I know plenty of weaknesses in Google, and we work hard on the problems that
we think matter the most (e.g. in 2010, we worked a lot on hacked sites so
that regular people wouldn't stumble into an awful experience).

But it's very helpful to get independent, outside examples. It moves the
conversation past "Google sucks" to "Google sucks because of query X."
Sometimes those queries are new, but often what's just as useful is hearing
what people dislike about the current results for the search X.

~~~
MichaelEdits
I find it very hard to believe that Google's Matt Cutts would come down from
the ivory tower to answer comments on a website.

Google suspended my Google Checkout account because two words in the title of
my book (make money editing from home) sent up a flag. No matter how many
times I offered to give them a copy of my manuscript, they declined.

The eventual apology was appreciated, as was the full reinstatement, but it
doesn't change the fact that I was wrongly accused, wrongly convicted, and had
no court of appeal to go to except the very people who shut me down.

I no longer use Gmail, Google Toolbar, Chrome, Google AdSense, Google
Checkout, Picasa, Blogger, Feedburner, Google Webmaster Tools, Google as a
search engine, or whatever else I can think of. Google violated my terms of
service.

Oh, and the nastygrams looked outsourced, and the apology had typos and bad
grammar. You guys should hire an editor.

~~~
MichaelEdits
Actually, I'm not done. I didn't do anything wrong. Google admitted this. It
was their mistake, not mine. They admitted this in their apology, which I did
appreciate.

For my account to be randomly suspended reminds me how much an Act of Google
resembles an Act of God. That's what annoyed me so much. Google did evil that
day.

And that, to bring it back to the point "Matt Cutts" made, is WHY Google
sucks. It can't be fixed.

------
bambax
It's easy to filter out spam once we identify it; so the question is: "what is
spam"?

Some argue that content farms such as Demand Media aren't spammers, because
the content they produce actually satisfies _better_ the casual searcher than
elaborate, savant exposés on the same subject. Casual content for casual
searchers.

Others consider that content farms-issued pages are the epitome of spam: spam
that doesn't look like spam, and that ends up cluttering search results. Spam
is not irrelevance: _spam is clutter_.

A corollary to "what is spam" is: "who should make the call"?

Originally, Google tasked itself with making this call, and it did a pretty
good job at it.

But why not me? It should be possible for Google to make a difference between
"casual" and serious content, and then let the user decide which they prefer.

Well actually, that is already possible: it's called "reading level" and it's
accessible in the advanced page.

Searching for "how to wire an outlet" gives ~12 M results, the first of which
comes from about.com.

When filtering the search to display only "advanced reading level" results,
there are only 264,000 results left, the first two coming from Wikipedia (and
the 3rd and 4th still coming from ehow.com).

So Google _already_ knows what is "casual content" and _already_ lets users
filter it out.

Maybe a simple solution would be to add the filter directly in the search
results page instead of having it buried in the advanced options.

------
Osiris
One thing that I notice about the spam sites and scraper sites is that they
often have very similar content and/or layout. What if Google was able to
determine how similar certain sites where and consolidate those into a single
result, like they do with Google News?

Then when I search for AMD Bulldozer news and there are 20 sites all with the
same article, from the same date, I don't have to change my search parameters
to show just the last month. Instead, it would determine that the content was
similar, smash into a single result, and leave room for 9 more less-similar
results that may better include what I want.

------
petercooper
Decreasingly? This has been a rollercoaster for years. I was more of a
Webmasterworld regular a few years ago than now but around 2005-2006 a lot of
people thought Google had gone to pot.

[http://www.theregister.co.uk/2006/05/04/google_bigdaddy_chao...](http://www.theregister.co.uk/2006/05/04/google_bigdaddy_chaos/)
<http://www.webmasterworld.com/google/3040496.htm> <http://www.seo-
news.com/archives/2006/apr/6.html>
<http://www.webmasterworld.com/forum30/34407.htm>
<http://www.mattcutts.com/blog/feedback-webspam/>

Plus ça change..

------
ivankirigin
Could it possibly be at google is in the middle of an innovator's dilemma?

Twitter, hacker news, tumblr, and quora are all really shitty google
replacements. But I use them to get certain kinds of information. It isn't
enough to justify a radical change at google -- especially if they are even
slightly focused on maintaining revenue.

There must an an opportunity for a more curated experience where the browsing
behavior of a few thousand selected people can be used to juice authority. I
don't think the human editors need to know they are doing that job. Maybe they
should use chrome data for this.

------
jacques_chester
A first step would be to hide the queries data (especially trending queries).
It was an interesting curio but its major consumers now are spammers.

------
ja27
Am I using a different Google than him?

I type in "how to wire an outlet" and all the top results look useful. Sure
there are some ads embedded on the pages and the top hit is about.com with a
10-page slideshow, but every hit looks like it explains exactly how to wire an
outlet.

<http://www.google.com/search?q=how+to+wire+an+outlet>

Even when I try the spammiest searches, it looks like they're returning pretty
relevant results:

<http://www.google.com/search?q=best+price+on+viagra>
<http://www.google.com/search?q=wrist+watch+deals>

~~~
spot
i ignored these "google sucks because of spam" articles until this one. i
tried his first worse example [large sensor compact camera]. almost all the
results on the first page are good. #3 is suite101 which is one of these
farms, but it actually contains good content too. so i will go back to
ignoring these "google sucks because of spam" articles, unless one shows up
that has some quantitative results.

------
FiddlerClamp
My guess for the near-to-mid term is celebrity curation.

I keep thinking about how Roger Ebert, after decades of movie reviews, started
branching out into political (anti-Tea-Party) commentary and other articles.
If you knew that a trusted brand (for many) like Ebert was curating home TVs,
or projectors, or blank DVD media in an unbiased way, wouldn't you want to see
what he had to say?

Or Thomas Dolby on audio equipment, Sting on Tantric books, and so on. They'd
make money through affiliate links or even subscriptions.

------
bambax
Isn't it possible that all of this recent bad press about Google would be a
consequence of "Instant"?

Here's my thinking:

\- to get good results, one needs to type as many relevant words as possible

\- Instant encourages people to type _less and less_ words (not even words: a
few keystrokes and you're on)

But if you type very few words, or if you search for "frequent" queries
(generated by Instant in response to your few keystrokes) then all you get is
spam.

Spam is optimized for frequent queries, not very specific ones. Instant should
be renamed _Instant spam_.

~~~
phpnode
That's not really the case, plenty of spammers optimize for long tail searches
because 1. they're easy to rank for. 2. they're easy to create ambiguous,
autogenerated content for

~~~
bambax
> _plenty of spammers optimize for long tail searches_

But how do they do it? By nature, there are many more long tail searches than
frequent ones, and each one is rarer (or unique).

How do spammers find them?

~~~
cap4life
Through keyword generating tools such as the Google Adwords keyword generator
and a host of free online ones (of questionable quality).

------
wheels
I started writing out a comment on the somewhat heretical notion that biasing
search results _against_ AdSense click throughs would probably be a strong
predicter for spam detection, but the comment got long enough that I folded it
into a blog post:

<http://news.ycombinator.com/item?id=2074621>

------
stcredzero
_It’s impossible to do any meaningful product research with Google._

Right now, I often start my product research within Amazon. However, that's
only a start, as Amazon isn't great for everything. For large appliances,
Consumer Reports is a good starting place. I guess I'm an example of the
switch from search engines to "expert" sites.

~~~
bambax
Amazon is great but search on Amazon doesn't work too well; what works very
well is to use Google and restrict the search to Amazon:

 _site:amazon.com some product_

------
stretchwithme
I have to say there's some truth to this. Why is it that I increasingly must
search through the search results just to find the site that originally
published the string returned in the first 3 to 8 results?

I don't want to patronize all these sites repackaging content created by
others, yet they continually appear before the creator.

------
gallerytungsten
One wonders if Google is becoming the new Yahoo. If so, a big opportunity for
the likes of DuckDuckGo and other nimble searchers. Today's upstarts can also
run on the cloud, sidestepping the need to build Google-scale data centers (at
least initially).

------
randrews
This kind of worries me.

On the one hand, Google isn't the best web search tool. I've switched to
DuckDuckGo, and so has everyone who's seen me use it. But, I think Google
still provides a valuable public service: indexing the entire web and handling
that much traffic is not an easy task, and a lot of other things (like DDG)
depend on that humongous cluster.

So on the one hand I want to see the best search engine win, but on the other
hand if Google goes out of business (or more likely, starts losing money and
canceling projects) then I'm afraid it'll take a lot of things out with it,
with no clear replacement.

~~~
poutine
Indeed perhaps time to give duckduckgo a try. They seem to actively filter out
all ad sites. I've been going nuts in the past couple months with google
searching for technical solutions to problems, command clicking on links to
open in tabs and discovering that most are clone ad sites from a question on
stackoverflow.

~~~
randrews
It was these that won me over: <http://duckduckgo.com/bang.html> . Most of the
time I know what general sort of thing I'm looking for, and I'm happy to give
the search engine hints if it'll help. Even if I don't, it'll essentially
flat-out ask me what sort of thing I want, and then give me more specific
results: <http://duckduckgo.com/?q=ruby>

But, I mean, I'm doing all this by typing it into Chrome's address bar. And a
lot of the time, DDG is just returning results from Google's API. I want to
_use_ DDG, but I want to make sure Google _still exists_ because I need it
even if I don't use it.

~~~
bad_user
I just use the browser, instead of the !bang feature in Duck Duck Go.

Chrome / Firefox have the option of adding a search shortcut.

For instance I type "py package-name" for searching inside Python's index, or
"am product" for searching in Amazon, or "w something" for searching
Wikipedia, or "t some words" for doing a google translate, or "hn something"
for searching Hacker News.

Chrome even does these shortcuts automatically, so you can just "amazon.com
android" and it will do a search on amazon; although in Firefox it is easier
to add your own.

~~~
T-hawk
FWIW, Opera has this too, and in fact had it first, by about 2005.

------
mhb
Isn't this how Facebook topples Google and completely dominates the internet?
By incorporating your social graph into your search results, your
relationships can influence what is returned by the search.

Suppose you could create some sort of "friend" list with HN users and that
were used to prioritize your search results. If you get a result you don't
like, click that you don't like it and the software will reduce the weights of
the parts of your social graph which caused that result to be highly scored.

------
radley
I set up a second, filtered search using Google Custom Search and added it to
my browser. I don't always use it, but it's easier to switch to when I
encounter spammy topics (like code look-ups). It's pretty easy to blacklist
fakes... and even useless SEO-heavy sites like experts-exchange, bigresource,
etc.

Here's how, if interested: <http://radleymarx.com/blog/better-search-results/>

~~~
iworkforthem
I am beginning to use this mode to Google search more, cut out a lot of the
spams.

------
prawn
Is there a huge incentive for Google to improve if a lot of the content farms
are monetised by AdSense and actually return Google money?

You could argue that they might lose their spot as the default search engine
for a lot of people, but Microsoft has presumably thrown a huge amount of
money and expertise at the problem and hardly dominated. I suspect this is not
going to be a significant problem for Google in a hurry.

------
PaulHoule
I'm not sure if prioritizing links over keywords is really going to help
matters.

I know a lot of 'little guys' who know something about a topic and can write
prolifically, but who suffer under the delusion that 'If I build it, they will
come.' Success in SEO is largely possible because 95% of webmasters have no
idea how to promote content.

I've also developed 'digital libraries' for major academic organizations and a
common thread there is a complete lack of interest in indexability. There's a
lot of fantastic content trapped in the ivory tower because nobody considered
the 'unwritten standards' for how the web worked.

A big part of the problem is that it's very hard to get legitimate links these
days. You used to be able to get into the Yahoo directory for free, but now
you have to pay a $300 a year bribe. Before 2000, it was common for people to
create large collections of links they liked. Today, major players like
Engadget have a policy of not wasting their PageRank on other sites. Afraid of
spam, many blogs and forums are on a hair trigger to stop people from dropping
links in comments, relevant or not.

If legitimate links are harder to get, that 'lowers the bar' for spammers.

A real answer to spam would be to strengthen the signal so it can break
through the noise. It might be helpful to be able to get more feedback from
web users about the quality of pages, but this is tough. The horrible truth is
that there are more pages on the web than there are viewers, so even if you
could get feedback from 10% of viewers, many pages would be badly
undersampled. Spammers would also target any feedback channels that exist, and
with low response and sampling rates, it might be easier to overload the
feedback chanel than it is to create link noise.

Another answer is to beat Demand Media at their own game, the same way that
Stack Overflow has beaten the spam sites that dominated programming questions
two years ago.

------
jonknee
I think search quality would go up if Google gave me the option of blocking
domains from SERPs. I never want to see results from a content mill (eHow,
Mahalo, etc) in addition to all the made for Adsense sites I come across less
frequently. They could also use the collective blocking data to help tweak the
spam filter.

------
apedley
I think the integration of social media is a possible solution.
Recommendations and likes (from Facebook) or other places are hard to
artificially jack up and can also offer great results if it ties them to your
friends. I like the direction Bing is going with this. It is the only way I
can see to get large human edited results of the web.

Unless Google develops highly advanced AI (which is a possibility) computer
algorithms can be gamed. Humans can be gamed as well but because we are all so
different I don't think there is a single approach that would fool a large
segment of the population at once.

~~~
rjvir
"Recommendations and likes (from Facebook) or other places are hard to
artificially jack up"

If recommendations and likes are added to Google's algorithm, people will find
ways to artificially jack them up. For example, marketers have aligned
networks of Digg users to increase the amount of Diggs. I find it hard to
believe that Demand Media and others will not be able to artificially inflate
Facebook likes.

------
aheilbut
I'm not quite sure how it will happen, but at some point I think it will be
beneficial to commoditize the underlying crawl and index data, so that there
can be more domain specific focus and more diverse sources of innovation
applied to solving this and other search problems. One or two sources trying
to be all things to all people and all problems isn't going to scale.

Blekko's slashtags are a good start, but it needs to go much further.

------
jeisc
These bad search results are not an IT problem, they are the results of top
Google executive policies: act like you want to be good and do better for the
Google users but keep on serving up the same old stuff because it is tied to
the revenue cow and the paying clients. This problem would never have existed
if Google considered the end user experience more important than the
advertisers.

------
jl6
Why does Google no longer offer the option to permanently remove a specific
domain from your search results? My personal search quality would be
dramatically improved if I could specify even a short blacklist.

In fact, dear lazyweb: is there a browser extension or greasemonkey script
that makes Google return 100 results at a time and then filters out the best
10 based on a blacklist?

~~~
eli
Set Google to return 100 results and then use a userscript or extension to
filter them. E.g.
[https://chrome.google.com/extensions/detail/ddgjlkmkllmpdheg...](https://chrome.google.com/extensions/detail/ddgjlkmkllmpdhegaliddgplookikmjf)

------
micaelwidell
Couldn't part of this problem be solved with an algorithm that identifies when
several pages have roughly the same content (ie. original wikipedia article +
5 copies of it elsewhere on the web) and then giving the oldest occurance in
the index a much higher rank?

That would kill incentive to create these spam-sites and give the user the
result s/he was looking for.

------
jv22222
Maybe Google could scrape DDG on the fly for each search, then do a diff, and
filter out any results that arn't in DDG... that would be the fastest way to
remove spam ;)

------
yhlasx
People just can't come up with right search queries and guilt search engine. I
always find what i want via google.

------
MichaelEdits
I was on the Internet long before Google showed up, and I'll be here long
after Google is dead and forgotten.

------
didip
Google, can't you solve these problems with money?

Pay army of users pressing ham/spam buttons, mechanical turk style.

------
shadowpwner
Please, not a rehash of what we've been reading for the last couple weeks.

~~~
macrael
I do find it interesting how this has become a meme, but I think that Marco
has added something interesting to the discussion. I like his categories of
searches, and the decision he outlines google as needing to make.

But, really, if you don't want to talk about this subject anymore, close the
tab.

~~~
shadowpwner
Apologies. I skimmed through the first few paragraphs and thought that it was
a summary. I was attempting to help HN cut back on multiple posts on the same
old content. Ah well, I guess I won't be doing that any more.

------
iconfinder
I don't think this is surprising - the top management seems more interested in
building OSs and social networks. Search doesn't seem like their highest
priority anymore.

~~~
Matt_Cutts
There's more people working on search quality than ever in our history. The PR
team for search quality does a great job pitching stories about search
quality, how it's hard, etc., but lots of reporters prefer to write about the
shiny things (or more tangible things--you can hold a phone in your hand)
rather than improving search quality.

But Google continues to work hard on improving search every day, even if that
doesn't always get covered.

------
ergo98
_One solution may be for Google to radically change their algorithms and
policies for web search to de-emphasize phrase-matching and more strongly
prioritize inbound links and credibility._

Inbound links and the calculated "credibility" from the same are what killed
the web the first time around. There was once a democratized web era when that
actually worked -- when millions of people had their little Goecities pages
and were linking the cool stuff -- but in the modern era it's 99% consumers
who cast no votes, and the last 1% is extraordinarily incestuous circular link
love: Marcos links to Coding Horror who links to Daring Fireball who links to
Scoble who links to Marcos, etc.

People with neither information or authority end up being the credible
authority on matters they have aren't authorities on. Scoble a few years back
pointed out the fact that according to search engines he was the most
important Robert in the world. That is a frightening concept.

We will move from an era of search engines to an era of expert engines. Many
of the questions I used to "ask" Google I now ask of Wolfram Alpha, and its
approach has turned out to be quite useful. Expand that computer knowledge
more broadly, and improve the human syntax parsing. and we'll have a winner.
Several such systems are built around computer learning of the wikipedia
corpus.

~~~
mlinsey
_Several such systems are built around computer learning of the wikipedia
corpus._

Any ones in particular that you've found work well?

~~~
michaelbuckbee
Powerset worked well enough that Microsoft bought them for $100 million.

