
Google Web spam - Gabriel Weinberg's Blog - taylorwc
http://www.gabrielweinberg.com/blog/2010/07/google-web-spam.html
======
moultano
Being in the index isn't very meaningful. Very little of our spam-
fighting/ranking prevents sites from showing up for "site:" queries, because
in general we think that if someone is intending on going to a domain
directly, the only reasonable thing to do is to show that domain.

Unfortunately I don't have a better way of assessing it to offer. Internally,
we often look at impression-weighted precision as a metric, but I don't think
there's an easy way we could expose that to you.

A more reasonable thing to do would be to take a sample of DDG's query logs,
scrape the results from Google, then see what percentage of Google's results
come from your spam domains, but that requires sending a lot more queries to
get any useful data.

~~~
epi0Bauqu
I think this method would violate (at least the spirit of) my privacy policy:
<http://duckduckgo.com/privacy.html>

That said, if anyone has a meaningful sample query set, I'm certainly
interested in running it against my spam index. I see a lot of hits on it via
other search APIs.

~~~
gojomo
You could let people opt their queries into such studies.

~~~
epi0Bauqu
I don't have accounts right now so it's a bit tricky. If there was some way
people could export their Google search history, then I could use that. Feel
free to email it in anonymously.

~~~
moultano
I found this link which looks like it dumps search history as an rss feed.
That might be a convenient way to send it over.

[https://www.google.com/history/?lookup?q=&output=rss&...](https://www.google.com/history/?lookup?q=&output=rss&num=1000)

------
sadiq
On an tangential note, it's a shame that Google don't offer a service like
BOSS (though BOSS could be enhanced by offering revenue sharing, as an
alternative to charging per query).

It seems the Google search APIs have actually gone backwards over the last few
years.

~~~
apollo
Yahoo's goal with BOSS is to fragment the search market; Google is so far
ahead that Yahoo knows they can't compete head-on.

Google doesn't want the search market to be fragmented; they want to dominate
the market. I think that's why Google doesn't have a good search API offering.

~~~
irq11
I think DDG is a great example of why Google _doesn't_ have a BOSS-like API.
It seems pretty clear that DDG is violating the TOS of the Yahoo API by mixing
search results. Yahoo seems to be looking the other way (for now), but you can
bet that Google would be less forgiving.

There's not a search engine out there that wants to allow you to muck with
their relevance algorithm by changing the results, and Google has more to lose
from DDG-like activity than it might gain.

~~~
epi0Bauqu
<http://developer.yahoo.com/search/boss/>

------
spec
The writer states himself that the results of "site:" don't mean anything: "Of
course this says nothing about how much they appear in the rankings." So
what's the point of this article?

~~~
epi0Bauqu
They mean something, i.e. that they "are in their index in some form." I've
been blacklisted before, and when you're blacklisted, you don't show up in
site: queries.

That said, I wanted to acknowledge that this isn't ranking data. However,
perhaps as a result of this post, I'll be able to get some and re-post those
results.

~~~
skinnymuch
I'm sure you would agree with this but in case others are reading, simply
blacklisting these sites wouldn't be the best thing to do. Many are simply
expired or parked pages.

~~~
epi0Bauqu
Google visits domains all the time so they should be aware quite quickly when
things move from spam/parked to non-spam/non-parked. Therefore, I don't see
why they shouldn't all be out of the index until they have useful content on
them.

