
Jason Calacanis Knows He's Spamming Google, He Just Thinks It's No Big Deal - mvandemar
http://smackdown.blogsblogsblogs.com/2010/02/22/apparently-jason-calacanis-knows-hes-spamming-he-just-thinks-its-no-big-deal/
======
dangrossman
While it's always a bit annoying to see a meme take up article slots on Hacker
News regularly, I don't mind this one. I see sites like Mahalo as doing
massive disservice to huge numbers of people --

* thousands of AdWords advertisers that have paid for their ads to be matched with content sites, not scraper pages with a huge ad to text ratio

* thousands of publishers whose work is being scraped, aggregated and outranked without as much as a backlink

* millions of web searchers that are hitting these pages instead of the real sources of the content they were searching for

And calling out companies that harm the fabric of the web for everyone else is
worth doing.

~~~
w00pla
The question is: why doesn't google do anything about this? There are
countless such pages. Why not just blacklist domains? Or artificially give
them lower pagerank?

Or is it okay as long as they get ad-sense money?

~~~
dlib
Isn't there a search engine out there that takes into account the
advertisements vs. content ratio? It would be trivial to get rid of Made for
Adsense sites by scanning for the Adsense code used and seeing how often it's
used on a page and the percentage of the page it takes up.

~~~
skinnymuch
Google allows you to have three ads though. If you had a page with three ads
and the best written 500 word article on it's topic, your definition of MFA
sites would have it banned even though it is obviously a ton better than most
sites around.

~~~
prawn
For smaller publishers, it's usually three blocks of ads plus two blocks of
link units. For those not familiar with AdSense, link units are the horizontal
or vertical lists which take the visitor to a search results page rather than
straight to an advertiser's page.

I think a strict definition of MFA is always going to be a little grey, but
there's clearly something wrong with pages that are nothing other than
widgets, scraped content and the like.

------
jsz0
It's really too bad Google doesn't allow you to simply blacklist domains from
your search results permanently. The thing that frustrates me the most about
these SPAM sites is the fact they're constantly popping up and my down voting
of the result seems to do absolutely nothing unless I'm using the exact same
search query. Just let me blacklist Mahalo and other sites like it
permanently. Better yet make it possible to subscribe to a blocklist so the
community can pool it's resources and fight back.

~~~
Devilboy
I request this from Google every couple of months. If I could remove an entire
domain from all my personalized search results, I'd be soooo happy.

~~~
benatkin
I've requested this, too. I even switched to Yahoo! for a few hours because
they let me blacklist expertsexchange.com (since changed to redirect to
experts-exchange.com). I don't see the option on Yahoo!'s page anymore.

~~~
elai
expert sex change .com?

~~~
benatkin
Yes. It used to show up in Google search results as expertsexchange.com. After
being thoroughly razzed, they made it redirect to experts-exchange.com.

Since it was called expertsexchange.com at the time I blocked it in the Yahoo!
search results, I took the liberty of referring to it by its old name.

------
japherwocky
Jason Calacanis is the Paris Hilton of the web.

Mahalo is not particularly interesting, not particularly evil, he doesn't
really do anything, and yet - we keep talking about him and putting his shit
on the front page of hacker news.

It's a mediocre aggregator/linkfarm, with some mechanical turk style
incentives for humans to contribute, and a nice chunk of $ in the bank. Just
ignore/mock him for another year or two until the funding runs dry.

~~~
mvandemar
You are absolutely right, it is mediocre crap. The issue is that through the
spam techniques described in the article, he can now gain undeserved top 10
rankings for many phrases using pages that have no business being there.

If you are not part of the web development community then these discussions
will most likely bore the hell out of you. However, if you are, and if you are
aware of how many innocent sites Google bans or penalizes on a daily basis, or
AdSense accounts that get canceled with no appeal for offenses much less than
his, then this stuff actually matters.

------
moe
One of the things I really miss in google is a persistent blocking preference
a.k.a. site blacklist. Mahalo would go straight in there, along with expert-
sexchange, sedo parking and a few others.

~~~
ryoshu
Depending on the browser you use, this is easy to do. I have experts-exchange
and other useless sites blocked using greasemonkey/greasemetal.

~~~
mvandemar
Yeah, sedo parking is a bitch tho, since by it's very nature it's tons of
different domains. You would need a way to block by app.

------
pclark
a year on, maybe not so crazy now eh?
<http://news.ycombinator.com/item?id=485618>

~~~
mvandemar
Heh, got any stock picks you want to share? :)

Nice call.

~~~
pclark
someones network of contacts > everything

------
jachee
SEO is snake oil. This is further proof that the whole "industry" is ruining
the integrity and usefulness of the internet.

~~~
aaronwall
That is the angle Jason used when he created his steaming pile. It doesn't
mean that anyone in the industry agrees with Jason's strategy.

And if you want to place the blame where it belongs remember that Google is
the company funding all this content scraping with their ads programs.

I just tried searching for a recent post from an official Google blog (about
AdSense using referral data for more relevant ad targeting) and found a
scraper site with their ads outranking them for their own content. Pretty sad.

~~~
olefoo
Actually, pretty savvy.

Google gets paid for the scraped site because of ad impressions and possible
clicks.

Google does not get paid for running an informative blog, in fact it's a cost
to them.

Google cares about getting paid.

------
axod
>> "Currently when I look, Google tells me that Mahalo has 356,000 pages
indexed"

I see 'Results 1 - 10 of about 2,200,000 from mahalo.com'

Have things stepped up a gear or am I misunderstanding?

~~~
mvandemar
No, different datacenters will show different results... sometimes very
different. It also matters if you are visiting Google.com or one of the
country variants.

According to what Jason said in another comment, however, all of their pages
are listed in their xml sitemap, and all of those are listed in a master xml
sitemap index located here (warning! huge files if you follow the links in the
first one!):

<http://www.mahalo.com/sitemapindex.xml>

Based on what I saw 2 million+ looks like a huge overestimate, if what Jason
said is true.

Edit: My bad, frederickcook's answer was the right one. I didn't realize you
were doing a regular text search.

~~~
chronomex
There are 12 sitemap-mahalo.xml files, each with 50,000 urls in it. The 12th
has 48,661 urls by my count. That gives a total size of 598,661 pages.

Methodology:

    
    
      wget http://www.mahalo.com/sitemapindex.xml
      cat sitemapindex.xml |sed -e's/[<>]/\n/g'|grep ^http|xargs wget
      for P in sitemap-mahalo.xml* ; do echo $P ; cat $P |sed -e's/[<>]/\n/g'|grep ^http|wc -l ; done
    

Edit: Formatting

------
jasonmcalacanis
this is getting really old and we're not interested in doing anything black
hat or even gray hat. as such we're doing the following:

1\. we're removing (or building out) any page in our system created by our
users with under 200 words of original content. This will take a couple of
weeks but it's tarted.

2\. we're not letting users create stub pages (short pages) until we can
noindex them and put them in a different directory (i.e. /stubs/) so google
can easily tell the difference between them.

these pages are < 1% of our revenue and low single digits of our traffic. we
don't benefit from them materially, and I think we're being targeted by Aaron
Wall and other SEOs for my "seo is bullshit" comment from 2005 or so.

I guess that is fine... I gotta live with the ramifications of what I say.
however, for the record I don't believe that SEO is BS any more... when i said
that it was when we were building joystiq and autoblog and we spent zero time
on SEO.

All that being said, we're being targeted by a small group of folks who want
to take us down. we're only going to get stronger from this because our
hundreds of contributors are rallying around building out the short pages.

Topix, Kosmix, NYTimes and Zimbio are all making quality topic pages and are
not getting attacked over it. not sure why there is some double standard.

regardless.... this is not a material thing for us. we're flushing all these
pages and moving them to a different directory going forward so that search
engines know where they are located (i.e. /stubs/ ).

thanks for the ass kicking.... having a horrible day today over this.

jcal

<http://bit.ly/jasondown>

~~~
tdm911
_we're removing (or building out) any page in our system created by our users
with under 200 words of original content. This will take a couple of weeks but
it's tarted._

or

 _i'm also getting a list of every page under 300 words and having the page
managers build them out in 30 days or deleting them._

from: <http://news.ycombinator.com/item?id=1143512>

is it under 200 words or under 300? are the goal posts moving already?

~~~
jasonmcalacanis
300 is the goal right now.... we're making plans right now.

.... keep the attacks up, only makes us more motivated.

~~~
aaronwall
There is a big difference between stating an observation and an attack. You
can (and have) dismissed the Google guidelines as irrelevant. But you
shouldn't slag off others as spammers if you are going to do far worse.

You are the one who attacked people. We are simply tracking how well you
performed in your efforts.

Hold yourself to your own standards or close your mouth. Period.

~~~
jasonmcalacanis
Aaron: let's be real for a moment. You're a bit of a troll and you're
certainly obsessed with me to an unhealthy level. I'm flattered, but it is
getting old.

You pointed out some minor stuff we could do better, we're doing better. Now
go back to your basement and hit F9 on Chatroulette.

~~~
corruption
Jason, either you are not actually reading what aaron's saying or you have an
emotional intelligence worse than my mother. Just saying.

Aarons stating facts. You are responding ad hominem. Perhaps you should hold
off on the reply button.

------
CoachRufus87
this is getting very, very old. can we move on? please??

~~~
prawn
It's obviously of interest and importance to a number of people involved in
this field or troubled by poor quality material showing up in Google. If
you're not one of those people, it's pretty easy to identify these links and
not upvote them or visit the articles/comments, etc.

There are countless articles on HN that I have no interest in (e.g., I don't
even know what Clojure is), but I just don't click through to them.

------
fjabre
Think this thread has gotten way out of hand.

A lot of the comments in here seem like more of a personal attack than
anything else. You might as well change the title of this post to "Jason
Calacanis ruined the Internet" or something to that effect..

Can we stop the drama already? I think we're going to need a hose to control
this mob.

