

Crawling the web at $2 per million pages - yarapavan
http://www.80legs.com/whatitis.html

======
anurag
I tried 80legs a couple of weeks ago when it was still in beta. It works well
if you're willing to write Java code to process HTTP responses (see
<http://80legs.pbworks.com/80Apps>). You can issue new requests in your
processing code and write processed data to files which you can download once
the crawl finishes. Even without Java, you can perform basic keyword/regex
matching on crawled data literally by filling out a web form. And they have an
API of course. Very impressive.

However, if you want to run crawling on your own infrastructure, I recommend
Scrapy (<http://scrapy.org/>), a python crawling framework introduced on HN
last year. Scrapy solves some of the more time-consuming problems involved in
writing a crawler from scratch (multiple simultaneous requests, pipelined
processing, raw caching, duplicate URL filtering) and comes with nifty
development and administration tools. More importantly, it has an active and
helpful set of core developers and good documentation. I am comfortable with
both Python and Java but I chose Scrapy over 80legs because I can crawl for
free on the machines I already have and I can afford to spend more time
crawling from a single IP compared to 80legs which will let me crawl much
faster but isn't free. Also, with Scrapy my bot can be 'naughty' - 80legs jobs
obey robots.txt and limit the crawling rate per domain.

When my crawling needs outgrow my infrastructure, I am going to look at 80legs
again.

~~~
jdrock
Just a quick heads up - we'll be looking into allowing JRuby and Jython fairly
soon for when you create 80Apps.

You'll also be able to mashup third-party apps and libraries into your own
code. This is something we hope to have in 2-3 months.

~~~
coderdude
How come there isn't a REST API? Java, JRuby, Jython. Ewwww. This isn't how
outfits are doing things anymore...

~~~
jdrock
The crawl runs on a super-heterogenous network of computers from around the
world. The JVM sandbox is the only thing we know. Java is the only language
that has something like this.

Yes, Java isn't as sexy anymore. But running your code on 50,000 computers
sounds sexier to me than a REST API.

------
kbrower
I would love it if it processed, stored, and indexed the data.

<http://parselets.com/> \+ mysql + sphinx

~~~
qeorge
Wow - parselets.com is amazing. Thanks for sharing.

I think 80legs can do some level of parsing now, but Parsely seems like a
great way to describe the analysis for targeted crawls on predictable data. If
it supported the combination you've described it would really open up some
cool possibilities.

------
fsniper
That reminds me about my fools day joke about MassiveClouding. I imagined
about a company buying cpu/hours from regular desktop computer users and
selling it to anyone interested in this kind of cpu power. Of course any
computer software could be run on the MassiveClouding without any changes. My
good imagination :) I don't know about the technology behind 80Legs but these
50000 computers might be a bot net.

~~~
Evgeny
Isn't that like what the SETI project does? (Except that they don't buy the
cpu power, they ask people to volunteer)

<http://en.wikipedia.org/wiki/SETI#SETI.40home>

"Any individual can become involved with SETI research by downloading the
Berkeley Open Infrastructure for Network Computing (BOINC) software program,
attaching to the SETI@home project, and allowing the program to run as a
background process that uses idle computer power."

~~~
fsniper
Yes it's something like it. There is another base difference; SETI@home or
other folding projects have a centralized infrastructure for the data to be
collected and bound. And all the software to be run on the client side are
written for a specific mission. So every software should be rewritten for
these platforms. My MassiveClouding dream has a some way of dissecting any
software into parts and run on the clients. That's why it's a dream.

------
sachinag
You know, I'd try this, but the hard part is that the _processing_ is still on
me (and I can't code, so I'm SOL). That, frankly, is the hard part, is it not?

~~~
jennwilson
Hi, I work for 80legs and wanted to address this--we will be building an "App
Store" over the next couple of months that will allow you to use Apps written
by other developers to do some pretty cool stuff. We already have several of
these 80Apps (as we call them) available, many of which have been written by
the semantic search engine company swingly.com. You can view descriptions for
them on the "Create Job" page if you sign up to use our Portal.

------
andreyf
I remember they had an explanation about this, but I don't remember what it
was, and it seems to have disappeared: what are these 50,000 computers they're
using?

~~~
e1ven
Plura has a java applet which you can embed into a webpage that gets viewed
for a long duration, such as a game or a streaming video. Affiliates embed an
iframe which loads their applet as the user plays their game. The user's CPU
goes up a bit, and they can help generate revenue for the game makers.

~~~
lsb
They're targeting desktop apps. The Java app downloads the pages, so it needs
high permissions, so you get the default Java unfriendly popup asking you for
confirmation.

~~~
e1ven
From <http://pluraprocessing.com/developer/index.html>

    
    
      Plura supports desktop and web-based games.
    
      If the game is hosted on a website, like a Flash game, the 
      developer only needs to include 1 line of iframe code. We 
      will soon be releasing a Javascript API for dynamically 
      controlling how this iframe is loaded, giving the developer 
      control over starting, stopping, and controlling CPU usage 
      in Plura. The iframe loads a Java web applet, which runs 
      completely in memory. This applet is forcibly restricted 
      from accessing the user's computer by the sandbox model 
      provided by Sun.
    
    

See also - <http://pluraprocessing.com/developer/index.html>

So far as I understand, they have multiple models. One affiliate model is
"Plura for Java Applets", where-as another is "Signed Java Applets"

I imagine there may be fewer options for unsigned applets, leaving the
developer with less potential revenue every month, where as desktop
application developers and signed java applications are left with the
providers who don't need signed applications.

That said, I agree the Java dialog is ugly and scary ;( If I were an
Affiliate, I'd want to avoid it. It breaks the user-experience of your site
into some gaudy and jarring, not to mention unbranded and unrelated to the
information the user is after.

------
Flemlord
I wonder how they arrived at their $2 per million pricing model. Why not $10
per million? Either one sounds ridiculously inexpensive.

~~~
charlesju
They're probably using EC2s and then it's a function of how many pages can be
scrapped per EC2 compute hour and then taking a premium on top of that.

The reason why they didn't want it to be too expensive (ie. 5 times as
expensive) is (1) competitors can easily equilibrate and steal market share if
their idea works b/c of the economic inefficiencies in their pricing model and
(2) this game plan is more a game of dependence rather than up front profits,
so it makes sense to take very little profit up front to get user traction.

~~~
jdrock
We're not using EC2. AWS doesn't and can't scale to our size, actually :)

Side note: AWS guys have asked if we use them. We said "No, we'd be losing
money then and wouldn't be able to scale to our size with you."

------
mwexler
One limitation is the throttling for each domain. If I had a smaller set of
pages that I wanted to read frequently, this solution would not work. I
understand the need to avoid DOS attacks, but in some cases, it would nice to
be able to read a million pages from 3 large domains instead of 1 from a
million pages.

~~~
bliving
That and...

"Your parseLinks() and processDocument() methods must complete within a total
10 seconds per document processed"

... as a limit on processing leaves room for competition. One advantage of the
BYO-Cloud solution is that you can pay for more intensive processing of the
crawl.

------
keefe
They're inflating the cost difference. Cloud : $0.10 / CPU-hour for "large
scale" crawling. There's no reason to use a small, unreserved EC2 instance for
large scale anything? I have a small reserved instance I use for git and
bugzilla and so forth, it's slow as hell. That's why there are bigger ones.

~~~
jdrock
The cost savings are still there with reserved instances if you do the math.

Actually, the cost is not the biggest issue with the cloud. If you're talking
about large-scale crawling, AWS will not adequately scale. You can't get
enough nodes or enough bandwidth.

~~~
keefe
My point was just in questioning a $4/million figure using a 0.10/hr instance.
Why don't you run benchmarks against a reserved high CPU instance for 100M
pages crawled? I was never contending that you cannot offer cost savings over
custom crawling in ec2, just the way the #s on the page were calculated. If
you have enough traffic to keep your instances saturated of course it is more
economical to buy dedicated servers, but nobody starts out that way. Why in
the world do you think you can't get enough nodes or bandwidth to do web
crawling off of EC2? I remember seeing Bezos talk about animoto scaling up to
3500 instances instantly for video transcoding, I find it extremely hard to
believe such a parallelizable task as web crawling could not be done on EC2.
It's one thing to say that you can do it cheaper, it's another thing entirely
to say that it can't be done on EC2 at all. How did you come to this
conclusion?

~~~
jdrock
The $4/million actually is not based on compute time, it's based on data
transfer in/out of AWS. If I included compute time, it would be higher.

And it's the bandwidth aspect that makes web crawling not feasible on AWS.
Yes, you have a few thousand nodes, but they're all going through a handful of
external IPs, which will cause serious performance issues.. the worst case is
that you'll get blocked entirely from the sites you're trying to crawl.

In other words, the bandwidth is not parallelizable on the cloud.

~~~
keefe
I'm really having trouble with this. My understanding is EC2 provides an
internal and external IP for each instance :
[http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-19/in...](http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-19/instance-
addressing.html) as well as a semi-friendly DNS name. Each of these machines
can certainly make its own requests to arbitrary URLs? I don't see how this is
any different than a bunch of machines sitting in a data center with a shared,
dedicated internet connection? Also, from my rather limited experience of
crawling sites the only time I drew negative attention was when I did not
throttle my crawler. If you are properly caching to avoid repetitious requests
and throttling your requests, how are you going to get blocked and why would
it be different in EC2 versus a dedicated hosting center?

~~~
jdrock
So a few points (I assume you're talking about the Elastic IPs):

1\. Yes, each instance can have its own IP, but by default, each account is
limited to 5 IP addresses.

2\. You can increase your limit, but my guess is that it's difficult to do so.
You have to put forward a special request and have it approved.

You're right that blocking may not be a big issue, but crawling several
different domains quickly will be hard.

Just so you know, we haven't encountered anyone doing large-scale crawling
that considers AWS or the cloud in general to be a realistic option. The
biggest reason is still the cost.. the outbound transfer rates just don't make
sense at scale.

~~~
keefe
Elastic IPs are about having the same functionality as static IPs, every
instance has an IP per the previous link I posted. Every time you connect a
new network device to any typical network, it gets an IP. I'm not sure how
that relates to scalability of the bandwidth.

You are limited to some number of instances (20, 50?) and yes, you have to
fill out a form to get more. The previous example with animoto shows how far
you can go. I would wager that finding the funding for a large # of instances
is more problematic than getting the approval.

I don't see why crawling several different domains quickly will be hard? There
shouldn't be any difference between a bunch of instances on EC2 and a bunch of
machines in a data center, from a technical point of view.

As far as the cost argument goes, of course I agree with you. If you can
project a high level of CPU/bandwidth usage for an extended period of time
then of course you should buy dedicated servers.

The only argument I was trying to make was that it is completely possible to
do crawling on EC2 or any other cloud provider from a technical point of view,
the only limitation is cost. I see the advantage of utility computing is that
it offers a cheap way to handle bursty traffic, which you may certainly run
into if your server utilization projections are off? I don't think you should
use it as your primary set of servers if you can project some large volume of
traffic.

~~~
jdrock
You're right that the data center and the cloud will be very similar, but our
assertion at 80legs is that _both_ are very poor choices.

I'm not arguing that it's impossible to do crawling on the cloud. I'm saying
it's near-impossible to do it on a large-scale on the cloud. 3500 instances is
pretty good, but will still be an order of magnitude slower than what 80legs
is capable of.

Now, if you show me someone that has 10,000+ instances on the cloud, I may
agree with you!

------
ujjwalg
This seems to be a very neat idea. How about processing of the crawled data,
if someone wants to process the data and show it to the end user in a
comprehensive way? We are looking for a solution which is affordable and
provide both.

~~~
jhammerb
Stuff the results of the crawl into Hadoop, write some MapReduce jobs to
process the data, and then serve the structured data back up with HBase?

------
tezza
..:: Loop Detection?? ::..

Are you supposed to implement your own cycle detection?

If not, how deep is the cycle detection that 80legs offers?

\--

There are plenty of HoneyTrap[1] OSS projects which will quickly rack up lots
of $$$ if the 80legs spider is backlisted.

For those who don't know, these projects create deeply linked pages, and
sometimes create infinite cycles. They are trying to hinder spammers, but may
hinder 80legs too.

\--

[1] [http://www.davidnaylor.co.uk/stopping-bad-robots-with-
honeyt...](http://www.davidnaylor.co.uk/stopping-bad-robots-with-
honeytraps.html)

~~~
jdrock
Because of the way 80legs handles crawls, users don't need to worry about loop
detection. There are really two issues here:

1\. For each user crawl, we only allow the same url once, so any simple loops
that involve the same urls are eliminated by this process no matter how large
the loop

2\. For more sophisticated "spider traps" that work with different urls and
domains, they can have a limited effect on your crawls. Because of our per-
domain rate throttling, the worst these traps can do is add a few cents per
day to someone's crawl.

------
yarapavan
MIT Technology Review also has an article on 80legs. Accessible at
<http://www.technologyreview.com/web/23528/>

------
teeja
Pricey!

I'd pimp for Open Google instead.

