
Data Mining 3.4 billion Web pages for $100 of EC2 - chrisacky
http://blog.luckyoyster.com/post/33592990831/data-mining-the-web-100-worth-of-priceless
======
chrisacky
I'm a large user of spot requests for my main application stack. I run my
startups _core_ services on spot requests. These are processes that I need
running 100% of the time (memcache, nfs, gearman, varnish, nginx etc).

I'm sure you have all heard of the Chaos Monkey that Netflix runs? Well, I
didn't actually even need to code a chaos monkey... All you have to do is run
_everything_ on spot requests. Eventually you will lose servers at
unpredictable times because someone outbids you[1].

The typical spot request pricing is _at least_ 3/10th of the price of an on
demand. For instance a cp1.medium (4 cores 4GB RAM) costs $0.044 per Hour for
a spot request. Compare that to on demand and it is $0.186 per Hour.

I bid $1 per hour for my spot requests across two zones in the same region. I
group my servers and use ELB (Elastic Load Balancers) to route requests...

Typically, a spot request might last for about a week before it gets killed
because the capacity isn't there. That's then when instances in my other zone
takes 100% of the load temporarily. At this point, since I've lost an entire
zones worth of servers, I have my auto scaling group fire up on demand
instances until I can get some more spot requests fulfilled. Creating a setup
like this took about a week, but the savings are enormous.

\----

How is the data stored on this setup?

RDS(MySQL) handles all data that can be stored in a database.

Ephemeral Storage is used to store things that don't need to be persistant (ie
transactional logs).

Sessions are managed through Redis.. If the redis servers die, then session
handling is handled via mysql temporarily. (It's a lot slower but the mysql
server is RDS so it's always running)

Elastic Block Storage volumes are automatically mounted to a single instance
which is then set up as a NFS server->client in order to allow other servers
to read from a particular mount point (ie.. A user uploads an image, and it's
stored on the NFS mount. A different server reads the file and starts
generating different dimensions, uploads all of the files to Amazon S3, and
then deletes the original file on the NFS device).

The worst part about losing servers is when the memcached server dies, because
I could lose weeks worth of cache storage. When this happens, I have to boot
up several micro instances that take my "cache warming" list and basically
start repopulating memcached again.

The entire system is designed to be redundant... I can kill every server and
then run the initialization script to start up the entire stack. (It's
basically lots of little cloud-init scripts[2])

[1] <http://chrisacky.com/images/lulz.png>

[2] <https://help.ubuntu.com/community/CloudInit>

~~~
tszming
Thanks for your detail explanation.

I haven't used spot instance before, I am curious how you handle the
termination gracefully? i.e. When the spot instance get terminated during the
middle of transactions, e.g. uploading large file, writing to DB etc.

~~~
garindra
I would guess the same way you do when regular servers terminate/die for
whatever reason.

~~~
raphinou
The thing is, this event is usually uncommon, whereas in this architecture it
is quite common.

------
wmf
Bit of a problem with the headline: they didn't crawl anything because Common
Crawl already did that.

~~~
chime
Data Mining != Crawling. I don't see a problem with that.

~~~
Steko
Submitted Title used to say "Crawled"

~~~
sjg007
Is grepped a better choice? You can crawl in memory from a repository, or
"crawl" across the net.

------
sadga
AWS Spot Instances are incredible. They make you a liquidity provider, and
reward you for it.

Paying list price for any load that isn't mission-critical and needed
immediately is insane.

~~~
chaz
Also, if you're looking for a longer-term commitment but want to get out of
it, their Reserved Instances can now be bought and sold. Great for if your
plans change and you want to recoup some of your costs.

<http://aws.amazon.com/ec2/reserved-instances/marketplace/>

------
baruch
If anyone is interested in spot instance pricing there was an interesting work
on the subject: <http://www.cs.technion.ac.il/~ladypine/spotprice-ieee.pdf>

In short, it's not a perfect supply-and-demand market, but it's interesting to
see the details they found.

------
cpenner461
Anyone have any experience/thoughts on using spot instances with Hadoop?
Specifically, regular instances with Hadoop installed (not via Elastic Map
Reduce). The cost savings are potentially huge, but I'd hate to lose my
instances 80-90% of the way through a set of long running (12-48h) jobs. I
guess if I had EBS backed instances I could relaunch and resume, but I'm not
sure how well that'd work in practice.

~~~
xtrahotsauce
We bring up spot instances all the time with EMR as additional compute
capacity, and we're okay with these instances possibly going away because we
use HDFS very little. Instead, we store almost everything on S3.

------
wanghq
"Master Data Collection Service. A very simple REST service accepts GET and
POST requests with key/value pairs. ... we then front end the service with
Passenger Fusion and Apache httpd. This service requires great attention, as
it’s the likeliest bottleneck in the whole architecture." Seems this can be
replaced by DynamoDB.

------
luckyoyster
Thanks for all the commentary. We're planning on presenting this work at
reInvent with the folks from Common Crawl, and also releasing sample code to
github. For those who haven't yet tried spot instances, or looked into the
Common Crawl data set, we highly recommend them!

------
amalag
What about using elastic map reduce with spot instances instead of a custom
job queue. Hadoop seems to do this for us and supports the arc format as an
inputformat.

~~~
jeffbarr
You can do that with ease. Here's my blog post:

[http://aws.typepad.com/aws/2011/08/run-amazon-elastic-
mapred...](http://aws.typepad.com/aws/2011/08/run-amazon-elastic-mapreduce-on-
ec2-spot-instances.html)

~~~
amalag
Yes I was wondering why the author implemented his own queuing mechanism
instead of just using hadoop via EMR.

------
zerop
I want to use common crawl to periodically fetch crawled data for some of the
sites. How frequently does common crawl updates its data set. Does it crawl
all sites?

