
Spinning Up a Spark Cluster on Spot Instances: Step by Step - ddrum001
http://insightdataengineering.com/blog/sparkdevops/
======
eranation
Very nice. I prefer using the built in command line ec2 scripts packages with
all Spark versions.

[https://spark.apache.org/docs/latest/ec2-scripts.html](https://spark.apache.org/docs/latest/ec2-scripts.html)

You can even specify the spot instance price, instance type, number of nodes,
etc.

e.g.

./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1
--zone=us-west-1a --spot-price=0.2 --instance-type=m4.4xlarge launch my-spark-
cluster

~~~
ddrum001
Indeed, the script works quite well for Spark - but we also wanted to provide
a guide for those who would like to further understand the config files,
especially for those who want to 'tinker' with these down the road (e.g.
adding nodes to the cluster). Also, we find it helpful for learning to set up
Spark on other platforms, and setting up other systems that don't have ec2
scripts yet.

------
boulos
If you don't care to do all the setup yourself, we've recently announced
Dataproc as a fully-managed service including support for Preemptible VMs:
[https://cloud.google.com/dataproc/](https://cloud.google.com/dataproc/) .

Disclaimer: I work on Compute Engine, specifically Preemptible VMs, but didn't
work on Dataproc (though I did add --preemptible to bdutil!)

~~~
stuartaxelowen
It only specifies pricing per CPU - how is affected by memory per node? Is
that configurable?

~~~
boulos
I wish they had been more clear, sorry about that (I'll send them the
equivalent of a pull request): you pay for _Dataproc_ at a rate of
$.01/"core"/hour regardless of which instance shape you use. However, you
still pay for the underlying compute and storage; $.01/core/hour is the
"service" fee.

------
stuartaxelowen
There's a script for that, actually:

[http://spark.apache.org/docs/latest/ec2-scripts.html](http://spark.apache.org/docs/latest/ec2-scripts.html)

------
technofiend
Man, wish I could use spark at work but it uses a mavenized build that
requires open internet access for dependency download. Not worth it when I
have to register every dependent jar file by hand in my internal repository.

~~~
angryasian
you use an uberjar or fatjar and package the dependencies as a single jar.

------
angryasian
Why wouldn't you just use EMR with yarn and spark 1.5 ?

