
Large-Scale Machine Learning with Spark on Amazon EMR - eaxitect
http://blogs.aws.amazon.com/bigdata/post/Tx21LOP0UQ2ZA9N/Large-Scale-Machine-Learning-with-Spark-on-Amazon-EMR
======
MoOmer
I do the same on Google Compute Engine, except without the auto-terminate and
scaling :(

However, Google's bdutil has a great set of shell scripts which auto setup the
environment; and, with minimal changes you can set up the exact Scala/Spark
versions you need.

The fact that I (just one dude) can set up a pipeline and chomp through TBs of
data on clusters with TBs of memory over the course of hours still keeps me in
awe of the advances of both GCE and AWS.

I'll have to give EMR/AWS a shot!

~~~
jeffreysmith
I have nothing against GCE. There's certainly innovation going on there. The
new dataflow system certainly includes some very exciting and powerful ideas:
[https://cloud.google.com/dataflow/](https://cloud.google.com/dataflow/)

The empowerment of these platforms is something that I'm very excited about.
Spark will allow you to go from some fairly basic processing of small files on
your laptop to a massive cluster that will process huge amounts of data very
efficiently. And EMR makes all of that even easier.

This is something that I hope to convey in my upcoming book:
[http://www.reactivemachinelearning.com/](http://www.reactivemachinelearning.com/)
One of the ideas that I'm playing with is that big data and small data are
basically the same. You should assume that you have an infinite amount of
data, and then you'll build your system to handle whatever comes at it. Even
if you end up not having a ton of data, you won't be sorry you used awesome
tools like Spark.

------
jeffreysmith
Jeff here. Glad that people are interested in this post. Feel free to ping me
with any questions.

~~~
GiusCo
Hi Jeff, congrats on your work. One question of general interest for buddying
big data scientists and engineers: do you think from your position that Spark
is going to replace Hadoop in the coming future or they will occupy different
niches in the market? Thanks.

~~~
MoOmer
I use Spark in conjunction with many Hadoop ecosystem mainstays: YARN, HDFS,
etc. Hadoop mapreduce can be swapped out for Spark, but many great things
beyond that have stemmed from the Hadoop project.

~~~
jeffreysmith
Yep, our Spark deployment, like many others, uses YARN and HDFS. EMR has done
some great work to make YARN a great target for deployment of jobs using
various technologies.

I'm very much not against the Hadoop ecosystem. The ecosystem represents very
real progress for data infrastructure. But Hadoop MapReduce is just not what
people should be using to build machine learning jobs at scale in 2015.

Spark makes great use of the Hadoop ecosystem, and I'm primarily interested in
future innovations in the big data space that try to work with the Hadoop
ecosystem instead of trying to supplant it. Modularity and composability
benefit us all.

