
Getting started with Spark on AWS - mwakanosya
https://blog.insightdatascience.com/spinning-up-a-spark-cluster-on-spot-instances-step-by-step-e8ed14ebb3b
======
banku_brougham
I chuckle whenever is see the acronym RDD, I can't shake the joke a while back
that it stands for 'resume-driven development.'

With the largest EC2 instances having 1TB of RAM, many problems that used to
require distributed compute can be more efficiently run on one machine with a
variety of software: R, Python/Pandas, scikit, julia, etc.

When you've cleaned and filtered your data to meet the problem space, you may
well find you don't have big data -- and be relieved that you dont need to
hire distributed systems engineers to keep everything running.

There are important use cases however, for example small-world graphs.

~~~
mrep
4TB nowadays:

Model vCPU Mem (GiB) SSD Storage (GB) Dedicated EBS Bandwidth (Mbps)

x1e.32xlarge 128 3,904 2 x 1,920 14,000

------
julsimon
Why not use Amazon EMR directly instead of installing Spark on EC2 instances?
[https://aws.amazon.com/emr/](https://aws.amazon.com/emr/)

------
falaki
Here is a much easier way to get started with Spark on AWS for free:
[https://community.cloud.databricks.com](https://community.cloud.databricks.com)

~~~
aynsof
This link requires a login.

