
Job-Scoped Hadoop Clusters with Google Cloud - vgt
https://cloud.google.com/blog/big-data/2017/06/fastest-track-to-apache-hadoop-and-spark-success-using-job-scoped-clusters-on-cloud-native-architecture
======
natekupp
We're also doing this at Thumbtack. We run all of our Spark jobs in job-scoped
Cloud Dataproc clusters. We wrote a custom Airflow operator which launches a
cluster, schedules a job on that cluster, and shuts down the cluster upon job
completion. Since Google can bring up Spark clusters in < 90s and bills
minutely, this works really well for us, simplifying our infrastructure and
eliminating resource contention issues.

~~~
vgt
Co-Author of the blog here.

Awesome stuff, glad to see folks leveraging the possibilities! Perhaps as a
follow-up you could write a guest blog on how this works for you! Feel free to
ping me offline.

------
gfodor
Hah, we were doing this with EMR 6 years ago, I guess we were a little early
:)

[https://www.youtube.com/watch?v=NF6zwHlbh_I](https://www.youtube.com/watch?v=NF6zwHlbh_I)

We built a coordinator that would spin up specific categories of machines for
each stage (some stages were MR jobs, some were hadoop streaming jobs) -- for
example when doing in-memory work it was useful to have fewer nodes with more
RAM, etc.

~~~
matt_wulfeck
Didn't see the video, but what platform did you spin up the cluster on?

On any public cloud I have a hard time seeing this make financial sense other
than Google, which bills by the minute.

~~~
mk89
EMR = Aws

------
matt_wulfeck
interesting idea. I can see it being worthwhile with per-minute billing, which
is not something supported by AWS. Also EMR charges a per-instance premium,
does anyone know if Google Cloud does similar?

I'm curious how shuffle data is handled. Does the cluster intelligently scale
down and move the shuffle data, or will the entire thing keep running while
waiting for a single skewed reducer to finish? Or does the entire thing run on
a single instance??

~~~
vgt
There is a per-core surcharge of basically $0.01 per hour [9]. And sadly
Dataproc doesn't have a good autoscaling story.

On the flip side, you can use Preemptible VMs for Dataproc, which are 80% off,
and Dataflow has an amazing auto-scaler

[77] [https://cloud.google.com/blog/big-
data/2017/01/understanding...](https://cloud.google.com/blog/big-
data/2017/01/understanding-cost-versus-speed-tradeoffs-in-google-cloud-
dataflow-batch-pipelines)

[9]
[https://cloud.google.com/dataproc/docs/resources/pricing](https://cloud.google.com/dataproc/docs/resources/pricing)

~~~
tawkspace
(Disclaimer: work on Dataproc) Though Dataproc doesn't natively have an
autoscaler, Spotify wrote one as part of Spydra:
[https://github.com/spotify/spydra](https://github.com/spotify/spydra)

~~~
Xeago
The autoscaler here doesn't solve the downscaling issues mentioned. We've seen
it work fine for scaling up, quickly. For now we're not scaling down, as the
job will finish quicker than individual tasks will time out when unlucky.

------
rmnoon
What about the traditional IO / data locality win of having your processing
colocated with your DFS? Is GCS bandwidth that amazing?

~~~
vgt
Good question! You probably are familiar with the bandwidth and throughput
power of the underlying storage system of GCS, Colossus, through use of
BigQuery. BigQuery Storage and GCS storage both leverage Colossus. It's silly
fast :)

Others can chime in more intelligently wrt Spark/Hadoop specifically, but I'll
point out that read latency from GCS would definitely be higher than local-
disk HDFS (esp Local SSD). Throughput, depending on your configuration, could
be much better with GCS. Spark/Hadoop don't take the same care to optimize the
storage-to-compute route as BigQuery, as evident by some bits of Hive
performing serial FS operations.

So my answer is, it depends on the configuration of the job, the cluster, how
data is written, choice of disk, and et cetera.

That said, when talking about price-performance, flexibility, scalability, and
ease of operations, I suspect the "job-scoped clusters" setup would have a far
superior TCO. We should try and do the math one day :)

(co-author of blog, work at G)

------
cutler
Can anyone tell me how, as a sole developer, it's possible to gain real-world
experience with distributed Hadoop and Spark given the massive computing
resources required? It just seems like a closed shop to me.

~~~
teraflop
Define "massive". You can learn all the most important aspects of the APIs and
programming model on a single machine, either by running in "local" (non-
distributed) mode or by running a few VMs to simulate a real cluster. And you
can spin up a cluster of 100 machines on either AWS or GCE for about a dollar
per hour.

