
Ask HN: What are the alternatives of hosting Apache Spark? - muramira
I truly love what databricks is doing, but their pricing model is unpredictable. Are there any other hosting companies that provide a fixed price?
======
SmirkingRevenge
If your spark jobs are mostly batch workloads, that can tolerate moderately
infrequent failures and restarts, try using google dataproc with preemptible
vms or amazon emr using spot instances.

Depending on your use case, you might spend many times less than you would
using regular VMs. Many instances that are several dollars an hour on AWS can
be used for a fraction of the price.

Its also fairly easy to automate the region selection and bid (on AWS that is,
not sure about gcloud).

If you need streaming, obviously this might not be the way to go.

~~~
virgilp
I find the EMR markup to be substantial; if I weren't working in a
corporation, I would stand up my own spark clusters, e.g. using spark-ec2

~~~
SmirkingRevenge
It is, but you can run the bare minimum number of core nodes (3 I think?) and
use spot instances for any others.

At a previous job, we just built our own ec2 image that ran spark in
standalone mode for ephemeral spark clusters, and it was wonderful and cheap.
And the clusters launched _very_ fast compared to EMR.

------
perlin
Rewrite all of your jobs using Apache Beam. Then use whatever runner you want:
Spark, Flink, Google Cloud Dataflow, etc.

------
sandGorgon
Google Dataproc - very good and very soon they will release kubernetes as the
manager instead of yarn.

~~~
patches11
"very soon they will release kubernetes as the manager instead of yarn"

Do you have more details on this?

~~~
sandGorgon
[https://cloudplatform.googleblog.com/2018/03/learn-to-run-
Ap...](https://cloudplatform.googleblog.com/2018/03/learn-to-run-Apache-Spark-
natively-on-Google-Kubernetes-Engine-with-this-tutorial.html?m=1)

------
Zaheer
Check out AWS Glue:
[https://aws.amazon.com/glue/](https://aws.amazon.com/glue/)

Disclosure: I work on this service

~~~
hkchad
So I've been knee deep in glue for the past month. Great service but we really
need more examples and an easier dev flow. I wasn't able to get consistent
results between juypter and dev endpoint as i was with the glue running the
script directly. So i spent a lot of wasted time waiting for glue to run my
job that eventually failed and then getting a less than helpful error message.
Now that i have my jobs running and orchestrated with step functions it's a
thing of beauty but should have taken 1/10 of the time if i had good examples
and a proper environment.

------
tejasmanohar
All 3 major cloud providers have offerings in this space. Amazon [0], Google
[1], Microsoft [2].

[0]: [https://aws.amazon.com/emr/](https://aws.amazon.com/emr/)

[1]: [https://cloud.google.com/dataproc/](https://cloud.google.com/dataproc/)

[2]: [https://azure.microsoft.com/en-
us/services/databricks/](https://azure.microsoft.com/en-
us/services/databricks/)

------
Sevii
You could give AWS EMR a shot, it probably doesn't offer as much as databricks
but should have consistent pricing.

------
antoncohen
Run Spark on a managed Kubernetes like GKE? There is experimental support for
using Kubernetes as the cluster manager.

[https://apache-spark-on-k8s.github.io/userdocs/index.html](https://apache-
spark-on-k8s.github.io/userdocs/index.html)

------
hiyer
You can try Qubole [0]. The pricing is a small percentage of what you pay to
the cloud provider, so it's predictable to an extent.

[0]: [https://www.qubole.com/](https://www.qubole.com/)

Disclosure: I work here.

------
tspann
[https://hortonworks.com/products/data-
platforms/cloud/aws/](https://hortonworks.com/products/data-
platforms/cloud/aws/)

------
scarecrowx
We're using Spark on EMR with Data Pipeline to do ETL and to run Scheduled
Jobs. Data pipelines terminates the cluster once ETL or job gets completed,
helps us a lot to save cost.

------
shelzzzzz
what part of it is unpredictable? I guess if you know how much VMs or EC2
you're planning on using its the same pricing model as Dataproc or EMR

------
curiousDog
Check out Azure Databricks

