
Launch HN: Data Mechanics (YC S19) – The Simplest Way to Run Apache Spark - jstephan
Hi HN,<p>We’re JY &amp; Julien, co-founders of Data Mechanics (<a href="https:&#x2F;&#x2F;www.datamechanics.co" rel="nofollow">https:&#x2F;&#x2F;www.datamechanics.co</a>), a big data platform striving to offer the simplest way to run Apache Spark.<p>Apache Spark is an open-source distributed computing engine. It’s the most used technology in big data. First, because it’s fast (10-100x faster than Hadoop MapReduce). Second, because it offers simple, high-level APIs in Scala, Python, SQL, and R. In a few lines of code, data scientists and engineers can explore data, train machine learning models, and build batch or streaming pipelines over very large datasets (size ranging from 10GBs to PBs).<p>While writing Spark applications is pretty easy, managing their infrastructure, deploying them and keeping them performant and stable in production over time is hard. You need to learn how Apache Spark works under the hood, become an expert with YARN and the JVM, manually choose dozens of infrastructure parameters and Spark configurations, and go through painfully slow iteration cycles to develop, debug, and productionize your app.<p>As you can tell, before starting Data Mechanics, we were frustrated Spark developers. Julien was a data scientist and data engineer at BlaBlaCar and ContentSquare. JY was the Spark infrastructure team lead at Databricks, the data science platform founded by the creators of Spark. We’ve designed Data Mechanics so that our peer data scientists and engineers can focus on their core mission - building models and pipelines - while the platform handles the mechanical DevOps work.<p>To realize this goal, we needed a way to tune infrastructure parameters and Spark configurations automatically. There are dozens of such parameters but the most critical ones are the amount of memory and cpu allocated to each node, the degree of parallelism of Spark, and the way Spark handles all-to-all data transfer stages (called shuffles). It takes a lot of expertise and trial-and-error loops to manually tune those parameters. To do it automatically, we first run the logs and metadata produced by Spark through a set of heuristics that determines if the application is stable and performant. A Bayesian optimization algorithm uses this analysis as well as data from past recent runs to choose a set of parameters to use on the next run. It’s not perfect - it needs a few iterations like an engineer would. But the impact is huge because this happens automatically for each application running on the platform (which would be too time-consuming for an engineer). Take the example of an application gradually going unstable as the input data grows over time. Without us, the application crashes on a random day, and an engineer must spend a day remediating the impact of the outage and debugging the app. Our platform can often anticipate and avoid the outage altogether.<p>The other way we differentiate is by integrating with the popular tools from the data stack. Enterprise data science platforms tend to require their users to abandon their tools to adopt their own end-to-end suite of proprietary solutions: their hosted notebooks, their scheduler, their way of packaging dependencies and version-controlling your code. Instead, our users can connect their Jupyter notebook, their Airflow scheduler, and their favourite IDE directly to the platform. This enables a seamless transition from local development to running at scale on the platform.<p>We also deploy Spark directly on Kubernetes, which wasn’t possible until recently (Spark version 2.3) - most Spark platforms run on YARN instead. This means our users can package their code dependencies on a Docker image and use a lot of k8s-compatible projects for free (for example around secrets management and monitoring). Kubernetes does have its inherent complexity. We hide it from our users by deploying Data Mechanics in their cloud account on a Kubernetes cluster that we manage for them. Our users can simply interact with our web UI and our API&#x2F;CLI - they don’t need to poke around Kubernetes unless they really want to.<p>The platform is available on AWS, GCP, and Azure. Many of our customers use us for their ETL pipelines, they appreciate the ease of use of the platform and the performance boost from automated tuning. We’ve also helped companies start their first Spark project: a startup is using us to parallelize chemistry computations and accelerate the discovery of drugs. This is our ultimate goal - to make distributed data processing accessible to all.<p>Of course, we share this mission with many companies out there, but we hope you’ll find our angle interesting! We’re excited to share our story with the HN community today and we look forward to hearing about your experience in the data engineering and data science spaces. Have you used Spark and did you feel the frustrations we talked about? If you consider Spark for your next project, does our platform look appealing? We don’t offer self-service deployment yet, but you can schedule a demo with us from the website and we’ll be happy to give you a free trial access in exchange for your feedback.<p>Thank you!
======
flowerlad
Running Spark on a Kubernetes cluster is already pretty easy, so it is unclear
what value this is adding. Controlling cost is the hard part. You may only
need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes
clusters are not easy to provision and de-provision, so you end up paying for
a cluster for 24 hour days and use it for only 1 hour. If someone comes up
with a way to pay for pre-provisioned Kubernetes clusters only for the
duration you use it that would be interesting.

~~~
quadrature
The problem being solved here is resource tuning. Which is a problem you will
eventually encounter as your data org grows big. Specifically in our case the
authors of our spark jobs understand the data modelling well but might not
know how to tweak the spark parameters to optimize execution. As mentioned in
the post, even if you do know what you're doing the process is long and time
consuming. so i definitely see the value add here.

if you need ephemeral spark clusters dataproc in GCP will give that to you,
theres probably a similar service in AWS and Azure.

~~~
waffletower
AWS EMR is a fairly straight-forward and reasonably cost-effective method to
manage ephemeral Spark clusters on Amazon Web Services.

------
ojnabieoot
Speaking as someone who might be in your target audience: my experience with
Databricks (back in 2017/2018, without Kubernetes) is that their product is
just as unreliable and frustrating as deploying a Spark cluster manually, but
also more expensive and more time-consuming. It was so bad that I was
wondering if the entire company was a scam - which isn't true, of course. I
suspect a big part of our problem was a shuffle-heavy workload hitting a
relatively new product. But it left a really bad taste in my mouth about the
entire business model of "Spark as a Service."

My impulse reaction to your sales pitch is "their product probably doesn't
work very well and is way too expensive." I know that's unfair, but this
entire idea of "our platform automates away the tedium of Spark clusters" just
strikes me as a bag of magic beans.

What would help a lot with drawing cynical, bitter people like me: _case
studies_ on your website. I know that's a lot to ask for a young startup. But
actual details about either money or developer time saved with Data Mechanics
- specific pains your customers were having and how Data Mechanics addressed
them, or specific analyses your customers were able to do now that they're
spending less time managing Spark. Running a big Spark job in the cloud is a
huge financial risk, and many Spark users are much more concerned about this
than the headaches involved with management - and again, my last experience
with Databricks resulted in more cost and more headaches. I do not think I am
alone here.

I am wondering if you're considering selling your Spark telemetry/parameter
tuning/etc software, or offering it as a service, etc. Speaking personally, I
would be much more open to using Data Mechanics's tools on my own Spark
cluster rather than outsource the actual management. At my organization, in
addition to AWS, we also have a local Hadoop cluster with Spark installed;
commercial software that gives better insight into its performance could be
very useful.

~~~
glapark
Shuffling in Spark works well for small datasets, but is not reliable for
large datasets because fault tolerance in Spark is incomplete. For example,
check this Jira:

[https://issues.apache.org/jira/browse/SPARK-20178](https://issues.apache.org/jira/browse/SPARK-20178)

So, if your problem was mainly due to shuffle-heavy workload, then I guess no
managed Spark service would be able to alleviate/eliminate it by automatic
parameter tuning. In other words, your pain might be due to a fundamental
problem in Spark itself.

IMO, Spark is great, but its speed is no longer its key strength. For
examples, Hive is much faster than SparkSQL these days.

~~~
jamesblonde
It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken
and hasn't yet been fixed. The problem is that Docker containers cannot (for
security reasons) share the same host-level disks. There is no external
shuffle service, and disk-caching is container-local (not using kernel-level
disk I/O buffering) which kills performance. Google's proposed soln below is
to use NFS to store shuffle files, which is not going to be performant. Stick
with YARN for Spark and only switch when shuffle is fixed for k8s. Databricks
are in no rush to get shuffle fixed for k8s.

References:
[https://youtu.be/GbpMOaSlMJ4?t=1617](https://youtu.be/GbpMOaSlMJ4?t=1617)
[https://t.co/KWDNHjudfY?amp=1](https://t.co/KWDNHjudfY?amp=1)
[https://issues.apache.org/jira/browse/SPARK-25299](https://issues.apache.org/jira/browse/SPARK-25299)

~~~
epdlxjmonad
I agree that Spark on Kubernetes will have a hard time fixing the problem of
shuffling. If they choose to use local disks for per-node shuffle service, a
performance issue arises because disk-caching is container-local. If they
choose to use NFS to store shuffle files, a different kind of performance
issue arises because of not using local disks for storing intermediate files.
All these issues will arise without properly implementing fault tolerance in
Spark.

We are currently trying to fix the first problem in a different context (not
Spark), where worker containers store intermediate shuffle files in local
disks mounted as hostPath volumes. The performance penalty is about 50%
compared with running everything natively. Besides occasionally some
containers almost get stuck for a long time. I believe that the Spark
community will encounter the same problem in the future if they choose to use
local disks for storing intermediate files.

~~~
jstephan
Glad our post sparked some pretty deep discussions on the future of spark-
on-k8s ! The OS community is working on several projects to help this problem.
You've mentioned NFS (by Google) but there's also the possibility to use
object storage. Mappers would first write to local disks, and then the shuffle
data would be async moved to the cloud.

Sources: \- end of presentation
[https://www.slideshare.net/databricks/reliable-
performance-a...](https://www.slideshare.net/databricks/reliable-performance-
at-scale-with-apache-spark-on-kubernetes) \-
[https://issues.apache.org/jira/browse/SPARK-25299](https://issues.apache.org/jira/browse/SPARK-25299)

------
izyda
What do you see as your key differentiator from Databricks? what's the key
pain point they weren't/couldn't solve that you are?

~~~
jstephan
(Former Databricks software engineer speaking) The pain point they didn’t
solve (well enough) is Spark cluster management and configuration. From our
experience and user interviews, it’s the critical pain point that still slows
down Spark adoption. Through our automated tuning feature, we’re going further
than them to provide a _serverless experience_ to our users.

This being said, Databricks is a great end-to-end data science platform, with
notable features we lack like collaborative hosted notebooks. A lot of people
don’t want/need the full proprietary feature set of Databricks though. They
choose to build on EMR, Dataproc, and other platforms instead. We hope they’ll
try Data Mechanics now :)

~~~
__vb__
databricks has other optimizations on top of open source spark version, are
you maintaining your own version of spark or using the vanilla version of
spark.

One thing I constantly deal with is how to optimize spark, how to use ganglia
and spark ui to dig into what is causing data skew and slowness while running
jobs. Is this something that you do better than databricks?

~~~
jstephan
Spark versions: Only vanilla (open source) Spark. But we offer a list of pre-
packaged Docker images with useful libraries (e.g. for ML or for efficient
data access) for each major Spark version. You can use them directly or build
your own docker image on top of them.

Optimization/Monitoring: This topic is very important to us, thanks for
bringing it up. Indeed we automatically tune configurations, but developers
still need to understand the performance of their app to write better code.
We're working on a Spark UI + Ganglia improvement (well, replacement really),
which we could potentially open source.

Would you mind emailing me (jy@datamechanics.co) or even scheduling a call
with me
([https://calendly.com/b/datamechanics/avk7bhxq](https://calendly.com/b/datamechanics/avk7bhxq))
so I show you what we have in mind and get your feedback? Anyone else
interested is welcome to do the same.

------
apankrat
Only tangentially related -

Data Mechanics was one of contenders for our company name too! It was one of
my favourite options in fact. It sounds nice, can be read in two ways, works
well when shortened - DataMech. But getting datamech.com proved to be
impossible, so we settled on something else. Just 2c.

~~~
jstephan
"There are two hard things in computer science: cache invalidation, naming
things, and off-by-one errors."

Good luck with your venture :)

------
soumyadeb
>Many of our customers use us for their ETL pipelines, they appreciate the
ease of use of the platform and the performance boost from automated tuning.

This is quite interesting. Founder of RudderStack here (we are a CDI or simply
an open-source Segment equivalent). I have seen a similar pain point across
some of our customers. They use RudderStack to get data into S3 (or
equivalent) and then run some kind of post-processing Spark jobs for
analytics/machine-learning use cases. Managing two setups (RudderStack on
Kubernetes + Spark) is a pain.

A singly managed solution with Spark on Kubernetes makes so much sense. Would
love to figure out how to integrate with you guys.

~~~
jstephan
Congrats for RudderStack, what you're saying makes a lot of sense. Reaching
out to you directly to follow up on a potential integration!

~~~
soumyadeb
Thanks a lot. Will follow up with you.

------
knes
Awesome! Making Spark more approachable is good news for the wave of new data
engineers.

Do you have any record demo you can share where we can see how a user would
set up and integrate with the other tools? that would be neat

~~~
jstephan
Thanks for the feedback! We're preparing a demo for the upcoming Spark Summit
next month... Stay tuned :)

In the meantime you can book a time with one of our data engineers through the
website to get a live demo:
[https://www.datamechanics.co](https://www.datamechanics.co)

------
apoverton
I've thought about solving this problem with an ML approach like you all are
taking but as you say never had the bandwidth because I was focusing on my
"core missions". I'm no longer a heavy spark user but am very happy to see you
all working on this!

It always seemed so inefficient to me to spend all this time hand tuning jobs
only to have the data change and need to do the same thing again.

Good luck!

~~~
jstephan
Thanks for the wishes! Indeed it's rarely worth it to build an automated
tuning tool: \- Unless you operate at a massive scale (eg Dr Elephant + TuneIn
projects, originally developed at LinkedIn) \- Or you operate a big data
platform yourself.

If you're curious about our ML approach, we gave a tech talk about it at last
year's Spark Summit: [https://databricks.com/session_eu19/how-to-automate-
performa...](https://databricks.com/session_eu19/how-to-automate-performance-
tuning-for-apache-spark)

------
missosoup
Spark is sort of dead though. Dask looks to be the way of the future. In part
because _doesn 't_ take a zillion parameters to tune and consume a bucket of
resources just for overheads. Good luck.

~~~
cozos
Genuinely curious, how do you figure that Spark is "sort of dead"?

~~~
missosoup
I've been in the industry for 10+ years. I've worked with everything from
telco metrics firehoses to bank customer event streams to deep learning.

The venn intersection of conditions where spark makes sense is really rather
narrow. A single high spec instance running leaner tooling will generally meet
one's requirements while blowing spark out of the water in terms of perf and
cost.

Operationally, spark is a huge PITA, hence databricks and a host of other
offerings, I guess including this one, to try to manage the pain. Meanwhile
something like dask-kubernetes will cater to the same use case with
significantly lower operational complexity and again much higher perf and cost
efficiency.

I can't really think of a scenario where I'd choose to use spark on a
greenfield project today.

------
sg47
I saw that dynamic allocation is enabled by default. I thought dynamic
allocation does not work well on k8s if the executors need to be kept around
for serving shuffle files. How does it work in your case?

~~~
jstephan
Thanks, great question !

Dynamic allocation is only enabled on our Spark 3.0 image (from the
3.0-preview branch, since the official 3.0 isn't released yet). It works by
tracking which executors are storing active shuffle files. These executors
will not be removed when downscaling. More info here:
[https://issues.apache.org/jira/browse/SPARK-27963](https://issues.apache.org/jira/browse/SPARK-27963)

It's not perfect, but there are more improvements for dynamic allocation being
worked on (remote shuffle service for Kubernetes).

------
perlin
Can confirm that running Spark at scale is difficult. Not even necessarily
talking about scale of data or scale of performance, but organizational scale.
Getting dozens or hundreds of engineers aligned around best practices, tooling
and local development for Spark is both challenging and extremely rewarding.
When you have everyone buy into Spark as not just an execution environment but
a programming paradigm, it really unlocks some cool potential. If anyone cares
this is how I've found to best get Spark users riding on rails:

* Use a monorepo to "namespace" different projects/teams/whatever. Each namespace has its own build.sbt for Scala jobs and Conda/Pip requirements file for PySpark. This gives you package isolation so that different projects can bump requirements at their own pace. This is crucial in larger organizations where you might have more siloed development or more legacy applications.

* Build each project in the monorepo into a separate Docker image and tag it accordingly with some combination of the branch and namespace.

* Deploy applications onto Kubernetes by invoking the SparkOperator ([https://github.com/GoogleCloudPlatform/spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)), This abstracts away a lot of the hassle of driver/executor configuration and gives you nice out-of-the-box functionality for scraping Spark metrics.

* For local development, use some type of CLI or Makefile to build/run the image locally. This is where the implementation diverges somewhat from using SparkOpelrator (unless you want to tell your employees that everyone needs to run Kubernetes on their local machine, which we thought would create too much friction).

* For orchestration, write a custom operator for Airflow that submits a SparkOperator resource to the Kubernetes cluster of your choosing. The operator should supervise the application state, since the SparkOperator doesn’t quite do that well enough for you. This is something I wish we had the opportunity to open source.

* Where it gets tricky is building Spark applications locally and running remotely, Say you built a job locally and tested it on a small subset of your data. Now you want to see what happens when you run across a full dataset, requiring more than 16gb of memory (or whatever the developer has on their laptop). You need some way to build your image locally but schedule it remotely. This could be done via the same CLI or Makefile, but you end up with a lot of images and it gets pretty costly. I’m sure we would have figured it out eventually if we didn’t all get laid off last month :P

* BONUS: Use Iceberg or Delta ([https://iceberg.apache.org/](https://iceberg.apache.org/)) ([https://delta.io/](https://delta.io/)). These are storage formats that work with distributed file storage like HDFS or S3 to partition and query data using the Spark DataFrame API. You get time travel, schema evolution and a bunch of other sweet features out of the box. They are an evolution of Hadoop-era partitioned file formats and are an absolute must for organizations dealing with lots of data & ML infrastructure.

This post took up more time than I had wanted, but it actually feels good to
write down before I forget. I hope it is useful for someone building Spark
infrastructure. I'm sure others have a completely different approach, which
I'd be curious to hear! As someone whose full time job was basically just to
orchestrate Spark application development, I can say for certain products like
this are needed in order for the ecosystem to thrive, and I would probably
have given you my business had the circumstances been correct. Good luck to
you and your team.

~~~
jstephan
Thanks for taking the time on this detailed and thoughtful feedback. We've
implemented some of the points you mentioned (SparkOperator, Airflow
connector, CLI is WIP) and have projects for the other points you mentioned,
like how to make it easy to transition from local development to remote
execution.

Sorry to hear about the layoffs. I'd like to follow-up with you to get your
feedback on specific roadmap items we have in mind. Would you email us at
founders@datamechanics.co to schedule a call, or at least keep in touch for
when we have an interesting feature/mockup to show you? Thanks and good luck
as well!

------
ggregoire
Julien is a really smart guy I had the pleasure of working with.

If you are reading this, I'm glad and very excited for you! Good luck!

~~~
jdumazert
Hey Guillaume, thanks for your wishes! Let's catch up!

------
ev0xmusic
Congrats guys, what you are doing is awesome :)

------
blancothewhite
Very interesting topics in good hands !

