Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Arc, an open-source Databricks alternative (tripl.ai)
175 points by seddonm1 10 months ago | hide | past | favorite | 36 comments

After being frustrated with building 'traditional' ETL (Extract-Transform-Load) pipelines - and around the same time as the famous 'Engineers Shouldn’t Write ETL' blog post - we started building a framework/toolkit to allow Technical Business Analysts to be able to build reliable data pipelines without much developer support: Arc. This has been implemented as a Jupyter Notebooks extension.

Arc is declarative and currently targets the Apache Spark execution engine but the abstracted API allows replacing execution engines without having to rewrite the logic or intent of the pipeline in future. It supports parameterized notebooks to build complex pipelines which can be executed in CICD environments for safe deployment.

We would be interested to hear your feedback.

> the famous 'Engineers Shouldn’t Write ETL' blog post

for anyone else wondering, this appears to be Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department, by Jeff Magnusson in 2016 https://multithreaded.stitchfix.com/blog/2016/03/16/engineer....

We also inspired from the same blog post ('Engineers Shouldn’t Write ETL') and built our own internal ETL tools.

Our primary design goal was the system to be self-service for data scientists. Since our data scientists use pandas dataframes and jupyter notebooks all the time, we built the system around these two: (1) We have a library (that we call pype) acting an interface between the database and python dataframes (similar to .to_csv method), so there is no SQL queries in ETL scripts, (2) schedule (parametrized) notebooks using some special keywords.

We have a demo screencast: https://drive.google.com/file/d/1SVTduaIH_3IsJ-QoGI4mLYZE8Jv...

Looks good. It is nice to see how much influence the 'Engineers Shouldn't Write ETL' post had!

With Apache Arrow (https://arrow.apache.org/) I think the future looks very bright for both of our projects. It is important to have standard open source libraries and my early experiments have shown very good performance results.

> repeatable in that if a job is executed multiple times it will produce the same result

can you please elaborate more on this statement, does it mean you dump the data first time job is executed? Because if data is evolving, at some point execution should produce different results

The better statement would be that this facilitates the development of idempotent jobs and aims to minimise side-effects.

What is your opinion about https://min.io/ (It also has S3 compatibility AFAIK)

I think you replied to the wrong thread. Perhaps you meant here: https://news.ycombinator.com/item?id=26577176

Sorry, you're right.

I'm in the process of doing something like this internally, at a smaller scale, and it's interesting to see that many of the concepts I've been experimenting with and playing around with are formalized here in a similar manner. My "solution" doesn't build on Spark, as I just don't have enough data to necessitate it. I think the big difference is really the project's SQL first approach, which is probably going to polarize: personally, it's a decision I can't abide by, but I'm sure a lot of people will love that.


Most of the time we run Spark as a single-node (i.e. --master local[*]) as now we can easily utilise large nodes like 128 core, 512GB and Spark does scale vertically well but also runs relatively well on a small node like the Docker example on the website running on a laptop. The ability to run SQL against separate storage is Spark's killer feature in my view.

Arc does support the full Scala API which you can implement as a plugin (https://arc.tripl.ai/plugins/) so for advanced teams they have full control.

The reason we went for SQL-first is that we are trying to find the balance that allows Business Analysts to develop their own logic without having to learn Scala or even Python - as they probably already know SQL.

Hopefully some of the ideas are relevant to what you are building.

I'm reading the docs thoroughly, many excellent ideas, and I'm sure I'll be borrowing some concepts. I also want to commend you for putting in the time in developing proper documentation, always greatly appreciated.

Great! I am glad someone is reading them as it has taken a huge amount of effort.

We also have a link from the Jupyter Completer (autocomplete) to the docs.

>Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;

I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn't replacing Databricks as such probably giving an alternative to one of the features in Databricks.

Yeah, agreed. I was a Databricks skeptic when I first came across it, but it's value goes a LONG way beyond just managing Spark.

For example, we found that Databrick's Spark (or their 'Delta engine' or whatever it's called) had 50-60% better performance on our workloads than than 'core' Spark. I guess that's not surprising when a large proportion of Spark contrionutors work for you and can performance tune! Not to mention things like MLFlow and all their data engineering stuff.

This is a cool project, and I admire it's ambition, but saying it's a real 'alternative' to Databricks is a bit disingenuous.

Databricks writes some good tools, but it can get pretty expensive. Kubeflow has been evolving well and is gaining lots of traction. It's pretty neat from my experience so far.

We provide multiple Docker images (https://github.com/orgs/tripl-ai/packages) that make the Spark deployment easy:

- arc-jupyter: allows you to develop on your local machine (and offline) or you can easily integrate it with a JupyterHub deployment on Kubernetes (https://zero-to-jupyterhub.readthedocs.io/en/stable/index.ht...). We have built JupyterHub on GCP Kubernetes (GKE) with full user-level auth via GCP IAM. If anyone is interested I can publish a secrets-removed version of our script.

- arc: is the execution only docker image (so is smaller than arc-jupyter). We have this orchestrated on Kubernetes too and now that Spark officially supports Kubernetes deployment it is actually really easy to create and destroy clusters on demand.

I like it a lot, but how large scale can it be?

If I want to move whole JDBC-accessible database to warehouse or lakehouse (like Postgres or Oracle to S3 with Iceberg or Snowflake or something), do I have to build a set of configuration for every table, or can I do some wildcards, autodetections, etc?

I like the look of this but worry about adopting something as big as this. That said things tend to grow then I wish I'd started with something like this.

A completely valid concern.

You can see that all stages in the video implement the PipelineStagePlugin: https://arc.tripl.ai/plugins/#pipeline-stage-plugins. This means you can safely remove them from the code base and recompile without that stage at all. These are all dynamically loaded at runtime so it should be easy (and to implement your own custom logic).

Similarly the Dockerfile https://github.com/tripl-ai/docker/blob/master/arc/Dockerfil... just includes the relevant plugins (if not in the main Arc repository) so you can easily remove them or the Cloud SDKs/JDBC drivers to reduce your surface area.

We have endeavoured to write a large number of tests but there is always room to add more.

As a data person who despairs at the terrible data pipelines I have to work with, this seems cool! Shall follow with interest.

Yes I think as a community we have largely got our levels of abstraction incorrect:

- Code for pipelines without frameworks leads to huge repetition of logic - or worse people reimplementing the same 'logic' differently. Also you end up with a massive upgrade problem when new versions of underling execution engines change.

- Databricks provides a low-level API which leads to a lot of duplication of common code across notebooks (and reusability is difficult).

- The GUI based tools are often too high level so have very high reusability but are difficult to customise - and hard to source control.

We have tried to build an abstraction somewhere in between which gives you the reusability of the GUI tools, plays nicely with source control and has the power to add custom logic via the plugin interface: https://arc.tripl.ai/plugins/ if required.


I'm curious how this compares to www.getdbt.com which seems to target a similar audience (technical analysts wanting to do ETL) with a similar approach (SQL first).

Thanks. dbt is very cool and evolved at the same time but focuses on the Transform step of ETL only. Unfortunately, as data engineers, we still spend a lot of time consolidating the many input sources to perform that transformation and also want to load it to places.

You can see the work we have done to build standardised methods for Extract (https://arc.tripl.ai/extract/) and Load (https://arc.tripl.ai/load/) in the documentation.

The idea makes sense, but Databricks exposes the complete Spark API, is that true for this project as well? Spark is a lot more than Spark SQL.

Yes. Most of the simple stages just invoke the Spark Scala API - for example MLTransform invokes a pretrained SparkML model against a dataframe and returns a new one. You can see the standard Spark ML call: https://github.com/tripl-ai/arc/blob/master/src/main/scala/a.... You can add any plugin you want via the interface: https://arc.tripl.ai/plugins/

This is really defining a dialect that is more simple for Technical Business Analysts to consume that is safer than code and a notebook environment to interactively build with.

For example, we do a lot of of low-level RDD Operations through databricks. From skimming the Website I feel something like this is not in the scope of this project.

In the end, I feel, it is about wording. Databricks is a serverless spark environment with Azure integration and notebooks. Unless the product copies all the aspects (i.e. the hosting) it may not be wise to call it a databricks alternative.

If I reed the title as it is here on HN, I would think is about the infrastructure and not about a custom low-code JSON-based template language on top of spark sql.

Fair enough. This really replaces the Notebook experience of Databricks and gives you the option to self host and as it is dockerised it is easy to deploy on Kubernetes (with JupyterHub) or other container hosts. If you want to do low-level RDD operations you can easily build a plugin.

I think previously Spark was difficult to run, now with Docker and supported Kubernetes it is fairly easy.

Edit: I forgot to link to the Deploy repo which does have some scripts to easily deploy to cloud environments (but needs some work): https://github.com/tripl-ai/deploy

Can you specify between complete pulls of the source vs incremental based on a timestamp field?

Yes. We usually use a ConfigExecute (https://arc.tripl.ai/execute/#configexecute) stage to dynamically calculate a runtime parameter and pass that into the JDBCExtract query for example. There is an example here: https://arc.tripl.ai/solutions/#delta-processing

Good to see more attention to this. AWS did a presentation on it last year.

Cool! I was not aware.

Remember when arc was a lisp that powered hackernews? Glad to read she's all grown up

It was actually named after an electric arc as it was initially developed at a large power company - but yes, naming is heavily overloaded now.

Arc as a project name on HN ?!? OP account created November 13, 2018... okay, alright.

Given just how massive a flop Arc the language was, no wonder nobody would have heard of it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact