
Mara: A lightweight ETL framework, halfway between plain scripts and Airflow - stadeschuldt
https://github.com/mara/data-integration
======
cosmie
I really like this!

I bootstrapped the ETL and data pipeline infrastructure at my last company
with a combination of Bash, Python, and Node scripts duct-taped together.
Super fragile, but effective[3]. It wasn't until about 3 years in (and 5x the
initial revenue and volume) that it started having growing pains. Every time I
tried to evaluate solutions like Airflow[1] or Luigi[2], there was just so
much involved with getting it going reliably and migrating things over that it
just wasn't worth the effort[4].

This seems like a refreshingly opinionated solution that would have fit my use
case perfectly.

[1] [https://airflow.apache.org/](https://airflow.apache.org/)

[2] [https://github.com/spotify/luigi](https://github.com/spotify/luigi)

[3] The operational complexity of real-time, distributed architectures is non-
trivial. You'd be amazed how far some basic bash scripts running on cron jobs
will take you.

[4] I was a one man data management/analytics/BI team for the first two years,
not a dedicated ETL resource with time to spend weeks getting a PoC based on
Airflow or Luigi running. When I finally got our engineering team to spend
some time on making the data pipelines less fragile, instead of using one of
these open source solutions they took it as an opportunity to create a fancy
scalable, distributed, asynchronous data pipeline system built on ECS, AWS
Lambda, DynamoDB, and NodeJS. That system was never able to be used in
production, as my fragile duct-taped solution turned out to be more robust.

~~~
vijucat
> When I finally got our engineering team to spend some time on making the
> data pipelines less fragile, instead of using one of these open source
> solutions they took it as an opportunity to create a fancy scalable,
> distributed, asynchronous data pipeline system built on ECS, AWS Lambda,
> DynamoDB, and NodeJS. That system was never able to be used in production,
> as my fragile duct-taped solution turned out to be more robust.

"An engineer is one who, when asked to make a cup of tea, come up with a
device to boil the ocean".

Source: unknown. I think it was in Grady Booch's OO book, or some such book.

Also: "architect astronaut" and "Better is the enemy of good" come to mind...

~~~
IncRnd
I think the reference to impossible things is far older than Grady Booch.

    
    
      To talk of many things:
          Of shoes--and ships--and sealing-wax--
      Of cabbages--and kings--
      And why the sea is boiling hot--
      And whether pigs have wings
      ~ Lewis Caroll, 1832-1898

------
antoncohen
I'm sure this is right for someone, everyone has different requirements, but I
don't really want a lighter-weight Airflow. I want an Airflow that runs and
scales in the cloud, has extensive observability (monitoring, tracing), has a
full API, and maybe some clear way to test workflows.

I was looking into how Google's Cloud Composer is run, which is a managed
Airflow service. They use gcsfuse to mount a directory for logs, because
Airflow insists on writing logs to local disk with no cleanup system, even if
you configure logs to be sent to S3/GCS. To health check the scheduler they
query Stackdriver Logging to see if has logged _anything_ in the last five
minutes, because the scheduler has no /healthz or other way to check health.
There is no built it way to monitor workflows, so you can't easily do
something like graph failures by workflow, email on failure is about all you
get. A GUI-first app that requires local storage is not what I expect these
days.

~~~
mariusae
> I want an Airflow that runs and scales in the cloud

I'd encourage you to look at Reflow [1] which takes a different approach: it's
entirely self-managing: you run Reflow like you would a normal programming
language interpreter ("reflow run myjob.rf") and Reflow creates ephemeral
nodes that scale elastically and that tear themselves down, only for the
purpose of running the program.

> has extensive observability (monitoring, tracing)

Reflow includes a good amount of observability tools out of the box; we're
also working on integrating tracing facilities (e.g., reporting progress to
Amazon x-ray).

> has a full API, and maybe some clear way to test workflows.

Reflow's approach to testing is exactly like any other programming language:
you write modules that can either be used in a "main" program, or else be used
in tests.

[1] [https://github.com/grailbio/reflow](https://github.com/grailbio/reflow)

------
notamy
What exactly IS an "ETL framework"? I looked at both this project and Apache
Airflow, and I'm not quite sure I understand...

~~~
shadowmint
Basically software for batch processing a tonne of data.

eg. importing a SAP feed into a database, or loading a bunch of csv files, or
like processing a bunch of images...

...anything where you have to convert data from some source through a series
of steps (typically a DAG) into some useful output.

However, its often misused.

For example, if you have a trivial amount of data, or trivial process an ETL
is over engineered cruft where a simple script would do.

So there are many (rubbish) ‘simple ETL frameworks’ which offer zero value and
technical complexity for no benefit.

Unless you _need_ multiple servers processing data through multiple steps and
you need the auditing and process control... you probably don’t need an ETL,
just a simple script.

~~~
cosmie
> you probably don’t need an ETL, just a simple script.

+1

> Unless you need multiple servers processing data through multiple steps and
> you need the auditing and process control

I'll stress the "multiple servers" part. You can add in a substantial amount
of multiple, sequential steps and auditing and process control in a simple
script. The part that adds orders of magnitude worth of complexity and
operational overhead and points of failure is being able to distribute it to
multiple servers. Distributed architectures are operationally and
architecturally expensive. And far more often than not, completely unnecessary
for a given use case.

~~~
bduerst
Some ETLs these days _are_ simple scripts - e.g. Spark, Dataflow - with their
config files.

~~~
cosmie
Being able to _define_ an ETL workload within a simple script is not the same
as your ETL system itself _being_ a simple script.

While I love both Spark and Dataflow, both of them are incredibly complex
distributed systems with very high operational costs. Someone, somewhere is
paying a lot of money to have an operational resource maintain that
complexity. Whether you have an internal devops resource doing so or you're
using a managed service, you're paying for that complexity somehow. And, for a
lot of workloads, you aren't actually getting any more value than you would
from standing up a ~$50/month standard Debian/Ubuntu server and a set of
simple scripts on it.

~~~
Mironor
You don't have to have a cluster to run spark scripts, setting master to
`local` (and running it on one machine) is often enough for small anounts of
data.

------
porker
What is there in the ETL space with bi-directional sync?

I don't usually run into problems where "transfer data from X to Y" is it.
Usually it's "there's data in CRM X and data in Event system Y, merge the two
keeping X as the master source"

There's Mulesoft et al but they seem overkill for small deployments, as well
as being stupidly expensive [1].

1\. I'm sure they're good value if you're an enterprise company. But if you
don't need the UI builder and only 2-3 sources kept in sync, they are
expensive. And with my reading of the documentation, conflict resolution isn't
great either.

~~~
ptrott2017
Kettle - the open source component of Hitachi Pentaho Data Integration is
worth looking at, has some functionality for this (you can join sources and
insert joined data back into master) and its pretty easy to extend to meet
requirements. Its Apache licensed, with great commercial support if you need
it, and can be found here:

[https://github.com/pentaho/pentaho-
kettle](https://github.com/pentaho/pentaho-kettle)

We are small shop but it is a key part of our data workflows. We also found
spoon the UI workflow builder to be very helpful for building workflows -
since its allowed team members who were not strong in Java to build workflows
that they need. Last but not least and a key decider in our adoption is the
community is super friendly and very helpful. Something we thought would take
months got implemented in a weekend thanks to quick feedback. Perhaps worth a
look.

~~~
karambahh
As a long time Kettle user (probably close to 10 years) I must warn potential
users that the learning curve is steep and that (as any large body of code) it
contains code that sometimes can run unpredictably. I got good at diagnosing
user induced bugs in PDI transformation via reading the stacktrace but it is
not to everyone liking.

To me, a very strong regression is the "new" UI which switched from meaningful
icons to a blue & white scheme that makes reading/discovering new
transformation a real pain: all is a blur of blue without the past color cues
that you learned in the past ("ok, this is the icon for a merge from a source
file & a database sent to an ES cluster" became "some stuff is read from blue
sources and sent to some blue output")

I recently learned about the capability to run transformation into a spark
cluster that replace the original engine by a new spark implementation, bring
obvious compute optimization for large enough dataset but I don't have enough
experience with it to speak of it positively or negatively.

~~~
ptrott2017
@karmbahh - good to know. I've used Kettle for about 9 months in production
and so far its been pretty solid - but we are not going that far off the
beaten path for most things. Its a big app, but at least there is
documentation and some great users who have been very helpful and the codebase
is by and large very logically laid out.

We do use the Adaptive Execution Layer - but so far not with spark (we use it
with our own processing engine) - its working well for us and its great we can
switch engines as needed.

re: UI. I like a lot of new look and feel but I can see how it did lose some
visual semantics and i can imagine any long term user would find the changes
frustrating. I guess with coming to the tool much later, this has been less of
an issue for me and we teak the presentation for our own workflows and plugins
anyhow.

For us Kettle/Pentaho PDI is a great open source project but it will
definitely be interesting to see how things evolve now Hitachi has acquired
Pentaho.

------
groodt
I'm interested in hearing thoughts from people who've used digdag
([https://www.digdag.io/](https://www.digdag.io/)) or pachyderm
([http://www.pachyderm.io/](http://www.pachyderm.io/)). Pachyderm is the most
interesting to me. It seems to be focussing on the data as well as the data
processing.

~~~
jdoliner
I'm the founder and a core developer of Pachyderm so I can weigh in on how it
compares (there is, of course, some potential bias here). I also was at Airbnb
around the time we released Airflow, so I got to see it being built up close
and used the system it was replacing quite a bit as well.

I think it's fair to say that Mara and Airflow are both in the same category
of DAG (directed acyclic graph) schedulers for Python; Python makes a ton of
sense as the language to focus on as it's the de facto lingua franca for data
science. I'd also put Luigi in that bucket, although I think Airflow has
degraded its mind-share quite a bit. All of them are targeting the data
pipeline use case, which is very well represented as a DAG, but the actual
management of the data is left up to the user. They (Mara, Airflow, or Luigi)
schedule tasks for you after all the tasks they depended on have completed,
but you have to figure out where to store your data so that downstream tasks
can find the data their upstream tasks outputted. At Airbnb we used HDFS as
this storage layer, often with Hive or Presto on top. Storing in s3 is also a
common pattern.

Pachyderm is also a DAG scheduler but we're a lot more prescriptive about
where you store the data, and a lot less prescriptive about what languages and
tools you use. Pachyderm ships with its own distributed filesystem (pfs) that
we use for storage, it does a few things that other storage solutions can't
do. In particular, it version controls your data and records "provenance" i.e.
where data comes from. For example if you train a machine learning model then
its provenance will be the data you used to train it. In terms of processing
we're much less prescriptive, because we let users express their code as a
Docker container rather than only having bindings for one language. So you can
use anything that you can fit in a Docker container. Data is exposed to your
code via the local filesystem, so regardless of language you have a very
natural interface to your data: system calls on files.

Hope this helps understand the differences between the various systems and
thanks for your interest in Pachyderm. Swing by our users slack channel [0] if
you'd like some help getting started with it.

[0] [http://slack.pachyderm.io/](http://slack.pachyderm.io/)

~~~
Eridrus
Random question:

I want to automate some workflows on my local machine, and besides the obvious
of just writing a script, I am interested in a system where I could describe
my workflow as a DAG and then have an easy (web?) UI where I could specify
which DAG nodes have changed (e.g. my data pre-processing code) and have it
automatically run all of the nodes that (recursively) depend on it as an
input, while not executing those whose inputs have not changed.

I am passing around very little actual data between these jobs; they are
mostly writing data to a (distributed) file system, so at most I need to pass
some paths around.

Some of the stages require launching a remote job and polling to find out if
it has completed.

Is there a good system for doing this? Now that I've described it I could
probably hack it together with a command-line UI without too much difficulty,
but having a pretty UI for launching and monitoring jobs would be great.

------
stadeschuldt
Here is a conference talk presenting the framework and the ideas behind:
[https://youtu.be/GdtFuOah-5c](https://youtu.be/GdtFuOah-5c)

------
samuell
Send a pull request to add it to [https://github.com/pditommaso/awesome-
pipeline/blob/master/R...](https://github.com/pditommaso/awesome-
pipeline/blob/master/README.md)

~~~
stadeschuldt
done

------
occams_chainsaw
Seems interesting, but I'll have to dismiss it for now due the total lack of
tests

------
endlessvoid94
In my experience one of the places where every framework breaks down is when
you must combine / reconcile multiple rows from multiple data sources to
produce one row in a fact table.

Does there exist a "framework" that lets me do this simply?

~~~
martin_loetzsch
(author here)

The mara example project [1] does exactly that. It combines PuPI download
stats with Github repo activity data.

[1] [https://github.com/mara/mara-example-
project](https://github.com/mara/mara-example-project)

~~~
endlessvoid94
Thanks! Just took a look.

The file directory structure is a bit confusing -- could you point me to the
file that performs this transformation?

~~~
martin_loetzsch
For example the PyPI download stats pipeline is here:
[https://github.com/mara/mara-example-
project/tree/master/app...](https://github.com/mara/mara-example-
project/tree/master/app/data_integration/pipelines/pypi)

The __init__.py contains the pipeline, and the rest is the SQL files that do
the transformations

~~~
endlessvoid94
Thank you!

------
thalesmello
As the one who implemented Airflow at my company, I understand how
overwhelming it can be, with the DAGs, Operators, Hooks and other
terminologies.

This looks like a good enough mid-term alternative. However, I have a few
questions (which I couldn't find easily in the homepage, sorry if I skipped
something):

\- Do you have a way of persisting connection information? I saw an example of
how to create a connection, but it isn't clear if the piece of code has to be
loaded every time you execute the ETL

\- How easy it is to implement new computation engines?

\- Plans of creating a command line to make it easier to execute operations?

~~~
martin_loetzsch
(author here)

Connection information is configured in code through [1], see [2] for an
example.

It's very easy to run other workloads. Either by directly invoking Python
functions from tasks or by writing own commands (operators)[3].

There is a command line. It's the interface for running from external
schedulers (jenkins, cron)[4] & [5]

[1] [https://github.com/mara/mara-db](https://github.com/mara/mara-db)

[2] [https://github.com/mara/mara-example-
project/blob/master/app...](https://github.com/mara/mara-example-
project/blob/master/app/local_setup.py.example)

[3] [https://github.com/mara/data-
integration/blob/master/data_in...](https://github.com/mara/data-
integration/blob/master/data_integration/pipelines.py#L47)

[4] [https://github.com/mara/data-
integration/raw/master/docs/exa...](https://github.com/mara/data-
integration/raw/master/docs/example-run-cli-1.gif)

[5] [https://github.com/mara/data-
integration/raw/master/docs/exa...](https://github.com/mara/data-
integration/raw/master/docs/example-run-cli-2.gif)

~~~
neuromantik8086
Perhaps this is addressed elsewhere, but do you have any plans to support
Common Workflow Language?

------
mooreds
I'd love this, except for ruby. Anyone know of anything like this?

~~~
thibaut_barrere
I wrote Kiba specifically for this.

I will share a good bunch of current use cases at Kaigi 2018 if you are
interested!

------
shadowmint
...but why?

Just use airflow.

Things I want in an ETL:

[x] works at scale.

[x] simple to use.

[x] not written in python (eg. in go or rust)

[x] easy to scale (eg. in docker)

[ ] this.

~~~
busterarm
[https://github.com/thbar/kiba](https://github.com/thbar/kiba) This is still
my workhorse and has never let me down. I do a ton of ETL.

[https://github.com/thbar/kiba-ex](https://github.com/thbar/kiba-ex) This
looks interesting though.

~~~
thibaut_barrere
Kiba author here - your comment made my day, so thanks!

~~~
busterarm
:D Thank you!!!

------
mariusae
Reflow [1] is also well-suited for ETL workloads. It takes a different tack:
it presents a DSL with data-flow semantics and first-class integration with
Docker. The result is that you don't write graphs, instead you just write
programs that, due to their semantics, can be automatically parallelized and
distributed widely, all intermediate evaluations are memoized, and programs
are evaluated in a fully incremental fashion:

[1] [https://github.com/grailbio/reflow](https://github.com/grailbio/reflow)

------
leblancfg
That looks very interesting indeed, and love to see some more development in
that space.

Would anyone care to explain how it differs from Airflow? I had dismissed
Airflow some months ago for my use case (Windows, and large number of
dependencies were an issue at the time), but would still like to eventually
migrate my ETL scripts to a solid framework sometime in the future.

~~~
artwr
With a little bit of work you could probably trigger tasks on a Windows
Machine, in particular if the work you need to do is mostly script and/or SQL
based. The main limit is that Gunicorn for the Airflow webserver is not
currently compatible with Windows, but the scheduler should work fine. I
believe it might be able to run in the Windows Subsystem for Linux, but I
don't think anyone has tested it as of yet. (Source: I am an Apache Airflow
committer)

~~~
leblancfg
Thanks! To be fair, I was looking at it from a single-user perspective, and
was optimizing for:

    
    
        time it takes to setup and move my ETL to Airflow + time I will save in the future by using the framework
    

vs

    
    
        time I will spend fixing and maintaining my already-existing ETL scripts
    

so I dropped the notion after I hit the first major snag. But I should put a
reminder to take a look back at it in a few months!

------
frugalmail
Just thought I would add:

If you want something more serious that supports better scale,
realtime/streaming, is written in a statically typed language check out
[https://kylo.io/](https://kylo.io/) it's added on to Apache NiFi (which was
developed by our friends at the NSA)

------
reinhardt
Why PostgreSQL only? The mara-DB dependency [1] claims to support more.

[1] [https://github.com/mara/mara-db](https://github.com/mara/mara-db)

~~~
martin_loetzsch
(author here)

Currently there is a hard dependency to Postgres for the bookkeeping tables of
mara. I'm working on dockerizing the example project to make the setup easier.

For ETL, Mysql, Postgres & SQL Server are supported (and it's easy to add
more).

~~~
random4369
I'm a bit confused about this. What if the target is HDFS? Why this dependency
on SQL databases for ETL?

------
samuell
I really like the interactive terminal-based menu, that's neat!

------
tjr225
This is slightly weird...but I named my dog Mara:
[https://flic.kr/p/FtdRNX](https://flic.kr/p/FtdRNX)

~~~
martin_loetzsch
Also weird: it's the name of a giant ugly guinea pig:
[https://en.wikipedia.org/wiki/Mara_(mammal)](https://en.wikipedia.org/wiki/Mara_\(mammal\))

------
soobrosa
What's the least verbose/boilerplate-heavy tool in your experience?

We couldn't make it leaner than this (works well in production in scale).
[https://github.com/wunderlist/night-
shift](https://github.com/wunderlist/night-shift) If we could get rid of Ruby
in there (super useful for scripting) and fly with only Python I'd be the
happiest person on Earth.

Also we started to go cloud agnostic, it handles both AWS and Azure. Do you
something that does AWS, Azure _and_ Google Cloud also?

~~~
martin_loetzsch
GNU Make is indeed the least verbose/ boilerplate-heavy tool and I use it for
a lot of things.

The problem with Make is lacking acceptance amongst younger programmers who
always want to work with the latest technologies.

------
nikolay
How is this better than Singer [0]?

[0]: [https://www.singer.io/](https://www.singer.io/)

------
mistrial9
the docs animations inline add to casual-browsing fun
[https://github.com/mara/mara-example-project](https://github.com/mara/mara-
example-project)

------
pgwhalen
Is anyone out there doing (or considering) streaming ETL as opposed to batch?

~~~
woqe
Apache nifi ([https://nifi.apache.org/](https://nifi.apache.org/)) is a
software which handles streaming ETL processing.

From a previous post I made about it: "We have HTTP endpoints set up to
receive data from our ERP's accounting system to send data to Concur and to
update customers' Lawson punchout ordering systems with shipment information.
The 'E' is an HTTP post with an XML payload. The 'T' consists of using the
payload to query other databases to build the 'L' payload, and the 'L' is an
HTTP post to the consumer's endpoints."

~~~
chrisjc
There's often a misconception about Nifi being batch or file based and not
streaming. This probably originates as result of data in Nifi being
represented as a FlowFile, which is kind of a misnomer. As woqe stated above,
Nifi can certainly doing streaming.

------
jmartrican
Does not seem lightweight. Maybe its light compared to the other solutions?

------
foolinaround
if this had a scheduler and an alerting mechanism when SLAs are breached, that
would be great! maybe future features?

~~~
martin_loetzsch
(author here)

It intentionally doesn't have a scheduler, just definition and parallel
execution of pipelines. For scheduling, use Jenkins, cron or Airflow.

Currently you can get notifications for failed runs in slack. Alerting itself
is not really in the scope of this project, but it should be easy to implement
in a project.

------
larsf
Composable is amazing at ETL -
[https://composableanalytics.com](https://composableanalytics.com)

It blows things like Alteryx, NiFi, Airflow out of the water.

~~~
mattbillenstein
Please add a disclaimer if you work for these guys... I personally generally
dislike enterprise solutions because I hate talking to sales people - but
that's just me.

~~~
larsf
Yes - I'm the founder. And you don't have to talk to me if you don't want to
:)

