
Ask HN: Data pipelines and ETLs - muramira
Been working with an org that is really struggling with the reliability and maintainability of their ETLs and data pipelines. Could you share some tools and best practices in 2017?
======
PaulHoule
The backstop "best practice" is to put somebody in charge of the issue and
give them both the responsibility and authority to fix it.

Next you need to face up to the causes of the problems. There may be five or
six root causes, and if you plan to fix just 4 of them, you will pay 80% of
the costs, but get 2% or 0% of the benefits in terms of cost savings because
the other root causes will still cause chaos, and now people will start to
blame the tools and procedures that were tried (those costs will be very
visible.)

Getting people on board with a realistic plan can be a little bit like getting
an alcoholic to recognize the damage that drinking has done to their life, but
the alternative is wishful thinking.

If you want to get into more specifics, click on my HN id and send me an
email.

See

[https://www.amazon.com/Art-Getting-Your-Own-
Sweet/dp/0070145...](https://www.amazon.com/Art-Getting-Your-Own-
Sweet/dp/0070145156)

------
nathanscully
We are using Airflow to manage ETL jobs. Nearly all of these are SQL steps
dynamically generated via an Airflow DAG that transform transaction and event
data on our SQL warehouse into 'master' tables everyone has access to. All SQL
and DAG code is committed into Github and we have a process to update Airflow
and merge any changes after its peer reviewed. Every change is done via a PR
so we have visibility and accountability.

One thing we want to improve is our testing component, curious to hear how
people manage test workflows, replicating prod before promoting new pipelines.
I.e. I want the branch to run a full test suite against a prod replica before
automatically replacing the current prod pipeline.

------
abd12
Can you give more background on what your data pipelines are like? Are they
mostly batch processes?

If so, I'd strongly recommend using a workflow tool like Luigi[0] or
Airflow[1]. In a phrase, I'd say they're like "Make for data".

[0]: [https://github.com/spotify/luigi](https://github.com/spotify/luigi) [1]:
[https://github.com/apache/incubator-
airflow](https://github.com/apache/incubator-airflow)

~~~
muramira
Basically, we improved a shell scripts based ETLs with lambdas. Now we barely
can maintain them, and when a something breaks, it takes an engineer an
inordinate amount of energy and time to get them fixed. Since we process about
0.5 to 1.5 TB of time series(IOT) I was thinking about an architecture that
combine AWS kinesis and airflow

