
Data Engineering Patterns with Apache Airflow [video] - yoquan
https://www.youtube.com/watch?v=23_1WlxGGM4
======
grillorafael
Apache Airflow seems like a really interesting project but I don't know anyone
using that can give a real life pros/cons to it.

Anyone here dares to give some feedback in that sense?

Ps: Why do people still use Prezi? It gives me vertigo.

~~~
glogla
We tested it, but the performancd was bad. We needed hundred workflows with
few hundred taks each, and Airflow would just topple over daily.

We ended up with proprietary tool from Teradata thats basically Airflow
written in perl - but it can handle all the work.

Other than scalability, Airflow is pretty nice.

~~~
caravel
[full disclosure, I'm the creator of Airflow]

Many environments run tens of thousands of concurrent tasks, and hundreds of
thousands of tasks daily. The list of companies using Airflow speaks for
itself [https://github.com/apache/incubator-superset#who-uses-
apache...](https://github.com/apache/incubator-superset#who-uses-apache-
superset-incubating)

But hey, it's like anything, you have to do a bit of work to get distributed
systems to run at scale. There are now hosted solutions to help with that
(Google Cloud Composer and Astronomer.io)

~~~
tedmiston
[https://github.com/apache/incubator-airflow#who-uses-
airflow](https://github.com/apache/incubator-airflow#who-uses-airflow)

:)

------
xkcd-sucks
What strategies do people use to make Airflow behave like an "event-driven"
scheduler versus a "time-driven" scheduler? Like, for example, processing data
as it is received versus processing data at set time intervals

~~~
tpaschalis
(Newbie Airflow user here). I believe one easy way to do it is by using
Airflow's 'sensors'.

Sensors are operators which _poke_ continuously with an action until it
returns True (eg. until a file exists, an API gives a specific response, a
process/query has finished).

Another way to do it would be to 'XComs', small pieces of information flying
between DAGs, or 'Triggers', but these require some more setup, and IMO depend
more on the way you're setting up your tasks.

~~~
xkcd-sucks
Yeah the issue is once a sensor fires once, it doesn't reset and keep firing
on new data

My homegrown solution is a sensor at the beginning, and at the end an airflow
api call to trigger a dag run of the same dag. Not DagRunOperator because then
no dag would never finish due to infinite recursion

It seems kinda sketchy so I'm considering a lower level Celery implementation
or even GenStage

------
weego
I've no idea what that presentation thing is, but no. It's stuttering,
grinding my mac to a halt, makes skimming through tedious.

~~~
tpfour
I couldn't be bothered to even pay attention to the material. I found the
presentation repugnant.

From Prezi's homepage: "Harvard researchers find Prezi more engaging,
persuasive, and effective than PowerPoint.". My experience was the complete
opposite. The medium completely destroyed the message.

------
caravel
Here's the actual talk:
[https://www.youtube.com/watch?v=23_1WlxGGM4](https://www.youtube.com/watch?v=23_1WlxGGM4)

~~~
dang
Ok, since people were complaining so much about the previous url, which was
[https://prezi.com/p/adxlaplcwzho/advanced-data-
engineering-p...](https://prezi.com/p/adxlaplcwzho/advanced-data-engineering-
patterns-with-apache-airflow/), we've switched to the video. Thanks!

~~~
yoquan
Thank you for switching to the video - I was late to response due to different
timezone. (I've chosen slide over video thinking people generally don't like
video. Seems the slide format was more disturbing)

------
ckdarby
This is the type of slides that would benefit HN if they had the video with
them as well.

~~~
blobbers
[https://en.wikipedia.org/wiki/Matthew_7:7%E2%80%938](https://en.wikipedia.org/wiki/Matthew_7:7%E2%80%938)

:

[https://www.youtube.com/watch?v=23_1WlxGGM4](https://www.youtube.com/watch?v=23_1WlxGGM4)

