Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Me and my team tried Airflow but found it didn’t fit well with our analytics work flow. For instance you must rewrite your jupyter notebook into an Airflow dag, doing basically the same work 2 times. We use dask and will soon deploy dask.distributed. I have yet to figure out where Airflow actually fits in the BI/data science architecture.


Not sure what you mean by "BI/data science architecture" but Airflow is essentially a scheduler and orchestrator for data processing jobs.

These activities are usually managed by cron and more often by advanced scheduler tools (depending on the vendor), so it's quite a core part of any architecture that needs to e.g. load/reload/refresh data periodically.

If the requirement is simply to connect notebooks to a data lake, then the only scheduling required is to load the data lake, and something like Airflow may be overkill for this, depending on what/how the data is processed and loaded.


I mean the same thing you mean. My issue with airflow is that it’s complicated and doesn’t adapt well to cloud computing. Dask runs on aws emr and eks, Kubernetes, etc.. Unfortunately orchestration is a lot more complicated than it looks. Parallel executions, retries, logs, status tracking, email notifications. Airflow doesn’t really tackle all orchestration work.


Airflow Maintainer here: what you are describing is exactly what Airflow takes care of (or should).

I wonder what your issue is/was? Notebooks are supported by means of a Papermill operator (equivalent to how Netflix operationalizes notebooks) or PythonOperator/BashOperator which would just wrap around your notebook.

However to parralelize tasks Airflow needs to know a bit more hence you might have found it required to break up your notebook into individual tasks that combine into a DAG. Is that what you meant?


With dask we code the workflow in the notebook and run in the notebook. We don’t have to fiddle with operator as every task is python code. Dask is easy to install which is important since each analyst has to be able to test the workflows before sending them to production. Finally by programming our own scheduler we can build the things we need. For instance we are able to listen to sql tables and api changes and trigger work based on that. Anyway I am sure I could make Airflow work too but it’s a harder fit vs dask.


I recall Jupyter being a parallel evolution alongside Airflow+Spark+Zeppelin (or similar mashup) and I think Jupyter has become "better".


Airflow maintainer here

Jupyter doesn't do scheduling and integrates pretty well with Airflow.


A lot of folks are moving to Dask, or Dask+Prefect these days.


Airflow Maintainer here

Prefect was started by an Airflow maintainer and friend of mine who also contributed the dask executor to Airflow. Hi @Jeremiah!


I heavily used to ETL (with Hive, Presto, and custom operators) using Airflow for three years. But I have no experience with the dask executor. Could you share your gentle introduction of the dask executor?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: