Yeah, I love this — pure Python with cron or periodic tasks (e.g., Django) works great. Celery task for parallelization, and if you pipe logs/alerts into a Slack channel, you can actually get really far without needing a "proper" orchestration layer.
I recently took over an Airflow system from a former colleague, and in our case, it’s just overly complex for what’s really a pretty simple data flow.
Although it lacks a lot of the monitoring and advanced web ui other platforms have (maybe because of it), Luigi is the simplest to reason about IMHO.
For a new project that will require complex orchestrations, I'd probably go with Dagster or Prefect nowadays. Dagster seems more complex and more powerful with its data lineage functionality, but I have very little experience with either tool.
If it's a simple project, a mix of Makefiles + GH Actions can work well.
Is there anything even more lightweight, where you don't have to write your code any differently? For instance, say I have 10 jobs that don't depend on each other, all of them pretty small.
Dagster and even Luigi feel like overkill but I'd still like to plug those into a unified interface where I can view previous runs, mainly logs and exit codes. Being able to do some light job configuration or add retries would be nice but not required. For the moment I just use a logging handler that writes to a database table and that's fine
One of the goals of Prefect's SDK is to be minimally invasive from a code-standpoint (in the simplest case you only need two lines to convert a script to a `flow`). Our deployment model also makes infrastructure job config a first-class citizen so you might have a good time trying it out. (disclosure: work at Prefect)
Love prefect! but for workflows involving concurrency, Prefect code needs to get somewhat invasive.
Prefect relies on prefect.task()-wrapped methods as the lowest granularity of concurrency in a program, and requires you to use the (somewhat immature) prefect task APIs to implement that concurrency.
This is an excellent write up thank you for sharing! Yea, our concurrency API needs an upgrade - coincidentally this is going to be a theme of the next sprint or two so I hope I can report some improvements back soon.
Straightforward programs in languages like Java, Python, etc.
The tools you describe all have the endpoint "you can't get there from here" and the only difference is if it takes you 5 seconds, 5 minutes, 5 days, 5 weeks or 5 months to learn that.
I few people have mentioned dagster and I took a look at that for some machine learning things I was playing with but I found dvc (data version control [1]) and I think it is fantastic. I think it also has more applications than just machine learning but really anything with data. If you have a bunch of shell scripts that write to files to pass data around, then dvc might be a good fit. it will do things like only rerun steps if it needs to.
Also for totally non-data stuff, Prefect is great.
I used to work for an automation company that produced a product called ActiveBatch. It was such an amazing tool for just drag and drop automation. Its focus was on full fledged workflow automation and not just data orchestration.
What I loved was its simplicity + its out of the box features. To set it up just took a simple MS SQL DB + An Installer. Bam you are up and running an absolute rock solid scheduler(i've seen million+ jobs running on it without it breaking a sweat). Then you could install (or use it to deploy) execution agents to all the servers you wanted as workers.
It also installed a robust Desktop GUI that had so many services built in ready to go (anything from executing scripts all the way to performing direct actions against countless products a company would have or against various cloud services).
There were so many pre built actions where all you had to do was input credentials and it would enumerate the appropriate properties from that service automatically. Then you could connect things together (ie, pull something from the cloud, process it on some other server, store it, pass it along to another service, whatever you wanted)
Only problem was this is very much a B2B application and their sales is really only interested in selling to enterprises and not end users. I really wish we had something like this that regular people could download.
Everything ive seen listed here requires extensive setup,requires coding, or does not have a robust desktop GUI but instead some half baked web gui which might require dropping back down to scripts/coding. You could set up hundreds/thousands of automated steps in ActiveBatch without writing a single line of code. I miss that product.
As I worked adjacent to people who ran thousands of jobs in ActiveBatch - that software indeed was very simple to use and its GUI might have been awesome - but it's been double edged sword where if you have hundreds of people working on it - it becomes maintenance nightmare and promoting changes between environments was non-existent, causing multiple incidents.
Mind you, it might have been just culture at that place, but I don't think this is as good of an example as you make it be. Sure, it was easy to get started and made the life easier at the beginning, but running it at scale was not in any way easy.
I wrote my own in half a day. Worked 24/7 for 3 years... then I quit.
Seriously, took me much less time than setting up airflow. Even had a webpage in the end, with all the tasks, a tree view, downstream, upstream tasks (these were incremental improvements beyond the initial half-day), CLI... The works.
I now know the points of fragility I didn't know before, but I'd do it again.
I like having containers running as CronJobs or Deployments in Kubernetes, but Argo Workflow has been a pretty reliable plugin to Kubernetes for the more advanced scenarios.
However, it’s simple only if you are already familiar with software containers and Kubernetes. But it’s perhaps better to learn than having to deal with dependency hell in Python or Java.
at my last startup I asked a friend to help me debug an Airflow DAG. he just pip installed prefect and I've never really looked back. at the time everything else felt too hard to figure out.
I’ve been using airflow for quite some time. Due to the maturity of where we are at, and while I’ve tested other solutions, I don’t really see changing things.
Unmeshed - it’s not open source. It’s a new version of Netflix Conductor. Scales really well and has a GitHub actions style agent that can be used to run commands orchestrated by the platform. It’s probably the cheapest commercial tool you can get.
If you're inside AWS, have a fully containerized workflow and/or can run some tasks in Lambda, Step Functions is probably ok? I personally prefer Airflow, but I wouldn't say is the 'simplest data orchestration tool'.
spaCy’s weasel package allows you to put a bunch of commands meant to be run in sequence in one project.yml file, pull assets, etc—- I find it to be the right level of abstraction and I’m pretty sure it’s not trying to become a cloud hosted do everything tool: https://github.com/explosion/weasel
reply