There is almost nothing that SWF can do that AWS Step Functions don't do better by a large margin.
> Using SWF turned out to be a lot more complicated than I expected. Having to parse all to events to figure out which activity goes next, seems error prone and makes the code confusing. The documentation is also not very clear on how this should be done, so hopefully this example helps people interested.
That's the only point in this article you need to know. SWF was a great stepping stone towards something better, but wow I wouldn't dare build anything new on it today.
The article is pretty naive. It writes code against the low level API which was never intended for workflow writers and then complains that it is too complicated. He had to use the high level library (https://docs.aws.amazon.com/amazonswf/latest/awsrbflowguide/...) instead.
It is like writing assembler code against raw Linux kernel systems calls while Python is available and complaining that Linux is too hard to use.
The main thing I liked about SWF was that the entire workflow was represented in my own code. We ended up cloning the parts of SWF we used when we had to deploy outside AWS. The decider/actor pattern is a great pattern and I'd definitely use it again.
However, as you say, the 'Simple' part is definitely a misnomer.
Look at temporal.io. I believe we managed to get rid of the original SWF rough edges while maintaining the "workflow as code" approach. And the best part is that it is open source and you can run it anywhere yourself.
I'm gonna go out on a limb and say any purported workflow tool that comes with a data model you have to memorize (i.e. "Before we start building a workflow, let’s learn a little about the components of an SWF") is too complex to be effective.
My problem with tools like these is that I already know the components of an "SWF" or whatever—these are the tasks I have that need to be run/managed. When a tool starts telling me what the architecture needs to look like, then it stops being a helpful tool and starts being a little know-it-all.
My favorite workflow tool is actually two pieces of software: cron and postgres. Cron schedules tasks and postgres handles shared state. It's easy enough to whip up an ACID-compliant task queue in SQL that has whatever bells and whistles you want, and all cron wants is a command to run and a schedule. No need to read a bunch of documentation about what a "task" is supposed to be vs. an "activity" vs. an "execution" or anything like that.
Of course, what my setup does not do is provide common functionality out of the box like "just gimme a way to kick off a series of FS-dependent tasks every day and record errors/halt if anything fails." I don't mind. It's not like Apache Airflow (just to give another example) has saved me from having to think about and express my system's dependencies and failure modes—it has only put a lot of unnecessary and unhelpful constraints on how I am able to express them.
Can you specify with what you mean by Airflow introduced unnecessary and unhelpful constraints? I'm very interested.
I'm currently working on a standard format for defining such workflows [0] and my own scheduling engine, which hopes to be as non-imposing as possible. It's supposed to be a "cron" for task scheduling with a dependency graph. Only added thing is that you can specify on what kind of environment you want to run your tasks.
So I would appreciate if you could tell me what were things that annoyed you in particular. I use airflow at work and I can list a million, but I don't know exactly with what you meant with that sentence.
Sure! To start, just fundamentally—why assume workflows are DAG-shaped? Why no cycles? Lots of real-world processes contain unscheduled repetition that arises at "runtime."
Or what if I can only find out what the rest of the workflow looks like once I'm halfway through it? Why must workflow definitions be static? No "decision" elements as in a flowchart?
Someone might read these complaints and think I'm asking for a programming environment rather than a workflow tool, and that's kinda my point :P
The "unnecessary" side is typically project-specific, but I tend not to need a separate notion of `backfill`, or any of the `Executor` functionality for distributed execution. I suppose if I needed to run stuff on multiple nodes I would just schedule jobs on Kubernetes directly.
The concept of a cyclic workflow sounds interesting to me! I also intend mine to be acyclic, but I guess a possible infinite loop is not bad in itself. Maybe I should reconsider my underlying data structure.
Why would a workflow need to be static? I would say that a specific instance of a workflow should be static. The main reason one would want to switch to such a system is because they want their jobs to do (almost) the same thing, but at different points in time. If your task is also changing while it runs, maybe this logic should be within the task, not the workflow. But if your workflow changes over time, that is a very valid point. One which I'm also trying to incorporate with incremental versioning.
Backfills as a separate notion I also find weird.
Lastly I view the executor part as very important. Because imagine you want to run different processes across an organisation inside your scheduling engine. And some are written in a Python environment and others are compiled code. Sure you can schedule it all directly to k8s, but then you lose the advantage of bundling all your workflows into one system specified for that reason. You basically go back to your "cron" example, where you deploy directly on infrastructure. Meaning you never intended to use a workflow engine in the first place :P
This mentality leads to systems that are a mess of callbacks, don't scale, and practically impossible to maintain.
You could make the same argument about SQL databases. Why do you need to understand their architecture, learn arcane SQL syntax, and learn to operate them? Instead, your program could write and read files directly from disk. But you still choose to use the DB as it has hundreds of man-years invested in it and gives you a higher level of abstraction.
Think about SWF and its newest incarnation in the form of temporal.io as a higher-level way to write task orchestrations. It requires some learning but allows you immediately to leverage dozens of man-years invested in the technology.
> This mentality leads to systems that are a mess of callbacks, don't scale, and practically impossible to maintain.
Why would it? I can put the same types of abstractions into my application layer in the form of a common library. Only difference is they can be a lot fewer and simpler because they only need to meet my exact requirements.
I often do make the same argument about SQL databases in cases where RDBMSs are not an appropriate tool for the job. In the case I mentioned, where I'm using it as a shared datastore that supports ACID transactions with concurrent access, I find Postgres (and many others, including many NoSQL stores) to be suitably placed in the abstraction spectrum to be worth using rather than rolling my own solution.
I agree in principle that there is not single tool for every possible situation.
At the same time I witnessed hundreds of cases when developers spent inappropriate amount of time developing "abstractions into application layer" instead of the business logic.
Very cool projects. In general, across all projects, what is the best approach to store/checkpoint the state of the program? I'd imagine something like CRIU but I'd love to hear your thoughts. When do you take snapshots. Do you hook-up to the VM's event loop being drained? How do you store an d replicate these snapshots?
The Temporal/Cadence/SWF operate as libraries any application can include to be able to implement workflow and activity logic. So hooking into low-level VM event loop was not an option.
So they rely purely on event sourcing to reexecute the program code from the beginning assuming that the workflow code is deterministic. The library provides various API wrappers to execute multithreaded code deterministically using cooperative multithreading.
In the future, WebAssembly can be used as a container as determinism is one of its core features.
What considerations do you put into how to handle versioning? I'm thinking mostly of the "decider" logic.
For example, in SWF + Flow land, I maintain different versions just by keeping the old implementation and using version numbers. For Temporal, your team's recommended way is to have a single decider get the execution's version, and write conditionals into the business logic for both cases (https://docs.temporal.io/docs/java-versioning). Is the change in approach intentional, or just a matter of what's been built so far?
The change in approach is intentional. It is still possible to use the old method, but for the majority of cases, it is just too heavyweight. There are two main problems with versioning entire workflows:
1. Need to keep the workers with old code around. It might be solved for some languages like Java by dynamic code loading, but it is still pretty complex. For long-running workflows dozens of changes can happen during their lifetime, requiring dozens of versions to be present at runtime.
2. When a bug is fixed the old versions do not get the fix as they are inherently immutable. Thus a workflow that runs for a month and has started yesterday is still going to experience a bug that was fixed today.
The approach that Temporal promotes is to version each piece of code independently when needed. This way, there is only one version of the entire code in production to maintain, and bug fixes can be deployed any time and apply to all the open workflows that didn't reach the code that was fixed.
What was the reasoning behind the decision to define workflows as code versus in a configuration format like JSON (think Netflix Conductor) in Cadence and Temporal?
The reason is that workflows are inherently imperative.
Configuration formats work well for domains where declarative programming makes sense. For example, Terraform defines "what should be done" instead of how it should be done.
Some workflows can be described using declarative syntax. But these belong to some specific domain. In this context, Terraform HCL can be seen as a workflow definition language for infrastructure deployment.
Temporal/Cadence are targeting nondomain specific workflows, so they have to be imperative. And I strongly believe that JSON/YAML/XML/etc. are awful languages for writing imperative programs. They add no value but introduce immense complexity. And they almost always embed some expression language for condition and other evaluations.
For imperative programming, any general-purpose programming language beats heads down any configuration based language in clarity and other features like IDE support, debugging, unit-testing, mocking. We have software systems built form millions of lines of code, and people still make sense out of them. Just imagine Linux kernel written in JSON instead of C.
Note that some workflow engines define workflows as code as well. Airflow is a notable example of this approach. Note that Airflow uses code to define DAG. But the actual execution engine executes the DAG and not code.
Temporal executes the workflow code in a general-purpose programming language directly without any intermediate conversion to AST/DAG or similar.
Then the question is, what makes Temporal workflow code a workflow? Is any code a workflow? Temporal makes workflow state, including local variables and thread stacks fully fault-tolerant. It is like having a computer with a fully durable RAM. Actually, Temporal provides even stronger reliability as a workflow code is not linked to a single computer is automatically migrated between computers or even different clusters when infra failures (or just new code deployments) happen.
> Temporal/Cadence are targeting nondomain specific workflows, so they have to be imperative.
I'm not sure that follows, although it's interesting! BPMN, for example, is a declarative way of defining processes that isn't domain specific. Did you have a look at that?
BPMN is not declarative. It is imperative. It doesn't say what should be achieved. It gives exact instruction what should be executed and in which order. And a big chunk of those instructions are not really visible in the diagram. For example which parameters each activity takes or shared state updates are hidden.
So it has all the issues that I pointed out above.
If you do not have all your stuff in AWS, and are not sure if you want to pay high amounts for managed or "serverless" solutions, I am currently developing a standard[0] for workflows and a distributed scheduler which is compliant to this standard.
It's still WIP, but I've used my fair share of managed solutions, "serverless", open-source (Airflow/Luigi) and BI engines to wonder, why there is no general standard just for the definition of the workflows. So that moving between systems is much less friction...
Don’t you think the big challenge is creating and managing events? Once the graph is built you can throw it at something like dask with fire and forget. The challenge is WHEN to run.
Yes and no.
So it's the same in a sense that I define a format for these workflows. This means that the workflow authoring tool is completly independant of the scheduler/execution engine, as long as both are compliant to the standard. BPMN tries to do a lot more, and I think it's overkill for most use-cases.
The major difference is OpenWorkflow defines tasks very loosely: you specify the environment you seek, and what to do on it. You can incorporate it into your current set of jobs without much friction.
I don't think BPMN is a good comparison, as companies who seek this are mostly huge and have slowly changing processes. This is more for cases when the world around you might be everchanging, incorporating APIs, shuffling files around or moving data for analytics/model training.
Just looked at it, since I never heard of it before.
It looks kind of like what I'm trying to achieve, but looking at the specification is very detailed again. It even comes with its own data model for input and output. I don't feel like this should be the focus nor mandatory. It's very Java/Python heavy, which, to be fair, are the two most common languages for these use-cases. But I believe you either have a very static, strong system, as-in shipping a whole BPMN model onto your company and implementing SAP, or you want something light-weight that's easy to use, like e.g. Airflow. I'm more on the light-weight front, but I think systems like Airflow have the problem of being too shut-in into one language. I want my workflows to be quickly defineable (e.g. via Python), but I don't want the Engine to actually run on Python...
I mean, worst case I just gained a lot of experience once I'm done with implementing. I do see a lot of similarities between what's already there and what I'm doing, but also some differences. And it's fun :D
This reminds me of a POC I did to serialize Kotlin continuation state in order to achieve something similar with Kotlin coroutines. The main issue with that approach is that the continuation context can end up closing over a huge variety of types (since it has to serialize the entire asynchronous stack). There was some discussion in this issue (https://github.com/Kotlin/kotlinx.coroutines/issues/76) about annotating suspend functions so that their continuation only closes over objects compatible with Kotlin serialization. This approach removes the requirement for all non-determinism in the code to go through some kind of framework, which seems like a plus.
I think in the future what will happen is someone will build a language with CSP style concurrency that deals with the "update-in-flight" and "persistence" problems. Essentially extending the notion of a programming language runtime from "once, here, on this computer, right now", to something much broader.
I used SWF for an auto-failover scenario many years back and found it very effective. I always liked the fact that worker machines could be on-premise if needed. I think Step Functions can also interoperate with on-premise resources/code also, though I have yet to work with Step Functions so far.
The Java framework Flow at first I thought was off putting because it uses AspectJ to modify the bytecode but after having used it it's kind of elegant.
I like how it re-runs through your decider each time with the state information filling up Promises as it goes.
Helps make.sure your decider is written in a deterministic fashion to the workflow events.
In my experience, Flow is truly awful. It bloats build times, and prevents incremental rebuilds. You'll curse AspectJ and Flow as your codebase grows. Stack traces become atrocious (even more atrocious than standard Java).
> make.sure your decider is written in a deterministic fashion
You're right, but this is easier said than done. And if it isn't, it will error at runtime.
Try temporal.io. It is just a Java library without any code generation and AspectJ. It also allows to write synchronous code while Flow was only asynchronous.
No thanks mate, the endless `Impl` of interfaces is giving me bad flashbacks. At least it's open source I guess. Not to mention the deployment hassle that the different workflow and activity workers represent, or how you roll back such a beast. Or even what the testing strategy looks like.
I agree that Flow being POC had all the above issues. Temporal listened to the users and solved all of them:
1. No code generation. So, you define one interface for activities and use it for both calling activities synchronously from the workflow and to implement them. Here is how would you do it in Temporal:
GreetingActivities activities = Workflow.newActivityStub(GreetingActivities.class)
// This is a blocking call that returns only after the activity has completed.
String greeting = activities.composeGreeting("Hello", name);
2. SWF or Temporal does not drive the deployment strategy. You can run all of the workflows and activities as a monolith in a single process or break them in multiple services. But it is purely your choice.
3. SWF wasn't possible to run locally. Temporal fully supports unit testing of long-running workflows with automatic time skipping and local integration testing using the service running in docker-compose.
I've been using temporal.io, which afaik is developed by some of the people who built SWF (and Cadence at Uber), and I think it's great. I'll probably use it for a ton of stuff in the future.
We (at Banzai Cloud) use Cadence for managing infrastructure. Workflow as code is a huge thing for us, because we can use the official SDKs for cloud providers, but still have the flexibility of implementing our own logic easily and in a testable way.
Temporal is even better as it replaces Thrift with gRPC and adds tons of improvements.
We have a bunch of services which orchestrate long running workflows so temporal is quite interesting to me. I've been lurking in the temporal slack for a while (and before that in the cadence slack).
Would be interested to know what kinds of things you're using it for/what scale/which language client?
I work for an accounting company with a few hundred workers, who every month had to submit a lot of files, and collect a lot of documents from human-only websites.
About two years ago I built a custom workflow engine in python where I had a few primitives (filesystem, database access, browser (puppeteer scripts), etc), and would define a workflow using those primitives using a YAML DSL.
In a couple weeks I ported the whole thing to Temporal (Java client), where each of those primitive groups is a different activity, which runs in a separate worker. I have a browser worker, which runs in the only servers that have internet access, a filesystem worker, which is on a Windows machine serving as an entrypoint to the company's windows filesystem, a database worker, etc.
This is all very scalable, and I can run as many of each worker and the engine itself as I like.
Currently the scale is not too big though. I launch a few "main" workflows in certain days of the month that will then spawn up to 4 thousand child workflows which run simultaneously with no problem.
Scale, interactivity, and durability are certainly part of this discussion. You can use Cadence/Temporal for purposes you just really couldn't use Airflow for. For instance, you could implement this entire actor-based auction example from Akka with Cadence/Temporal https://doc.akka.io/docs/akka-enhancements/current/persisten... There is absolutely no way you could do that with Airflow.
To understand why Cadence is so useful, you have to go a bit beyond the traditional ETL scheduling usages IMO.
> Using SWF turned out to be a lot more complicated than I expected. Having to parse all to events to figure out which activity goes next, seems error prone and makes the code confusing. The documentation is also not very clear on how this should be done, so hopefully this example helps people interested.
That's the only point in this article you need to know. SWF was a great stepping stone towards something better, but wow I wouldn't dare build anything new on it today.