Durable execution is best done at the level of a language implementation, not as a library.
A workflow engine I recently built provided an interpreter for a Scheme-based language that, for each blocking operation, took a snapshot of the interpreter state (heap + stack) and persisted that to a database. Each time an operation completes (which could be after hours/days/weeks), the interpreter state is restored from the database and execution proceeds from the point at which it was previously suspended. The interpreter supports concurrency, allowing multiple blocking operations to be in progress at the same time, so the work to be done after the completion of one can proceed even while others remain blocked.
The advantage of doing this at the language level is that persistence becomes transparent to the programmer. No decorators are needed; every function and expression inherently has all the properties of a "step" as described here. Deterministic execution can be provided if needed. And if there's a need to call out to external code, it is possible to expose Python functions as Scheme built-ins that can be invoked from the interpreter either synchronously or asynchronously.
I see a lot of workflow engines released that almost get to the point of being like a traditional programming language interpreter but not quite, exposing the structure of the workflow using a DAG with explicit nodes/edges, or (in the case of DBOS) as decorators. While I think this is ok for some applications, I really believe the "workflow as a programming language" perspective deserves more attention.
There's a lot of really interesting work that's been done over the years on persistent systems, and especially orthogonal persistence, but sadly this has mostly remained confined to the research literature. Two real-world systems that do implement persistence at the language level are Ethereum and Smalltalk; also some of the older Lisp-based systems provided similar functionality. I think there's a lot more value waiting to be mined from these past efforts.
We (AutoKitteh) took a different approach to provide durable execution at the language level:
We are using Temporal as a durable execution engine, but we built a layer on top of it that provides a pure Python experience (JS coming soon).
This is done by hooking into the AST and converting functions that might have side effects to Temporal activities. Of course it's deeper than that.
For the user, it's almost transparent. Regular Python code (no decorators) that you can run as Duralbe workflow in our system or as regular Python.
You can look at the open-source, and we have a blog explaining how it's implemented.
I see it being a trade-off between how explicit the state persisted for a workflow execution is (rows in a database for Temporal and DBOS) vs how natural it is to write such a workflow (like in your PL/compiler). Given workflows are primarily used for business use-cases, with a lot of non-determinacy coming from interaction with third-party services or other deployments, the library implementation feels more appropriate.
Though I am assuming building durability at a language-level means the whole program state must be serializable, which sounds tricky. Curious if you could share more?
There's certainly a tradeoff between the two approaches; a simpler representation (list of tasks or DAG) is easier to query and manipulate, at the cost of being less expressive, lacking features like loops, conditionals, etc.
In the workflow engine I described, state is represented as a graph of objects in memory; this includes values like integers/strings and data structures like dictionaries/lists, as well as closures, environments, and the execution stack. This graph is serialised as JSON and stored in a postgres table. A more compact binary representation could be added in the future if performance requirements demand it, but JSON has been sufficient for our needs so far. A delta between each snapshot is also stored in an execution log, so that the complete execution history is stored for auditing purposes.
The interpreter is written in such a way that all object allocation, object manipulation, and garbage collection is under its control, and all the data needed to represent execution state is stored in a manner that can be easily serialised. In particular, we avoid the use of pointers to memory locations, instead using object ids for all references. So the persistent state, when loaded, can be accessed directly, since any time a reference from one object to another needs to be followed, the interpreter does so by looking up the object in the heap based on its id.
Non-deterministic and blocking operations (including IPC receives) are handled outside of the evaluation cycle. This enables their results to be explicitly captured in the execution log, and allows for retries to be handled by an external mechanism under control of the user (since retrying can be unsafe if the operation is not idempotent).
The biggest win of using a proper language for expressing the workflow is the ability to add arbitrary logic between blocking operations, such as conditional tests or data structure manipulation. Any kind of logic you might want to do can be expressed due to the fact the workflow language is Turing-complete.
That's really interesting! It does seem that this is identically semantically to the library approach (as the logic your interpreter adds around steps could also be added by decorators) but is completely automatic. Which is great if the interpreter always does the right thing, but problematic/overly magical if the interpreter doesn't. For example, if your problem domain has two blocking operations that really form one single step and should be retried together, a library approach lets you express that but an interpreted approach might get it wrong.
As soon as you do that, you tie in your state-preserving storage (a database?) with your language as your "programming environment", and it becomes harder to decouple them (or the design becomes overly complex with configurable implementations of a state-database interface).
So, I don't think this should be at the language level. Potentially at a programming environment level, which includes configuration for such environment, but separation of concerns is screaming loudly in my head :)
Reminded me of the Restate idea[1], discussed here[2] recently, except they do it as a library. An excerpt:
To persist intermediate steps (line 8), handlers use the SDK (ctx.run), which sends the event to the log and awaits the ack of the conditional append to the event’s execution journal. On retries, the SDK checks the journal whether the step’s event already exists and restores the result from there directly.
Though I think I agree with your point that it would be better to have this even more integrated than "just" a library. Perhaps something like how you can override the global memory allocator in C and similar languages, to avoid a tight coupling to the persistence layer.
A contributor in this space that I always thought was under-appreciated is Amazon SWF + Flow Framework. It's an older technology and SWF itself is deprecated, but it is a weird middle ground here in that it makes language-level modifications to inject workflow state persistence and coordination into Java code (via AspectJ). The original concept was to build a "distributed CPU" (ex see https://docs.aws.amazon.com/amazonswf/latest/awsflowguide/aw...).
SWF was the predecessor for AWS Step Functions. IIUC the lesson learned from SWF was that it was just too flexible, and imposing a more limiting set of constraints was both easier for the programmer to use & reason about, and made for simpler and faster execution.
I was playing with resumable execution and I agree that language-level support would be immense improvement. That said, I was able create library that would take javascript code and execute it line-by-line where after each line the state is stored but user needs to use state object to store all intermediate results which are persisted. But there is still problems if the computation fails during on some line. We can retry it but if it broke the entire execution state then we have a problem.
I wonder how to achieve the transaction-like (commit/rollback) behavior that will works across boundaries. Doing it on language-level is the way. Compiler/interpreter can handle all the state serializations.
This comment reminds me that despite having built lots of cool stuff in my career, that in reality I've just barely done anything that can be called "software engineering".
This seems like temporal only without as much server and complexity. Maybe they ignore it or it really is that simple.
Overall really cool! There are some scalability concerns that are brought that I think are valid but maybe you have a Postgres server backing up every few servers that need this kind of execution. Also, every function shouldn't be its own step but needs to be divided into larger chunks where every request only generates <10 steps.
The example is overly simplified. It glosses over many of the subtle-but-important aspects of durable execution.
For example:
- Steps should be small but fallible operations - eg. sending a request to an external service. You generally want to tailor the retry logic on steps to the specific task they are doing. Doing too much in a step can increase failure rates or cause other problems due to the at-least-once behaviour of steps.
- The article makes a big deal of "Hello" being printed 5 times in the event of a crash, but durable execution doesn't guarantee this! You can never have exactly-once guarantees of side-effectful functions like this without cooperation from the other side. For example, if the external service supports idempotency via request IDs then you can generate an ID in a separate step and then use that in your request to get exactly-once behaviour. However, most services don't offer this. Crashes during a step will cause the step to re-run, so durable execution only gives you at least once behaviour.
- Triggering the workflow itself is a point of failure. In the example, the workflow decorator generates an ID for the workflow internally, but for triggering a workflow exactly once the workflow ID needs to be externally generated.
- The solution is light-weight in terms of infrastructure, but not at all lightweight in terms of performance.
Thanks! DBOS is simpler not because it ignores complexity, but because it uses Postgres to deal with complexity. And Postgres is a very powerful tool for building reliable systems!
Temporal has the option of using postgres as the persistence backend. Presumably, the simplicity of DBOS comes from not having to spin up a webserver and workflow engine to orchestrate the functions?
This is the can of worms introduced by doing event processing. As soon as you break the ties between request and response a billion questions come up. Asking what happens to the reservation in a request response scenario is just chuck an error at the user and ask them to try again.
As soon as you accept the user input and tell the user all is well before you have processed it you enter into these kinds of problems.
I know not every single thing we do can be done without this kind of async processing, but we should treat these scenarios more seriously.
It's not that it can't ever be good, it's that it will always be complicated.
Having used several external orchestrators I can see the appeal of the simplicity of this approach, especially for smaller teams wanting to limit the amount of infrastructure to maintain. Postgres is a proven tool, and as long as you design `@step`s to each perform one non deterministic side effect, I can see this scaling very well both in terms of performance and maintainability.
I don’t think what you are describing as heavy is that big of a deal if an external orchestration system is required only for deployment, while the workflow can be developed and tested without a server on a laptop or notebook.
Bringing in orchestration logic in the app layer means there is more code being bundled with the app, which has its own set of tradeoffs - like bringing in a different set of code dependencies which might conflict with application code.
In 2025, I would be surprised if a good workflow engine didn’t have a completely server-less development mode :)
> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.
I'd argue that durable execution intrisically is complex and external orchestrators give you tools to manage that complexity, whereas this attempts to brush the complexity under the rug in a way that does not inspire confidence.
The embedded seems to ignore that most modern systems are distributed now and an operation will spawn multiple services. So an external system is better for that usecase I would say.
Where is the state stored? What if I have 5 copies of my service now, and 3 later. What happens to the "embedded state"? Was it on the container? Did it just evaporate? Was it on a volume? Is it stuck in limbo? Even if it was put into s3 it would be stuck in limbo.
DBOS solves this problem by storing state in Postgres, which is really good at coordinating multiple copies of the same service. Essentially, Postgres does the hard parts of external orchestration, letting you work with a simple library abstraction.
> In some sense, external orchestration turns individual applications into distributed microservices, with all the complexity that implies.
While I'm not entirely convinced by the notion that distributed microservices inherently increase complexity, I do see significant benefits in how they empower workflows that span multiple projects and teams. For instance, in Temporal, different workers can operate in various programming languages, each managing its own specific set of activities within a single workflow. This approach enhances communication and collaboration between diverse projects, allowing them to leverage their unique tech stacks while still working together seamlessly.
My reaction is "No way, not again !!". I personally done this internal orchestration at scale at a large enterprise spanning millions of execution and it has scalability problem. we eventually externalized it to bring back sanity.
Love it! I built a toy library that looked very similar to this one a few months ago. How does this handle changing the workflow code? I quite like how Temporal handles it, where you use an "if has(my_feature)" to allow for in-progress workflows to be live-updated, even in the middle of loops. I also introduced an idea of "object handles", something like a file descriptor, which is an opaque handle to the workflow function but which can be given to a step function to be unwrapped, and it can can be persisted and restored via a consistent ID.
I think the example given in this blog post might need a "health warning" that steps should, generally, be doing more than just printing "hello".
I can imagine that the reads and writes to Postgres for a large number of workflows, each with a large number of small steps called in a tight loop, would cause some significant performance problems.
The examples given on their main site are a little more meaningful.
Yes, that's totally fair. Usually, a step is a meaningful unit of work, such as a API call that performs an external state modification. Because each step is a fair chunk of work, and the overhead is just one write per step, this scales well in practice--as well as Postgres scales, up to 10K+ operations/second.
Your intuition is good - having worked with something similar to this, it works great but does not scale very well. The step journaling is pretty brutal to postgres/rdbms , and you hit vertical scaling limits quicker than you would like
I have never used it, but a predasessor of mine talked about Clipper alot and I believe it allowed remote execution blocks tied to a storage backend, in this case I'm talking about xBase languages ...
I think also Rebol supports remote execution blocks ...
The solution seems to be solving for the simplest use case (internal stateless functions) rather than the most complex use case (external state-impactful functions).
Furthermore, the words used aren't really what they should be talking about.
>> Because workflows are just Python functions, the thread can restart a workflow by simply calling the workflow function with its original inputs and ID, retrieved from Postgres.
>> For this model to work, we have to make one assumption: workflow functions must be deterministic.
Yes, "deterministic"... because all state modification is aligned to ensure that.
If instead of a single print() the function/step had 2 print()'s, the state leaks and the abstraction explodes.
The right abstraction here is probably something more functional / rust-like, where external state modifications sections are explicitly decorated (either automatically or by a developer).
That's exactly what this model is! The @Step decorator is for external state modifications. Then @Workflows orchestrate steps. The example shows the simplest possible external state modification--a print to the terminal.
Steps can be tried multiple times (if a failure happens mid-step) but never re-execute once complete. Since idempotency can't be added externally, that's the strongest possible guarantee any orchestration system can give you (and if your step is performing an idempotent operation, which is the safest thing, you can use the workflow ID as an idempotency key). More details in the docs: https://docs.dbos.dev/python/tutorials/workflow-tutorial#rel...
You're forcing adopters to divide any state-impactful activity into its own function (because only functions can be decorated with step, no?). That's seriously inelegant when scaled to larger codebases.
Regional tagging (e.g. safe/unsafe) would be a better approach, as it would allow developers to more naturally protect code, without redefining its structure to suit your library.
You start to grok the problem here, but primarily think about it in terms of databases, which are just one (admittedly common) type of external state:
>> If you need to perform a non-deterministic operation like accessing the database, calling a third-party API, generating a random number, or getting the local time, you shouldn't do it directly in a workflow function. Instead, you should do all database operations in transactions and all other non-deterministic operations in steps.
Note: Think you should really change "all" into "each in a separate transaction/step" there, to communicate what you're recommending?
As a thought exercise: imagine a Python program that automates a third party application via the GUI. Some UI actions cannot be undone (e.g. submit). Some are repeatable without consequence (e.g. navigating between screens).
How would your framework support that?
Because if you can efficiently support the pathological leaky-state case, you can trivially support all simpler cases.
Yeah, this definitely requires splitting state-impactful activity into its own function. That's good practice anyways, though I understand it might be a pain in large codebases. Regional tags are definitely an interesting alternative!
For the UI example, I don't think you'd use durable execution for most of the UI--it's just not needed. But maybe there's one button that launches a complex asynchronous background task, and you'd use durable execution for that (with careful workflow ID management to ensure idempotency and allow you to retrieve the status of the background task).
> splitting state-impactful activity into its own function.
Each separate state-impactful activity into its own function, no?
> For the UI example...
I wasn't talking about building a GUI in a python program: I was talking about using python to automate an external GUI app.
Because that toy example (which also happens to be done in the real world!) encapsulates a lot of the state issues that don't seem well-handled by the current design.
Namely, that external interactions are a mix between stateless and stateful operations.
A workflow engine I recently built provided an interpreter for a Scheme-based language that, for each blocking operation, took a snapshot of the interpreter state (heap + stack) and persisted that to a database. Each time an operation completes (which could be after hours/days/weeks), the interpreter state is restored from the database and execution proceeds from the point at which it was previously suspended. The interpreter supports concurrency, allowing multiple blocking operations to be in progress at the same time, so the work to be done after the completion of one can proceed even while others remain blocked.
The advantage of doing this at the language level is that persistence becomes transparent to the programmer. No decorators are needed; every function and expression inherently has all the properties of a "step" as described here. Deterministic execution can be provided if needed. And if there's a need to call out to external code, it is possible to expose Python functions as Scheme built-ins that can be invoked from the interpreter either synchronously or asynchronously.
I see a lot of workflow engines released that almost get to the point of being like a traditional programming language interpreter but not quite, exposing the structure of the workflow using a DAG with explicit nodes/edges, or (in the case of DBOS) as decorators. While I think this is ok for some applications, I really believe the "workflow as a programming language" perspective deserves more attention.
There's a lot of really interesting work that's been done over the years on persistent systems, and especially orthogonal persistence, but sadly this has mostly remained confined to the research literature. Two real-world systems that do implement persistence at the language level are Ethereum and Smalltalk; also some of the older Lisp-based systems provided similar functionality. I think there's a lot more value waiting to be mined from these past efforts.