
Netflix Conductor: Open-source workflow orchestration engine - swyx
https://netflix.github.io/conductor/
======
kelnos
I set up Conductor where I work while evaluating workflow engines, and overall
wasn't too happy with it. The default datastore is this Netflix-specific thing
(Dynomite) that's built on top of redis. It's not particularly easy to
integrate operationally into non-Netflix infrastructure, and Conductor itself
hard dependencies on several services.

The programming model for workflows/tasks felt a little cumbersome, and after
digging into the Java SDK/Client, I wasn't impressed with the code quality.

We did have some contacts at Netflix to help us with it, but some aspects
(like dyomite itself, and its sidecar, dynomite-manager) felt abandoned with
unresponsive maintainers.

We've started using Temporal[0] (née Cadence) recently, and while it's not
quite production-ready, it's been great to work with, and, just as critically,
very easy to deal with operationally in our infrastructure. The Temporal folks
are mostly former Uber developers who worked on Cadence, and since they're
building a business around Temporal, they've been much more focused and
responsive.

[0] [https://temporal.io/](https://temporal.io/)

~~~
doktorhladnjak
The Temporal founders worked on AWS SWF too before building Cadence at Uber.
They have a lot of experience in this area and are making the product better
with each iteration no doubt. I enjoyed using Cadence at Uber and definitely
sad about not having it at my current company.

One of the founders, mfateev, is around elsewhere on this thread answering
questions.

------
freeqaz
Workflows and orchestration are my jam -- that's what we're trying to simplify
over at [https://refinery.io](https://refinery.io)

Conductor is a cool piece of tech, and it's a well-established player in a
rapidly growing space for workflow engines.

I used to work at Uber and that company had microservice-hell for a while.
They built the project Cadence[0] to alleviate that. It is similar to
Conductor in many ways.

One project to watch out for is Argo[1] which is a CNCF-backed project.

There are also some attempts[2] to standardize the workflow spec.

Serverless adds a whole new can of worms to what orchestration engines have to
manage, and I'm very curious to see how things evolve in the future.
Kubernetes adds a whole dimension of complexity to the problem space, as well.

If anybody is interested in chatting about microservice hell or complex state
machines for business logic, I'd be excited to chat. I'm always looking for
more real world problems to help solve (as an early stage startup founder) and
more exposure to what others are struggling with is helpful!

0: [https://github.com/uber/cadence](https://github.com/uber/cadence)

1: [https://argoproj.github.io/argo/](https://argoproj.github.io/argo/)

2:
[https://serverlessworkflow.github.io/](https://serverlessworkflow.github.io/)

~~~
jayd16
Hey, wait a second. Are these just a modern incarnation of the enterprise
service bus? Is there a significant difference?

~~~
extrapickles
Yes. The only difference from what I can tell is that each service/node on the
bus is a VM/Container rather than a bespoke built machine.

~~~
swyx
dragonwriter below you says the opposite - based on a plane-level analysis.
would be interested if that changes your mind.

~~~
extrapickles
The real answer is it depends on what you think what responsibilities exactly
ESB or Workflow Orchestration should have. dragonwriter is of the reasonable
opinion that ESBs focus should be a message plane. If your of my opinion, ESBs
generally include some of the features found in workflow. devonkim is also
right in that ESBs in general let the business data/process bleed into the
rest of the stack, where as with workflow, its completely agnostic to the
business data/process.

The most accurate answer would be from comparing various ESB products to
Workflow Orchestration products, as every vendor/product will have slightly
different opinions on where their responsibilities are.

~~~
swyx
fair enough. appreciate the thoughts!

------
theptip
Quick notes from skimming the docs:

* Conductor implements a workflow orchestration system which seems at the highest level to be similar to Airflow, with a couple of significant details.

* There are no "workers", instead tasks are executed by existing microservices.

* The Orchestrator doesn't push work to workers (e.g. Airflow triggering Operators to execute a DAG), instead the clients poll the orchestrator for tasks and execute when they find them.

My hot take:

If you already have a very large mesh of collaborating microservices and want
to extract an orchestration layer on top of their tasks, this system could be
a good fit.

Most of what you're doing here can also be implemented in Airflow, using an
HTTPOperator or GRPCOperator that triggers your services to initiate their
task. You don't get things like pausing though. On the other hand, you do get
the ability to run simple/one-off tasks in an Airflow operator, instead of
having to build a service to run your simple Python function.

I'm unsure on whether push/pull is better; I think it largely depends on your
context. I'm inclined to say that for most cases, having the orchestrator push
tasks out over HTTP is a better default, since you can simply load-balance
those requests and horizontally scale your worker pool, and it's easier to
test a pipeline manually (e.g. for development environments) if the workers
respond to simple HTTP requests, instead of having to provide a stub/test
implementation of the orchestrator. (In particular I'm thinking about "running
the prod env on your local machine in k8s" \-- this isn't practical at Netflix
scale though.)

~~~
dilly_li
Is there a workflow tool that’s designed with micro service in mind?

My particular use case: \- several workers process the data on the workers’
local threads \- several workers serve as relays to interface with external
third party services, hold all the necessary credentials, and conduct cron-
like checking. \- the ETL tool doesn’t directly provision these workers.

The second point is part of the reason why we don’t want to use Airflow’s k8s
operator.

But it doesn’t seem like there is a better option in terms commonly used and
robustness. So we are leaning towards write some custom operators and sensors
to make Airflow more friendly to micro services.

Thought?

------
tupac_speedrap
We've used Conductor at my workplace for about a year now. The grounding is
pretty solid but the documentation is pretty pants once you dig into it. We
have to resort to digging into github issues to find fairly fundamental
features that aren't really documented. I feel Conductor is something Netflix
has open-sourced and then sort of dumped on the OS community.

For example there isn't any examples of how to implement workers using their
Java client, we had to dig up a blog post to do that, although it is fairly
simple a very basic example of implementing the Worker interface would be
nice.

They also do not make it clear the exact relationship between tasks and
workflows and it's hard to find any good examples of relatively complex
workflows and task definitions available on the internet other than Netflix's
barebones documenatation and the kitchen-sink workflow they provide, which is
broken by default on the current API.

Also the configuration makes mention of so many fields that are pretty much
undocumented, like you can swap out your persistence layer for something else
but I would have no idea how that works.

------
TheColorYellow
Suprised to see Camunda isn't mentioned here more.

Open-Source BPMN compliant workflow processing with a history of success.
Goldman Sachs supposedly runs their internal org with it.

Slightly different target use case, but Camunda has really shined in
microservices orchestration and I find implementing complex workflow and
managing task dependencies much easier with it.

~~~
swyx
do you have some recommended resources to learn more about BPMN? what's your
take on BPMN vs other approaches? (JSON or cadence/temporal style "workflow as
code")

------
lis
Very interesting. Looks a lot like zeebe [0], which uses BPMN for the workflow
definition. This makes it easier to communicate the processes with the rest of
the company. I never used it in production, just played around with it for a
demo.

[0] [https://zeebe.io/](https://zeebe.io/)

~~~
theptip
I've looked at Zeebe, and Camunda too - likewise, just in a demo capacity.

Interested in folks' experiences deploying these tools, as this sounds like a
potentially very useful way of modeling business workflows that span multiple
services.

~~~
MrSaints
I've used Conductor, Zeebe, and Cadence all in close to production capacity.
This is just my personal experience.

Conductor's JSON DSL was a bit of a nightmare to work with in my opinion. But
otherwise, it did the job OK-ish. Felt more akin to Step Functions.

Arguably, Zeebe was the easiest to get started with once you get past the
initial hurdle of BPMN. Their model of job processing is very simple, and
because of that, it is very easy to write an SDK for it in any language. The
biggest downside is that it is far from production ready, and there are
ongoing complaints in their Slack about its lack of stability, and relatively
poor performance. Zeebe does not require an external storage since workflows
are transient, and replicated using their own RocksDB, and Raft set-up. You
need to export, and index the workflows if you want to keep a history of it or
even if you want to manage them. It is very eventually consistent.

With both Conductor, and Zeebe however, if you have a complex enough online
workflow, it starts getting very difficult to model them in their respective
DSLs. Especially if you have a dynamic workflow. And that complexity can
translate to bugs at an orchestration level which you do not catch unless
running the different scenarios.

Cadence (Temporal) handles this very well. You essentially write the workflow
in the programming language itself, with appropriate wrappers / decorators,
and helpers. There is no need to learn a new DSL per se. But, as a result,
building an SDK for it in a specific programming language is a non-trivial
exercise, and currently, the stable implementations are in Java, and Go.
Performance, and reliability wise, it is great (relies on Cassandra, but there
are SQL adapters, though, not mature yet).

We have somewhat settled on Temporal now having worked with the other two for
quite some time. We also explored Lyft's Flyte, but it seemed more appropriate
for data engineering, and offline processing.

As it is mentioned elsewhere here, we also use Argo, but I do not think it
falls in the same space as these workflow engines I have mentioned (which can
handle the orchestration of complex business logic a lot better rather than
simple pipelines for like CI / CD or ETL).

Also worth mentioning is that we went with a workflow engine to reduce the
boilerplate, and time / effort needed to write orchestration logic / glue
code. You do this in lots of projects without knowing. We definitely feel like
we have succeeded in that goal. And I feel this is an exciting space.

~~~
theptip
Thanks for the thoughtful reply, this is very useful.

The concept of having business users able to review (or even, holy grail,
edit/author) workflows was one of the potentially appealing aspects of the
BPMN products; did you get a signal on whether there were any benefits? "the
initial hurdle of BPMN" sounds like maybe this isn't as good as it seems on
the face of it?

Also, how do you go about testing long-lived workflows? Do any of these
orchestrators have tools/environments that help with system-testing (or even
just doing isolated simutions on) your flows? I've not found anything off-the-
shelf for this yet.

~~~
MrSaints
You raised a pretty good point about being able to review the BPMN. I did not
immediately think of this, but now that you have mentioned it...

1\. It was good for communicating the engine room

I remember demo'ing the workflows within my team, and to non-technical
stakeholders. It was very easy to demonstrate what was happening, and to
provide a live view into the state of things. From there, it was easy to get
conversations going, e.g. about how certain business processes can be extended
for more complex use-cases.

2\. It empowered others to communicate their intent

Zeebe comes with a modeller which is simple enough even for non-technical
users to stitch together a rough workflow. The problem is, the end-result
often requires a lot of changes to be production-ready. But I have found that
this still helps communicate ideas, and intent.

You do not really need BPMN for this, but if this becomes the standard
practice, now you have a way of talking on the same wavelength. In my case, we
were productionising ML pipelines so data scientists who were not incredibly
attuned to data engineering practices, and limitations, were slowly able to
open up to them. And as a data engineer, it became clearer what the
requirements were.

On the point about testing, the test framework in Zeebe is still a bit
immature. There is quite a few tooling / libraries in Java, but not really in
other languages. The way we approached it was lots of semi-auto / manual QA,
and fixing live in production (Zeebe provides several mechanisms for
essentially rescuing broken workflows).

The testing in Cadence / Temporal is definitely more mature. But you do not
have the same level of simplicity as Zeebe. That said, the way I like to see
it / compare them, you could build something like Zeebe or even Conductor on
Cadence / Temporal, but not vice versa.

------
shadykiller
Can someone explain how and where to use a Workflow Orchestration Engine ?

~~~
juancampa
These are useful for tasks that can last an arbitrarily long amount of time.
Think about the process of signing up a user and waiting for her to click on a
email verification link. This process can literally never end (the user never
clicked) but more commonly it takes a few minutes.

It's easier to implement these things if you can write the code like:

    
    
       await sendVerificationEmail();
       await waitForUserToClick();   // This could take forever
       await sendWelcomeEmail();
    

If you do the above in a "normal" program, said program could stay in memory
forever (consuming RAM, server can't restart, etc). The workflow engine will
take care of storing intermediate state so you can indeed write the above
code.

The other option is to implement a state machine via your database and some
state column, but the code doesn't look as pretty as the above three lines.

Note that this particular tool seems to be more declarative than my example
above (it uses JSON to do define the steps), so instead of using an `if`
statement, you'd need to declare a "Decision".

Hope this helps!

~~~
vvladymyrov
So is it like
[https://github.com/uber/cadence](https://github.com/uber/cadence), AWS SWF,
Amazon Step Functions?

~~~
ecnahc515
Yes, they all solve similar problems. This is a space with a lot of differing
requirements, so they all share some common features, but many have different
strengths for various types of tasks.

------
ramon
I prefer Power Automate / Logic Apps interface, it would be cool if there was
a Power Automate open source imagine the number of plugins for that cloud tool
that would come up? It's a valuable tool and part of the O365 ecosystem and
could be even greater, more strategy and vision into that product would make
O365 and Azure a leader in components and integrations this is the most
valuable thing in the end of it all.

~~~
catmanjan
There is, it's called Apache nifi

It's ripe for someone to make it saas

~~~
ramon
Dude awesome tip, just saw it now thank you! Great project! Watching the
videos now... this is ground breaking stuff and little marketing for now on
it... impressive I'm impressed with Nifi.

------
ForHackernews
Does this have the same limitations as Airflow? How does it compare to
something like Prefect?[0]

[0] [https://medium.com/the-prefect-blog/why-not-
airflow-4cfa4232...](https://medium.com/the-prefect-blog/why-not-
airflow-4cfa423299c4)

~~~
thundergolfer
That blog post is a good rundown of the problems with Apache Airflow.

------
jiehong
We started using it around 2016 in the company I work for. We decided to use
it to automate the often manual setup of new clients for each product. It grew
to use our own security and rights system, and we also added a different
database support (which we are working on the open source). We also changed
the Jason API to conform to our company wide standard.

At the time, we wanted something that we could host ourself, maintained, open
source and that was working!

Nowadays internal teams also use it to automate their own processes as well.

We’d probably go for a “push based” workflow engine, maybe based on events,
mainly for latency and load reasons, but it’s something we’re ok with so far
(there is a way to listen to event for some tasks though, but it’s not that
easy)

If I’m not mistaken, Netflix uses it to automate video encoding for shows, but
that might be outdated.

Overall, we’re pleased about it. But here are some cons about it: we wished we
could split some services out (such as read only ones, or the workflow
definitions from the executions, etc, but the code isn’t architecture for such
an easy split: for example, pushing the result of a task computation by a
worker triggers the current workflow to determine the next task to schedule,
but it’s doing this internally, and not through the defined interfaces)
Security of the api is not so easy, as it’s not really modular (unlike the
database implementation, which is great). That point is being worked on
though, so there is some hope for the future.

------
f0rr0
I have been exploring workflow orchestration for sometime now - specifically
Temporal. Temporal's authors don't recommend it for very high throughput (per
workflow) use cases, although I haven't benchmarked it myself. Also using it
in a SaaS environment, I would prefer some serverless deployment strategy
which possibly allows scaling down to zero.

I have my eyes on flink stateful functions
[http://statefun.io/](http://statefun.io/) The abstractions are quite low
level as compared to Temporal but the overall ability to write
tasks/activities as serverless functions which have access to state is quite
attractive.

Would be happy to talk to someone who has explored this further.

------
yeswecatan
For better or worse, we ended up creating our own workflow engine at my
company. Unfortunately, everyone who ends up using it hates it. We've also run
into the problem where the entire process of producing our end product is
encoded in the workflow. Downstream steps depends on earlier steps etc. If any
part of the process changes, managing this data becomes tough.

Additionally, we have software engineers writing these workflows. Ideally we
would have tooling so that those who know the process can write these things.
The difficulty we have had though is making it easy to join/match up earlier
parts of the process with later steps. We do this now by keeping a lot of data
in the workflow and by occasionally persisting data in other places. Software
engineers, not the process people, are the ones who understand the data model
and how to munge everything together.

Have others dealt with this issue?

~~~
Ataraxy
I've been really interested in creating my own workflow engine for personal
use and I kind of had the same thought while trying to plan it out.

One approach that comes to mind to solve this sort of end user issue would be
to explicitly define all inputs/outputs for the dataflow of a node as
"requires" and "provides". In this manner each node would run in parallel the
moment their requirements (ie. dependencies) are met. Additionally, since what
each piece of data a node needs and provides is explicitly defined you could
technically automatically wire nodes together without having to even really
needing know the underlying data model itself.

So it just means needing to define clear unique labels for each port. In a UI
a user could just drop nodes into a space which would automatically wire up to
matching ports. You can then display which needs are not met and even have an
interface for choosing matching nodes that fit what might be missing.

In the end all you would really need to know is how to compose the pieces of
logic to get the desired outcome.

I'm a novice at this stuff though so take that with a grain of salt.

~~~
yeswecatan
Yup in theory it makes sense. We do define inputs/outputs for each task and
workflow but it's somewhat crude; for instance, we just check if the variable
name will be available but not if a specific key is in a dictionary. We can
definitely improve this though with schemas (probably json schema) and
validation.

Wiring this together _without any user input_ (which is our goal) though is
very, very hard. Let's say we have two parallel branches-- one that orders
boxes with holes in them and another that orders shapes that can fit in those
holes. When both orders are in the workflow can join and we want to put the
shapes into boxes. How do we know which shape can go in each box? Maybe we
have a BoxType with a list of possible shapes. This gets very complicated
though when you have many attributes and care about different attributes at
different stages of the process. Additionally, if the process should update
the database at some point to, say, change a flag from `false` to `true` the
user would need to know the underlying data model.

~~~
Ataraxy
So if it's a long running task that's definitely a more complex problem that
would require some sort of queue/pause/resume system presumably but I imagine
the same concept could still apply.

The node that merges context would simply have a requirement/dependency on the
results "provided" by each branch. It would wait to execute until those
requirements were met.

The user wouldn't need to know about the data model still, the "merge node"
would just intrinsicly wait for the results provided form the seperate
branches.

Fundamentally when each node completes the system itself just needs to check
against the list of nodes to see if its dependencies have just been met and
keeps track of which nodes have been already executed so that they don't end
up getting triggered again when the next check happens.

There will always be a need for some sort of workflow level state or context
management that governs all of this orchestration that you would want to
persist to a database somewhere if this is a long running workflow but this is
a systems concern and the user doesn't need to know about it.

That was just a long way of me more or less saying that it doesn't matter how
many branches there are, all that matters ultimately is that a node waits to
execute when its requirements are met.

~~~
yeswecatan
I guess what I'm trying to say is that the logic in the merge node is where
the magic happens. So you have parallel branches, A and B. A orders boxes
while B orders shapes. Somewhere in A and B we created Box and Shape instances
and those are outputs of A and B, respectively. The merge node, C, waits until
A and B are completed and takes their outputs as input. The merge node needs
to know how to match up the Box and Shape instances to say shape 2 can fix in
box 1.

------
iblaine
Since this has come up several times, Airflow is an orchestration tool for ETL
jobs (long running complex processes) and Netflix Conductor is an
orchestration tool for micro-services (short running simple processes).

------
jayd16
Seems neat. I guess this partially solves the problem of having some workflow
stuck/dropped.

I wonder how much overhead there is. How much latency does each task cause?

Is it feasible to complete workflows while a user/client is waiting for a
RESTful response?

------
f0rr0
There is also
[https://github.com/dapr/workflows](https://github.com/dapr/workflows) which
uses Azure Logic Apps engine on
[https://github.com/dapr/](https://github.com/dapr/)

------
dmead
> Almost no way to systematically answer “How much are we done with process
> X”?

is this a typo?

------
The_rationalist
Any difference with airflow?

------
dang
If curious see also

2016
[https://news.ycombinator.com/item?id=13174743](https://news.ycombinator.com/item?id=13174743)

------
realistcake
It's great to see corporations getting more involved in open source software;
giving back and empowering the developer community.

------
lewisjoe
can somebody ELI5, why would someone need such a workflow orchestration
engine? What problems are best solved with workflow engines?

~~~
swyx
really good answer in another comment
[https://news.ycombinator.com/item?id=24215303](https://news.ycombinator.com/item?id=24215303)

------
itpragmatik
How does this compare with Cadence/Temporal?

~~~
ttsda
You need to specify a DAG, rather than just write regular code

------
dahfizz
I wonder what the relationship is to StackStorm[1]? StackStorm is older and
lists Netflix as a sponsor / user.

[1] [https://stackstorm.com/](https://stackstorm.com/)

------
iamAtom
Comparison with Airflow will be helpful.

