
Ask HN: How do you handle long-running workflows at your company? - superzamp
The canonical answer to this question apparently used to be ESBs, but the rise of the microservice paradigm eventually pushed them to decline and left a void I&#x27;m not sure how is currently filled.<p>HN, how do you handle your days-long sequences of business steps?<p>Some seed questions:<p>* Is your system more P2P or orchestrated?<p>* Do you leverage some existing tools or built your own?<p>* Are you confident in your monitoring of errored workflows?<p>* How do you retry errored workflows?<p>* If your system if more P2P, how do you keep a holistic view of what&#x27;s happening? Can you be certain that you don&#x27;t have any circular event chains?
======
vorpalhex
Our main bus for microservices is a RabbitMQ cluster. Most services have their
own isolated write store and read store (which might be a true read store, or
just a db replica).

Long running jobs are a rarity, so we usually spin up a new RabbitMQ cluster
and services, but tie those services back to the main write/read stores. This
allows regular operations to still occur, but we can monitor the bulk process
and commit resources to it in a more isolated fashion.

Errors end up in error queues in Rabbit, and can be dumped back in to be
reprocessed if appropriate (or just ignored if it's a side effect we don't
care about).

Once it's setup and running, it works well enough. Spinning up a new rabbit
cluster and service instances is currently manual, but since we've moved to
Kubernetes I'm hoping this can be automated almost entirely.

------
devedlee
We developed and use Argo
([https://github.com/argoproj/argo](https://github.com/argoproj/argo)), a
Kubernetes-native workflow engine. Argo is currently used by companies like
Cyrus Biotechnology, Gladly, Google, Intuit, and NVIDIA. Currently collecting
use cases and requirements on a Kubernetes-native eventing framework for Argo
([https://github.com/argoproj/argo-
events/issues/1](https://github.com/argoproj/argo-events/issues/1)) to make it
easier to kick off workflows.

~~~
stpedgwdgfhgdd
Does Argo support recovery? In the sense that if a workflow step or the
workflow engine crashes halfway, the last (idempotent) action is retried?

~~~
jessesuen
I work on argo. The workflow-controller is very tolerant to crashes and
designed to be this way. Workflow state is captured in the workflow CRD object
(in k8s etcd). Because step names are formulated, in the event of a crash (say
before the created pod is persisted in etcd), when the controller restarts and
tries to schedule the pod again, it hits an AlreadyExists error and
understands how to handle this. Thus, workflows are idempotent in crash
scenarios.

------
andscoop
The only tool I have found that checks all those boxes (and then some) is
Airflow. I liked it so much that I went to work for Astronomer.io, which is
building a managed and on-prem solution to Airflow.

It's not the perfect tool, but we are striving to make it better.

~~~
kiechu
Thumbs up for Airflow. It's is great for ETL tasks.

------
rch
Check out Luigi (Python --
[https://github.com/spotify/luigi](https://github.com/spotify/luigi)).

I've built (or worked on) a few bespoke systems myself, but Luigi covers
better than 80% of what I typically need.

~~~
tedmiston
There was a good tutorial at PyCon this past weekend called _Workflow Engines
Up and Running_ [1] on Python workflow automation frameworks, specifically
comparing Luigi vs Airflow. The video is on YouTube as well [2].

[1]:
[https://us.pycon.org/2018/schedule/presentation/58/](https://us.pycon.org/2018/schedule/presentation/58/)

[2]: [https://youtu.be/kw0RL9LZk9s](https://youtu.be/kw0RL9LZk9s)

------
52-6F-62
In publishing/media:

Some workflows are shorter than others, but in the journalism side the
workflows tend to bottom out at a day and max out at a few months (for the
workflow, but is ultimately dependent on the weight of the story)...

Most of that is handled above the technology, mind.

The exploration for the right tool(s) is ongoing. I've been leveraged to build
one but the status of that clandestine project is in flux to put it lightly.
Not sure if I can elaborate on that right now.

Currently, the needs and preferences vary so much that there are many
different services used, but the company is seeking to centralize some efforts
(like content generation and management) and externalize others (like
distribution).

------
kumaranvpl
In one of my previous companies, they used Airflow(by airbnb) to schedule and
manage workflows. Previously they were using nothing but CRON. It turned out
to be not so efficient to retry failed workflows and cancel the execution of
following dependent jobs. Airflow turned out to be a great fit for our case. I
highly recommend checking it out.

~~~
tedmiston
We're using Apache Airflow [2] internally as well. It's pretty featureful and
addresses most concerns mentioned by OP, such as orchestration, retries, open
source code reuse, and dependency management. It has primitive monitoring and
alerting, but one needs to bring something external for that today.

Shameless plug - My startup [2] offers Airflow as a SaaS as well as an
enterprise distribution with monitoring tools to build upon core Airflow.

[1]: [https://github.com/apache/incubator-
airflow](https://github.com/apache/incubator-airflow)

[2]: [https://www.astronomer.io/](https://www.astronomer.io/)

------
inoop
Have you looked at AWS Step Functions? [https://aws.amazon.com/step-
functions](https://aws.amazon.com/step-functions)

edit: to add, I would highly recommend using a workflow engine over a
distributed messaging system. With messages it's hard to track where a given
work item is in your pipeline, and it's not always easy to do mass operations
such as just stopping all running workflows (e.g. when you have an outage) and
resuming them later, re-driving failed items from the beginning of the
workflow, etc. Workflow engines typically give you a nice dashboard where you
can do all those things, for free.

~~~
slucha
Do you have some recommendations for workflow engines?

------
Raidion
Two options we use, database used as a queue for granular out of process work.
If something errors, we'll get a notification for that one record, but the
rest will keep processing.

For stuff that we don't need such granularity/replay, we use Amazon's SNS
event framework to trigger different APIs.

Sometimes we do a combination of those, an SNS event triggers a lambda that
puts a record in the database queue, which gets picked up by a job engine and
raises an SNS event that hits an API that sets a record to available.

------
charriu
Our system is based on the Camunda process engine (in a Java EE environment).
There's a central process server (or cluster) running the process engine, with
events to start process instances.

Workflows are defined using bpmn and then executed by the engine. Errors are
reported to the process engine as "Incident", which then show up in the
management ui/apis. These can be retried any number of times.

We also have an older system based on Carnot/Stardust/IPP. This one used JMS
messages everywhere.

~~~
julienmarie
Interesting, I'm looking into Camunda right now for our processes. How would
you describe the experience in terms of adoption and results ?

~~~
seabrookmx
Not OP, but we built a product around it and though Camunda is reliable and
fast enough for our use, the developer experience is pretty gross.

The BPMN gets saved out as an XML document, but the editor doesn't do a good
job of making the format consistent. This makes changes to the BPMN basically
impossible to code review without downloading the old and new copies and
visually inspecting, which is a chore for large workflows. Especially when
variable inputs and outputs require clicking into each node.

Small code snippets in either JS or a Java plugin (jar) can be embedded and
used to massage variables and track state. These are also difficult to code
review and test as you essentially need to write a harness that mimics Camunda
to run them.

All of our new products are using simpler workflows via FaaS and queues
(RabbitMQ). If we ever needed large workflows again I'd lean towards something
like Airflow.

------
alyandon
* Orchestrated, in-house built workflow execution engine.

* Message queue based with service listeners that translate and dispatch messages to individual workers via HTTP requests.

* Workflow execution state is currently backed by RDBMS.

* Infrastructure errors with workflow executions are exceedingly rare and devops can push a button to retry a step if they failed due to a transient condition.

Now, the important bit:

* Retries due to business logic error aren't really a thing unless there is a defined recovery transition for that step in the workflow. This forces people to acknowledge their code did something unexpected (or the workflow definition itself doesn't properly handle all necessary error cases) and fix the underlying issue. Once the root cause is identified and fixed, the workflow instance can be canceled and resubmitted. However, since things that do work usually have side effects, there is sometimes manual cleanup that has to occur that falls on the development team to fix (with assistance from the devops team, if needed). No one likes doing cleanups or getting on devops bad side so there is an incentive to make sure code and workflows are well tested before being released to production.

------
ameyamk
At LinkedIn we heavily use Azkaban for this. (Open source:
[https://azkaban.github.io/](https://azkaban.github.io/)) Azkaban API can be
used to launch offline computation jobs as necessary - Azkaban ensures
monitoring, SLA alerting, failed restarts and other dependency management etc.

~~~
superzamp
Azkaban really seems to strike the right balance between simplicity and
featurefulness, I'll definitely give it a try! Plus it seems relatively simple
to deploy & maintain.

The documentation often mention Hadoop and data jobs, have you also used it
for non-data things? Would you by chance have some workflows examples?

~~~
ameyamk
You can use this for any execution. eg. here is a job type to trigger shell
command such as ' echo "hello" '
[http://azkaban.github.io/azkaban/docs/latest/#command-
type](http://azkaban.github.io/azkaban/docs/latest/#command-type)

Note execution environment for such jobs is Azkaban executor server itself, so
you have to take care of resource management (eg. one job taking all RAM on
the machine will affect other jobs running on the same machine)

------
tabtab
I'm going into get-off-my-lawn mode here if you don't mind. I don't see why
this requires a new-fangled technology or buzzword. Just have a status code(s)
or indicator(s) on a given request. The client side or requesting service(s)
can periodically check on the status using polling and/or user status update
requests. For example, poll automatically every 2 minutes (to avoid flooding
the network), but give user the option of clicking a button to check current
status.

Give the requester an option of a time-limit, if applicable. If the process
takes too long, the status changes to "timed-out". The client/requester can
then issue a "re-submit" request, if applicable.

The technique is pretty much the same whether using ESB, microservices, Stored
Procedures, or carrier pigeons.

~~~
tedmiston
The benefits of the frameworks become more tangible as the structure of your
workflows become more complex and with (acyclic) dependencies and with
distributed execution.

~~~
tabtab
True, but what if the "grow complex" step doesn't happen? YAGNI. If you are in
a domain that needs complex work-flows, I can see selecting a work-flow
framework. But I've seen some really ugly frameworks where one put in
Cadillacs for every part when Chevys would do just fine 98% of the time. Staff
have to learn, understand, maintain, and tune complex frameworks.

~~~
tedmiston
I understand the concern. It's kind of like using a web framework — you
probably wouldn't write your own form processing and ORM code to avoid
injection attacks when it's already been built. Airflow has analogous features
and protections for tasks/workflows.

Most companies adopting Airflow already have workflow requirements like this,
even if it's just a single transform or moving data from one system to
another.

Even if you have just a two-step workflow with Task B dependent upon success
of Task A, Airflow offers protection, historical stats, email alerting, etc
over trying to schedule successive cron jobs with built-in assumptions,
hacking together a dependency system, etc.

To me, Airflow is the Honda of this domain. Overall it's a relatively small
and simple framework from the DAG author's perspective.

------
Lord_Zero
We let the web application do it and pray the web server does't croak mid job
which it usually does.

------
dalacv
Check out Pipefy.com - It is like Trello + Customizable workflow:

[https://d2qfyj0q2n9d96.cloudfront.net/uploads/2017/08/email-...](https://d2qfyj0q2n9d96.cloudfront.net/uploads/2017/08/email-
messaging.gif)

[https://downloads.intercomcdn.com/i/o/55498996/caa3b5f8a6334...](https://downloads.intercomcdn.com/i/o/55498996/caa3b5f8a6334e2abad60849/email-
inbox-1.gif)

[https://downloads.intercomcdn.com/i/o/58246710/bf1485442ffb1...](https://downloads.intercomcdn.com/i/o/58246710/bf1485442ffb1b8c93387d77/phase-
settings-00.gif)

~~~
JohnnyConatus
Have you tried integrating home-grown services into their workflows? Curious
as otherwise it looks pretty good.

------
crispyporkbites
We basically email stuff around and then when it gets stuck somewhere follow
up with another email / conference call to move it forward again. If it keeps
getting stuck or doesn't move it's obviously not an important process so it
falls out of the loop.

------
spapas82
In my previous job (banking) we ware using Appian for all our workflows. It
was a strange beast of a Java UI application with a k/kdb core and database.

It's UI was really good, much better than Activiti and similar BPM systems;
you could create a rather complex workflow with almost no code, just by
creating yout BPMN flow through the built-in editor. Also, the editor and the
rules system was builtin on the web UI so you didn't have too use external,
eclips-ish tools unless you wanted to write custom BPMN nodes, mainly for
integration with external systems. Errors and retries were handled through
BPMN.

The main problem was the K core: Because nobody knew how to write K we relied
on the Java API for access to the kdb database (actually messing with the kdb
directly was not even supported by Appian thus even if somebody was willing to
learn the bank wouldn't let him mess with the kdbs), which, because of the
restrictions it had resulted to having to manually edit a couple of hundred
live process instances to change a task assignee or skip a non working custom
node... Also because kdbs are stored in memory we needed a very big amount of
RAM on the server; which was growing larger proportionally with the ptocess
instances.

Even with these shortcomings, I still think that it was a good product, much
better than other workflow solutions like Activiti or jbpm or Alfresco. One
last thing: Appian was way too expensive; don't consider it if you are not a
bank...

------
jodison
We use Apache Oozie ([http://oozie.apache.org/](http://oozie.apache.org/)) an
orchestration system for Hadoop. We don't run days-long workflows, but we run
some that have over a dozen steps, and I have no reason to believe Oozie
couldn't handle longer-running workflows. Oozie has facilities for handling
retries based on user-defined behaviors, and because it can run shell scrips,
Java apps, Spark jobs, and most anything in the Hadoop ecosystem, I've found
it to be pretty easy to integrate with our other tooling. My one complaint
(and it's more a complaint with YARN) is that it can be quite difficult to get
your hands on logs when your workflows fail. You can get them, but it can be a
real pain.

We were running Oozie on Cloudera, but are migrating to AWS, and I was pleased
to find that it can be installed on an EMR cluster[1] and managed with Hue[2]
which has a decent UI to administer the schedule with, and a visualization
depicting the workflow DAG.

[1]: [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
conf...](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-
apps.html)

[1]: [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
oozi...](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-oozie.html)

[2]: [http://gethue.com/tutorial-a-new-ui-for-
oozie/](http://gethue.com/tutorial-a-new-ui-for-oozie/)

------
isaachier
Uber wrote its own framework:
[https://github.com/uber/cadence](https://github.com/uber/cadence)

------
tdondich
I'm the CTO at ProcessMaker and so I might be a little biased. Our customers
use our ProcessMaker BPM product if the workflows require human intervention
through forms/email other interactions. The reporting tools assist in
monitoring and dealing with circular chains.

If you are a developer and want to develop your own system around a workflow
engine, we also have www.processmaker.io which is a workflow engine in the
cloud. So all the infrastructure hassle is taken care of for you and you
communicate via an api to build out your workflows and execute them. I feel
like that's better described as an orchestration engine however it supports
task assignment to people. An approach like this works well with microservices
since it can act as a microservice orchestration engine with a more human
workflow approach.

Both of these approaches can be long running (some customers have year long
processes running).

Let me know if you want to know more details, happy to share.

------
steven_h
If you're in AWS, SWF and StepFunctions are great for starting and monitoring
task completion / failure for long running processes, either interconnected or
single.

You can write your own code to long poll in either and do work as it's needed,
but with StepFunctions you can wrap lambdas to give a little more visibility
and error handling.

------
fredley
A custom layer built on top of Celery that allows for better monitoring and
dependency management, amongst other things. Monitoring, particularly of
failure is pretty ok in Celery anyway.

The whole thing can generate its own graph by inspecting dependencies, and we
use dagre to draw pretty process workflows with status, interactions and
monitoring.

~~~
tedmiston
Any chance your layer is open source? I'd be curious to see how it compares to
something like Airflow with a Celery executor.

------
unit_circle
We spent a long time shopping around (ETL tools, Airflow, Luigi, etc.) And
eventually found Argo. We are in the process of migrating our home-rolled
scientific JS based workflows.
[https://github.com/argoproj/argo](https://github.com/argoproj/argo)

------
fpierfed
I worked with long (>> 24 hours, some times up to a week) complex workflows on
big (thousands of nodes) clusters. We used custom software layered on top of a
job scheduler like PBS Pro or HTCondor. The nice thing about this setup is
that it supports re-running failed jobs, has pretty good monitoring, does an
OK job at resource selection and allocation and is language agnostic. The last
point is good if your workflows have parts written in different languages.
There are a handful of conferences a year on these topics by the way. My
favorite is HTCondor Week at the University of Wisconsin in Madison. Talks are
online [1]

[1]:
[http://research.cs.wisc.edu/htcondor/HTCondorWeek2017/](http://research.cs.wisc.edu/htcondor/HTCondorWeek2017/)

------
agotterer
On some of our Ruby workflows We use Sidekiq Pro which has scheduled and
batched jobs. The batched jobs is neat because it has a callback feature that
you can use for starting additional steps / workflows. We monitor/alert on
progress with statsd, datadog, and the sidekiq ui.

------
plcancel
IIS, AppFabric, Windows Workflow Foundation services (WF). Leverage there for
orchestration, persistence, error handling, etc. Considering the demise of
AppFabric, do you mind if I Ask HN: How would you handle these long-running
workflows in the long-run (and keep IIS and WF)?

------
scarface74
The last time I did it maybe a year ago, I had a combination of batch jobs
that once completed, should kick off other batch jobs.

I created a poor mans fire and forget pub/sub model where when one process was
finish it would "raise an event".

Raising an event, would look in Hashicorp's Consul to see what jobs should be
run based on the event and submit a job to Nomad. There were a number of EC2
instances running Nomad agents that would kick off the subsequent jobs. Nomad
jobs could be executables or Docker containers.

I was very much a Hashicorp fanboy until I transitioned to using native AWS
services. These days I would probably use AWS Step Functions.

------
geomagilles
There is no obvious solution right now. That's why we are building Zenaton
(I'm cofounder). It's in closed beta by now, but you can have a look at the
documentation
([https://zenaton.com/documentation](https://zenaton.com/documentation)) and
also read some use cases
([https://medium.com/zenaton](https://medium.com/zenaton)). Zenaton provides a
very simple way (in your own programming language) to orchestrate background
jobs

~~~
dalore
No obvious solutions? Many enterprise companies have a workflow management
product. Adobe has one which it makes quite a bit of enterprise revenue from.

[https://www.adobe.com/uk/marketing-cloud/experience-
manager/...](https://www.adobe.com/uk/marketing-cloud/experience-
manager/project-workflow-management.html)

~~~
geomagilles
Indeed - but I do not think it's related to the question. The question here
is: how do I - as a developer - implement a workflow? Still there are numerous
BPM solutions, but often overly sophisticated. You have AWS SWF, but
complicated to use, Airflow but in Python only, your own implementation using
queues, database, etc... Look at the diversity of answers: there is no
_obvious_ answer right now.

~~~
superzamp
From what I see on your website it seems your product indeed has found a sweet
spot between business-heavy and deep-tech systems.

The only concern I have is having such a critical part of my application
running in a proprietary SaaS environment. Do you have plans to consider on-
premise licensing or having an open-source community codebase with enterprise
plans?

~~~
geomagilles
Thx. I totally understand your concerns. We work hard to make developer life
much more easy. That's also why the solution is hosted. So you do not have to
install, maintain, scale your own system. Just to clarify (if needed): your
tasks are executed on your servers, we handle only the orchestration itself.
Pricing is a work in progress, but we will probably offer a large free usage.

------
gaigepr
The Argo project is a workflow engine built on top of kubernetes. Workflows
are written as yaml templates and support DAGs as well as loops and
conditionals.

[https://github.com/argoproj/argo](https://github.com/argoproj/argo)

We use this at my company to stitch together various scientific software
packages each of which can take minutes to 10s of hours to run. Argo supports
retrying, resubmitting, suspending, and resuming workflows. It really is a
neat project, especially if you are already using kubernetes!

------
andyv133
I've used Redmine with the Checklists plugin for this. Each thing that needs
to be done is a redmine issue, and each issue can have a checklist. As team
members check off items on the list, the issue logs who/what/when and then the
user can assign the next person in the chain to the issue. At the time the
checklists plugin didn't include templating functionality (not sure if it does
now), so I rolled my own using the Redmine REST api and some PHP.

Hardest part was getting managerial support; they really liked paper.

------
FLUX-YOU
"State machine" was the easiest for simpler stuff. I put it in quotes because
it feels like one, but probably isn't.

Map out each state of your workflow, and having errors give the option to fix
immediately, try again, or revert to a previously known-good state. You likely
want to start with a 10,000ft view of the workflow and then work on each of
those steps as an independent unit and add all of their intermediate steps (on
and on until you reach the bottom of the recursion).

This gives you a good opportunity to break things up into microservices that
completely handle individual steps if they are big and detailed enough.

P2P is hardest because you will likely need to code something to determine who
should decide to move things to the next state (simple majority? one person
elected?) and keep track of consensus between all parties.

Orchestration is easier because there's usually one person, one role, or one
security claim in control at a particular step and changing who can advance
the state at each step is pretty easy as well.

All of this was mostly for the goal of really easy unit testing.

But note that whatever backing data store you use can be changed by any
developer unless you code all of the business rules there, too. Many people do
not like doing this though because it's not as easy as all of the unit testing
frameworks, debuggers, and IDEs we have for code.

The challenge is that you need to know the workflow completely and that will
very likely involve talking to a lot of people and the chances you will miss
one or two edge cases is high. The counter to that challenge is that as
developers building a product that saves time/money, you can bend the workflow
to make it easier to code and sometimes eliminate those extra steps
(literally, we had someone copying and pasting stuff to 'make it work', so of
course we can automate that).

Saving known-good states can also be challenging depending on what you're
doing, but if you need change history or diff'ing in a user-consumable form,
you'll have to do that anyway. If you get this right, it can save your users a
lot of potentially lost work and headache if a bug gets past unit testing.

Once everything is modular, logging isn't too difficult either.

------
tamcap
We have developed a custom workflow system in PHP for our company (academic
text editing and related work). Back then (I was not directly involved from
the start) none of the out of the box solutions fit our criteria, and it made
more sense to just build a bespoke, custom fitted system. Workflows range from
a few days to a month+, with no technical upper limit enforced, as far as I
know.

We also don't need a huge throughput, so having something super-optimized was
not a large concern.

------
xemdetia
I had been looking at using BPMN and an implementation of Camunda as a
reasonable goal, but I never found an implementation of running a BPMN service
that I liked in the time I had allotted. In the workflow each item is
essentially a ticket so you end up with concurrent tickets in the state
machine. It also has timers to generate events so you can have that monthly
event start and trigger some other actions, and it also includes failure
paths.

~~~
probledo
So what is your problem using BPMN?

------
aprdm
In the past used both Celery and RabbitMQ for some custom services... however
we also used to use Qube (A render farm manager) to link long running tasks
together in a dependency chain which was workflow specific.

Some times job could take days to finish (doing a water simulation in a 4k res
sequence)

Qube has resource constraints per job as well as number of tries and so on.
Those would all be configured at job (think workflow) submission time.

------
Maro
Airflow jobs in the backend.

------
tnolet
If you're on the AWS platform, Lambda with SNS messages as triggers works
really well. Nicely decoupled and mimics the ESB-like workflow a bit. You get
monitoring out of the box. Apparently, AWS is also working on having SQS
function as a trigger for Lambda steps. That would resolve some issues with
retrying and deadletter boxing.

~~~
dgemm
Or, you know, SWF: [https://aws.amazon.com/swf/](https://aws.amazon.com/swf/)

------
zie
Nomad just does it for you(mostly): [https://www.nomadproject.io/docs/job-
specification/parameter...](https://www.nomadproject.io/docs/job-
specification/parameterized.html)

------
TheWiseOne
If you are using C#, you can use the Durable Task Framework
([https://github.com/Azure/durabletask](https://github.com/Azure/durabletask))
to handle some of this stuff.

------
enraged_camel
We use (as well as sell/implement/customize) an Enterprise Content Management
system that has very robust business automation capabilities.

[https://www.laserfiche.com](https://www.laserfiche.com)

------
KirinDave
Here is a very engineering-centric view of what I tend to do. These workflows
are optimized around "long" in the scope of microservices, 1-3 minutes. If you
go much longer than this, consider hard why this doesn't fit into ETL loads
before engineering more solutions.

Firstly, there is an issue of what these tasks are composed of. They tend to
start with a human-generated action, result in several programmatic steps
interacting with internal and external services. They then tend to result in a
write to a private store, or a call to a service that arbitrates this.

For your task initiation: you're basically building a queue even if you don't
like queues. I recommend you embrace this where possible. Eventually you may
find so much programmatic traffic that a queue will be unsuitable, but that
won't change the need for a queue for human-initiated actions. Do try to write
task states to a non-durable store so you can watch tasks!

For your task executors: every aspect of them must consider first that any of
the sequential actions may fail to execute, and thus cause the entire task to
fail. You simply cannot escape the need to retry tasks. Build for this from
day one. For inspiration, a primitive but effective system is Amazon SQS. You
can achieve similar effects by rerunning blocks in Kafka, and Rabbit has its
own solutions. The more heavyweight the mechanics, the more likely the spine
of your product is to break at a critical moment. Be careful.

For your microservices, as informed by previous information you _must strive
for idemopotency on every endpoint_. Even if you can't truly reach this (and
true provable idempotency is actually very hard), achieving a practical notion
of idempotency to accomodate modest retries is absolutely essential.
Retrofitting large systems with idempotency is even more difficult than doing
it to start with. Accept performance tradeoffs for this without hesitation.
Anyone who says that tail latency is more important than data integrity for
business logic is either in lottery-winning-rare condition or is over-
prioritizing engineering. If a human needs to act to correct bad data the cost
of recovery skyrockets and can spiral out of control.

For your final commit stores, remember that they're not infinite or magical
and many can't provide very useful concurrency guarantees. Prefer append-only
tables even if this obliges you to run cleanup cycles. If you are going to
update records in place, try to use stores with "upsert" operations or "test-
and-set" mechanics.

Let's loop back around with this advice and answer each of your questions in
turn:

> Is your system more P2P or orchestrated?

Orchestrated systems are easier to monitor, understand and build. They tend to
run into scaling challenges after a certain level. Write your software to be
agnostic to this. Start with orchestrated when possible.

> Do you leverage some existing tools or built your own?

Both. Bespoke workflow tools are easy. Custom, consistent state storage is
harder. Shy away from that outside of very special use cases (e.g., integrated
CRDTs or a bloom filter for whitelisting events inside a hot loop.

> Are you confident in your monitoring of errored workflows?

Personally: no. It's genuinely difficult to do this. The harder you try, the
more likely it is that your error monitoring system becomes the contention
point that breaks your system.

> How do you retry errored workflows?

We use SQS to queue workflows. They get a lot of retries by having the queue
claw back the message. In some rare cases work times out and is clawed back to
the queue spuriously. I've worked hard to make sure all the services that it
calls don't care about such cases and result in expensive nops.

> If your system if more P2P, how do you keep a holistic view of what's
> happening? Can you be certain that you don't have any circular event chains?

The situation is identical for all types of architectures. AS good bit of
advice for the later I picked up is NEVER have a workflow fork conditionally
into a prior state. Always have them flow downwards and "away" from your event
dispatch queues. If you can, use different queues for internal traffic vs
external traffic. You might also use different microservices or tags on
microsevice requests. All of this is in service of trying to avoid feedback
loops in your system.

------
h43k3r
We have a lot of long running workflows written in DTF

[https://github.com/Azure/durabletask](https://github.com/Azure/durabletask)

------
ojhughes
[https://concourse-ci.org/](https://concourse-ci.org/) is extremely flexible
and we use it for a number of complex workflows

------
NewDimension
Does anyone have recommendation for a non-software dev environment? e.g. user
task workflow management. I'm looking for a fully fledged product and/or an
API backend.

~~~
tedmiston
Does something like Zapier fit your use cases?

~~~
NewDimension
Zapier looks like a app trigger. I'm looking for more of a task management
system.

~~~
BMarkmann
Check out the BPM tools others have mentioned (Activiti, Camunda, etc...)

------
alimbada
We are currently using Activiti (a fork of jBPM) for a client project. Tooling
is pretty shoddy for changing the workflow, but it works.

------
hb3b
Samanage (funny enough, no posts yet about JIRA)

~~~
rando444
We, and likely many others, use Jira as kind of the second tier / exception
handling.

When the automated system fails, it automatically opens a Jira ticket to get
the right people to fix the automated workflow.

You can then use the Jira case history to drive process improvement.

------
chjohnst
Jenkins, but my pipelines are mostly data pipelines (taking source data to
convert to something else) nightly.

------
jononor
What kind of workflows take multiple days? I am assuming that means human
inputs are needed for (some) steps?

~~~
brudgers
A debiting and crediting a bank account is an example of a long running
workflow...though I am not certain that is what the OP meant. Anyway, as a
workflow, the process that maintains an individual account usually runs over
many years and perhaps a century or more. The underlying architecture is one
reason why banking still (sometimes) uses COBOL...the software was written
around abstractions that address the timelines involved. For what it's worth,
Michael (not that one) Jackson's _Principles of Program Design_ is where I
picked up account balance as a long running process.

------
dalacv
pipefy is kinda cool. we don't use it but i've been looking at implementing
some of our processes in it.

------
dominotw
kafka streams and spark jobs.

