Ask HN: How do you handle long-running workflows at your company?

vorpalhex · on May 14, 2018

Our main bus for microservices is a RabbitMQ cluster. Most services have their own isolated write store and read store (which might be a true read store, or just a db replica).

Long running jobs are a rarity, so we usually spin up a new RabbitMQ cluster and services, but tie those services back to the main write/read stores. This allows regular operations to still occur, but we can monitor the bulk process and commit resources to it in a more isolated fashion.

Errors end up in error queues in Rabbit, and can be dumped back in to be reprocessed if appropriate (or just ignored if it's a side effect we don't care about).

Once it's setup and running, it works well enough. Spinning up a new rabbit cluster and service instances is currently manual, but since we've moved to Kubernetes I'm hoping this can be automated almost entirely.

devedlee · on May 14, 2018

We developed and use Argo (https://github.com/argoproj/argo), a Kubernetes-native workflow engine. Argo is currently used by companies like Cyrus Biotechnology, Gladly, Google, Intuit, and NVIDIA. Currently collecting use cases and requirements on a Kubernetes-native eventing framework for Argo (https://github.com/argoproj/argo-events/issues/1) to make it easier to kick off workflows.

stpedgwdgfhgdd · on May 14, 2018

Does Argo support recovery? In the sense that if a workflow step or the workflow engine crashes halfway, the last (idempotent) action is retried?

jessesuen · on May 14, 2018

I work on argo. The workflow-controller is very tolerant to crashes and designed to be this way. Workflow state is captured in the workflow CRD object (in k8s etcd). Because step names are formulated, in the event of a crash (say before the created pod is persisted in etcd), when the controller restarts and tries to schedule the pod again, it hits an AlreadyExists error and understands how to handle this. Thus, workflows are idempotent in crash scenarios.

devedlee · on May 14, 2018

The latest Argo release (https://github.com/argoproj/argo/releases) supports resubmitting workflows with "memoized" steps.

andscoop · on May 14, 2018

The only tool I have found that checks all those boxes (and then some) is Airflow. I liked it so much that I went to work for Astronomer.io, which is building a managed and on-prem solution to Airflow.

It's not the perfect tool, but we are striving to make it better.

kiechu · on May 15, 2018

Thumbs up for Airflow. It's is great for ETL tasks.

rch · on May 14, 2018

Check out Luigi (Python -- https://github.com/spotify/luigi).

I've built (or worked on) a few bespoke systems myself, but Luigi covers better than 80% of what I typically need.

tedmiston · on May 14, 2018

There was a good tutorial at PyCon this past weekend called Workflow Engines Up and Running [1] on Python workflow automation frameworks, specifically comparing Luigi vs Airflow. The video is on YouTube as well [2].

[1]: https://us.pycon.org/2018/schedule/presentation/58/

[2]: https://youtu.be/kw0RL9LZk9s

dfsegoat · on May 14, 2018

Also using luigi here in production since 2015. We use it to manage a multi-day pipeline that is essentially semi-automatic (requires occasional human intervention).

52-6F-62 · on May 14, 2018

In publishing/media:

Some workflows are shorter than others, but in the journalism side the workflows tend to bottom out at a day and max out at a few months (for the workflow, but is ultimately dependent on the weight of the story)...

Most of that is handled above the technology, mind.

The exploration for the right tool(s) is ongoing. I've been leveraged to build one but the status of that clandestine project is in flux to put it lightly. Not sure if I can elaborate on that right now.

Currently, the needs and preferences vary so much that there are many different services used, but the company is seeking to centralize some efforts (like content generation and management) and externalize others (like distribution).

kumaranvpl · on May 14, 2018

In one of my previous companies, they used Airflow(by airbnb) to schedule and manage workflows. Previously they were using nothing but CRON. It turned out to be not so efficient to retry failed workflows and cancel the execution of following dependent jobs. Airflow turned out to be a great fit for our case. I highly recommend checking it out.

tedmiston · on May 14, 2018

We're using Apache Airflow [2] internally as well. It's pretty featureful and addresses most concerns mentioned by OP, such as orchestration, retries, open source code reuse, and dependency management. It has primitive monitoring and alerting, but one needs to bring something external for that today.

Shameless plug - My startup [2] offers Airflow as a SaaS as well as an enterprise distribution with monitoring tools to build upon core Airflow.

[1]: https://github.com/apache/incubator-airflow

[2]: https://www.astronomer.io/

inoop · on May 15, 2018

Have you looked at AWS Step Functions? https://aws.amazon.com/step-functions

edit: to add, I would highly recommend using a workflow engine over a distributed messaging system. With messages it's hard to track where a given work item is in your pipeline, and it's not always easy to do mass operations such as just stopping all running workflows (e.g. when you have an outage) and resuming them later, re-driving failed items from the beginning of the workflow, etc. Workflow engines typically give you a nice dashboard where you can do all those things, for free.

slucha · on May 15, 2018

Do you have some recommendations for workflow engines?

Raidion · on May 14, 2018

Two options we use, database used as a queue for granular out of process work. If something errors, we'll get a notification for that one record, but the rest will keep processing.

For stuff that we don't need such granularity/replay, we use Amazon's SNS event framework to trigger different APIs.

Sometimes we do a combination of those, an SNS event triggers a lambda that puts a record in the database queue, which gets picked up by a job engine and raises an SNS event that hits an API that sets a record to available.

charriu · on May 14, 2018

Our system is based on the Camunda process engine (in a Java EE environment). There's a central process server (or cluster) running the process engine, with events to start process instances.

Workflows are defined using bpmn and then executed by the engine. Errors are reported to the process engine as "Incident", which then show up in the management ui/apis. These can be retried any number of times.

We also have an older system based on Carnot/Stardust/IPP. This one used JMS messages everywhere.

mirceal · on May 14, 2018

Multiple things: 1) Camunda w/ a Postgres RDS DB. Works for more complex stuff that’s expressed in BPMN 2) If the workflow involves mostly automated stuff and is not running for years, AWS SWF (usually coupled with an API for checkpointing state, keeping track of wflows)

julienmarie · on May 14, 2018

Interesting, I'm looking into Camunda right now for our processes. How would you describe the experience in terms of adoption and results ?

seabrookmx · on May 14, 2018

Not OP, but we built a product around it and though Camunda is reliable and fast enough for our use, the developer experience is pretty gross.

The BPMN gets saved out as an XML document, but the editor doesn't do a good job of making the format consistent. This makes changes to the BPMN basically impossible to code review without downloading the old and new copies and visually inspecting, which is a chore for large workflows. Especially when variable inputs and outputs require clicking into each node.

Small code snippets in either JS or a Java plugin (jar) can be embedded and used to massage variables and track state. These are also difficult to code review and test as you essentially need to write a harness that mimics Camunda to run them.

All of our new products are using simpler workflows via FaaS and queues (RabbitMQ). If we ever needed large workflows again I'd lean towards something like Airflow.

charriu · on May 14, 2018

Process modeling requires some reading up front, I think. Integration into our application was relatively easy - activities implemented as Java classes with a reasonably good API.

As for results, we are quite happy with camunda. No issues with performance, incident handling, etc. We have about 100k new process instances/day, with 5-10 steps per process (3 different processes), some of which run over multiple days.

alyandon · on May 14, 2018

* Orchestrated, in-house built workflow execution engine.

* Message queue based with service listeners that translate and dispatch messages to individual workers via HTTP requests.

* Workflow execution state is currently backed by RDBMS.

* Infrastructure errors with workflow executions are exceedingly rare and devops can push a button to retry a step if they failed due to a transient condition.

Now, the important bit:

* Retries due to business logic error aren't really a thing unless there is a defined recovery transition for that step in the workflow. This forces people to acknowledge their code did something unexpected (or the workflow definition itself doesn't properly handle all necessary error cases) and fix the underlying issue. Once the root cause is identified and fixed, the workflow instance can be canceled and resubmitted. However, since things that do work usually have side effects, there is sometimes manual cleanup that has to occur that falls on the development team to fix (with assistance from the devops team, if needed). No one likes doing cleanups or getting on devops bad side so there is an incentive to make sure code and workflows are well tested before being released to production.

ameyamk · on May 14, 2018

At LinkedIn we heavily use Azkaban for this. (Open source: https://azkaban.github.io/) Azkaban API can be used to launch offline computation jobs as necessary - Azkaban ensures monitoring, SLA alerting, failed restarts and other dependency management etc.

superzamp · on May 14, 2018

Azkaban really seems to strike the right balance between simplicity and featurefulness, I'll definitely give it a try! Plus it seems relatively simple to deploy & maintain.

The documentation often mention Hadoop and data jobs, have you also used it for non-data things? Would you by chance have some workflows examples?

ameyamk · on May 14, 2018

You can use this for any execution. eg. here is a job type to trigger shell command such as ' echo "hello" ' http://azkaban.github.io/azkaban/docs/latest/#command-type

Note execution environment for such jobs is Azkaban executor server itself, so you have to take care of resource management (eg. one job taking all RAM on the machine will affect other jobs running on the same machine)

tabtab · on May 14, 2018

I'm going into get-off-my-lawn mode here if you don't mind. I don't see why this requires a new-fangled technology or buzzword. Just have a status code(s) or indicator(s) on a given request. The client side or requesting service(s) can periodically check on the status using polling and/or user status update requests. For example, poll automatically every 2 minutes (to avoid flooding the network), but give user the option of clicking a button to check current status.

Give the requester an option of a time-limit, if applicable. If the process takes too long, the status changes to "timed-out". The client/requester can then issue a "re-submit" request, if applicable.

The technique is pretty much the same whether using ESB, microservices, Stored Procedures, or carrier pigeons.

tedmiston · on May 14, 2018

The benefits of the frameworks become more tangible as the structure of your workflows become more complex and with (acyclic) dependencies and with distributed execution.

tabtab · on May 16, 2018

True, but what if the "grow complex" step doesn't happen? YAGNI. If you are in a domain that needs complex work-flows, I can see selecting a work-flow framework. But I've seen some really ugly frameworks where one put in Cadillacs for every part when Chevys would do just fine 98% of the time. Staff have to learn, understand, maintain, and tune complex frameworks.

tedmiston · on May 16, 2018

I understand the concern. It's kind of like using a web framework — you probably wouldn't write your own form processing and ORM code to avoid injection attacks when it's already been built. Airflow has analogous features and protections for tasks/workflows.

Most companies adopting Airflow already have workflow requirements like this, even if it's just a single transform or moving data from one system to another.

Even if you have just a two-step workflow with Task B dependent upon success of Task A, Airflow offers protection, historical stats, email alerting, etc over trying to schedule successive cron jobs with built-in assumptions, hacking together a dependency system, etc.

To me, Airflow is the Honda of this domain. Overall it's a relatively small and simple framework from the DAG author's perspective.

Lord_Zero · on May 14, 2018

We let the web application do it and pray the web server does't croak mid job which it usually does.

dalacv · on May 14, 2018

Check out Pipefy.com - It is like Trello + Customizable workflow:

https://d2qfyj0q2n9d96.cloudfront.net/uploads/2017/08/email-...

https://downloads.intercomcdn.com/i/o/55498996/caa3b5f8a6334...

https://downloads.intercomcdn.com/i/o/58246710/bf1485442ffb1...

jelling · on May 14, 2018

Have you tried integrating home-grown services into their workflows? Curious as otherwise it looks pretty good.

crispyporkbites · on May 14, 2018

We basically email stuff around and then when it gets stuck somewhere follow up with another email / conference call to move it forward again. If it keeps getting stuck or doesn't move it's obviously not an important process so it falls out of the loop.

spapas82 · on May 14, 2018

In my previous job (banking) we ware using Appian for all our workflows. It was a strange beast of a Java UI application with a k/kdb core and database.

It's UI was really good, much better than Activiti and similar BPM systems; you could create a rather complex workflow with almost no code, just by creating yout BPMN flow through the built-in editor. Also, the editor and the rules system was builtin on the web UI so you didn't have too use external, eclips-ish tools unless you wanted to write custom BPMN nodes, mainly for integration with external systems. Errors and retries were handled through BPMN.

The main problem was the K core: Because nobody knew how to write K we relied on the Java API for access to the kdb database (actually messing with the kdb directly was not even supported by Appian thus even if somebody was willing to learn the bank wouldn't let him mess with the kdbs), which, because of the restrictions it had resulted to having to manually edit a couple of hundred live process instances to change a task assignee or skip a non working custom node... Also because kdbs are stored in memory we needed a very big amount of RAM on the server; which was growing larger proportionally with the ptocess instances.

Even with these shortcomings, I still think that it was a good product, much better than other workflow solutions like Activiti or jbpm or Alfresco. One last thing: Appian was way too expensive; don't consider it if you are not a bank...

jodison · on May 15, 2018

We use Apache Oozie (http://oozie.apache.org/) an orchestration system for Hadoop. We don't run days-long workflows, but we run some that have over a dozen steps, and I have no reason to believe Oozie couldn't handle longer-running workflows. Oozie has facilities for handling retries based on user-defined behaviors, and because it can run shell scrips, Java apps, Spark jobs, and most anything in the Hadoop ecosystem, I've found it to be pretty easy to integrate with our other tooling. My one complaint (and it's more a complaint with YARN) is that it can be quite difficult to get your hands on logs when your workflows fail. You can get them, but it can be a real pain.

We were running Oozie on Cloudera, but are migrating to AWS, and I was pleased to find that it can be installed on an EMR cluster[1] and managed with Hue[2] which has a decent UI to administer the schedule with, and a visualization depicting the workflow DAG.

[1]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-conf...

[1]: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-oozi...

[2]: http://gethue.com/tutorial-a-new-ui-for-oozie/

isaachier · on May 14, 2018

Uber wrote its own framework: https://github.com/uber/cadence

tdondich · on May 14, 2018

I'm the CTO at ProcessMaker and so I might be a little biased. Our customers use our ProcessMaker BPM product if the workflows require human intervention through forms/email other interactions. The reporting tools assist in monitoring and dealing with circular chains.

If you are a developer and want to develop your own system around a workflow engine, we also have www.processmaker.io which is a workflow engine in the cloud. So all the infrastructure hassle is taken care of for you and you communicate via an api to build out your workflows and execute them. I feel like that's better described as an orchestration engine however it supports task assignment to people. An approach like this works well with microservices since it can act as a microservice orchestration engine with a more human workflow approach.

Both of these approaches can be long running (some customers have year long processes running).

Let me know if you want to know more details, happy to share.

steven_h · on May 14, 2018

If you're in AWS, SWF and StepFunctions are great for starting and monitoring task completion / failure for long running processes, either interconnected or single.

You can write your own code to long poll in either and do work as it's needed, but with StepFunctions you can wrap lambdas to give a little more visibility and error handling.

fredley · on May 14, 2018

A custom layer built on top of Celery that allows for better monitoring and dependency management, amongst other things. Monitoring, particularly of failure is pretty ok in Celery anyway.

The whole thing can generate its own graph by inspecting dependencies, and we use dagre to draw pretty process workflows with status, interactions and monitoring.

tedmiston · on May 14, 2018

Any chance your layer is open source? I'd be curious to see how it compares to something like Airflow with a Celery executor.

unit_circle · on May 14, 2018

We spent a long time shopping around (ETL tools, Airflow, Luigi, etc.) And eventually found Argo. We are in the process of migrating our home-rolled scientific JS based workflows. https://github.com/argoproj/argo

fpierfed · on May 14, 2018

I worked with long (>> 24 hours, some times up to a week) complex workflows on big (thousands of nodes) clusters. We used custom software layered on top of a job scheduler like PBS Pro or HTCondor. The nice thing about this setup is that it supports re-running failed jobs, has pretty good monitoring, does an OK job at resource selection and allocation and is language agnostic. The last point is good if your workflows have parts written in different languages. There are a handful of conferences a year on these topics by the way. My favorite is HTCondor Week at the University of Wisconsin in Madison. Talks are online [1]

[1]: http://research.cs.wisc.edu/htcondor/HTCondorWeek2017/

agotterer · on May 14, 2018

On some of our Ruby workflows We use Sidekiq Pro which has scheduled and batched jobs. The batched jobs is neat because it has a callback feature that you can use for starting additional steps / workflows. We monitor/alert on progress with statsd, datadog, and the sidekiq ui.

plcancel · on May 14, 2018

IIS, AppFabric, Windows Workflow Foundation services (WF). Leverage there for orchestration, persistence, error handling, etc. Considering the demise of AppFabric, do you mind if I Ask HN: How would you handle these long-running workflows in the long-run (and keep IIS and WF)?

scarface74 · on May 18, 2018

The last time I did it maybe a year ago, I had a combination of batch jobs that once completed, should kick off other batch jobs.

I created a poor mans fire and forget pub/sub model where when one process was finish it would "raise an event".

Raising an event, would look in Hashicorp's Consul to see what jobs should be run based on the event and submit a job to Nomad. There were a number of EC2 instances running Nomad agents that would kick off the subsequent jobs. Nomad jobs could be executables or Docker containers.

I was very much a Hashicorp fanboy until I transitioned to using native AWS services. These days I would probably use AWS Step Functions.

geomagilles · on May 14, 2018

There is no obvious solution right now. That's why we are building Zenaton (I'm cofounder). It's in closed beta by now, but you can have a look at the documentation (https://zenaton.com/documentation) and also read some use cases (https://medium.com/zenaton). Zenaton provides a very simple way (in your own programming language) to orchestrate background jobs

dalore · on May 14, 2018

No obvious solutions? Many enterprise companies have a workflow management product. Adobe has one which it makes quite a bit of enterprise revenue from.

https://www.adobe.com/uk/marketing-cloud/experience-manager/...

geomagilles · on May 14, 2018

Indeed - but I do not think it's related to the question. The question here is: how do I - as a developer - implement a workflow? Still there are numerous BPM solutions, but often overly sophisticated. You have AWS SWF, but complicated to use, Airflow but in Python only, your own implementation using queues, database, etc... Look at the diversity of answers: there is no obvious answer right now.

tedmiston · on May 14, 2018

Airflow can run tasks written in languages besides Python in several ways, such as through the BashOperator, DockerOperator, dispatching a job to a Spark cluster, etc. It's common to use multiple languages.

It's really just the configuration for tasks, DAGs, etc that must be done in Python. I know some people have even automating that to pull from yaml or json instead, but I prefer to have the flexibility myself.

superzamp · on May 14, 2018

From what I see on your website it seems your product indeed has found a sweet spot between business-heavy and deep-tech systems.

The only concern I have is having such a critical part of my application running in a proprietary SaaS environment. Do you have plans to consider on-premise licensing or having an open-source community codebase with enterprise plans?

geomagilles · on May 15, 2018

Thx. I totally understand your concerns. We work hard to make developer life much more easy. That's also why the solution is hosted. So you do not have to install, maintain, scale your own system. Just to clarify (if needed): your tasks are executed on your servers, we handle only the orchestration itself. Pricing is a work in progress, but we will probably offer a large free usage.

gaigepr · on May 14, 2018

The Argo project is a workflow engine built on top of kubernetes. Workflows are written as yaml templates and support DAGs as well as loops and conditionals.

https://github.com/argoproj/argo

We use this at my company to stitch together various scientific software packages each of which can take minutes to 10s of hours to run. Argo supports retrying, resubmitting, suspending, and resuming workflows. It really is a neat project, especially if you are already using kubernetes!

andyv133 · on May 14, 2018

I've used Redmine with the Checklists plugin for this. Each thing that needs to be done is a redmine issue, and each issue can have a checklist. As team members check off items on the list, the issue logs who/what/when and then the user can assign the next person in the chain to the issue. At the time the checklists plugin didn't include templating functionality (not sure if it does now), so I rolled my own using the Redmine REST api and some PHP.

Hardest part was getting managerial support; they really liked paper.

FLUX-YOU · on May 14, 2018

"State machine" was the easiest for simpler stuff. I put it in quotes because it feels like one, but probably isn't.

Map out each state of your workflow, and having errors give the option to fix immediately, try again, or revert to a previously known-good state. You likely want to start with a 10,000ft view of the workflow and then work on each of those steps as an independent unit and add all of their intermediate steps (on and on until you reach the bottom of the recursion).

This gives you a good opportunity to break things up into microservices that completely handle individual steps if they are big and detailed enough.

P2P is hardest because you will likely need to code something to determine who should decide to move things to the next state (simple majority? one person elected?) and keep track of consensus between all parties.

Orchestration is easier because there's usually one person, one role, or one security claim in control at a particular step and changing who can advance the state at each step is pretty easy as well.

All of this was mostly for the goal of really easy unit testing.

But note that whatever backing data store you use can be changed by any developer unless you code all of the business rules there, too. Many people do not like doing this though because it's not as easy as all of the unit testing frameworks, debuggers, and IDEs we have for code.

The challenge is that you need to know the workflow completely and that will very likely involve talking to a lot of people and the chances you will miss one or two edge cases is high. The counter to that challenge is that as developers building a product that saves time/money, you can bend the workflow to make it easier to code and sometimes eliminate those extra steps (literally, we had someone copying and pasting stuff to 'make it work', so of course we can automate that).

Saving known-good states can also be challenging depending on what you're doing, but if you need change history or diff'ing in a user-consumable form, you'll have to do that anyway. If you get this right, it can save your users a lot of potentially lost work and headache if a bug gets past unit testing.

Once everything is modular, logging isn't too difficult either.

tamcap · on May 14, 2018

We have developed a custom workflow system in PHP for our company (academic text editing and related work). Back then (I was not directly involved from the start) none of the out of the box solutions fit our criteria, and it made more sense to just build a bespoke, custom fitted system. Workflows range from a few days to a month+, with no technical upper limit enforced, as far as I know.

We also don't need a huge throughput, so having something super-optimized was not a large concern.

xemdetia · on May 14, 2018

I had been looking at using BPMN and an implementation of Camunda as a reasonable goal, but I never found an implementation of running a BPMN service that I liked in the time I had allotted. In the workflow each item is essentially a ticket so you end up with concurrent tickets in the state machine. It also has timers to generate events so you can have that monthly event start and trigger some other actions, and it also includes failure paths.

probledo · on May 16, 2018

So what is your problem using BPMN?

aprdm · on May 15, 2018

In the past used both Celery and RabbitMQ for some custom services... however we also used to use Qube (A render farm manager) to link long running tasks together in a dependency chain which was workflow specific.

Some times job could take days to finish (doing a water simulation in a 4k res sequence)

Qube has resource constraints per job as well as number of tries and so on. Those would all be configured at job (think workflow) submission time.

Maro · on May 14, 2018

Airflow jobs in the backend.

tnolet · on May 14, 2018

If you're on the AWS platform, Lambda with SNS messages as triggers works really well. Nicely decoupled and mimics the ESB-like workflow a bit. You get monitoring out of the box. Apparently, AWS is also working on having SQS function as a trigger for Lambda steps. That would resolve some issues with retrying and deadletter boxing.

dgemm · on May 14, 2018

Or, you know, SWF: https://aws.amazon.com/swf/

zie · on May 14, 2018

Nomad just does it for you(mostly): https://www.nomadproject.io/docs/job-specification/parameter...

TheWiseOne · on May 15, 2018

If you are using C#, you can use the Durable Task Framework (https://github.com/Azure/durabletask) to handle some of this stuff.

enraged_camel · on May 14, 2018

We use (as well as sell/implement/customize) an Enterprise Content Management system that has very robust business automation capabilities.

https://www.laserfiche.com

KirinDave · on May 14, 2018

Here is a very engineering-centric view of what I tend to do. These workflows are optimized around "long" in the scope of microservices, 1-3 minutes. If you go much longer than this, consider hard why this doesn't fit into ETL loads before engineering more solutions.

Firstly, there is an issue of what these tasks are composed of. They tend to start with a human-generated action, result in several programmatic steps interacting with internal and external services. They then tend to result in a write to a private store, or a call to a service that arbitrates this.

For your task initiation: you're basically building a queue even if you don't like queues. I recommend you embrace this where possible. Eventually you may find so much programmatic traffic that a queue will be unsuitable, but that won't change the need for a queue for human-initiated actions. Do try to write task states to a non-durable store so you can watch tasks!

For your task executors: every aspect of them must consider first that any of the sequential actions may fail to execute, and thus cause the entire task to fail. You simply cannot escape the need to retry tasks. Build for this from day one. For inspiration, a primitive but effective system is Amazon SQS. You can achieve similar effects by rerunning blocks in Kafka, and Rabbit has its own solutions. The more heavyweight the mechanics, the more likely the spine of your product is to break at a critical moment. Be careful.

For your microservices, as informed by previous information you must strive for idemopotency on every endpoint. Even if you can't truly reach this (and true provable idempotency is actually very hard), achieving a practical notion of idempotency to accomodate modest retries is absolutely essential. Retrofitting large systems with idempotency is even more difficult than doing it to start with. Accept performance tradeoffs for this without hesitation. Anyone who says that tail latency is more important than data integrity for business logic is either in lottery-winning-rare condition or is over-prioritizing engineering. If a human needs to act to correct bad data the cost of recovery skyrockets and can spiral out of control.

For your final commit stores, remember that they're not infinite or magical and many can't provide very useful concurrency guarantees. Prefer append-only tables even if this obliges you to run cleanup cycles. If you are going to update records in place, try to use stores with "upsert" operations or "test-and-set" mechanics.

Let's loop back around with this advice and answer each of your questions in turn:

> Is your system more P2P or orchestrated?

Orchestrated systems are easier to monitor, understand and build. They tend to run into scaling challenges after a certain level. Write your software to be agnostic to this. Start with orchestrated when possible.

> Do you leverage some existing tools or built your own?

Both. Bespoke workflow tools are easy. Custom, consistent state storage is harder. Shy away from that outside of very special use cases (e.g., integrated CRDTs or a bloom filter for whitelisting events inside a hot loop.

> Are you confident in your monitoring of errored workflows?

Personally: no. It's genuinely difficult to do this. The harder you try, the more likely it is that your error monitoring system becomes the contention point that breaks your system.

> How do you retry errored workflows?

We use SQS to queue workflows. They get a lot of retries by having the queue claw back the message. In some rare cases work times out and is clawed back to the queue spuriously. I've worked hard to make sure all the services that it calls don't care about such cases and result in expensive nops.

> If your system if more P2P, how do you keep a holistic view of what's happening? Can you be certain that you don't have any circular event chains?

The situation is identical for all types of architectures. AS good bit of advice for the later I picked up is NEVER have a workflow fork conditionally into a prior state. Always have them flow downwards and "away" from your event dispatch queues. If you can, use different queues for internal traffic vs external traffic. You might also use different microservices or tags on microsevice requests. All of this is in service of trying to avoid feedback loops in your system.

h43k3r · on May 16, 2018

We have a lot of long running workflows written in DTF

https://github.com/Azure/durabletask

ojhughes · on May 14, 2018

https://concourse-ci.org/ is extremely flexible and we use it for a number of complex workflows

NewDimension · on May 14, 2018

Does anyone have recommendation for a non-software dev environment? e.g. user task workflow management. I'm looking for a fully fledged product and/or an API backend.

tedmiston · on May 14, 2018

Does something like Zapier fit your use cases?

NewDimension · on May 15, 2018

Zapier looks like a app trigger. I'm looking for more of a task management system.

BMarkmann · on May 15, 2018

Check out the BPM tools others have mentioned (Activiti, Camunda, etc...)

alimbada · on May 14, 2018

We are currently using Activiti (a fork of jBPM) for a client project. Tooling is pretty shoddy for changing the workflow, but it works.

hb3b · on May 14, 2018

Samanage (funny enough, no posts yet about JIRA)

rando444 · on May 14, 2018

We, and likely many others, use Jira as kind of the second tier / exception handling.

When the automated system fails, it automatically opens a Jira ticket to get the right people to fix the automated workflow.

You can then use the Jira case history to drive process improvement.

chjohnst · on May 14, 2018

Jenkins, but my pipelines are mostly data pipelines (taking source data to convert to something else) nightly.

jononor · on May 14, 2018

What kind of workflows take multiple days? I am assuming that means human inputs are needed for (some) steps?

brudgers · on May 14, 2018

A debiting and crediting a bank account is an example of a long running workflow...though I am not certain that is what the OP meant. Anyway, as a workflow, the process that maintains an individual account usually runs over many years and perhaps a century or more. The underlying architecture is one reason why banking still (sometimes) uses COBOL...the software was written around abstractions that address the timelines involved. For what it's worth, Michael (not that one) Jackson's Principles of Program Design is where I picked up account balance as a long running process.

mabn · on May 14, 2018

In my case it was waiting until tax information becomes available in another system (IIRC up to several months, but usually few days). Sometimes it was required for a person to actually travel somewhere to get the data via paper forms. Usually human input was needed for some steps but sometimes it was able to complete automatically.

Most waiting (hours-days) happened because the work was waiting in a queue for a user to take care of it.

dominotw · on May 14, 2018

kafka streams and spark jobs.