The canonical answer to this question apparently used to be ESBs, but the rise of the microservice paradigm eventually pushed them to decline and left a void I'm not sure how is currently filled.
HN, how do you handle your days-long sequences of business steps?
Some seed questions:
* Is your system more P2P or orchestrated?
* Do you leverage some existing tools or built your own?
* Are you confident in your monitoring of errored workflows?
* How do you retry errored workflows?
* If your system if more P2P, how do you keep a holistic view of what's happening? Can you be certain that you don't have any circular event chains?
Long running jobs are a rarity, so we usually spin up a new RabbitMQ cluster and services, but tie those services back to the main write/read stores. This allows regular operations to still occur, but we can monitor the bulk process and commit resources to it in a more isolated fashion.
Errors end up in error queues in Rabbit, and can be dumped back in to be reprocessed if appropriate (or just ignored if it's a side effect we don't care about).
Once it's setup and running, it works well enough. Spinning up a new rabbit cluster and service instances is currently manual, but since we've moved to Kubernetes I'm hoping this can be automated almost entirely.