We run a small SaaS where users are able to create accounts, submit billing information and upload/call ML artefacts.
In order to not reinvent the wheel we use external services where possible: Auth0 for authentication, Stripe for handling billing, etc. For this question, I am considering this a 'microservices' architecture. I am aware that this definition will spark its own discussion, but I believe that the problem generalises to a lot of the other (better, more complete, etc.) microservice definitions. So please, bear with me.
Now, in the lifecycle of a customer (createAccount, addBilling, deleteAccount, ...) at various points we expect operations to occur atomically. By which I mean (simplified) that upon creating a new account, I also need to ensure a customer is created in Stripe as well as register the user in Auth0 - but if either of these subtasks fail, the operation (createAccount) should fail completely and in fact 'undo' any state changes already performed. If not, I risk high-impact bugs such as double charging customers.
Now, in a 'conventional' setup (without external services), I would resolve a lot of this by ensuring transactional operations on a single source-of-truth database. I understand that 'idempotency' comes up a lot here, but whichever way I try to apply that here - it always seems to explode in a (fragile/brittle) spaghetti of calls, error handling and subsequent calls.
Surely this has been resolved by now, but I'm having a hard time finding any good resources on how to approach this in an elegant manner. Concretely:
Do you recognise the problem of atomicity in microservices architecture?
And,
How do you deal with guaranteeing this atomicity in your microservices?
In fact, since the services you're describing don't know about each other, distributed transactions aren't an option.
I think the only solution to this problem is idempotency. Idempotency is a distributed systems swiss army knife—you can tackle 90% of use cases by combining retries and idempotency and never have to worry about ACID. Yes, it adds complexity. No, you don't have a choice.
I'm also not sure why this requires a lot of complexity. Can you explain how you're implementing idempotency? The simplest approach is to initialize an idempotency key on the browser side which you thread across your call graph. Stripe has built in support for idempotency keys so in that case, no additional logic is required. For providers without idempotency support, you'll need a way to track idempotency keys atomically, but this is usually trivial to implement. When a particular provider fails, you can ask users to retry.
* If you need a particular operation to succeed only if another succeeds (creating a stripe charge, for example), make sure that it runs after its dependencies.
* If you don't like the idea of forcing users to retry, you can ensure "eventual consistency" using a durable queue + idempotency.
I'm not a fan of HN comments that trivialize problems, but if you have to build complex distributed systems machinery to solve the problem you're describing, I feel strongly that something's going wrong.