
Ask HN: How do you deal with atomicity in microservice environments? - c89X
We run a small SaaS where users are able to create accounts, submit billing information and upload&#x2F;call ML artefacts.<p>In order to not reinvent the wheel we use external services where possible: Auth0 for authentication, Stripe for handling billing, etc. For this question, I am considering this a &#x27;microservices&#x27; architecture. I am aware that this definition will spark its own discussion, but I believe that the problem generalises to a lot of the other (better, more complete, etc.) microservice definitions. So please, bear with me.<p>Now, in the lifecycle of a customer (createAccount, addBilling, deleteAccount, ...) at various points we expect operations to occur atomically. By which I mean (simplified) that upon creating a new account, I also need to ensure a customer is created in Stripe as well as register the user in Auth0 - but if either of these subtasks fail, the operation (createAccount) should fail completely and in fact &#x27;undo&#x27; any state changes already performed. If not, I risk high-impact bugs such as double charging customers.<p>Now, in a &#x27;conventional&#x27; setup (without external services), I would resolve a lot of this by ensuring transactional operations on a single source-of-truth database. I understand that &#x27;idempotency&#x27; comes up a lot here, but whichever way I try to apply that here - it always seems to explode in a (fragile&#x2F;brittle) spaghetti of calls, error handling and subsequent calls.<p>Surely this has been resolved by now, but I&#x27;m having a hard time finding any good resources on how to approach this in an elegant manner. Concretely:<p>Do you recognise the problem of atomicity in microservices architecture?<p>And,<p>How do you deal with guaranteeing this atomicity in your microservices?
======
mooted1
Having built similar applications in microservice environments, I think there
are usually simpler answers than distributed transactions. And if you do need
distributed transactions, this is often a sign that your service boundaries
are too granular.

In fact, since the services you're describing don't know about each other,
distributed transactions aren't an option.

I think the only solution to this problem is idempotency. Idempotency is a
distributed systems swiss army knife—you can tackle 90% of use cases by
combining retries and idempotency and never have to worry about ACID. Yes, it
adds complexity. No, you don't have a choice.

I'm also not sure why this requires a lot of complexity. Can you explain how
you're implementing idempotency? The simplest approach is to initialize an
idempotency key on the browser side which you thread across your call graph.
Stripe has built in support for idempotency keys so in that case, no
additional logic is required. For providers without idempotency support,
you'll need a way to track idempotency keys atomically, but this is usually
trivial to implement. When a particular provider fails, you can ask users to
retry.

* If you need a particular operation to succeed only if another succeeds (creating a stripe charge, for example), make sure that it runs after its dependencies.

* If you don't like the idea of forcing users to retry, you can ensure "eventual consistency" using a durable queue + idempotency.

I'm not a fan of HN comments that trivialize problems, but if you have to
build complex distributed systems machinery to solve the problem you're
describing, I feel strongly that something's going wrong.

~~~
mooted1
Also, I think the answers you're getting keep telling you to build distributed
transactions because

a.) they didn't read your post and are overindexing on "microservices" b.)
this isn't an atomicity problem, it's a workflow problem. When people say
atomicity, they're usually referring to operations over data. In your case, it
sounds like you need a way to coordinate execution of side-effectful
operations (like creating a charge).

~~~
LoSboccacc
but, when one starts talking about idempotence and retries the first thing
that comes to mind is these are just user mediated two phases commits, where
an error dialog and user interaction drive the phases instead of a distributed
transaction system, by which point one can just cut the middleman and build
the thing out of tested and robust components

~~~
mooted1
What middleman do you want to cut out? My proposal doesn't add any
dependencies. It's also not possible to prevent duplicate charges if a user
POSTs a form twice without idempotency keys, so idempotency is necessary.

I'm really not sure what you're actually proposing?

~~~
LoSboccacc
the user doing the retries

------
siliconc0w
You often don't need or want cross service atomicity, just eventual
consistency. Each microservice should be an idempotent state-machine where it
expects to have atomic commits to it's own state but never expects to be able
to conduct a transaction across the service boundary. However, services can
conduct local transactions on their own state joined against a read-only
cached copy of another service's state - you can implement this with an event
bus or a shared caching tier. This can allow you to avoid writing your own
joining logic and use standard ORMs. Ensuring the queues are flowing and
retries happen is very important though, you need to monitor queue lengths and
job errors to ensure the eventual part of eventual consistency is happening.

For this particular example createAccount would create a local commit with a
state of CREATING, returns an account_id and asynchronously creates jobs to
complete the billing, the auth, whatever. You then have a job that is polling
in the background to move the account to CREATED once all or enough of the
dependencies are successful(i.e you may have a slow third party provider you
don't want to block on). Your front end polls the state of the account and
displays a pretty animation to distract the user while you do the work.

~~~
c89X
Thanks. From my (limited) understanding this sounds a lot like the Sagas that
have been suggested by others - does that match your intention?

------
jjanyan
"Sagas", or distributed transactions, are what you're looking for. These are
APIs/functions/methods that know how to complete every step of your atomic
operation and how to roll it back if any step fails. They more or less
recreate what would have been a single database transaction pre-microservice.

~~~
c89X
That's it! Not sure how I never stumbled upon this. Thanks!

~~~
manasvi_gupta
Came here to say same thing. Saga pattern is the way to go.

Also, suggest you to carefully think about transaction boundaries, if
possible, prefer eventual consistency over transactions.

------
mh8h
The real business world is not atomic. The idea is to keep track of things
that you've done, and do compensating things to roll their effect back if
something goes wrong. See the Saga pattern for an example. I also found Gregor
Hohpe's blog post [0] titled "Starbucks Does Not Use Two-Phase Commit" very
informative.

[0]
[https://www.enterpriseintegrationpatterns.com/ramblings/18_s...](https://www.enterpriseintegrationpatterns.com/ramblings/18_starbucks.html)

~~~
voldybot
good read. ... thanks

------
perlgeek
Not everything that uses lots of services needs to be a classical
"microservice" architecture, where every service can call every other service.

You can have a component that coordinates workflows (can be a different
component for each workflow, or can be one central component, depending on
your need to scale vs. simplicity).

In the OpenStack project, they developed TaskFlow[0] for that. It allows you
to build graphs of tasks, handles parallelization, persistence of meta
information (if desired) and, important for your use case, rollback functions
in case of errors.

[0]:
[https://docs.openstack.org/taskflow/latest/](https://docs.openstack.org/taskflow/latest/)

------
phyrex
"Designing Data-Intensive Applications" by Martin Kleppmann touches on this
subject: [http://dataintensive.net](http://dataintensive.net)

------
AlexITC
I know two not-so-complex ways for dealing with this:

1\. Use a persistent queue where you store what needs to be done, this allows
you to ensure than an operation will eventually succeed by a leveraging the
powers of the queue, you'll need to keep track of the stages where the jobs
failed in order to not repeat operations.

2\. Rely on the clients retrying and use idempotency, there is a nice and very
detailed blog[0] about it, this works very well with your microservices but
most external services won't have idempotency and you'll need to plan on how
to deal with those in order to emulate idempotency, sometimes the queue helps.

[0]: [https://brandur.org/idempotency-keys](https://brandur.org/idempotency-
keys)

~~~
c89X
thanks, useful resource!

------
namelosw
This is a hard problem to tackle.

An obvious approach is 2-phase commit or N-phase commit. 2PC and NPC are
generally not recommended because it sacrifices performance too much.

* Pros: It is more like a transaction so that you don't have to write rollback logic.

* Cons: It seriously sacrifices system availability since the distributed transaction happens across networks.

A commonly recommended approach is Saga, it's basically compensation.

* Pros: Does not sacrifices system availability since it's an asynchronous/optimal design.

* Cons: Have to write a lot of compensation logic and call them correctly.

* Cons: There are a lot of operations that cannot be compensated. These operations should be arranged in the last steps.

* Cons: What happens if compensation fails?

* Cons: It's eventual consistency, it's not strong atomicity.

I have to admit, I don't try to tackle this problem unless it's really
important for the business in a microservices environment.

If your business requires atomicity in most of the places, it's highly
recommended to have a carefully designed monolith so you can easily benefit
from RDBMS, with your domains modeled in different modules instead of
services.

~~~
c89X
Thanks, really appreciate the pros and cons here. To be clear, and I
understand the confusion: I'm actually building more or less a monolith with a
RDBMS to encapsulate most of the business logic of my problem domain.

However, for various parts of my application that seem high-complexity, high-
impact and non-competetive edge (such as authentication and credit card
handling) I want to outsource to external services (Auth0 and Stripe). This
brings with it the challenge of managing state, and consistency across various
external services (on top of the state I manage internally).

The reason microservices is mentioned in the question, is because I believe
these problems generalise from my specific case, to microservices.

------
gregoryl
Just don't bother. At the scale you're working at, if something tries to run
and, e.g. The user doesn't exist, just throw an exception, email the team for
a manual fix. If(if!) it becomes too common, throw in a retry or two.

You have better things to be doing in a startup vs distributed transactions
(and honestly, microservices...)

------
jbob2000
Microservices are good for large complex systems, with many developers making
many changes. It sounds like your system is rather simple, I don’t think you
need to pursue this architecture, especially just starting out - you’ll need
to write too much infrastructure and you probably don’t need the scalability.

~~~
gigatexal
Yeah I agree. If prototyping an idea why not start with a monolith and a
single DB before going microservices for even scaling is needed?

~~~
c89X
So you would recommend implementing authentication and handling credit cards
from scratch? The reason I'm going with this approach is because although I do
use a monolith and a single (hosted) DB for my business logic my understanding
was that implementing high-impact, high-complexity (non competitive advantage)
systems like authentication and credit card handling are best 'outsourced'.

This outsourcing of high-impact, high-complexity does come with the
disadvantage of now having to deal with external services. As far as I can
tell, the atomicity/transactional problem with these external services
generalises to microservices - hence the microservices tag.

~~~
gigatexal
No of course not.

You have third party thing working like authentication. Bake that into your
application in a simple flow where if it works the user gets added to the
database or gets access etc to things. Then you would be able to use
transactions in the DB

------
moondev
This may not be the solution you are seeking, but I would do this with an
asynchronous job model. Kick off a job where each stage is a dependency on the
previous stage. A failure of any stage will kickoff a cleanup job. The account
creation service would be responsible for creating the job and monitoring the
status. You could run jobs with whatever you see fit. A kafka consumer would
be an interesting option. For a quick POC you could even do it in Jenkins via
the API.

------
steven2012
Calling external services "microservices" is fundamentally wrong. They are
"services". All that will do is create a lot of confusion because you are
utterly misusing terminology. You can't just decide to misuse terminology.
That's like calling "red" "blue". You don't gain information by misusing well-
defined terms, you create confusion and misinformation.

That said, your ideas on distributed systems is wrong. Any point in a workflow
can and will fail. You need to account for all of this. You can successfully
create an account on Stripe, and then when you bill that account, it could
return an error. Or even worse, it can timeout, meaning you don't know whether
or not a user was charged.

You have to take into consideration all of these failure situations. There is
no atomicity in the way that you expect. Whenever things deviate off the happy
path, you fail quickly and decisively so that everyone knows where they stand.
That gives people the option to retry or call support.

~~~
ralmeida
What’s the practical difference between having an internal microservice which
handles subscriptions/charging and using Stripe, for example?

------
quadrature
Hate to shill, but this is a problem we faced where i work and we wrote a
great article about our solution
[https://engineering.shopify.com/blogs/engineering/building-r...](https://engineering.shopify.com/blogs/engineering/building-
resilient-graphql-apis-using-idempotency)

------
BurningFrog
Simplest Thing That Could Possibly Work:

I would look at keeping track of what substeps have completed, and only treat
the user as created if they're all done.

You _could_ roll back the successful ones if not everything worked, but you
could also just let them be.

------
myvoiceismypass
Google “sagas” - it is a pattern that helps with this, rollbacks across
boundaries, etc

------
z3t4
There are no correct answer that applies to all architectures. But generally
speaking you have a core that use the services like black boxes with API's.
The core/caller need to handle the errors.

------
quintes
2 phase commit and transaction commit / rollback. Think about boundaries and
domains

~~~
mooted1
That's not what the OP is asking--they're using external services like Stripe.
Creating a Stripe charge isn't something you can roll back.

------
jrockway
There is no one-size-fits-all approach to atomicity in microservices.
Microservices typically want to store the most minimal amount of state
possible, and push the coordination up a level; i.e., the calling service will
send all the relevant information to the service that is going to handle this
tiny piece of the work. Eventually there is going to be one service that owns
some sort of workflow, and pushes the rest of the world in the direction of
its desired outcome.

A year or so ago I needed to write a service to rotate accounts on network
equipment every day or so. The control channel and the network hardware tended
to be flaky, so I designed a state machine to make as much progress as
possible, and be able to pick up where it left off. Each newly-created account
was a row in a transactional database, and the successful completion of the
operation was noted by changing a column from NULL to the current time. The
flow was; if a new account is needed, generate the credential (generated_at).
Find the old credentials and log into the device. Add the new account
(applied_at). Try logging in with the new account (verified_at). Wait until
account expiration (2 days later). Delete the account (deleted_at). Verify
that we can't log in with the old account (delete_verified_at).

From this data, we could see what state every account was in, and query for a
list of operations that needed to be performed. (Or if nothing needed to be
done, how long to sleep for so that we would re-run the loop at the exact
instant when work needed to be done, not that it was critical.)

I believe that your account creation and account deletion should follow a
similar workflow. Accounts that fail to be created after retrying for X length
of time should just move into the deletion workflow.

The user creation service and the day-to-day user operation service should
probably be the same thing. The "user" row in your database should be what the
state machine uses to figure out what operations need to be performed.

A user record would probably look something like:

    
    
      username, password, email, ... = <stuff your user gives you>
      created_at = <now>
      stripe_account_id = <what the stripe account creation rpc returns>
      stripe_account_verified_at = <time when you verified that the stripe account works>
      auth0_account_id = ...
      auth0_account_verified_at = ...
      delete_requested_at = <when the user clicked "delete account" or you decided you don't want to retry the above steps any more>
      stripe_account_delete_verified_at = <time that you ensured the account is gone>
      auth0_account_delete_verified_at = <ditto>
    

Now you can do queries to figure out what work needs to be done. "select
username from users where delete_requested_at is null and
stripe_account_verified_at is null" -> create stripe accounts for those users.
"select username from users where delete_requested_at is not null and
auth0_account_delete_verified_at is null" -> delete auth0 accounts for those
users.

The last bit of complexity is to prevent two copies of your user service from
running the query at the exact instant and deciding to create a stripe account
twice. I would personally just run one replica of that job. This makes
deployments slightly more irritating, since no user creation can occur during
a rollout (when the number of replicas is 0), but it sure is simple. I worked
on a user-facing feature at Google that did that, and although I hated it, it
worked fine. It is also possible to add a column to lock the row; check that
stripe_account_creation_started_at is null; change it to the current time;
commit. Only one replica will be able to successfully commit that and progress
to the next step, but then you need a way of finding deadlocks (the app could
crash right after the commit) and breaking them.

It is a little complicated but I personally would rather have exact accounting
of what is happening and why, rather than guessing when some user logs in and
doesn't have a stripe account.

Edit to add: one last bit... I like running this all in a background loop that
wakes up periodically to see what needs to be done, but your createAccount RPC
should also be able to immediately wake this loop up. That way if everything
goes well, you don't add any latency by introducing this state machine /
workflow manager. If something happens like Stripe being down... progress will
resume when the loop wakes up again. For that reason, I think you should be
explicit and provide the end user with a system that lets them request an
account and lets them check the status of the request. (Maybe not the user
directly, but your signup UI will be consuming this stuff and providing them
with appropriate feedback. You don't want the createAccount RPC to just hang
while you're talking to Stripe and Auth0 in the background. Probably. The
happy case might take a second, but the worst case could take much longer.
Design your UI to be good for the worst case, and it will be good for the
happy case too.)

~~~
c89X
I like this approach: it closely matches my initial instincts of having a
'transaction' table for each type of event, with columns for each of the
steps. The reason I'm weary of this approach, is because I have done an
implementation of this in the past but it turned out to become a nightmare -
this is most likely due to a combination of the complexity of that particular
domain, my inexperience with this type of problem and external pressures.

I did find some pretty good resources on Sagas - do you have any thought on
comparing Sagas pro/cons vis-a-vis this approach?

~~~
jrockway
I do not know much about sagas.

I like this approach because you need this user table anyway (probably!) and
no data is duplicated in your system. There is one authoritative place that
maps username to stripe id, and the logic to work on computing that (making
stripe api calls, retrying) lives inside the system that is responsible for
telling other parts of your infrastructure that ID. So you do get a consistent
view from everywhere; no system will ever have the "wrong answer".

The complexity here comes from handling every possible transition. With an ad-
hoc approach, some transitions are hidden and can be ignored until they
actually happen. With this approach, you have to think about it all up front
and write the necessary unit tests.

The first version of my password changer thing detected states I hadn't
thought about and just sent an alert for someone to manually do something.
There was a control plane outage and everything broke one day, and that forced
me to implement some of the automatic cleanup. Some other comments suggest
"don't write this, just send your ops team an email when something breaks" and
that is fine for the initial version. All you really want is a detailed record
of what's happened and what went wrong, from there you can either fix it one-
off, or finally write the code to fix it automatically. You don't need to
implement retries and rollbacks on version 0... just flag the account and go
fix it yourself until you decide you'd rather have the computer do it.

------
metapsj
event sourcing and the notion of compensating transactions go along way in
solving these types of problems.

~~~
throwaway40324
I did a find for 'event sourcing' in this thread and yours is the single hit
at the time of my comment. I came here to agree, and tell the OP that you
indeed would benefit from an Event Sourcing Architecture.

~~~
metapsj
thx :-) i get the impression event sourcing for the uninitiated is too
abstract a concept and is often confused with event-driven architectures.

