Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you deal with atomicity in microservice environments?
116 points by c89X on Nov 28, 2019 | hide | past | favorite | 40 comments
We run a small SaaS where users are able to create accounts, submit billing information and upload/call ML artefacts.

In order to not reinvent the wheel we use external services where possible: Auth0 for authentication, Stripe for handling billing, etc. For this question, I am considering this a 'microservices' architecture. I am aware that this definition will spark its own discussion, but I believe that the problem generalises to a lot of the other (better, more complete, etc.) microservice definitions. So please, bear with me.

Now, in the lifecycle of a customer (createAccount, addBilling, deleteAccount, ...) at various points we expect operations to occur atomically. By which I mean (simplified) that upon creating a new account, I also need to ensure a customer is created in Stripe as well as register the user in Auth0 - but if either of these subtasks fail, the operation (createAccount) should fail completely and in fact 'undo' any state changes already performed. If not, I risk high-impact bugs such as double charging customers.

Now, in a 'conventional' setup (without external services), I would resolve a lot of this by ensuring transactional operations on a single source-of-truth database. I understand that 'idempotency' comes up a lot here, but whichever way I try to apply that here - it always seems to explode in a (fragile/brittle) spaghetti of calls, error handling and subsequent calls.

Surely this has been resolved by now, but I'm having a hard time finding any good resources on how to approach this in an elegant manner. Concretely:

Do you recognise the problem of atomicity in microservices architecture?


How do you deal with guaranteeing this atomicity in your microservices?

Having built similar applications in microservice environments, I think there are usually simpler answers than distributed transactions. And if you do need distributed transactions, this is often a sign that your service boundaries are too granular.

In fact, since the services you're describing don't know about each other, distributed transactions aren't an option.

I think the only solution to this problem is idempotency. Idempotency is a distributed systems swiss army knife—you can tackle 90% of use cases by combining retries and idempotency and never have to worry about ACID. Yes, it adds complexity. No, you don't have a choice.

I'm also not sure why this requires a lot of complexity. Can you explain how you're implementing idempotency? The simplest approach is to initialize an idempotency key on the browser side which you thread across your call graph. Stripe has built in support for idempotency keys so in that case, no additional logic is required. For providers without idempotency support, you'll need a way to track idempotency keys atomically, but this is usually trivial to implement. When a particular provider fails, you can ask users to retry.

* If you need a particular operation to succeed only if another succeeds (creating a stripe charge, for example), make sure that it runs after its dependencies.

* If you don't like the idea of forcing users to retry, you can ensure "eventual consistency" using a durable queue + idempotency.

I'm not a fan of HN comments that trivialize problems, but if you have to build complex distributed systems machinery to solve the problem you're describing, I feel strongly that something's going wrong.

Also, I think the answers you're getting keep telling you to build distributed transactions because

a.) they didn't read your post and are overindexing on "microservices" b.) this isn't an atomicity problem, it's a workflow problem. When people say atomicity, they're usually referring to operations over data. In your case, it sounds like you need a way to coordinate execution of side-effectful operations (like creating a charge).

but, when one starts talking about idempotence and retries the first thing that comes to mind is these are just user mediated two phases commits, where an error dialog and user interaction drive the phases instead of a distributed transaction system, by which point one can just cut the middleman and build the thing out of tested and robust components

What middleman do you want to cut out? My proposal doesn't add any dependencies. It's also not possible to prevent duplicate charges if a user POSTs a form twice without idempotency keys, so idempotency is necessary.

I'm really not sure what you're actually proposing?

the user doing the retries

This approach makes a lot of sense. I was unaware of the concept of idempotency keys (nor did I know Stripe supported them). I'm actually trying to avoid complexity, which is why I turned to Auth0, Stripe (i.e. external services) to handle most of the logic for me - I'm sure I just need to figure out how to correctly apply those! Thanks for your input!

The way I have tried to implement (very likely, shoddily) idempotency is to have each mutation on an external service (say creating a customer in Stripe, corresponding to an account already administered in my database) first check if a customer with that ID already exists, and if not to create it - otherwise use the existing object (Stripe customer in this example).

``` user = Database.get_user(my_user_id)

if (Stripe.customer_exists(user.id) { return Stripe.get customer(user.id) } else { stripe_customer = Stripe.create_customer(user.id, user.email) Database.set_user_property("stripe_customer_id", stripe_customer.id) return customer } ```

The problems here are multiple, but at the very least I see the possibility of very nasty bugs if somehow the `customer_exists` call returns a false negative - this will cause the same customer to be created in Stripe twice (and thus potentially be charged twice). Another, more likely, issue is that between the `Stripe.create_customer` call and the `set_user_property` there may be an unexpected event (service/network goes down, whatever) failing to store the property - leading to a duplicate Stripe customer the next time the above code is executed. On top of that, I find it pretty difficult to reason about a code base chock full of this type of logic (perhaps that is just my personal limitation though!).

You often don't need or want cross service atomicity, just eventual consistency. Each microservice should be an idempotent state-machine where it expects to have atomic commits to it's own state but never expects to be able to conduct a transaction across the service boundary. However, services can conduct local transactions on their own state joined against a read-only cached copy of another service's state - you can implement this with an event bus or a shared caching tier. This can allow you to avoid writing your own joining logic and use standard ORMs. Ensuring the queues are flowing and retries happen is very important though, you need to monitor queue lengths and job errors to ensure the eventual part of eventual consistency is happening.

For this particular example createAccount would create a local commit with a state of CREATING, returns an account_id and asynchronously creates jobs to complete the billing, the auth, whatever. You then have a job that is polling in the background to move the account to CREATED once all or enough of the dependencies are successful(i.e you may have a slow third party provider you don't want to block on). Your front end polls the state of the account and displays a pretty animation to distract the user while you do the work.

Thanks. From my (limited) understanding this sounds a lot like the Sagas that have been suggested by others - does that match your intention?

"Sagas", or distributed transactions, are what you're looking for. These are APIs/functions/methods that know how to complete every step of your atomic operation and how to roll it back if any step fails. They more or less recreate what would have been a single database transaction pre-microservice.

That's it! Not sure how I never stumbled upon this. Thanks!

Came here to say same thing. Saga pattern is the way to go.

Also, suggest you to carefully think about transaction boundaries, if possible, prefer eventual consistency over transactions.

The real business world is not atomic. The idea is to keep track of things that you've done, and do compensating things to roll their effect back if something goes wrong. See the Saga pattern for an example. I also found Gregor Hohpe's blog post [0] titled "Starbucks Does Not Use Two-Phase Commit" very informative.

[0] https://www.enterpriseintegrationpatterns.com/ramblings/18_s...

good read. ... thanks

Not everything that uses lots of services needs to be a classical "microservice" architecture, where every service can call every other service.

You can have a component that coordinates workflows (can be a different component for each workflow, or can be one central component, depending on your need to scale vs. simplicity).

In the OpenStack project, they developed TaskFlow[0] for that. It allows you to build graphs of tasks, handles parallelization, persistence of meta information (if desired) and, important for your use case, rollback functions in case of errors.

[0]: https://docs.openstack.org/taskflow/latest/

"Designing Data-Intensive Applications" by Martin Kleppmann touches on this subject: http://dataintensive.net

I know two not-so-complex ways for dealing with this:

1. Use a persistent queue where you store what needs to be done, this allows you to ensure than an operation will eventually succeed by a leveraging the powers of the queue, you'll need to keep track of the stages where the jobs failed in order to not repeat operations.

2. Rely on the clients retrying and use idempotency, there is a nice and very detailed blog[0] about it, this works very well with your microservices but most external services won't have idempotency and you'll need to plan on how to deal with those in order to emulate idempotency, sometimes the queue helps.

[0]: https://brandur.org/idempotency-keys

thanks, useful resource!

Just don't bother. At the scale you're working at, if something tries to run and, e.g. The user doesn't exist, just throw an exception, email the team for a manual fix. If(if!) it becomes too common, throw in a retry or two.

You have better things to be doing in a startup vs distributed transactions (and honestly, microservices...)

This is a hard problem to tackle.

An obvious approach is 2-phase commit or N-phase commit. 2PC and NPC are generally not recommended because it sacrifices performance too much.

* Pros: It is more like a transaction so that you don't have to write rollback logic.

* Cons: It seriously sacrifices system availability since the distributed transaction happens across networks.

A commonly recommended approach is Saga, it's basically compensation.

* Pros: Does not sacrifices system availability since it's an asynchronous/optimal design.

* Cons: Have to write a lot of compensation logic and call them correctly.

* Cons: There are a lot of operations that cannot be compensated. These operations should be arranged in the last steps.

* Cons: What happens if compensation fails?

* Cons: It's eventual consistency, it's not strong atomicity.

I have to admit, I don't try to tackle this problem unless it's really important for the business in a microservices environment.

If your business requires atomicity in most of the places, it's highly recommended to have a carefully designed monolith so you can easily benefit from RDBMS, with your domains modeled in different modules instead of services.

Thanks, really appreciate the pros and cons here. To be clear, and I understand the confusion: I'm actually building more or less a monolith with a RDBMS to encapsulate most of the business logic of my problem domain.

However, for various parts of my application that seem high-complexity, high-impact and non-competetive edge (such as authentication and credit card handling) I want to outsource to external services (Auth0 and Stripe). This brings with it the challenge of managing state, and consistency across various external services (on top of the state I manage internally).

The reason microservices is mentioned in the question, is because I believe these problems generalise from my specific case, to microservices.

Microservices are good for large complex systems, with many developers making many changes. It sounds like your system is rather simple, I don’t think you need to pursue this architecture, especially just starting out - you’ll need to write too much infrastructure and you probably don’t need the scalability.

To be clear, if you're interacting with external services that you don't control, you will run into the problem of coordinating transactions. Doesn't matter if you're running a single-server monolith.

(That's why the OP considers this a microservice)

Yeah I agree. If prototyping an idea why not start with a monolith and a single DB before going microservices for even scaling is needed?

So you would recommend implementing authentication and handling credit cards from scratch? The reason I'm going with this approach is because although I do use a monolith and a single (hosted) DB for my business logic my understanding was that implementing high-impact, high-complexity (non competitive advantage) systems like authentication and credit card handling are best 'outsourced'.

This outsourcing of high-impact, high-complexity does come with the disadvantage of now having to deal with external services. As far as I can tell, the atomicity/transactional problem with these external services generalises to microservices - hence the microservices tag.

No of course not.

You have third party thing working like authentication. Bake that into your application in a simple flow where if it works the user gets added to the database or gets access etc to things. Then you would be able to use transactions in the DB

This may not be the solution you are seeking, but I would do this with an asynchronous job model. Kick off a job where each stage is a dependency on the previous stage. A failure of any stage will kickoff a cleanup job. The account creation service would be responsible for creating the job and monitoring the status. You could run jobs with whatever you see fit. A kafka consumer would be an interesting option. For a quick POC you could even do it in Jenkins via the API.

Calling external services "microservices" is fundamentally wrong. They are "services". All that will do is create a lot of confusion because you are utterly misusing terminology. You can't just decide to misuse terminology. That's like calling "red" "blue". You don't gain information by misusing well-defined terms, you create confusion and misinformation.

That said, your ideas on distributed systems is wrong. Any point in a workflow can and will fail. You need to account for all of this. You can successfully create an account on Stripe, and then when you bill that account, it could return an error. Or even worse, it can timeout, meaning you don't know whether or not a user was charged.

You have to take into consideration all of these failure situations. There is no atomicity in the way that you expect. Whenever things deviate off the happy path, you fail quickly and decisively so that everyone knows where they stand. That gives people the option to retry or call support.

What’s the practical difference between having an internal microservice which handles subscriptions/charging and using Stripe, for example?

Hate to shill, but this is a problem we faced where i work and we wrote a great article about our solution https://engineering.shopify.com/blogs/engineering/building-r...

Simplest Thing That Could Possibly Work:

I would look at keeping track of what substeps have completed, and only treat the user as created if they're all done.

You could roll back the successful ones if not everything worked, but you could also just let them be.

Google “sagas” - it is a pattern that helps with this, rollbacks across boundaries, etc

There are no correct answer that applies to all architectures. But generally speaking you have a core that use the services like black boxes with API's. The core/caller need to handle the errors.

2 phase commit and transaction commit / rollback. Think about boundaries and domains

That's not what the OP is asking--they're using external services like Stripe. Creating a Stripe charge isn't something you can roll back.

There is no one-size-fits-all approach to atomicity in microservices. Microservices typically want to store the most minimal amount of state possible, and push the coordination up a level; i.e., the calling service will send all the relevant information to the service that is going to handle this tiny piece of the work. Eventually there is going to be one service that owns some sort of workflow, and pushes the rest of the world in the direction of its desired outcome.

A year or so ago I needed to write a service to rotate accounts on network equipment every day or so. The control channel and the network hardware tended to be flaky, so I designed a state machine to make as much progress as possible, and be able to pick up where it left off. Each newly-created account was a row in a transactional database, and the successful completion of the operation was noted by changing a column from NULL to the current time. The flow was; if a new account is needed, generate the credential (generated_at). Find the old credentials and log into the device. Add the new account (applied_at). Try logging in with the new account (verified_at). Wait until account expiration (2 days later). Delete the account (deleted_at). Verify that we can't log in with the old account (delete_verified_at).

From this data, we could see what state every account was in, and query for a list of operations that needed to be performed. (Or if nothing needed to be done, how long to sleep for so that we would re-run the loop at the exact instant when work needed to be done, not that it was critical.)

I believe that your account creation and account deletion should follow a similar workflow. Accounts that fail to be created after retrying for X length of time should just move into the deletion workflow.

The user creation service and the day-to-day user operation service should probably be the same thing. The "user" row in your database should be what the state machine uses to figure out what operations need to be performed.

A user record would probably look something like:

  username, password, email, ... = <stuff your user gives you>
  created_at = <now>
  stripe_account_id = <what the stripe account creation rpc returns>
  stripe_account_verified_at = <time when you verified that the stripe account works>
  auth0_account_id = ...
  auth0_account_verified_at = ...
  delete_requested_at = <when the user clicked "delete account" or you decided you don't want to retry the above steps any more>
  stripe_account_delete_verified_at = <time that you ensured the account is gone>
  auth0_account_delete_verified_at = <ditto>
Now you can do queries to figure out what work needs to be done. "select username from users where delete_requested_at is null and stripe_account_verified_at is null" -> create stripe accounts for those users. "select username from users where delete_requested_at is not null and auth0_account_delete_verified_at is null" -> delete auth0 accounts for those users.

The last bit of complexity is to prevent two copies of your user service from running the query at the exact instant and deciding to create a stripe account twice. I would personally just run one replica of that job. This makes deployments slightly more irritating, since no user creation can occur during a rollout (when the number of replicas is 0), but it sure is simple. I worked on a user-facing feature at Google that did that, and although I hated it, it worked fine. It is also possible to add a column to lock the row; check that stripe_account_creation_started_at is null; change it to the current time; commit. Only one replica will be able to successfully commit that and progress to the next step, but then you need a way of finding deadlocks (the app could crash right after the commit) and breaking them.

It is a little complicated but I personally would rather have exact accounting of what is happening and why, rather than guessing when some user logs in and doesn't have a stripe account.

Edit to add: one last bit... I like running this all in a background loop that wakes up periodically to see what needs to be done, but your createAccount RPC should also be able to immediately wake this loop up. That way if everything goes well, you don't add any latency by introducing this state machine / workflow manager. If something happens like Stripe being down... progress will resume when the loop wakes up again. For that reason, I think you should be explicit and provide the end user with a system that lets them request an account and lets them check the status of the request. (Maybe not the user directly, but your signup UI will be consuming this stuff and providing them with appropriate feedback. You don't want the createAccount RPC to just hang while you're talking to Stripe and Auth0 in the background. Probably. The happy case might take a second, but the worst case could take much longer. Design your UI to be good for the worst case, and it will be good for the happy case too.)

I like this approach: it closely matches my initial instincts of having a 'transaction' table for each type of event, with columns for each of the steps. The reason I'm weary of this approach, is because I have done an implementation of this in the past but it turned out to become a nightmare - this is most likely due to a combination of the complexity of that particular domain, my inexperience with this type of problem and external pressures.

I did find some pretty good resources on Sagas - do you have any thought on comparing Sagas pro/cons vis-a-vis this approach?

I do not know much about sagas.

I like this approach because you need this user table anyway (probably!) and no data is duplicated in your system. There is one authoritative place that maps username to stripe id, and the logic to work on computing that (making stripe api calls, retrying) lives inside the system that is responsible for telling other parts of your infrastructure that ID. So you do get a consistent view from everywhere; no system will ever have the "wrong answer".

The complexity here comes from handling every possible transition. With an ad-hoc approach, some transitions are hidden and can be ignored until they actually happen. With this approach, you have to think about it all up front and write the necessary unit tests.

The first version of my password changer thing detected states I hadn't thought about and just sent an alert for someone to manually do something. There was a control plane outage and everything broke one day, and that forced me to implement some of the automatic cleanup. Some other comments suggest "don't write this, just send your ops team an email when something breaks" and that is fine for the initial version. All you really want is a detailed record of what's happened and what went wrong, from there you can either fix it one-off, or finally write the code to fix it automatically. You don't need to implement retries and rollbacks on version 0... just flag the account and go fix it yourself until you decide you'd rather have the computer do it.

event sourcing and the notion of compensating transactions go along way in solving these types of problems.

I did a find for 'event sourcing' in this thread and yours is the single hit at the time of my comment. I came here to agree, and tell the OP that you indeed would benefit from an Event Sourcing Architecture.

thx :-) i get the impression event sourcing for the uninitiated is too abstract a concept and is often confused with event-driven architectures.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact