Hacker News new | past | comments | ask | show | jobs | submit login
APIs, robustness, and idempotency (stripe.com)
280 points by edwinwee on Feb 22, 2017 | hide | past | favorite | 50 comments

I authored this article and just wanted to leave a quick note on here that I'm more than happy to answer any questions, or debate/discuss the finer points of HTTP and API semantics ;)

An ex-colleague pointed out to me on Twitter today that there are other APIs out there that have developed a concept similar to Stripe's `Idempotency-Key` header, the "client tokens" used in EC2's API for example [1]. To my knowledge there hasn't really been a concerted effort to standardize such an idea more widely, but I might be wrong about that.

[1] http://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_In...

My team and I would love a blog post from you guys about the architecture of your idempotency tokens!

Our implementation is probably less sophisticated than you're thinking, but thanks! I'll pitch the idea to the team. To put it very simply, it's a middleware that stores a known response by an idempotency key's value and does a lookup on it on subsequent requests.

How would this handle a distributed system, where a request might be routed to a different backend server? Do you try to guarantee that a request is statefully resent to the same target each time? Do you force your idempotency keys to be stored in some kind of distributed database, like a redis cluster?

In principle, what you've built is quite similar to a cryptographic nonce [1], usually used to prevent replay attacks against authentication services, etc.

We use similar tokens in our API for idempotency as well, though we called them nonces. Standardizing a bit on terminology and how to implement them properly would be really helpful.

[1] https://en.m.wikipedia.org/wiki/Cryptographic_nonce

There are similarities, but given that people are pretty use to the word "nonce" being limited in use to crytography, I'm not sure that we could say definitively that re-using it here would be less confusing overall.

From a more practical sense, the naming is most likely to stay unchanged just because a lot of people are used to already.

Absolutely, I wasn't suggesting that you should change the naming. In fact I agree that nonce is likely not the best term for this.

But the concept as it applies to HTTP APIs is a great one that I think could have a broadly recognized implementation. Whether folks start adopting the "Idempotency-Key" header or some other approach, having awareness and maybe even a standard could be great.

> Whether folks start adopting the "Idempotency-Key" header or some other approach, having awareness and maybe even a standard could be great.

Oh yes, hugely agree there! Part of the reason that I wrote this is that I couldn't find much prior art around idempotency keys (there is some, but not much). I am at least partially hoping that people will remember this article (or that it'll show up in a Google search) as they're implementing their own similar concept, and re-use the naming.

Hi. Thanks for the very interesting article. Quick question regarding the following:

> "On a response failure (i.e. the operation executed successfully, but the client couldn’t get the result), the server simply replies with a cached result of the successful operation."

Have you considered having the server respond with different http-status-codes for the initial successful request as opposed to any nop-retries? This provides the client with additional information that may prove useful. A lazy client could simply choose to treat both response codes identically, as a success indicator. And a more diligent/sophisticated/paranoid client could choose to act upon this information in some other way.

From a side-effects/persisted-data perspective, returning a different status-code would have the exact same effects as what you described. But it would also give clients additional information that they can choose to act upon if desired.

How would the client act differently upon it? Sounds about as useful as telling the client how many TCP packets in the connection had to be retransmitted by the server.

To give one extreme example: it's possible that the duplicate requests are occurring due to a client-side bug. If the duplicated request produces a different status code, a diligent/paranoid client can use this as a warning-signal that there might be a bug in the client's implementation.

Have you considered using the resource identifier as an idempotency key? Basically, have the client generate the ID (UUID, but namespaced on the server side [client+date]). This eliminates the need for the client to generate the idempotency key as well as the eliminates the need for the server to maintain an idempotency repository?

I think that would be a fine alternative to the system we have.

I don't have a perfect history of events, but I suspect that our current design is basically the result of two things:

1. Idempotency keys as a concept were introduced quite a bit later than the API was originally conceived, so it made sense to make them an optional augmentation to existing integrations.

2. Our resource IDs have a fairly specific formats including a prefix (i.e. `acct_` for an account or a `ch_`). It would still be possible to generate this client-side, but it's a little extra trouble.

If you were to do it over again, would you have done it the same way? What would you have changed?

Idempotency keys are a simple enough concept, that I think that they're mostly okay as is (we could have done a few smarter things on the server side implementation, but luckily that can still be fixed).

There are certainly a few things around HTTP semantics that Stripe got wrong. e.g. Most updates should probably be `PATCH` instead of `POST`, but it's probably not worth changing at this point.

Is there any plan for the ability to discard previously used idempotency keys, so that a request with all the same parameters can actually be duplicated? Sometimes a payment will fail because a customer's card is declined (e.g. because of fraud detection or insufficient funds), and after the customer sorts it out, we'd like to try recharging their card, and not just receive the same error message. Our current workaround is to wait 24 hours for the idempotency key to expire, but it would be nice to be able to retry the request sooner without maintaining state on our side to generate a different idempotency key for the same payment.

This is by design. If you want to make "another attempt", you should use a new idempotency key. Think of it as "one attempt/transaction/request" == "one idempotency key".

Yes, exactly! Like you say, the idea is that an idempotency key represents a single request. It's perfectly okay to make a new request that's almost identical to the original in that it shares the same endpoint and all the same parameters, but that should be done using a freshly generated idempotency key.

Curious that they don’t mention HTTP conditional requests [1] even in passing. This mechanism is typically used for slightly different things, but you can, for example, make a PATCH request “idempotent” (in their sense) by adding an If-Match header to it. I’d say that Idempotency-Key itself may be considered a precondition and used with status codes 412 [2] and 428 [3].

By the way, WebDAV extended this mechanism with a general If header [4] for all your precondition needs. I’m kinda glad it didn’t catch on though...

[1] https://tools.ietf.org/html/rfc7232

[2] https://tools.ietf.org/html/rfc7232#section-4.2

[3] https://tools.ietf.org/html/rfc6585#section-3

[4] https://tools.ietf.org/html/rfc4918#section-10.4

(I wrote this.)

It's always a bit of a fine line as to what makes the final cut in this sort of article (I tried to stay on message without getting too off track), but HTTP conditional requests are definitely something that could have been a good fit.

I should point out though that using `ETag`/`If-Match` generally has a slightly different use on updates compared to Stripe's `Idempotency-Key`. A server sends back an `ETag` that's correlated to the current state of a resource, and clients make conditional requests using one so that they can get a guarantee that they're not changing state where they didn't expect to.

Because every HTTP request stands by itself, it's very possible for a client to fetch a resource and go to update it on a second request only to accidentally clobber changes that were made by a different client. It's this sort of "mid-air collision" that `ETag`/`If-Match` help to avoid. Mozilla's documentation on the subject is quite good:


If you interpreted 'current state of a resource' somewhat liberally, where the resource is the abstract state of a particular versioned/timestamped request, does an ETag then really become the same as an Idempotency-Key?

Or put another way, is the only difference that ETags are generally computed based on the server's stored state of a particular resource so it's possible to have multiple clients with the same ETag, while the 'resource' backing an Idempotency-Key is the entire state of a particular user that encompasses all the user's resources?

So I think that you could absolutely patch a system that's quite similar to `Idempotency-Key` into `ETag`, but you might be pushing the original concept far enough base to the point where you're not gaining much by doing so.

Using `If-Match` is essentially indicating that you want to make a request conditionally as long as the server's state matches a nonce that you're holding. Presumably that nonce was handed to you already by the server on an initial request that you already executed.

You can hand Stripe's API an `Idempotency-Key` on the first request that you make against it. Furthermore, you'd never say that the first request made in this way was meant to be conditional, even if subsequent retries (after an initial failure) might be.

I think it wouldn't be a problem in practice to just retrofit `ETag` to do the same thing, but doing so is (arguably) semantically wrong, and I think there's something to be said for the clarity that just using your own header with an obvious name like `Idempotency-Key`.

I'm open to be persuaded though :)

If your requests are "POST to create something" requests, you can get a more REST-ful flavor of idempotency by turning the POST into a redirecting GET followed by a PUT to emulate two phase commits.

Instead of POSTing to /transactions, I GET /transactions/fresh (optionally a URL linked from /transactions to avoid assumptions about URL structure and capabilities) which generates a unique ID and redirects to /transactions/{some-unique-id}. Attempting to GET that transaction would return 404 as it doesn't exist yet, but I can PUT to it to write a transaction. Now I'm using only idempotent methods instead of POST, and I don't need to figure out how to properly construct and manipulate a token in a header, and proxies don't need to know about this special header to know requests are idempotent since the methods I'm using communicate that already.

This adds a minimum of one extra request to all two-phase resource creations. If your goal is to retry safely though, the number of retries could dwarf that overhead. It all depends on how likely it is that failures actually occur. Client-side ID generation removes the extra request but brings back the problems of clients needing to understand URL and ID formats and construction logic.

One thing about Stripe's API I have mixed feelings about is the liberal versioning. My experiencing with 100s of payment integrations is that they get done once and hopefully never touched again. I know most of Stripes updates are "additive" such that they are backwards compatible if coded liberally, but it can be confusing. Same with Lob.

(I work at Stripe.)

API versioning is definitely a debatable subject, and I don't think that anyone at Stripe would claim that the current state of affairs is perfect by any means, but it's one that we think provides a good compromise between the stability of client integrations and our own ability to iterate on the API's design and make progress.

The classic problem with web APIs is that unless you have a good versioning scheme, once you've published them, you can never make a backward incompatible change (like removing a field) unless you're okay with breaking some people's integration. Especially when it comes to payments, people tend to have strong feelings about having their integrations broken, so we try to take as many precautions as possible to make sure that doesn't happen.

An approach to versioning that you'll see in many places is to do "major" API versioning where you do something like prefix your URLs with `/v1/` or send in a special `Accept` header. A problem with that approach though is that you'd need an incredibly good reason to ever build out a `/v2/` because if you ever bump that major version you're going to leave an incredible number of users behind on the original. Most people want to integrate one time and not have to worry about upgrading (ideally ever).

At Stripe, we've tried to build a compromise by introducing minor, date-based versions that include only a fairly constrained set of changes, but which we can bump more liberally. As others have mentioned here, your account gets locked into a version on its first request, and if never want to worry about API versioning at all, you can leave that version untouched essentially indefinitely.

If we realize that we made an API design mistake somewhere, we can fix it relatively easily and keep the API's design more cohesive for new users, while also leaving current users unaffected. It's also much easier to maintain for us because we only have to build a small compatibility module instead of having to maintain two (or more) completely divergent major API versions.

Anyway, I hope that helps explain some of the thinking behind this versioning scheme :)

If you don't mind me asking how exactly to you guys process requests with a versioned API?

Say I come in with a request for V2. How does that get directed to the V2 code path? What about services that are identical in V1 and V2. Do you have 2 copies of the same logic? Sorry for the naive question but API versioning is something that has been on my mind recently.

> If you don't mind me asking how exactly to you guys process requests with a versioned API?

This information has been talked about publicly before, so I don't mind explaining at all.

For the most part, the core API endpoint logic is all coupled to just the latest version. For each substantial API change in each new API version, logic is encapsulated into what we call a "compatibility gate".

Before responding, a merchant's current version is looked up, and the response is passed back through a compatibility layer that applies changes for each gate until it's been walked all the way back to the target version, then the response is sent back.

I'm glossing over a few details here of course — versioning can affect request parameters and even core logic in many places, so some gates need to be embedded throughout core code. We try to keep that as clean as we can.

Funny you mention that, I can't recall ever seeing a proper move from /v1 to /v2!

OpenStack Keystone is at /v3 already. But yeah, it really took about 3-4 years until most consuming OpenStack services updated.

For every one successful major version revamp there may be a dozen that never changed :)

Even amongst those that did do a major API revision, I suspect that you're usually left with the no-win situation of either leaving users lingering on your previous version ~forever, or enforcing a deprecation schedule and annoying a lot of people (Twitter's V1 retirement for example).

How is this confusing? You get locked into the version, and never have to worry about things breaking. If Stripe releases new features, you also get the new features. For most people, you just never have to think about versioning.

I'm cool with Stripe's versioning for the most part. There have just been a few occasions where to get a non-breaking enhancement I've need to move to a breaking version. But not a huge deal.

My experiencing with 100s of payment integrations is that they get done once and hopefully never touched again.

This is basically the only way to work with any payment service integration, in my experience.

A lot of these services concentrate on documenting their current versions but not historical ones, so maintenance of integrations using older versions is unnecessarily difficult, even if you only want to use other features that were already supported at your current version.

It's often also painful to update to a more recent revision, with limited documentation or tools for identifying breaking API changes and how to convert your integration systematically to work with the newer version. You basically have to do a full rewrite against the new API from the start, and of course that probably comes with a significant risk of regressions in other areas.

In some cases, there are also confusing and/or poorly documented rules about which version you actually get both on active requests to the API and in any webhooks you get back, depending on what's pinned to what, whether you've updated a default version in a dashboard somewhere, whether you send any extra headers with your request, whether any testing/sandbox environments are linked to the same version as production, etc.

It's all very unfortunate, because every payment service we use has added potentially useful new features since we first integrated, and using those features would probably bring in more revenues for us and by extension more fees for the payment services. However, the risk of breaking such an important part of our system is just too great for us.

It sounds like you haven't looked a bit at Stripe's versioning, given your generic critiques.

Stripe's API versions come with absurdly detailed changelogs, documentation that's complete for each version and automatically shows you the right doc for the version you're on, and consistency/forethought in API design that means very, very rare breaking changes that cause real problems for an application.

I've used them for almost 3 years now in an extremely complex payments system -- we use nearly every feature they offer in several different ways -- and it's always been a pleasure.

It sounds like you haven't looked a bit at Stripe's versioning, given your generic critiques.

I was being polite. Stripe is my go-to example for integrations with payment services sometimes becoming write-only by default as time passes and their API moves on.

Stripe's API versions come with absurdly detailed changelogs

Stripe's API has a basic changelog. This is helpful, but also the minimal requirement to be useful at all.

It would be more useful to provide a migration guide that also shows how to change an existing integration to work with newer API versions when they make significant changes, similar to the documentation available for setting up a new integration in the first place.

documentation that's complete for each version

As Stripe's service has grown, unfortunately their documentation has grown less reliable. There have been quite a few errors and omissions in recent years, which we've pointed out to them from time to time.

As for "each version", where have you found official documentation for any API version except the latest?

automatically shows you the right doc for the version you're on

No, it doesn't. I've just checked this.

consistency/forethought in API design that means very, very rare breaking changes that cause real problems for an application

Breaking changes aren't particularly unusual with Stripe. Fortunately, they are pretty reliable at supporting older API versions as well, which is to their credit.

Again, I would consider this a minimal requirement to be useful in this industry, but it's not something that everyone does as well as they do and I know it's part of the reason some businesses stay with them.

I've used them for almost 3 years now in an extremely complex payments system -- we use nearly every feature they offer in several different ways -- and it's always been a pleasure.

Unfortunately that doesn't mean the limitations of how their system works are any less limiting.

In addition to the various points about, I'd add that although you can override your current API version on a request, you can't similarly override it temporarily on webhooks, so testing an updated integration is difficult.

There's also, as far as I've ever seen, no documentation that specifies exactly what the effects of that API version setting on the dashboard are. For example, will changing it affect both production and testing environments simultaneously, and can it be downgraded again when testing against a certain API version is finished or if a regression is found? You can only find these things out by trial and error in my experience, and it's a brave developer who'd try that on their first integration when it might compromise their production system.

It's still not possible to set up a fully automated integration test suite either. This would obviously be a very useful facility if you wanted to check whether your integration would still work properly against a new API version.

I'm happy that you're happy, but please consider that there are plenty of us who have also been working with Stripe and maintaining integrations with their systems for a lot longer than 3 years, and maybe we've simply encountered different problems than you have.

[Edit: Removed some unnecessary snark. The point of this comment was not to have a dig, because Stripe do better than many in this area. But the problem of effectively becoming version locked as a current API gets further ahead of the current integration is ubiquitous with web services, including payment services. Almost all of them, at least among the ones I've used professionally, could do more to help their users not just integrate initially but also maintain or update those integrations, possibly months or years after they were first written.]

Stripe pin your API version once you have sent you first request, you can see it in the dashboard with an option to update to the latest api revision.


As to how they achieve it Amber has a good blog post over at http://amberonrails.com/move-fast-dont-break-your-api/ but it means it really is integrate once, and never touch it again :)

The way stripe does the idempotency keys has always reminded me of two phase commit but I know that the two things are quite different. I wonder what in the distributed systems literature inspired this technique.

To me the most interesting thing that came out of this article was exponential backoff-retry. I've always used e (natural log) as the base for my exponential backoff, and never used jitter (all my apps are single clients that are just trying to hit the AWS API, and none run simultaneously and I don't own the other side of it).

By looking it up I learned the concept of jitter, which is how Ethernet works, and I think it's really cool.

Nice, clear article. I've always been impressed by the usefulness and clarity of Stripe's documentation. I should pay more attention to their blog.

An "idempotency key" can also just be a resource URL. If you use an idempotent HTTP verb this works wonders and is also more RESTful:

PUT example.com/orders/abc123/charged

The first PUT starts charging. If the client disconnects and retries while the original charge process is already running on the server, the server can just await this already running process. And if the order is already charged, the server can just return the outcome of the charge.

As long as there is a unique URL for every state a resource such as an order can be in, you can uniquely identify every action on it on the server-side and re-attach to running processes.

Using unique URLs for each state is the main idea behind REST (REpresentational State Transfer). REST namely means just this: representing server states as resources (URLs).

The trick with idempotency keys in practice is figuring out a good stable way to record and query them.

So you want your endpoint to only do something once. Fine, does that mean I need a table in the DB with every key I've ever seen?

The most pleasant way I've solved this has been to think of this as rate limits where the limit is once per forever. After that they fit nicely in a token bucket and rate limit caching solution. This is one of the most useful thing that I think https://www.ratelim.it/documentation/once_and_only_once does. (I built RateLim.it)

I'm shocked at how few HTTP libraries on GitHub properly handle exponential backoff, let alone retries.

Do the Internet a favor, and file an issue with your favorite HTTP library asking them to implement exponential backoff.

I haven't found anything in JS that does this properly though. Do people really just write apps that crap out upon the first HTTP request failure?

The best library I have come across is actually SquareUp's OkHttp (the payment processing companies seem to be the only ones getting this right).

As the author of a http library (lua-http https://github.com/daurnimator/lua-http) that doesn't, I'm interested in how you'd want retries (not to mention exponential backoffs) to work:

- Should they be the default?

- What requests should be retried? (as much as we wish GETs were idempotent... they're not) see https://lists.w3.org/Archives/Public/ietf-http-wg/2017JanMar... for an intro to the complexities here

- What should get reused? (e.g. do new dns lookup? reuse connection? reuse proxy connection?)

- How to handle non-replayable pieces? (e.g. request bodys coming from a fifo)

- Usually a request has a timeout/deadline. should it be restarted for the retry? (probably not)

I haven't implemented retries yet, as the above questions seem hard to answer without knowing application/server specific qualities (and hence should be left to the user of the http library). Please feel free to file an issue :)

That list of questions is why I prefer retry and back-off to be separate from the internals of the http library.

As a library user I need to work around the peculiarities of the particular service endpoints I'm integrating with. Backoff and retry aren't specific to the application/transport protocol through which a service is consumed.

For example, (in the java world) the approach in libraries like hysterix, guava-retrying, and failsafe https://github.com/jhalterman/failsafe#retries

What do you recommend for JS?

Libraries that provide similar abstractions in js do exist (https://github.com/tim-kos/node-retry and https://github.com/MathieuTurcotte/node-backoff), but I'm not qualified to recommend anything.

Most of the colleagues I've had have rolled their own one off solutions when needed.

There's a decent Go library for various backoff algorithms. Useful if you're creating services which call webhooks: https://github.com/cenkalti/backoff

It seems to me that OkHttp doesn't in fact support a retry back off.


An "idempotency key" as this article suggests applies to WebSockets as well and is particularly useful for correlating a response to a request.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact