
APIs, robustness, and idempotency - edwinwee
https://stripe.com/blog/idempotency
======
brandur
I authored this article and just wanted to leave a quick note on here that I'm
more than happy to answer any questions, or debate/discuss the finer points of
HTTP and API semantics ;)

An ex-colleague pointed out to me on Twitter today that there are other APIs
out there that have developed a concept similar to Stripe's `Idempotency-Key`
header, the "client tokens" used in EC2's API for example [1]. To my knowledge
there hasn't really been a concerted effort to standardize such an idea more
widely, but I might be wrong about that.

[1]
[http://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_In...](http://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html#client-
tokens)

~~~
Johnie
Have you considered using the resource identifier as an idempotency key?
Basically, have the client generate the ID (UUID, but namespaced on the server
side [client+date]). This eliminates the need for the client to generate the
idempotency key as well as the eliminates the need for the server to maintain
an idempotency repository?

~~~
brandur
I think that would be a fine alternative to the system we have.

I don't have a perfect history of events, but I suspect that our current
design is basically the result of two things:

1\. Idempotency keys as a concept were introduced quite a bit later than the
API was originally conceived, so it made sense to make them an optional
augmentation to existing integrations.

2\. Our resource IDs have a fairly specific formats including a prefix (i.e.
`acct_` for an account or a `ch_`). It would still be possible to generate
this client-side, but it's a little extra trouble.

~~~
Johnie
If you were to do it over again, would you have done it the same way? What
would you have changed?

~~~
brandur
Idempotency keys are a simple enough concept, that I think that they're mostly
okay as is (we could have done a few smarter things on the server side
implementation, but luckily that can still be fixed).

There are certainly a few things around HTTP semantics that Stripe got wrong.
e.g. Most updates should probably be `PATCH` instead of `POST`, but it's
probably not worth changing at this point.

------
vfaronov
Curious that they don’t mention HTTP conditional requests [1] even in passing.
This mechanism is typically used for slightly different things, but you can,
for example, make a PATCH request “idempotent” (in their sense) by adding an
If-Match header to it. I’d say that Idempotency-Key itself may be considered a
precondition and used with status codes 412 [2] and 428 [3].

By the way, WebDAV extended this mechanism with a general If header [4] for
all your precondition needs. I’m kinda glad it didn’t catch on though...

[1] [https://tools.ietf.org/html/rfc7232](https://tools.ietf.org/html/rfc7232)

[2]
[https://tools.ietf.org/html/rfc7232#section-4.2](https://tools.ietf.org/html/rfc7232#section-4.2)

[3]
[https://tools.ietf.org/html/rfc6585#section-3](https://tools.ietf.org/html/rfc6585#section-3)

[4]
[https://tools.ietf.org/html/rfc4918#section-10.4](https://tools.ietf.org/html/rfc4918#section-10.4)

~~~
brandur
(I wrote this.)

It's always a bit of a fine line as to what makes the final cut in this sort
of article (I tried to stay on message without getting too off track), but
HTTP conditional requests are definitely something that could have been a good
fit.

I should point out though that using `ETag`/`If-Match` generally has a
slightly different use on updates compared to Stripe's `Idempotency-Key`. A
server sends back an `ETag` that's correlated to the current state of a
resource, and clients make conditional requests using one so that they can get
a guarantee that they're not changing state where they didn't expect to.

Because every HTTP request stands by itself, it's very possible for a client
to fetch a resource and go to update it on a second request only to
accidentally clobber changes that were made by a different client. It's this
sort of "mid-air collision" that `ETag`/`If-Match` help to avoid. Mozilla's
documentation on the subject is quite good:

[https://developer.mozilla.org/en-
US/docs/Web/HTTP/Headers/ET...](https://developer.mozilla.org/en-
US/docs/Web/HTTP/Headers/ETag)

~~~
smallnamespace
If you interpreted 'current state of a resource' somewhat liberally, where the
resource is the abstract state of a particular versioned/timestamped request,
does an ETag then really become the same as an Idempotency-Key?

Or put another way, is the only difference that ETags are generally computed
based on the server's stored state of a _particular_ resource so it's possible
to have multiple clients with the same ETag, while the 'resource' backing an
Idempotency-Key is the entire state of a particular user that encompasses all
the user's resources?

~~~
brandur
So I think that you could absolutely patch a system that's quite similar to
`Idempotency-Key` into `ETag`, but you might be pushing the original concept
far enough base to the point where you're not gaining much by doing so.

Using `If-Match` is essentially indicating that you want to make a request
conditionally as long as the server's state matches a nonce that you're
holding. Presumably that nonce was handed to you already by the server on an
initial request that you already executed.

You can hand Stripe's API an `Idempotency-Key` on the first request that you
make against it. Furthermore, you'd never say that the first request made in
this way was meant to be conditional, even if subsequent retries (after an
initial failure) might be.

I think it wouldn't be a problem in practice to just retrofit `ETag` to do the
same thing, but doing so is (arguably) semantically wrong, and I think there's
something to be said for the clarity that just using your own header with an
obvious name like `Idempotency-Key`.

I'm open to be persuaded though :)

------
notjack
If your requests are "POST to create something" requests, you can get a more
REST-ful flavor of idempotency by turning the POST into a redirecting GET
followed by a PUT to emulate two phase commits.

Instead of POSTing to /transactions, I GET /transactions/fresh (optionally a
URL linked from /transactions to avoid assumptions about URL structure and
capabilities) which generates a unique ID and redirects to
/transactions/{some-unique-id}. Attempting to GET that transaction would
return 404 as it doesn't exist yet, but I can PUT to it to write a
transaction. Now I'm using only idempotent methods instead of POST, and I
don't need to figure out how to properly construct and manipulate a token in a
header, _and_ proxies don't need to know about this special header to know
requests are idempotent since the methods I'm using communicate that already.

This adds a minimum of one extra request to all two-phase resource creations.
If your goal is to retry safely though, the number of retries could dwarf that
overhead. It all depends on how likely it is that failures actually occur.
Client-side ID generation removes the extra request but brings back the
problems of clients needing to understand URL and ID formats and construction
logic.

------
pbreit
One thing about Stripe's API I have mixed feelings about is the liberal
versioning. My experiencing with 100s of payment integrations is that they get
done once and hopefully never touched again. I know most of Stripes updates
are "additive" such that they are backwards compatible if coded liberally, but
it can be confusing. Same with Lob.

~~~
brandur
(I work at Stripe.)

API versioning is definitely a debatable subject, and I don't think that
anyone at Stripe would claim that the current state of affairs is perfect by
any means, but it's one that we think provides a good compromise between the
stability of client integrations and our own ability to iterate on the API's
design and make progress.

The classic problem with web APIs is that unless you have a good versioning
scheme, once you've published them, you can never make a backward incompatible
change (like removing a field) unless you're okay with breaking some people's
integration. Especially when it comes to payments, people tend to have strong
feelings about having their integrations broken, so we try to take as many
precautions as possible to make sure that doesn't happen.

An approach to versioning that you'll see in many places is to do "major" API
versioning where you do something like prefix your URLs with `/v1/` or send in
a special `Accept` header. A problem with that approach though is that you'd
need an incredibly good reason to ever build out a `/v2/` because if you ever
bump that major version you're going to leave an incredible number of users
behind on the original. Most people want to integrate one time and not have to
worry about upgrading (ideally ever).

At Stripe, we've tried to build a compromise by introducing minor, date-based
versions that include only a fairly constrained set of changes, but which we
can bump more liberally. As others have mentioned here, your account gets
locked into a version on its first request, and if never want to worry about
API versioning at all, you can leave that version untouched essentially
indefinitely.

If we realize that we made an API design mistake somewhere, we can fix it
relatively easily and keep the API's design more cohesive for new users, while
also leaving current users unaffected. It's also much easier to maintain for
us because we only have to build a small compatibility module instead of
having to maintain two (or more) completely divergent major API versions.

Anyway, I hope that helps explain some of the thinking behind this versioning
scheme :)

~~~
malsheikh
If you don't mind me asking how exactly to you guys process requests with a
versioned API?

Say I come in with a request for V2. How does that get directed to the V2 code
path? What about services that are identical in V1 and V2. Do you have 2
copies of the same logic? Sorry for the naive question but API versioning is
something that has been on my mind recently.

~~~
brandur
> If you don't mind me asking how exactly to you guys process requests with a
> versioned API?

This information has been talked about publicly before, so I don't mind
explaining at all.

For the most part, the core API endpoint logic is all coupled to just the
latest version. For each substantial API change in each new API version, logic
is encapsulated into what we call a "compatibility gate".

Before responding, a merchant's current version is looked up, and the response
is passed back through a compatibility layer that applies changes for each
gate until it's been walked all the way back to the target version, then the
response is sent back.

I'm glossing over a few details here of course — versioning can affect request
parameters and even core logic in many places, so some gates need to be
embedded throughout core code. We try to keep that as clean as we can.

------
smg
The way stripe does the idempotency keys has always reminded me of two phase
commit but I know that the two things are quite different. I wonder what in
the distributed systems literature inspired this technique.

------
officelineback
To me the most interesting thing that came out of this article was exponential
backoff-retry. I've always used e (natural log) as the base for my exponential
backoff, and never used jitter (all my apps are single clients that are just
trying to hit the AWS API, and none run simultaneously and I don't own the
other side of it).

By looking it up I learned the concept of jitter, which is how Ethernet works,
and I think it's really cool.

------
josephorjoe
Nice, clear article. I've always been impressed by the usefulness and clarity
of Stripe's documentation. I should pay more attention to their blog.

------
strictfp
An "idempotency key" can also just be a resource URL. If you use an idempotent
HTTP verb this works wonders and is also more RESTful:

PUT example.com/orders/abc123/charged

The first PUT starts charging. If the client disconnects and retries while the
original charge process is already running on the server, the server can just
await this already running process. And if the order is already charged, the
server can just return the outcome of the charge.

As long as there is a unique URL for every state a resource such as an order
can be in, you can uniquely identify every action on it on the server-side and
re-attach to running processes.

Using unique URLs for each state is the main idea behind REST
(REpresentational State Transfer). REST namely means just this: representing
server states as resources (URLs).

------
jdwyah
The trick with idempotency keys in practice is figuring out a good stable way
to record and query them.

So you want your endpoint to only do something once. Fine, does that mean I
need a table in the DB with every key I've ever seen?

The most pleasant way I've solved this has been to think of this as rate
limits where the limit is once per forever. After that they fit nicely in a
token bucket and rate limit caching solution. This is one of the most useful
thing that I think
[https://www.ratelim.it/documentation/once_and_only_once](https://www.ratelim.it/documentation/once_and_only_once)
does. (I built RateLim.it)

------
Hydraulix989
I'm shocked at how few HTTP libraries on GitHub properly handle exponential
backoff, let alone retries.

Do the Internet a favor, and file an issue with your favorite HTTP library
asking them to implement exponential backoff.

I haven't found anything in JS that does this properly though. Do people
really just write apps that crap out upon the first HTTP request failure?

The best library I have come across is actually SquareUp's OkHttp (the payment
processing companies seem to be the only ones getting this right).

~~~
daurnimator
As the author of a http library (lua-http [https://github.com/daurnimator/lua-
http](https://github.com/daurnimator/lua-http)) that doesn't, I'm interested
in how you'd want retries (not to mention exponential backoffs) to work:

\- Should they be the default?

\- What requests should be retried? (as much as we wish GETs were
idempotent... they're not) see [https://lists.w3.org/Archives/Public/ietf-
http-wg/2017JanMar...](https://lists.w3.org/Archives/Public/ietf-http-
wg/2017JanMar/0194.html) for an intro to the complexities here

\- What should get reused? (e.g. do new dns lookup? reuse connection? reuse
proxy connection?)

\- How to handle non-replayable pieces? (e.g. request bodys coming from a
fifo)

\- Usually a request has a timeout/deadline. should it be restarted for the
retry? (probably not)

I haven't implemented retries yet, as the above questions seem hard to answer
without knowing application/server specific qualities (and hence should be
left to the user of the http library). Please feel free to file an issue :)

~~~
nebiyu
That list of questions is why I prefer retry and back-off to be separate from
the internals of the http library.

As a library user I need to work around the peculiarities of the particular
service endpoints I'm integrating with. Backoff and retry aren't specific to
the application/transport protocol through which a service is consumed.

For example, (in the java world) the approach in libraries like hysterix,
guava-retrying, and failsafe
[https://github.com/jhalterman/failsafe#retries](https://github.com/jhalterman/failsafe#retries)

~~~
Hydraulix989
What do you recommend for JS?

~~~
nebiyu
Libraries that provide similar abstractions in js do exist
([https://github.com/tim-kos/node-retry](https://github.com/tim-kos/node-
retry) and [https://github.com/MathieuTurcotte/node-
backoff](https://github.com/MathieuTurcotte/node-backoff)), but I'm not
qualified to recommend anything.

Most of the colleagues I've had have rolled their own one off solutions when
needed.

------
daliwali
An "idempotency key" as this article suggests applies to WebSockets as well
and is particularly useful for correlating a response to a request.

