Hacker News new | comments | show | ask | jobs | submit login
Implementing Stripe-Like Idempotency Keys in Postgres (brandur.org)
182 points by craigkerstiens on Oct 27, 2017 | hide | past | web | favorite | 41 comments



Craig, attended your talk at ATO earlier this week--good talk. I was the guy who asked about performance considerations of joins in Postgres using UUIDs.

This post reads like it's vaguely implementing what should be in a queue backend. There's a locked_at field in the schema, furthermore should this not be performed via ...FOR UPDATE?


Care to share the answer to your question about uuid join performance? I've had a hard time finding much about that on the internet.


Sure, there is definitely a little extra overhead on the join performance, though my experience I've seen so many other issues become the biggest bottleneck before joins of UUIDs. We regularly used UUID as identifies at Heroku Postgres and use them at Citus as well and they work extremely well for us.

It is of note that we're actually using the UUID datatype though and not just generating a UUID and throwing it into a text field.


> It is of note that we're actually using the UUID datatype though and not just generating a UUID and throwing it into a text field.

I was thinking that a UUID datatype implemented as a series of ints could have fairly good join performance, since you can effectively treat it as a series of separate smaller int indices that you join across, and I imagine that's a well understood and optimized problem for years now. A text UUID field though, ugh, that just seems so wasteful even before you get to optimization techniques.


Reading your comment, it's not clear to me whether you're aware that the UUID datatype in PostgreSQL is a 128-bit value as opposed to a text field; or if you're just speaking of applications that use a database where native UUID is not available. If the latter, feel free to ignore the rest of this :)

For comparison in storage:

    select pg_column_size('DDB5E9ED-60B6-4405-A6A8-18E339C7B172'::uuid) AS uuid_bytes,
           pg_column_size('DDB5E9ED-60B6-4405-A6A8-18E339C7B172'::text) AS text_bytes;
     uuid_bytes | text_bytes 
    ------------+------------
             16 |         40
    (1 row)
As 'craigkerstiens points out, the index performance of the native UUID datatype is very good: it's not a string comparison.

If you're using the UUID type to encode information that you may want to access subfields of, you could create function indexes to do so. Otherwise, I think indexing on the native UUID type is going to be better than a collection of narrower fields and indexes.

Edit: Spell 'craigkerstiens correctly.


> Reading your comment, it's not clear to me whether you're aware that the UUID datatype in PostgreSQL is a 128bit value as opposed to a text field

I was speaking towards the idea that some people throw UUIDs into text fields, as implied by the parent;s clarification that they were not doing that. For example, in years past when that type may not have existed yet. People have been talking about "just use a uuid for the primary key" for at least around a decade now.

> As 'craigkierstens points out, the index performance of the native UUID datatype is very good: it's not a string comparison.

I would assume so! Any database that implemented an actual 128 bit numeric type (no matter if it's converted to hexidecimal notation for display, ints aren't stored in decimal either) as a text field would deserve quite a bit of ridicule, in my opinion.

> If you're using the UUID type to encode information that you may want to access subfields of

And that answers a question I raised somewhat implicitly in my other comment, which is why use a UUID in the first place. Because you can encode specific sub type information into the parts. Thanks!

Edit: Forgot some words in my paragraph that mentioned ridicule, so I added them to make sure it was obvious what and why I thought they would deserve some ridicule (which is actually a bit stronger than how I generally like to come across).


We use a combination of both - the app is old enough that uuid's weren't an option when it was started - and like the gp said, even in the case of text fields theres a bunch of other things that degrade perf before those joins become an issue. that being said, we are hopeful that the new pg10 hash indexes will be a good fit for these (we already cant do range lookups on a uuid) and are hopeful that we will get a speed improvement with them when we get to that point.


I'm curious, did you ever do testing on whether converting a UUID to a a pair of 64-bit ints and using a composite primary key worked well? I imagine it's annoying to deal with at both the schema and application level, but it is space efficient. Depending on how optimized numeric/decimal types are, those might by a good fit as well. You said other things degrade perf before that matters, so it may be moot, but wasting space for what's going to be in an index seems likely to have performance problems that affect other things subtly (such as ballooning your index memory usage).

If we're storing UUID's in canonical form that's 36 bytes of information (32+4 for separators, and if you're stripping separators why not convert to a numeric form anyway...). Even if we assume we're only storing 32 characters, that's 256 bits, so we're doubling the storage requirements and memory usage for an index. Assuming you're using UUID's because you have a lot of records and want to keep them unique, that space presumably matters (especially if you're using it multiple places because there are relations). If you don't have a huge amount of records, a sufficiently large auto-incrementing int field that increments by a random amount between 1 and 255, and a randomized bit shuffle codified into a reversible function (extend int to 128 bits, use previously decided bit shuffle to rearrange bits. Fully reversible) would solve that need and make everyone who has to deal with the DB's life easier. Just convert the ints to and from UUID representations between your view and controller.

I've never had to actually deal with UUID's at the DB level, so there could be some major downsides to these that I'm not thinking of. At this point, I'm just not sure what storing a UUID in a database gets you.


The two biggest things it gets us is an identifier that is unique across the database and also something that the application layer can generate and then provide to the database - instead of inserting a record and finding out what the id is after the fact. Secondarily it means that someone cannot guess what the next record is by trying to increment or decrement an integer key but like you say there are ways around that.

We do plan to eventually convert the text versions into real uuids but our performance/size isnt an issue at the moment and its a lot of risky engineering work to do.


> it gets us is an identifier that is unique across the database and also something that the application layer can generate and then provide to the database - instead of inserting a record and finding out what the id is after the fact

That's a good point, and makes sense. Also, as gzrm noted in another comment, you can make use of the different parts of the UUID to encode specific information, which may be helpful (but at the loss of uniqueness bits).


With respect to uniqueness, you can avoid collisions using, for example, some kind of sharding scheme, similar to what Instagram did:

https://engineering.instagram.com/sharding-ids-at-instagram-...


It's been quite a while since I've needed to, but when I was setting up HA databases, I just set the auto increment amount to some number more than the number of databases, and set the starting offset for each database to sequential ints (1,2,3...). This results in databases that can't generate duplicate ids. As a bonus, you know what database was used to create each record.


That's one way if you're using sequences to generate ids. If you're encoding information in UUIDs, you also have to take into account how many bits you want to allocate to the database (shard) id. One method I've used is to hard-code the shard id as the return value in a function: effectively a constant. (Edit to add: This serves the same purpose as setting the starting offset of the sequence.) You do have a bit of custom code on each database instance in this case, which is its own tradeoff.


Snowflake id is a BIGINT. How does it contrast from UUID, other than joining faster?


Trivially, a UUID is wider (128 bits as opposed to 64). You can choose to generate your UUIDs using a Instragram[0]-like strategy so you can, for example, benefit from temporal/spatial locality in the indexes. There may be performance consequences of using the wider datatype, but I suspect this could be overwhelmed by other application concerns if you're dealing with data that benefits from the wider type.

[0] https://engineering.instagram.com/sharding-ids-at-instagram-...

Edit: Update to correct snowflake to Instagram's strategy. I make this mistake often enough that it's now self-reinforcing :/


So you are presumably agreeing that snowflake makes sense?


Oh, of course! Depending on your application needs, it certainly can work. The benefit of PostgreSQL UUID is that you effectively have twice the bits, so you have more room to encode data and satisfy your uniqueness guarantees.


Hi online :)

I'll wait a bit to see if Brandur (author chimes in first), otherwise happy to pile in some thoughts.


> This post reads like it's vaguely implementing what should be in a queue backend. There's a locked_at field in the schema, furthermore should this not be performed via ...FOR UPDATE?

Yeah, I'm relying on transactional semantics to guarantee no contention during the lock, but the part that I'd intended as important here is really just the idea of locking the key — not the particular implementation. I think alternatively using a `FOR UPDATE` is totally plausible and likely even to be more performant than what I've written here.

Try to imagine this as more of a "ideas guide" rather than a dictation on precise implementation.


This is a really great article. I especially like that the author pointed out some of the pitfalls with non-acid systems.

I started wondering at the beginning of the article how many people were going to read it, try to implement it in mongo, and faceplant.

The only thing that strikes me as odd about the hypothetical example is that I can’t imagine a universe where you would want a request for a ride to complete 72 hours after it initially failed.

I get the point of keeping the information around for 3 days in case of errors. And perhaps my imagination is lacking, but most of the cases where I would want to implement this model are time-sensitive. I need a ride now. I need a message guaranteed to deliver exactly once.

Actually, the message queue aspect is how I think of this in general. It’s an application/database layer that takes an at least once mq system and forces it to be at most once. The combination then seems to imply exactly once.

Actually, after writing that out l, I can now think of a really great use case for this that isn’t time sensitive. ETL systems.

Let’s say you are streaming live updates from some third party system to some staging area and triggering processes from there through a broker to transform and load data into a warehouse.

Your team pushes an update that breaks something on Friday night. Nobody works that much on the weekend, so no ones cares. The dev lead gets notifications, but gets to take it easy until Monday morning because they (trigger warning: intentional singular they) know that everything can be replayed and resolved after the bug is fixed. Even in the case of frequently changing api data, you have the entire mutation history stored in your table and can accurately replay it.

I can get on board with this.

And yeah. Sorry for thinking out loud. Now I’m thinking of all kinds of place in my web apps this can be applied, as the article suggests. I think I need to write a Python wrapper around this concept and make it dead simple to include in projects.

Thank you again to the author for this very clear and excellent article.


Why not use Twitter's snowflake id? Instagram blogged about a Postgres function that generates them

http://rob.conery.io/2014/05/28/a-better-id-generator-for-po...


Which of the problems covered in this post would be solved by Snowflake?


This is a really great article that lays out a proper implementation of idempotent requests, something that many people (even very experienced developers) regularly mess up. Thanks for sharing


I'm using Memsql and it doesn't seem to have a UUID type, but after a quick search with Mysql and UUID, the binary type is being used.

Can this be used with Memsql with the binary type and will I get the same benefits?

Thanks


Yes I've seen uuids in databases in both binary and varchar(36) forms. Interacting with them with your tools/ORM won't be as nice as the native type, but it's definitely doable and you get most of the benefits.


Don't use an ORM, but otherwise thanks for the clarification!


I always throughly enjoy reading Brandur's articles.


how do you guys implement atomic_phase ? im not a ruby guy, so am not able to figure out the definition.

It looks like a atomic check-and-update, but more sophisticated.


This is a fantastic article.


"Idempotency keys", sheesh. Just use standard terminology: these are just futures/promises.


Wait...what? I'm all for "standard terminology" but they're for sure just not the same thing. At all.


Sure they are. The idempotency key is simply a representation of a promise. The E language and the Waterken server implement promise semantics exactly like this: bind the results of a local or remote computation to a durable representation, and that representation is used to store the result upon completion. All future requests return the that stored result, ie. idempotency.

In case of network partition where the client doesn't receive a reply, it can safely and simply retry the operation on the promise, which returns the stored result.

The E language popularized futures/promises, and it's still the gold standard.


I think you are arguing a very niche perspective here.

The concept of idempotence in mathematics predates futures/promises by about a century and is part of the general background understood by most people exposed to any CS concepts. It has been prevalent across a wide range of communities discussing networking protocol and message-passing designs since the inception of the Internet. It appears in countless RFCs and other standards documents which are read by people building web-based systems.

The concept of futures/promises you are promoting is honestly a niche part of computer science from the 1970s which persists in an intersection of distributed systems research and programming language research. It was never the prevalent formalism shared by general CS or network engineering communities. Meanwhile, a different aspect of the same 1970s concept has had a rebirth as an asynchronous programming syntax and appears in widely consumed languages. This different camp of PL researchers has essentially won the popularity contest which defines the new "standard" meaning of futures and promises for the general technical audience.


I'm not arguing against idempotency, or attempting to redefine it, I'm pointing out that "idempotent key" as used in this article precisely denotes a future. You might as well write an article on a "text-based client/server GET/PUT/POST protocol" without labelling it HTTP.

It's stupid and unnecessarily obfuscated. This abstraction has a well-known technical name, so use it, don't invent a new term.

And I frankly find it bizarre that you call futures/promises "niche" when they're now part of every web browser in the world and millions of programmers use them every day. The fact that these futures aren't durable isn't relevant since the browser environment itself is transient.


I find it bizarre for you to invoke the niche E programming language as if it is obviously relevant to any use of unique identifiers for detecting replay of messages. Why don't we criticize the author for his lack of Petri nets, or his glaring omission of a denotational semantics in the appendix to his article?

At this point in the discussion, I am unsure that you understand promises in distributed systems formalism, because you seem to be conflating it with the ones recently en vogue in Javascript and other popular languages. Those futures/promises in every web browser in the world are a purely local asynchronous programming idiom, or what I like to think of as CPS unconversion. I didn't suggest that browser programming is a niche concept, but it also has nothing to do with organizing idempotent remote state changes over a lossy message channel.

There is a vast literature using many different forms of explicit identifier tied to some form of idempotence. Do you run around the IETF and W3C telling them to stop saying "Message-ID" when they really mean futures/promises? Or security researchers with their nonces? How about those pesky IPv4 and IPv6 designers with their silly ports, SYN cookies, and sequence numbers?


> Why don't we criticize the author for his lack of Petri nets, or his glaring omission of a denotational semantics in the appendix to his article?

Oh, was I asleep when Petri nets and denotational semantics are now used by millions of programmers around the world?

> Those futures/promises in every web browser in the world are a purely local asynchronous programming idiom, or what I like to think of as CPS unconversion. I didn't suggest that browser programming is a niche concept, but it also has nothing to do with organizing idempotent remote state changes over a lossy message channel.

Of course it does, and I already made this point so I don't know why you're ignoring it. Add durability to browser promises, and there you go. The durability is irrelevant to the semantics.

> Do you run around the IETF and W3C telling them to stop saying "Message-ID" when they really mean futures/promises? Or security researchers with their nonces?

Nonces and Message-IDs have different semantics than promises. It's like you're not even listening.

Idempotency keys have the same semantics as promises. Ergo, they are promises. Promises are idempotent, but not every idempotent abstraction is a promise. It's simple, I'm not sure where the confusion lies.


You seemed fixated on establishing the primacy of the E programming language as the inventor or rightful heir to the use of keys to correlate messages and establish idempotent operations. So, I casually mentioned "prior art" of keys used to manage message retry, deduplication, and replay avoidance.

For more complete examples, consider SMTP, NNTP, and their UUCP predecessors which provided idempotent message transfer and eventually-consistent replication using client-generated, globally distinct IDs. I absolutely fail to see how these are different than a tweet ID or any similar solution for the web. It is such standard practice, we take it for granted like the use of 2's complement arithmetic.

The concept's reappearance in Twitter, Instagram, countless bespoke CRUD applications, or the PostgreSQL example in this article are not "actually promises" just because they're on the web. They are just the umpteenth instance of source-labeled resources being tracked in a distributed system. That is the web abstraction: a resource is created and we do not care whether it is an event, object, message, computational job, bank transaction, or packet of curry being delivered to my doorstep next Tuesday.

A Javascript promise is a local transform for asynchronous control flow. It has zero intrinsic connection to any IO performed in that flow, nor in the remote computation or storage that might occur. The programming language stops where the message leaves the language's runtime and enters the real world of distributed systems. It is not the control flow syntax which makes my communication idempotent, it is me and the recipient agreeing to overlay idempotence semantics on some data we agree to encode into our messages.

Message formats have nothing to do with programming languages. This is how we understand Internet architecture. The local programming language and programming paradigm of each node in the distributed system is immaterial to the messaging semantics and interop standards and their concepts.


But it doesn't "precisely denote a future". Not even remotely close to the same thing.


What a great argument. Idempotency keys are single-assignment/idempotent variables. Promises are single-assignment/idempotent variables. They're totally not the same thing!


Futures/promises means nothing to me.

But I always thought an idempotency key was just sending in a time stamp or something else unique.


Time stamps are not unique, you definitely need something unique, like a GUID to distinguish it from such tokens bound to other client requests.

And you'd be better served learning about the more general abstraction rather than learning a single domain-specific trick. This solution is obvious once you know about promises.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: