Hacker News new | past | comments | ask | show | jobs | submit login
UUID v7 (postgresql.org)
231 points by Recursing 15 days ago | hide | past | favorite | 193 comments



Can someone ELI5 what that "UUID v7 support" actually means in the title?

I don't know how to navigate commitfest (nor would I probably understand the source code to begin with), but the reason I'm confused is that you can already use all the proposed draft UUID implementations in Postgres (as long as you generate the ID application-side). In fact, PG will happily accept any 128 bit ID to be inserted into a UUID column, as long as it's hex encoded – even the dashes are optional.


I'm amazed at how "we" have managed to turn such a simple idea as "128 bits is a large enough address space for uncoordinated generation to be essentially collision free" into such a "heavy" concept with 7 different versions

If you want to do something smart like encoding your node ID within the value, or prefixing a timestamp for sortability, then sure, do that in your application. No one else really needs to care how you produced your 16 bytes. Just do some napkin math to make sure you're keeping sufficient entropy

I'm not sure "UUID" even needed to be a column type, versus a "INT16" and some string formatting/parsing functions for the conventional representation (should you choose to use that in your application). You could also put IPv6 addresses in the same type. Though I guess this depends on how much you think the database should encode intention versus raw storage in types


> No one else really needs to care how you produced your 16 bytes

UUID isn't about how it's done, it's about what it is.

Instead of everyone doing something differently, everyone can just comply with UUID.

Instead of having to repeat it across your docs that the IDs of this entity are sortable, you can just say they are UUIDv7. If someone wants to extract the timestamp from your ID, they don't need to figure out which (32, 48, 50?) bits are the timestamp nor what resolution the timestamp has because you can tell them UUIDv7.

You don't have to write your own validation functions because you can tell the database that this hex string is a UUID and it can do it for you.

You're probably making the case for Sqlite here which is very minimal, but something more full-featured like Postgres, I prefer these conveniences. I can tell because whenever I use Sqlite in a case where I could've used Postgres, I regret it!


But I feel the point is: None of that is a relevant concern IDs should take on.

Most functional things related to e.g. embedding the record creation time within the ID is one of those "that's cool, but I've never seen anyone do it" kind of things. If you need to sort records by when they were created, there are probably three or four happened_at fields on the record you'd use (created_at in this case). If you need the exact time; those are there for that.

Counter-argument: Well, you can save a few bytes on every record by getting rid of the created_at field and just using a UUIDv7. Maybe, but I've never seen anyone do it. What if you need to change the time the record was created? Are you planning to explain to all your integration providers the process of extracting a timestamp from a UUIDv7? What if you need to run complex SQL timestamp functions on created_at? Etc. Its cool, but it never actually happens.

Once we enter the domain of "using the node id or timestamp or something to reduce the probability of ID collision", that's a totally reasonable responsibility within an ID's set of concerns. But, that's a very different need.

> You don't have to write your own validation functions

Why are we validating IDs?

> but something more full-featured like Postgres, I prefer these conveniences.

Agreed. I am a vocal UUID hyper-hater. UUIDs should be destroyed, and humanity would be (oh so slightly) better off if they had never existed. But, they're still a thing, and I think its cool that databases have hyper-specific types like this.

My wish is Postgres would have other more sane automatic ID types and gen capability, in addition to uuid & autoincrement.


The point of the timestamp in UUIDv7 is not to encode creation time, it is to provide some (coarse-grained) chronological sortability.

Random primary keys are bad, but exposing incremental indexes to the public is also bad, and hacking on a separate unique UUID for public use is also bad. UUIDs are over-engineered for historical reasons, and UUIDv7 as raw 128 bits without the version encoding would be nicer.

But, to the end-user it's just a few lost bits in a 128-bit ID with an odd standard for hyphenation. The standardization means you know what to expect as developer, instead of every DB rolling their own unique 128-bit ID system with its own guarantees and weirdnesses.


But my point is: When is that standardization actually leveraged? Literally, tactically, what does "you know what to expect as a developer" mean? When is this standardization used in a fashion that enables more capability than just "the ID is a string don't worry about it"?

The realistic answer is: it isn't, because pre-UUIDv7 there was literally nothing about the UUID spec that conferred more capability than just a random string. And, truly; people used them as "just gimme a random string" all the flipping time. The pipes of the internet are filled with JSON that contains UUIDs-in-a-string-field, 4 bytes wasted to hyphens, 1 byte wasted to a version number, none of that is in service to anyone or anything.


1. The other UUID versions are actually used. However, the expectations is in what the developer gets when generating it. Even "random ID" can be messed up if the author tries to be smart - e.g., rolling their own secret chronical sortability hack for their database but not telling you how much entropy and collision resistance you have left, or them hacking in multi-server collision resistance by making some bits a static server ID.

People have reason to do those things, and oh boy do you want to know that it's happening. With UUID, over-engineered as it may be, you know what you're asking for and can see what you're getting - truly random, server namespaced, or chronologically sortable.

2. Being upset over 4 bytes wasted to hyphens but not being upset about JSON itself seems hypocritical. JSON is extremely wasteful on the wire, and if you switch to something more efficient you also get to just send the UUID as 16 bytes. That's a lot more than 4 bytes saved.

Over JSON you can still base64 encode the UUID if it's not meant to be user-facing.


> hacking on a separate unique UUID for public use is also bad.

Is it bad just because of the extra bytes used, or something else?


You need to maintain a UNIQUE index and have two IDs, with the one set as primary key solely existing for on-disk ordering. It's a nasty hack.


Postgres doesn't use the primary key for on disk ordering. MySQL does.


Ah, good point. The concern about on-disk ordering then only applies to other databases.

I suppose the index order may still be relevant for PostgreSQL?


what do you even mean by "Why are we validating ids"? zzzzyyyy-zyzy-zyzy-zzyyzzyyzzyy does this look like a valid ID? I could totally store this in the database if there was no validation involved


From GP (and my) perspective, the useful part of UUID is that it's 16 bytes. This is usually for formatted as 32 hex digits with dashes in specific places.

The version/variant bits are the pointless part. Of course if you put the 16 bytes on the wire you would still have some encoding (perhaps 22 base64 characters?) that requires decoding/validation, but in memory and in your DB it's just 16 bytes of opaque data.


> Instead of everyone doing something differently, everyone can just comply with UUID.

With which UUID ? UUID v1 ? UUID v2 ? UUID v3 ? ... UUID v7 ?


UUID has the version number and variant number encoded in the generated UUID. You can just examine the UUID to determine which version it is.


The UUID specs are still confusing (or at least were to me lol) because the words "version" and "variant" both just say that something changes, not what is changing or why it's changing.

version from Latin vertere "to turn, turn back, be turned; convert, transform, translate; be changed"

variant from Latin variare "change, alter, make different,"

In my own UUID/GUID code I have taken to calling them "behavior" and "layout", respectively: https://github.com/okeeblow/DistorteD/blob/NEW%E2%80%85SENSA...


4.1.1 The variant field determines the layout of the UUID. That is, the interpretation of all other bits in the UUID depends on the setting of the bits in the variant field. As such, it could more accurately be called a type field; we retain the original term for compatibility.

4.1.3 The version number is in the most significant 4 bits of the time stamp (bits 4 through 7 of the time_hi_and_version field). The following table lists the currently-defined versions for this UUID variant. The version is more accurately a sub-type; again, we retain the term for compatibility.

It's recognized in the RFC and all you've done is broke compatibility for fashion.


Naming != compatibility. Nobody should be looking at my code except for me and it helps me remember it better so I'm not sorry 8)


Unless the random bits don’t follow those conventions.


It isn't a UUID if it doesn't contain those flag bits in their expected positions. A "random" UUID has only 122 bits of randomness for this reason.


In practice, UUIDs are treated as an opaque 128-bit field. In any sufficiently complex system, there is no practical way to standardize on a single blessed version. Furthermore, all of the standardized UUIDs are deficient in various ways for some use cases, so there are a large number of UUID-like types used in large enterprises that are not "standard" UUIDs but which are better fit for purpose. This is deemed okay because there is no way to even standardize on a single UUID version among the official ones. Furthermore, there are environments where some subset of the UUID standard types (including each of v3/v4/v5 in various contexts) are strictly forbidden for valid security reasons.

The practical necessity of mixing UUID versions, along with other 128-bit UUID-like values, means that the collision probabilities are far higher in many non-trivial systems than in the ideal case of having a single type of 128-bit identifier. There is a whole separate element of UUID-like type engineering that happens around trying to mitigate collision probabilities when using 128-bit identifiers from different sources, some of which you may not control.

Having 128-bits is the only common thread across these identifiers which everyone seems to agree on.


I think you're drastically underestimating the purpose and management of UUIDs in large scale systems.

If you're building for a single application or data type, sure do your thing, have at it. If you're trying to coordinate UUID spaces and generation across thousands of different applications and data types, like large data pipelines, then this matters a lot.

Also, having native database support (like indexing, filtering, etc.) improves efficiency for these types of workloads.


Because it turns out that trying to index/sort things by UUID doesn't work great. UUID, at least somewhere after version one isn't just some large number. Different parts of the field have different meanings depending on the specification.


The patch is adding functions for generating UUIDv7 in the database. Which you can already do, for example with the PL/pgSQL function from [1]. Or you can just generate them in your application code. As you mentioned everything else about UUIDv7 already works and didn't require any changes.

It's really more about convenience, and maybe a bit of speed.

1: https://gist.github.com/fabiolimace/515a0440e3e40efeb234e126...


Thank you!


Postgres can now auto-generate uuidv7, so you can define it in your schema and don't have to add the ID on the client side when doing an insert


Yes my expectation is that Postgres can do that than for you. Its still easier to have an autogenerate on than doing it externally


On a distributed application, your application might need proper time synchronization for the time component. Not when it is generated DB side.


I've said this before on HN, but we're talking about UUID v7 again, so it bears repeating: prefer UUID v4 for unique keying, of the sort that's transferable between databases. For timestamps, use timestamps, to sort by insertion, use an autoincrementing primary key. Disk space is not so expensive that we can't afford all three of these things.

Conflating timestamps and uniqueness is a conflation. These concerns are best separate. If you put a timestamp in your UUID, your UUID now has a timestamp. You can't remove it if you don't want the timestamp to be a part of the uniqueness any more.


UUID7 offers several key advantages:

  - sortable by insertion
  - less vulnerable to sequence prediction attack
  - allows partitioning of tables in the future
Sequence prediction attack is a problem when you want to use identifiers in a public API. For example, a user or competitor can iterate through your product catalogue by incrementing the key. Sorting by inserting is a common use case as well. You can achieve this with an auto-incrementing primary key, however, this will create issues when you need to partition the table. Of course, you may never need this functionality, but there is a reason UUID7 have been added, as they are very useful in certain scenarios.


I get the public API argument, but it certainly feels backwards to structure the internals of your application for that specific use case. Generating and mapping an UUID to your identifier when you need it is pretty straight forward and allows you to avoid exposing the identifier at all.

Which leaves partitioning, something that very few applications will ever need, even if plenty of developers hope they will. A slightly less weird use case, but perfectly doable up to a point using regular sequences.

I see a solution looking for problems.


I have build many successful systems without ever using a GUID. However, at a certain scale they become very handy, and it then it definitely helps you can also sort them.


Just add a verifier: /item?id=123;vf=ixkdS1Xb

Then select * from item where id=123 and vf='ixkdS1Xb'


As a DBRE, I care mostly about DB performance, and not getting paged / blamed for things out of my control.

I fully agree and support that auto-incrementing integers should be used whenever possible. My preference for UUIDv7 over UUIDv4 is solely that they’re less likely to wreak havoc on the DB, if devs insist on having a UUID PK.


In most ORM apps you need to pre-allocate the integers to build the references in the object graph before actually executing these inserts. Doesn't this carry a performance penalty?


I know Django gives you an auto-incrementing integer field named id even if you don’t add one to your model. I have no idea about others, but I would hope they aren’t doing work the DB is already doing for you.


> but I would hope they aren’t doing work the DB is already doing for you.

Sadly this is very much a thing that ORMs do all the time.


(1)

    |------------------------------|-----------------------------|------------|----
    | primaryKey (bigInt-internal) | publicKey (uuidv4-external) | created_at | ...  
    |------------------------------|-----------------------------|------------|----
(2)

    |------------------------------|-----------------------------|------------|----
    | primaryKey (uuidv4-internal) | publicKey (uuidv4-external) | created_at | ...
    |------------------------------|-----------------------------|------------|----
(3)

    |------------------------------|-----------------------------|----
    | primaryKey (uuidv7-internal) | publicKey (uuidv4-external) | ...
    |------------------------------|-----------------------------|----
(4)

    (not recommended)
    |---------------------|----
    | primaryKey (uuidv7) | ...
    |---------------------|----
----

  (1) [X] sortable by insertion, [X] timestamp
  (2) All of (1) and [X] transferable between databases
  (3) Use UUIDv7 as a primary key for internal and UUIDv4 for external. App or SELECT statement will need to extract the timestamp from UUIDv7 if you need to use it. Also, if you're using a DB Client you can't just view the 'created_at' column to get an idea of when a row was created.
  (4) Use UUIDv7 as a primary key for internal & external use.


I don't think you've provided any actual reasons to not use uuidv7 or addressed any of the disadvantages of having multiple keys.



Word to the wise: be very careful about adding semantics to unique ids that aren't inherent to the identity of the thing being identified.

Over time conflicts between the id's primary job (uniquely identifying something) and the extra semantics can arise, and the solutions tend to get pretty messy.

Here we have a unique id that embeds a timestamp. The classic conflict here is with privacy/security. A UUIDv7 user id tells you when the user was created. A UUIDv7 of a medical record tells you when some medical event occurred.

There are things whose identity is inherently time-based and not private, so I'm not giving a blanket recommendation to not use these. Just understand what you are signing up for.

For a database, you can use bigints for primary ids but only internally. Then you also have an external random (v4) uuid... and a timestamp if you want, for that matter -- now that it's a separate column, you can expose/hide it on a case-by-case basis, depending on need. So this gets you the benefits of a uuidv7 but maintains flexibility, though at the cost of some complexity and extra bytes/record.

Other conflicts can arise too, and they can be hard to always foresee, so generally be careful about extra semantics in unique ids.


It looks like you're explaining this well, but I still don't understand what you're saying.


At my company we had a competitor scrape our API for various businesses. One of the fields was an bson ObjectId that represented when the customer entered our system. This unique identifier encodes a timestamp of its creation.

Our competitor was able to ascertain, based on that timestamp, when our customers contract was up and was (briefly) able to poach some customers by underbidding us until we corrected this.


Wow. I'm amazed by stories like.

I'm so naively the "take the high road" kind of guy, that I just assume everyone (or every company) should just do the right thing. Stealing customers in this way from a competitor, I have no ability to rationalize such an action.

And this kind of makes me scared, like if I were to ever own a business, I just know I'm swimming with sharks with no ability to defend.

I believe your story, but it's just crazy to me. Go earn a customer's business in a legit way, not be stealing data from a competitor.


Giving someone an offer when their existing contract is running out isn't that shady. Lots of companies ask for that. It shouldn't be hard to rationalize!


Uhh ... that's entirely what business is about. Some shady methods maybe, but also some pretty good customer acquisition methods


Businesses do not have to act unethically. Doing so limits who you can hire and collaborate with.


Honestly I didn't even think of it when I first wrote the endpoint. Our founder asked me "Is there anything we're sending over the API that might clue someone in when someone signs up? Are we sending a createdAt field or something?" and I said "No, but we do have a timestamp in one of the IDs..." -- well, we removed the field and this behavior stopped soon after.

Anyway, the arc of the universe bends toward justice: this (former) competitor got sold for parts a few years later.


Isn't the real issue here that your competitors had the authorization to see the contracts you had with other customers?


They didn't. We had something like:

    "locationId": "53c24146ef0b601b77974fcd"
They took the first four bytes (53c24146) which is a timestamp that represents 1405239622 seconds since Unix epoch. Our website clearly stated we work off annual contracts (a norm for our industry) - it wasn't secret information. So from this timestamp they could ballpark when a customer's contract was up.


How did you find out that they did?


The event timestamps embedded in the UUID can be correlated with external event streams, or even with other events within the same dataset, to de-anonymize the context of the event associated with the UUID. This is a common class of de-anonymization attack. Anything that allows temporal correlations to be inferred potentially leaks quite a lot of info about the data underlying the unique ids.


They're saying that in certain circumstances, if your API exposes the PK publicly, it may leak information you don't want leaked (the precise datetime something occurred, in the case of UUIDv7).

If that's an issue for you, you can get around this in a variety of ways, as they mention: you could use an associative table that maps the externally-exposed random ID to an internal-only ID.


Be careful about combining 2 pieces of information in to 1 column. From the above examples, you may want something to uniquely identify a record in your db and you may want something that tells you when the record was created. If you combine these two things, you then have a problem if you want to give an untrusted party that unique reference without telling them when it was created.


Don't use UUIDv7 if you want to keep the creation time of the event/id/entry private.


> A UUIDv7 of a medical record tells you when some medical event occurred.

It tells you when the event was documented. If the event didn't contain a date time stamp itself, I would be highly surprised, because what other value is there in documenting it?

The security problem here is inherent in the practice and your choice of primary key isn't a material factor at all.

Do you imagine there's a public CRUD database with simple Rails style accessors that can drill all the way down to individual event records inside my health information? And that, somehow the leak of a primary key in a URL might give away the fact that _something_ happened to me, medically, 12 days ago?


>And that, somehow the leak of a primary key in a URL might give away the fact that _something_ happened to me, medically, 12 days ago

Just as a side note: it may also leak the location, not just the time. An that is enough e.g. for disproving an alibi or leaking an important commercial secret (if you are in the same location as competitor HQ, for example).


UUIDv1 is the only one that specifies a MAC address to be included in the UUID. All others specify timestamp data, pseudo random values, or MD5/SHA1 string hashes.

I cannot rightly apprehend the scenario you are describing. A medical provider might generate a UUIDv1 and add it to a record of mine, and this will somehow destroy my ability to have an alibi in court with respect to corporate espionage?

I'm not in a bond movie, I just need to keep track of events and have them sort in chronological order


I‘m just pointing out that an identifier with timestamp in certain situations can leak more than just timestamp. E.g. an identifier of medical record in the hospital which location is known. Just this. The exact scenarios of how it can be leaked and how this information can be used are left to your imagination. The possibility of such attack depends on what’s at stake.


For good security, you don't leak internal IDs at all, sure. It is rare to find people doing that. And random primary keys (or any key really, and even more so for clustered indexes PK) really trash a db cache and locking systems.


IDs should just be IDs. You should be able to hand out your IDs on the street corner without compromising security. If knowing IDs has a negative impact on security, you've designed your system improperly.


I couldn't agree more!


No.

While it’s no longer on ranks on the top 10 web vulnerabilities, gaining internal insight to systems is one of first things you do when infiltrating.

But people are messy and lazy. Nowadays, you ask for GDPR data and people give you CSVs with all their real table and column names.

Sometimes when you are just a little inside, figuring out an id is like figuring out a password (particularly with uuid as opposed to a sequence). Real nice if it leaks easily.


Again, if knowing an ID allows someone to unfiltrate your systems, you've designed things poorly. IDs are not keys, and they should never be treated as such (looking at you, US Social Security numbers).

An ID is like social security number. When you give it away on the street corner, who knows what other people will do with it.

IMHO, leaking an ID always impose a risk. It is always have negative impact if leaked, no matter how perfect your system is.


That's hyperbolic. If your table ID poses security risks like a social security number, you've designed your system horribly wrong.


It’s like one small bug that lets you move sideways to you can use a bigger and better one.


You're almost always going to have to leak some sort of ID in an API, otherwise your API is going to be exceptionally hard to work with. You could choose to have a separate external ID, but provided that knowledge of an internal ID doesn't convey any information or additional privilege, it's not that big a deal.


The only reason why people can do dangerous things with an SSN is because it's used for authorization - i.e. as a secret - not just as ID. That's broken by design, but there's no reason to repeat that design.

> For good security, you don't leak internal IDs at all, sure. It is rare to find people doing that.

We must live in a different universe. I'd wager to say that over 90% of all backends leak their primary key when speaking to the front-facing client.


I think the statement was "It is rare to find people doing good security", not "It is rare to find people leaking internal IDs"


I read it as '[…] you don't leak internal IDs at all […]. It is rare to find people doing that [i.e. not leaking internal IDs]'


Yeah, sorry not clear. Typing on iPhone never really gets the message across as clearly as using a PC


99.9%


> And random primary keys (or any key really, and even more so for clustered indexes PK) really trash a db cache and locking systems.

That sounds like a problem which should be solved by making database engines not assume keys have some sane ordering, not by putting timestamps in UUIDs.


This is mostly correct but primary keys have to co-exist with the existing data infrastructure that is unlikely to be replaced for decades.

It is quite possible to do cluster-style indexing on UUIDs through disk on a single server at rates of tens of millions per second, I do it every day, just not with your typical ordered-tree architectures. Many popular database engines are not designed to make this particular scenario perform well.


I'm not sure why that'd inherently be a problem? Knowledge that some record exists about a medical event at a particular time is not too problematic, compared to _who_ it happened to


very rare diseases may be an exception, especially in known geographic locations. that can become identifiable (and is governed by HIPAA in the US)


Usually when you want this property it is also a benefit that your events when ordered by primary key, also gives a rough ordering by time, no need for a secondary index


If you like this (I do very much), you might also like pg_idkit[0] which is a little extension with a bunch of other kinds of IDs that you can generate inside PG, thanks to the seriously awesome pgrx[1] and Rust.

[0]: https://github.com/VADOSWARE/pg_idkit

[1]: https://github.com/pgcentralfoundation/pgrx


Is the headline true at all? It was punted to the next commitfest on 2/1. It could easily keep getting punted like it has been for a ~year now.


Is it that hard to implement? Supporting an additional UUID version in PostgreSQL sounds like the most trivial change to implement (compared to anything that touches core backend, table management, replication, query schedulling, and so on).


The UUID spec update has not been finalized yet.

It would be quite unfortunate to end up with a UUID v7 in PostgreSQL that’s not quite the standardized one because the patch got merged too quickly.

EDIT: here is the IETF working group page https://datatracker.ietf.org/wg/uuidrev/about/

Their milestone seems to submit the final proposal by March.


> It would be quite unfortunate to end up with a UUID v7 in PostgreSQL that’s not quite the standardized one because the patch got merged too quickly.

The chances of that seem extremely low at this point. The contents of a version 7 UUID have not changed since work started on RFC 4122 bis in October 2022: https://author-tools.ietf.org/iddiff?url1=draft-ietf-uuidrev...


The chances may be low, but either it's a draft or a final version.

There's clearly little pressure to rush this, considering it's not difficult to add a custom function generating UUIDv7 ...


The patch is already written, it's on that page. The bottleneck in Postgres is reviewer bandwidth which is why it's been moved out of several commitfests.


I don't think reviewer bandwidth is the main issue for this patch. It's a 200-line change (considering C code, there's more in docs/tests), and the code is not overly complicated / sensitive (in the sense that it's very isolated and unlikely to break random stuff).

For me the main challenge was that it's still considered a draft (AFAIK). It may be unlikely to change, but if it does I'd rather not have to deal with persistent UUIDv7 data generated per some previous spec.

Also, if I really want/need UUIDv7, it's not that hard to create an extension that generates UUID in arbitrary ways, including the proposed v7.


+1

there's already well maintained extension https://github.com/fboulnois/pg_uuidv7

It's slightly different from recommendations by draft RFC version (there's no counter), but fully within spec requirements. From practical point there's no difference at all.



What’s a use-case for a v7 UUID vs using a v4 UUID which is entirely random based? Is it just that v7 includes a timestamp for better sorting?


> Is it just that v7 includes a timestamp for better sorting?

Correct. The sortable nature of UUIDv7 improves database performance and index locality by helping the index be more efficient since rows are inserted in a predictable order instead of scattered randomly.


If nothing else (and there is plenty of else) it saves a lot of time knowing the order something was created when debugging.


Yes. UUIDv7 can be sorted by default in order of creation, which helps reduce fragmentation problems of randomness over time.


It would be nice to have either fully integrated support for returning the UUIDv7 encoded in crockford's base32 or else ship a crockford's base32 encode/decode functionality in the same release so that this can be compatible with Ulid (given that it's one the primary open source existing works UUIDv7 was modeled after).


looks like v7 is basically v1 updated. v6 looks interesting with the db optimisation in mind.

UUIDv7 spec: https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...


FYI that's a link to a now very old internet draft.

The current draft can be found here: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...


> Systems that do not involve legacy UUIDv1 SHOULD consider using UUIDv7 instead.

The optimization relevant to v6 also applies to v7. The difference is that v7 UUIDs are not directly compatible with v1 UUIDs.


No thread about UUID is complete without a plug for NanoID! https://github.com/ai/nanoid/blob/main/README.md


That's a lot of words to say: “encode 126 random bits in base64 (url-safe variant)”.

It seems the algorithm is equivalent to:

    function nanoid(alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_") { 
      return Array.from(crypto.getRandomValues(new Uint8Array(21)), i => alphabet.charAt(i & 63)).join('');
    }
(Maybe String.fromCharCode() is slightly faster than Array.join(), but I doubt it matters much.)

On node.js it's even easier, since url-safe base-64 encoding is supported natively:

    Buffer.from(crypto.getRandomValues(new Uint8Array(16))).toString('base64url').substring(0, 21)
Do we really need an entire Github project dedicated to a 1-liner?


Agreed, and I actually share the same view about "UUID" as a named concept itself. I wrote more in another comment, but in summary

* Using uncoordinated random generation in a large address space for IDs is often a very useful idea

* 128 bits is a good rule of thumb for "will never ever collide"

* Making this into a heavyweight "UUID" concept, with it's own bespoke string format, and 7 different standard ways to generate them, feels like a ridiculous waste of cognitive effort that makes such a simple concept appear opaque and magic. If you want to encode other data (timestamp, node ID) in 16 bytes you can still do that of course. There's no need for anyone else to even know. Just do some quick calculations to ensure you haven't eliminated too much entropy


> * 128 bits is a good rule of thumb for "will never ever collide"

It depends.

True random number are slow. Fast PRNG are prone to collision. That's why we have different specs and versions of UUIDs.


> True random number are slow. Fast PRNG are prone to collision.

Not in this decade. You can slam out a million securely random 128 bit numbers per second per core. For numbers you will store, the effort to store them is orders of magnitude greater than the effort to generate securely.


Not sure it's too relevant for postgres, where uuid is 16 bytes in db & can be generated by db


Also not comparable to UUIDv7 because it isn't sortable


Does this stay compatible with the old uuid column type, as uuidv7() uses the same 128 bit format as gen_random_uuid()?

That would mean it's easy to update old apps, as we would only have to change the default value for the columns.


Yes, internally it's the same as every other UUID (16 bytes, passed by reference). There's no reason to store / represent it differently.


Yes, we have already updated our in-app UUID generation to use v7 UUIDs and are storing them in regular postgres UUID columns (postgres 14). Works great!


Note that, as jmull says in https://news.ycombinator.com/item?id=39262286 , embedding timestamps in every uuid can potentially expose private information.


Yes, you can check UUIDs to see if they’re v7, and extract the timestamp if so. This seems to me less problematic in most cases than being able to guess the next ID (as is the case with numeric IDs). At least for us, anybody with access to the ID also has access to the time the record was created, so there’s no new information being exposed.

It’s a good thing to keep in mind though for sure.


Yes, you can use v7 today with uuid columns, you just need to add a custom function (or the more performant pg_uuidv7 extension) to generate them.


If you need a uuid v7 right now you can get one here: https://uuidv7.app/


Thank the Gods. I was almost out.


Always glad to help.


Whoosh


When will this be available? (Thanks in advance)


I wonder why the size of the table with uuid7 is smaller than the one with uuid4? Both are using the uuid data type.


As a sibling comment guessed, btree indexes in Postgres can store ordered data much more efficiently than random data. Inserting randomly ordered data leads to fragmentation, with lots of empty space on your index pages.


The main data is the same, it's the primary key which is smaller, presumably a result of the structure of the uuid7 being more regular.


You probably should not use UUIDs to start with in your database at least not as an ID. UUIDv7 aims solve some of the issues of UUIDv4 that are even less suitable in for databases. 99% of times using BigInt for an ID is better.


There are some nice features of using UUIDs rather than ints. It's been written about before, a few on the top of my head: Client side generation of ids. No risk of faulty joins (using the wrong ids to join 2 tables can never get any hits with UUIDs, it can with ints).

Those two sucks for us right now (planning to move to UUIDs).


Uniqueness aside, UUIDs for public-facing IDs also prevent enumeration attacks and leaking business information other than timestamps.


> No risk of faulty joins

Wouldn't Snowflake IDs also solve that problem? A Snowflake ID will fit within a signed 64-bit int.

https://en.wikipedia.org/wiki/Snowflake_ID

The nice thing about a Snowflake ID is that you can encode it into 11 characters in base 62. If I have a UUID, I'm going to need 22 characters. Maybe that doesn't really matter given that 11 characters isn't something someone will want to be typing anyway and Snowflake IDs do require a bit of extra caution to make sure you don't get collisions (since the number you can make per second is limited to how big your sequence generation is).


the same idea, but this is a IETF standard.


The "faulty joins" can be solved by having a shared sequence for all tables. A bigint column should be enough for most use cases.


shared sequence for all tables can't be parallelized


Sorry but that’s terrible advice. I’ve worked on projects that started with integer ids and it caused nothing but problems.


What kind of problems did you encounter?


Not OP, but I can answer this:

Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]

[0]: https://blog.twitter.com/engineering/en_us/a/2010/announcing...


> Integers don't scale because you need a central server to keep track of the next integer in the sequence.

They most assuredly do scale. [0]

Also, Slack is built on MySQL + Vitess [1], the same system behind PlanetScale, which internally uses integer IDs [2].

[0]: https://www.enterprisedb.com/docs/pgd/latest/sequences/#glob...

[1]: https://slack.engineering/scaling-datastores-at-slack-with-v...

[2]: https://github.com/planetscale/discussion/discussions/366


I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.

It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.

I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.


In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.

I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.

I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.


Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.

If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.


If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.

If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.


Whatever is distributing the chunks is still a point of central coordination.


Yes, and?

If you can't manage minor levels of coordination because your database is on fire, the problem is that your database is on fire.


Fewer points of coordination is always better.

In general you shouldn't need to make a roundtrip to produce an ID.


> Fewer points of coordination is always better.

The distributed database needs a coordination system anyway, so it's not an additional point.

> In general you shouldn't need to make a roundtrip to produce an ID.

Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.


> The distributed database needs a coordination system anyway, so it's not an additional point.

Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

> Did you forget the context over the last week? We're already talking about reserving big chunks to remove the need to make a roundtrip to produce an ID. There would instead be something like one roundtrip per million IDs.

OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.


> Nope! Distributed databases do not necessarily need a "coordination system" in this sense. Most wide-scale distributed databases actually cannot rely on this kind of coordination.

I'm assuming a system that tracks nodes and checks for quorum(s), because if you let isolated servers be authoritative then your data integrity goes to hell. If you have that system, you can use it for low-bandwidth coordinated decisions like reserving blocks of ids.

Am I wrong to think that most distributed databases have systems like that?

> OK, it's very clear that you're speaking from a context which is a very narrow subset of distributed systems as a whole. That's fine, just please understand your experience isn't broadly representative.

Sure, but the first thing you said in this conversation was "Whatever is distributing the chunks is still a point of central coordination." which is equally narrow, so I wasn't expecting you to suddenly broaden when I asked why that mattered.


The way to solve that is giving each server it's own range of IDs.


Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.


I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.

This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.

The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.


    postgres=# CREATE TABLE foo(id INT, bar TEXT);
    CREATE TABLE
    postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
    INSERT 0 1
    postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED 
               ALWAYS AS IDENTITY (START WITH 2);
    ALTER TABLE
    postgres=# INSERT INTO foo (bar) VALUES ('ACK');
    INSERT 0 1
    postgres=# TABLE foo;
     id |     bar
    ----+--------------
      1 | Hello, world
      2 | ACK
    (2 rows)


I don’t understand what you’re getting at. This was a pre-existing Postgres db in production.

I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.


You said data import, so I assumed it was pulling rows into an empty table. The example I posted was a way to create a table with a static integer PK that you could rapidly generate in a loop, and then later convert it to auto-incrementing.

> I’m sure there’s a way to get it to work with integer ids but it would have been a pain. With UUID’s it was very simple to generate.

IME, if something is easy with RDBMS in prod, it usually means you’re paying for it later. This is definitely the case with UUIDv4 PKs.


No I mean an active prod table with people adding new rows all the time. It's just so much easier not having to worry about integer conflicts and auto-incrementing shenanigans.

But I get you like integers so whatever works for you, I just don't think they're the right tradeoff for most projects.


This doesn’t really help you in this case, because the patch is to generate the UUIDs in the database?


Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.


IMO it’s always been easier to generate them in the client. Every major platform has had libraries since forever.


They also leak information.


I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.


Yes, but the only sane way to generate integer IDs is in the database.


Here are some reasons for using UUIDs; not apply to all businesses:

- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)

- sequential ids leak competitive information: https://en.wikipedia.org/wiki/German_tank_problem

- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)


It's also much easier to merge data from different sources when they all use UUIDs for row identification.

I would never advise this. I use UUIDv4 for basically everything. It adds minimal overhead to small systems and adds HUGE benefits if/when you need to scale. If you need to sort by creation date use a "created" column (or UUIDv7 if appropriate).

If your system ever becomes distributed you will sing the praises of whoever choose UUID over an int ID, and if it never becomes distributed UUID won't hurt you.

Note: this is for web systems. If it's embedded systems then the overhead starts to matter and the usefulness of UUID is probably nil.


It is worth mentioning that the reason UUIDv4 is strictly forbidden in some large decentralized systems is the myriad cases of collisions because the "random number" wasn't quite as random as people thought it was. Far too many cases of people not using a cryptographically strong RNG, both unwittingly or out of ignorance that they need to.

Less of an issue if you have total control of the operational environment and code base, but that is not always the case.


How does this happen? Are people implementing UUIDv4 themselves using rand() or equivalent? Or has widely used UUIDv4 libraries had such bugs?


It comes in a couple common flavors. Most commonly it is people just rolling their own implementation and using a PRNG or similar. Not every environment has a ready-made UUIDv4 implementation, and not all UUIDv4 implementations in the wild are strict. A rarer horror story I've heard a couple times is discovering that the strong RNG provided by their environment is broken in some way. Both of these cases are particularly problematic because they are difficult to detect operationally until something goes horribly wrong.

The main reason non-probabilistic UUID-like types are used for high-reliability environments is that it is easy to verify the correctness of the operational implementation. It isn't that difficult to deterministically generate globally unique keys in a distributed system unless you have extremely unusual requirements.


It adds a lot of overhead at any scale, it’s just that the overhead is hidden due to the absurd speed of modern hardware.

I’ll again point out (I said this elsewhere in a post today on UUIDs) that PlanetScale uses int PKs internally. [0] That is a MASSIVE distributed system, working flawlessly with integers as keys. They absolutely can scale, it just requires more thoughtful data modeling and queries.

[0]: https://github.com/planetscale/discussion/discussions/366


GitHub also uses int PKs and has over 100,000,000 users.


One reason you might want not to use integers for stuff like user ids is that you may leak the information about the magnitude of your userbase.


Huh, why would using uuidv4 be a problem? Collisions?


My understanding is that they cause a lot of page fragmentation, which leads to excessive writes to the WAL


You can't use them as cursors, because they are not inherently ordered like integer ids.


IDs are one thing, cursor-able fields/columns are a different thing.

You cursor on timestamps, or serial numbers, or etc., not IDs.


I have never wanted to use database cursors and I predict that I never will.


Sure, if you don't offer pagination or only have small tables, you can get away with offsets. I tend to go for cursors as a default because I like to build applications with performance in mind and it’s the same effort.


We may be talking about different things. I thought you were referring specifically to [database cursors](https://en.wikipedia.org/wiki/Cursor_(databases)) so that's what I was talking about. If you're talking about something else, like the concept of so-called "cursor-based pagination" in general, then that is still an option even with even randomly-generated primary keys, so long as there are other attributes that can be used to establish an order (which attributes need not be visible to the a user)


I have offered pagination over large tables, without database cursors or non-random keys, without offsets, while keeping performance in mind, with little effort.


How about the rest of us?

I don't have a resource of the top of my head to present to you, but in the least keyset pagination is superior to the offset one because it does not get invalidated by new inserts.


The rest of you also don't need database cursors or non-random keys to have keyset pagination.


When working with distributed systems or compiling a IDs from different systems it is helpful to make sure the ID is unique


That's especially true if you care about performance and do a lot of joins, the hit can be over 10%.


Would you recommend starting a new project with UUID v7 ?


If the included timestamp doesn't expose sensitive data, then using UUID v7 is a good default, because it has performance advantages on database operations and items might be sorted by ID in a meaningful way, what is sometimes desired (even if I never used it).


If you need some of its attributes.

Some comments: https://news.ycombinator.com/item?id=39261469

https://news.ycombinator.com/item?id=36433481

And have a look at this post from last year: https://news.ycombinator.com/item?id=36438367


Nice, but somewhat pointless, since the UUID (or any other type of ID representing an external entity) better be generated by the actor creating that entity, i.e. the FE client, or in the worst case, the BE/application server. If it is a database server generated ID, it could as well be a serial autoincrement bigint ;)

EDIT:

For those who misunderstood: I'm very much pro-UUID, and against serial autoincrement server-generated IDs. With the exception when you need to heavily optimize for speed and/or index storage space. And even then there are hybrid solutions like using UUID externally, and serial IDs internally.

https://news.ycombinator.com/item?id=39261435


There are many reasons for not using a bigserial index, such as avoiding the data leak associated with publicly visible sequential database indexes. These reasons apply regardless of where an index is generated.

Generating IDs away from the database is mainly done due to design restrictions demanding it, not because it is beneficial.


A serial auto increment bigint is not globally unique. A traditional uuid, is not sequential. This particular uuid are both globally unique and increasing and can be used for primary key. It simplifies certain schemas and architectures.


Trusting your clients to generate your database IDs... What could go wrong?


1. you're not trusting anyone

2. the UUID by itself doesn't authenicate or authorize anything

3. there is a small chance of collision, and it can be handled on the backend/persistance/DB layer, i.e. return error to the client in case of collision and ask to generate a new UUID

4. many non-trivial and/or CQRS/ES apps work like this

5. if you are really paranoid, you can push down UUID generation logic to BFF (Backend-For-Frontend) layer

6. Lots of DistSys problems can be solved with client-generated IDs. But most people mistakenly think that DistSys applies to backend only, and exclude clients from the picture.

7. Scaling RDBMS (especially Postgres) is hard. UUID generation is slower than serial bigint, so it's best to keep it outside DB layer fo this reason also.

8. Client-generated UUIDs help to make client requests idempotent, and enable error handling with retries (although it's better to add an additional layer of request idempotency with IDEMPOTENCY_KEY HTTP header, or GraphQL Relay's clientMutationID).


I think small chance of collision is a misleading way to think about it when any malicious client can submit 100% collisions if it wants to. Dealing with this isn't a big deal, but if you ignore it completely as unlikely/paranoid in a hacked together app you risk letting someone do something like take ownership of an existing record by pretending to create it.


P_collision = 1 / 2^122

the malicious actor will need to somehow get the authorized session first, then get the real UUID which already inserted into the DB.

And now she will be able to do DDoS by replaying create_entity requests with already existing UUID.

But ... the same scheme equally applies to server-generated IDs also when used for updates instead of insert/create (even if there is a translation layer between internal and external IDs) ;)

In my implementations this create_entity will also have an idempotency key, so after first successful request, the replayed/retried requests will hit the cache only, and skip the database.

In case the attacker will also change idempotency_key for each replayed request, then the only remedy is to monitor for this scenarios, or reject requests with the same {sessionID, requestName, enitityUUID} and different idempotencyKey-s over short time intervals.

But again, this equally applies to any type of server-generated ID.


Suppose we're talking about an app where users can post events. Each public event has a page at /event/<uuid>. So any malicious user can trivially find the id of an existing event. Suppose there's also an api where you POST a JSON body to /api/v1/event to create an event. This uses a client generated ID so the body contains the title, location, etc and the supposedly newly generated id. A malicious client could submit an existing uuid instead of generating a new one, and the server would need to reject this.

I'm not saying this is a big issue, but saying "P_collision = 1 / 2^122" is misleading and gives a false sense of security. P_collision is 100% if a malicious user can specify any id they want and wants to specify a colliding one.

I don't understand why you're bringing in idempotency keys. The way to fix this is to reject client generated ids that already exist.


> Suppose we're talking about an app where users can post events. Each public event has a page at /event/<uuid>. So any malicious user can trivially find the id of an existing event.

This is bad UX and bad design, usually human-readable event slug will be shown in the URL.

But I agreed that since UUIDs are not encrypted, they can be acquired by the attacker (but this equally applies to any other unencrypted data handled by the client).

> Suppose there's also an api where you POST a JSON body to /api/v1/event to create an event. This uses a client generated ID so the body contains the title, location, etc and the supposedly newly generated id. A malicious client could submit an existing uuid instead of generating a new one, and the server would need to reject this.

Again this is bad design, and the attacker will also need to either hijack the session, or to steal JWT token.

> I'm not saying this is a big issue, but saying "P_collision = 1 / 2^122" is misleading and gives a false sense of security. P_collision is 100% if a malicious user can specify any id they want and wants to specify a colliding one.

I agreed with you, but the attacker can copy any ID, including server-generated ones. So I don't understand the problem. Any external data need to be validated be it client-generated or server-generated. Do you claim that somehow validating UUIDs for uniqueness in the DB layer is more expensive than any other validation of the external data?

> I don't understand why you're bringing in idempotency keys. The way to fix this is to reject client generated ids that already exist.

Please read again, yes all external data need to be validated, so it equally applies to server-generated IDs (only they will need to be validated on UPDATE-s and DELETE-s instead of INSERT-s).

Idempotency keys just reduce the load on the hard-to-scale RDBMS database layer, so it will not be hit on every retry or replayed malicious requests.


> 1. you're not trusting anyone

Sure you are. You're trusting the client to send you valid data.

> 2. the UUID by itself doesn't authenicate or authorize anything

Okay, better hope no one ever considers the UUID to be a unique randomly generated token.

Besides that, users might be tempted to submit "vanity" UUIDs if they get to decide their own identifiers, breaking assumptions about the system.

> 4. many non-trivial and/or CQRS/ES apps work like this

Cool, if your friends jump off a bridge, you gonna follow them?


Who said anything about trust? Server-side validation still applies; you don't just DELETE /user/{id} without verifying ownership, regardless of where the id comes from.

But client-generated IDs make idempotency easier and remove whole classes of errors. They're typically a huge win.


why not both? Have the table column autogenerate UUID if insert does not have one or else insert if it does. This provides an added flexibility for scenarios mentioned in the comments.


I do this (i.e. if UUID wasn't supplied in the API call - the backend will generate it), but for me it's only useful for debuging or manual API testing.

Reason: I'm not a machine, so I dislike generating UUIDs by myself ;)


Clients have/supply meaningful data. An ID isn't meaningful data.


Parents have/supply meaningful data. A name of their newborn child isn’t meaningful data ;)


If that name is expected to be a unique identifier, then certainly not!

First name is a locally unique identifier for some period of time (usually a lifetime, otherwise for at least a decade) until a child will change it for any reason (e.g. immigration, conversion to a different religion, gender re-assignment, a recommendation from a personal psychic, or simply not liking it).

But yes, it's not a true semantic key since the first name usually doesn't come from an immutable properties of the child, but the last name can be a semantic key, derived from the properties of the family (e.g. geographic origin, profession, hair color, look, etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: