If you encounter some URL like https://fancy.page/users/15 chances are that the 15 is a numeric ID and 1 to 14 also exist. And the lower numbers tend to be admin accounts as they are usually created first. This might be used by an attacker to extract data or maybe gain access to something internal. One could argue that using UUIDs only hides a security hole in this case but thats better than nothing I guess.
They make life hell for database clustering, merges and migrations.
In addition, on a more minor level, in a client-centric (apps, browser JS etc) world, the use of incremental numbers is an un-necessary pain point. If you use UUIDs, the client can generate its own without the need to a call back to the API (unless necessary in context, obviously).
Frankly, IMHO in the 21st century, the use of incremental numbers for IDs in databases thoroughly deserves to be consigned to the history books. The desperate clutching at straws arguments that went before (storage space, indexing etc.) are no longer applicable in the modern database and modern computing environment.
Also, sequence generators are a non-problem in competent architectures, since you can trivially permute the number such that it is both collision-free and non-reversible (or only reversible with a key).
It is still common to use structured 128-bit keys in very large scale databases, and it is good practice in some cases, but these are not UUIDs because the standardized versions all have problems that can't be ignored. This leads to the case where there are myriad non-standard UUID-like identifiers (in that they are 128-bit) but not interoperable.
As I said in my later comment, let's put the "don't use UUID because high scale" argument to one side, shall we ?
Because per-later comment:
1) Vast majority of database implementations are not remotely "high scale". Most database implementations could use UUIDs and nobody would ever notice.
2) "High scale" brings specific environment concerns, not only related to databases
I would say yes, with the options we have today with modern compute.
We live in a world where compute is powerful enough to enable Let's Encrypt to issue SSL certificates for 235 million websites every 90 days off the back of a single MySQL server.
For high scale environments there are also other options such as async queues and Redis middleware.
Database technology itself is also evolving, and the degree of measurable downside is less than it might have been 10 years ago.
I would still argue that for the vast majority of people, UUIDs are the way to go. I would certainly urge caution against premature optimisation involved with the "but high scale" argument. Sure things MIGHT be noticeable at high scale, but I think its fair to say most people are not doing anywhere enough high scale to do so and should probably just use UUIDs and cross the "high scale" bridge if/when they ever come to it.
Finally, its also worth pointing out that all the hyperscalers use UUIDs or other unique identifiers widely in their infrastructure and APIs, all of which must inevitably be tied into a database backend.
For dimensions, UUIDs are usually fine since writes are infrequent. For facts or timeseries data, ordered IDs are more efficient.
One fun for instance I worked directly with: Microsoft's SQL Server made some interesting assumptions based on UUID v1 and sorts the last six bytes first. In UUIDv1 those would have been the MAC addresses and clustering by originating machine first has some sort of sense to it in terms of ordered writes. The ULID timestamp is coincidentally also six bytes (48 bits) so (ignoring the Endian issues of the other "fields" in the UUID) you can get Microsoft's SQL Server to order UUIDs in mostly the same way as their ULID representation by just transposing the first six bytes to be the last six bytes.
Unfortunately UUID v6+ won't sort well in Microsoft SQL Server's sort order today.
Other databases will vary on what you need to do to get sortable UUIDs.
A reference to me on all of this deep rabbit hole was Raymond Chen's blog on the many GUID/UUID sort orders just in Microsoft products: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
(With the fun punchline at the bottom being the link to the Java sort order. My sympathies to anyone trying to sort UUIDs in an Oracle database.)
The road of databases is paved with many such bodies.
Whether it is developers treating databases like some blackbox dumping ground, or designing generic "portable" schemas, or people who don't know SQL writing weird long convoluted queries.
Many people are quick to blame "the database", but 99% of the time its the fault of those who designed the schema and/or the queries that run on it.
I think your statement "UUID for PKs is not a good idea out of the box" is unfair and too broad brush. Without knowing the exact details of every bit of your environment (from database hardware upwards), its not possible to accept such a generic statement to be read as a fact.
Will see if I can a section about security implications, there's a similar time based argument to be made for ULIDs as well — you don't inadvertently want to expose a timestamp in some cases.
UUIDs could have prevented the leak even if they still managed to completely disregard any authentication logic on the backend.
It's not as easy as incremental IDs, without doubt, but it's worth correcting the idea that (most) UUIDs are designed to provide security in this situation, beyond maybe a quarter-layer of defence in depth. In fact, the RFC explicitly says:
> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example.
I think people may be misled by the fact that UUIDs are frequently hex-encoded prior to being sent over the wire (or even, stupidly, in the database). It looks like a hash, but it's very much not one.
Edit: This is all referring to RFC 4122, to be clear. It's entirely possible that there are some other UUID schemes out there which do hash their contents.
Apparently there are some proposals to make official UUID variants with this sort of composition too, which some threads in this discussion go into more detail on.
Going to the spec ... Yeah, that's weird. The spec calls those 80 bytes "randomness", and apparently you are meant to generate a random number for the first use within a particular ms... but on second and subsequent uses you need to increment that random number instead of generating another random number?
Very odd. I don't entirely understand the design constraints that led to a non-random "randomness" section still called "randomn" in the spec even though they are not.
In other cases, a 128-bit key is simply encrypted (conveniently being the same block size as AES), which allows you to put arbitrary structure inside the exported key.
Do you have a favorite link for more information on this?
* UUIDv6 - sortable, with a layout matching UUIDv1 for backward compatibility, except the time chunks have been reordered so the uuid sorts chronologically
* UUIDv7 - sortable, based on nanoseconds since the Unix epoch. Simpler layout than UUIDv6 and more flexibility about the number of bits allocated to the time part versus sequence and randomness. The nice aspect here is the uuids sort chronologically even when created by systems using different numbers of time bits.
* UUIDv8 - more flexibility for layout. Should only be used if UUIDv6/7 aren't suitable. Which of course makes them specific to that one application which knows how to encode/decode them.
UUIDv7 is thus the better choice in general.
(I recently wrote Python and C# implementations - https://github.com/stevesimmons/uuid7 and https://github.com/stevesimmons/uuid7-csharp)
Kyzer Davis has since submitted two further revisions -01 and -02 in April and October 2021. See history in .
The current -02 draft is due to expire in April 2022. Presumably Kyzer Davis will try to get it discussed before then.
The GitHub repo tracking these drafts is https://github.com/uuid6/uuid6-ietf-draft/.
> If you suddenly have a million people who want to buy things on your store, you can't ask them to wait because your sequence generator can't number their order line items fast enough. And because a sequence must store each number to disk before giving it out, your entire system is bottle-necked by the speed of rewriting a number on one SSD or hard disk — no matter how many servers you have.
There’s maybe a handful of apps in the world that see so much traffic that this would be a problem. Unless you expect to reach Amazon-scale anytime soon, or need distributed ID generation (like generating them in mobile apps or SPAs), just starting with a simple BIGSERIAL (or rather, BIGINT GENERATED BY DEFAULT AS IDENTITY as the state of the art is) will be good enough to get started.
You can always add complexity to your app later. Taking it away once added is much more difficult.
Besides, UUIDs have fragmentation issues. I’d use ULID if I had need to generate IDs in a distributed fashion.
"Handful" is wrong. Any major system will start to run into this as soon as you start saying the word "scale" in design meetings.
UUIDs can also provide room for encoding other information, like the type of the object, where it was created, etc, since MAC address is often integrated in the UUID.
And if you want “other information”, good database design would put that other information in a column of its own.
Distributed databases have their place. But the tradeoffs they bring are often not worth it for your 1.0/MVP app.
Here hotspotting is the aim, since it lets you efficiently prune query plans from index scans to direct reads of the right chunk.
If you generate 70 trillion UUIDs, the odds of two of these being a duplicate is approximately the same as the chance of one person being hit by a meteorite this year.
To put it bluntly, to add structures like time, Mac addresses or domain ids to UUIDs to avoid collisions is really not useful and considering the downside that it leaks that information is a bad idea.
UUID wasn’t either but I at least knew that.
128 bit strongly generated payload (instead of 80 bits for ULIDs).
Only 32 bit time precision but that's wall clock time anyway.
A year ago or so we had to store a reference to one of our entities into a legacy third-party system, which used char(20) as the column size and of course couldn't be changed.
Since Base85 encodes a UUID as exactly 20 ASCII characters, it saved me from having to add an extra indirection. (Also, and to be honest mainly, from giving ammo to our CEO who had never liked UUIDs)
Of the various 85-character encodings, I thought Z85  was the best one. It's not URL-safe, but it's safe for copy-pasting into queries, source code, XML, JSON, CSV, etc.
Should probably add base58 as well, Bitcoin uses it.
Do you know if lex62 has any performance disadvantage versus bases that are in 2^n? (32, 64 etc?)
I always assumed the conversation to 2^n bases could be done more efficiently.
Let's say a UUID comes back with an error message. This could be used to figure out how long it took to generate the error. That could tell you if a particular resource is cached, even if you don't have access to that resource.
Timing attacks are usually pretty creative. It's hard to predict how extra timing information could be misused.
Lol this is an odd bit of conjecture to interject.
"When using a numeric primary key, you need to be sure the size of key you're using is big enough" - as the article itself notes, 64 bits should be enough for anyone.
"That number that you first pulled out and didn't use is lost forever. This is a common error with sequences — you can never assume that the highest number you see as an ID is implicitly the count of the number of items or rows." - true, so always treat numeric IDs as opaque, like UUIDs. The fact that they are actually sequential is an implementation detail.
"You can copy a table over to a new database and forget to copy over the sequence generator object and its state, setting yourself up for an unexpected blowup." - doing weird manual operations on your database offers a wide range of ways to screw up, far beyond this. Just don't do this? When you copy a database around, you need to copy the whole thing, to preserve its integrity. If you're creating a frankenbase, then of course you need to exercise caution. If you're really worried, on app startup, check that the sequence's next value is higher than any existing ID, and crash if it isn't.
"Having a single place where identifiers are generated means that you can add data only as fast as your sequence generator can reliably generate IDs." - this is a real problem, but it's easily overcome by batching. Rather than using hitting the database for a new ID every time you need one, the application can occasionally hit the database to acquire a range of IDs, keep that range in memory, and use them as needed. You might be able to build batching on top of the database's built-in sequence machinery, or you might not, or you might prefer not to even though you can. At worst, it means adding a table to the database to track sequence values. Scaling is then accommodated by tuning the batch size and scope (per instance, per thread, etc).
"On a scaling-related note, numeric IDs limit your sharding options" - the approach i have seen is to use batched sequences, and move the sequence machinery out of the database that is being sharded, and into a separate service, or its own database. Application instances can all pull batches of IDs from the shared service or database, which ensures that they are non-overlapping.
The nice thing about numeric IDs is that you can start with the simple and easy approach, a standard database sequence, and then migrate to more scalable generation strategies as your database grows, without having to change your data model. The problem of generation is nicely encapsulated.
> The nice thing about numeric IDs is that you can start with the simple and easy approach, a standard database sequence, and then migrate to more scalable generation strategies as your database grows, without having to change your data model.
But with UUIDs you don't have to "migrate to a more scalable generation strategy" as you grow, you've started with a simple and easy approach that just keeps working as you grow, no? It would be odd to suggest that's an advantage to sequential IDs.
Or is the suggestion that UUIDs aren't as simple and easy as sequential IDs? I'd say they are just as simple and easy to implement (most DBs will do it for you with no more trouble than a sequence); but they are, it's true, a bit more inconvenient to use as a human-friendly ID, whether in developer debugging or URLs. That is, I'd agree, their main downside.
I do agree that serially incrementing numbers may sometime be relegated to the history books, although this is such a baked in function that, lacking any other prebuilt functions, old-time SQL developers simply reach for it.
Can anyone here provide some hints as to how you can verify that the randomness is actually random?
If I was to create some blackbox and I claimed it generated 100% random numbers, is it possible to disprove my claims?
This question arises with hardware RNGs - could some of them just output random looking data that the dark-powers know the seed of?
FYI no commitment is made to that by YouTube.
The API merely defines it as a "string" with no further commitment as to length, format or characters 
As for the "how?" there is an unsubstantiated answer based on reverse-engineering posted on SO
This is similar to C rand(), which you wouldn't want to use in production but is useful when generating the same test data for the same seed every time.
“Great work Geoff! One question: what’s the probability of two transactions having the same ID?”
“It is very low”
“Hmmm. But it’s not zero?”
“It’s so low that it practically is zero.”
“But it’s not technically zero? This company wasn’t built on taking chances, son! Come back when your product complies with our corporate zero-risk policy.”
"Okay boss, we _could_ do that, but so you know what that would mean?"
And then you tell them about cosmic rays, bitflips and redundant computing and what that would mean for the cost of IT at your company.
"... or, we could just use UUIDs like nearly everybody else. I will spend a few days thinking about what would happen in case of a UUID collision and create a mechanism that adverts the worst consequences if you want. That should be enough in my judgement, we could also ask $collegue what if they agree with that conclusion"
On a side note: bosses who think they just need to be convincing enough in order to change physics are the worst. Some bosses expect NASA-level solutions for no or minor resource cost at all.
But there are a lot more for whom "look, Azure uses UUIDs for their VMs, and it's good enough for them" is somehow more convincing.
Would love to hear experts chime in on the tradeoffs here:
In terms of collision resistance, how much does adding a client fingerprint component really help? 80 bits of randomness in ULID already sounds pretty collision resistant to me, since that's a 50% chance of collision after generating 2^40 IDs. It kind of feels like the risk of collision on the fingerprinting mechanism itself (here it's described as 2 chars from PID and 2 chars from hostname for node, which honestly sounds a little bit shakey to me) combined with the reduced bits of randomness could undermine any potential gains in collision resistance through client fingerprinting.
Do folks know of examples where collision resistance through 80 bits or more randomness has failed in practice and generated collisions? Would love to see more reading material on this kind of stuff.