Hacker News new | past | comments | ask | show | jobs | submit login
Be Careful with UUID or GUID as Primary Keys (tomharrisonjr.com)
593 points by bkudria on June 9, 2017 | hide | past | favorite | 292 comments

One of the post's points is that UUIDs will scatter your writes across the database, and that for this reason you want a (more or less) sequential key as your primary key. This crucially depends on both your database technology and your query patterns.

In a single-node database or even a manually-sharded one, this post's advice is good (For Friendfeed, we used a variation of the "Integers Internal, UUIDs External" strategy on sharded mysql: https://backchannel.org/blog/friendfeed-schemaless-mysql).

But in a distributed database like CockroachDB (Disclosure: I'm the co-founder and CTO of Cockroach Labs) or Google Cloud Spanner, it's usually better to get the random scattering of a UUID primary key, because that spreads the workload across all the nodes in the cluster. Sometimes query patterns benefit enough from an ordered PK to overcome this advantage, but usually it's better to use randomly-distributed PKs by default.

For CockroachDB, my general recommendation for schema design would be to use UUIDs as the primary keys of tables that make up the top level of an interleaved table hierarchy, and SERIAL keys for tables that are interleaved into another. (Google's recommendations for Spanner are similar: https://cloud.google.com/spanner/docs/schema-design#choosing...)

You make a good point about knowing your database technology. For me the takeaway from the article is "know your database technology and your data" especially if you plan to get big.

The original post refers to RDBMS and seems to be largely based on MySQL/InnoDB experience. InnoDB clusters on the primary key and copies primary key values into secondary indexes, so scattered writes and exploding secondary indexes are as well-known problem if you use UUIDs as the primary key on MySQL. Another well-known issue is that UUIDs can lead to random scattering of reads which chew up space in the buffer pool.

That said, other RDBMS types don't necessary have the same problems. SQL server can cluster on non-primary key columns or not cluster at all (heap organization). In the latter case secondary indexes on the table use row IDs (RIDs) to refer to rows from the index. Using UUID as a primary key does not inexorably entail performance problems though it might if you made bad choices about indexing.

It seems to me people sometimes make a bigger deal of UUIDs and DBMS than is really called for. Badly chosen surrogate keys or time-series clustering can lead to performance problems that are just as severe depending on your access patterns. There are perfectly reasonable workarounds for UUID issues like using integer surrogate keys, which RDBMS support quite well. To use them effectively though you need to spend time to understand your DBMS technology and implementation choices thoroughly. A lot of people skip this step.

With DynamoDB you also want to be scattering your reads. As you scale up the table (either data or read/writes) it will be partitioned and your total read/write capacity is split evenly amongst the partitions.

If you don't have even distribution of requests across the table, you're going to keep running in to throttle limits.

dynamo and cassandra use consistent hashing of the primary key, ordered numbers are already hashed randomly.

Thanks for this reclarification, lead me to do a little searching.


Has some helpful tidbits, like how pks get implicit copies to all other indexes making uuids quite expensive memory wise if you have a lot of referencing tables and a lot of indexes.

One of the recommendations I'm not sure on is to compress and reorder the uuid for time sequence.

I get the compression part (remove dashes, convert to bin with unhex), but I'm concerned reordering the uuid so the time sections are together (to order your writes better) might actually decrease entropy? and make collisions likely.

Does anyone here know much about uuid generation and whether the time sections reordered would make a difference across a table of around 100m~?

I think I might just go with internal incremental and external uuid. Solves the knowing key before insert problem with little downsides.

The time swap thing is now implemented in mysql8 (broken link in the mariadb post)


The probability of a collision can be estimated using the "Birthday Paradox" equation: http://planetmath.org/approximatingthebirthdayproblem

Say you can tolerate a 1 in 1 Billion probability of collision:

With completely random 128-bit UUIDs, you would need slightly over 800 Trillion Rows to have a 1 in 1 Billion chance of a collision. Equation: sqrt(2(2^128)ln(1/(1-1/1000000000)))

Adapting the UUID to contain the first 8 bytes as the number of milliseconds since the 1970 epoch would mean 64 bits of random data. You would need to insert over 192,000 rows IN A GIVEN MILLISECOND to have a 1 in 1 Billion chance of collision. Equation: sqrt(2(2^64)ln(1/(1-1/1000000000)))

> might actually decrease entropy? and make collisions likely.

It's rearranging in a deterministic manner, there's no loss of information, it's just a permutation. There is no impact to entropy nor the likelihood of a collision.

The UUID time-swapping trick seems like a bad idea outside of special cases. Overloading the UUID in this way makes you dependent on generating UUIDs correctly so that the ordering works. You can get insertion ordering by using an auto-increment integer primary key which also results in a relatively short key for secondary indexes. This is much easier to understand and cannot be messed up by application errors.

I've been on the wrong-side of this unfortunately.

Rather than implementing a sane archive delete strategy, data was left there to collect so there was pretty much no hope for a particular join not to generate page miss.

Eventually we settled on a NoSQL DB with denormalized data in partitions. A few years later we realized we made a poor choice in having too few and too big of partitions, and basically just repeated the mistake with a different database technology (made even worse because now the things on disk were even bigger that we needed to fetch!)

Could you retain simple ints as the primary key and use a hash of it for sharding instead? Then you'd get the best of both worlds.

>One of the post's points is that UUIDs will scatter your writes across the database

And will wreak havoc on any indexes it is included in [if you do a lot of writes and don't have sufficient padding]... page split city!

Some UUID and Guids can be made sequential.

I can't speak for other databases, but SQL Server's NEWSEQUENTIALID is only guaranteed until the server reboots. At which point the next UUID integer value can be less than the previous, at which point you're back to splitting pages.

Normally your application would generate them not the dB

I didn't provide enough context.

There's nothing wrong with using UUID's, but I would avoid them for clustered indexes. And I would say typically you wouldn't care about whether or not UUID's were sequential unless you WERE putting them into a clustered index.

I'm sure there are use cases where you would worry about the sequential nature for reasons not related to clustered indexes, but in general what I said is true.

You application generates sequential guids before placing in a database with a clustered index.

Thus solving the issue with NEWSEQUENTIALID and clustered index issue.

Just for other people reading this thread, as I've posted before[1], a word of caution if you're trying to generate sequential UUIDs client-side for SQL server: SQL server does not sort UUIDs in the order you'd expect. The bytes are ordered 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1-0.

[1] https://news.ycombinator.com/item?id=13120522

I'm not sure what that buys you that you can't get by just using an identity column.

Places where the user request, is detached from the actual insert. For example if you are using a queing system for user requests, but you still need to return an Id to the user.

Fairly common when building a distributed application.

why do you need it to be sequential then?

Because it will eventually go into a clustered index once it's been processed.

I'm not trying to be hardheaded, and I understand I don't see all of the nuance because I'm not familiar with the problem you were solving.

But having said that, I have to go back to asking why you didn't just use an identity of some sort.

I cannot see any reason outside of convenience why you would insist on sequential UUID's for that, and I would argue against the convenience seeing as how UUID's are not naturally sequential and any work you did to guarantee them being sequential would also have been useful for generating sequential integers in the software as well (even 128 bit integers).

You could have literally created a 128-bit sequential integer, waved your hands, and then announced to the world that it's a UUID (you wouldn't even need to set the bit to determine which version of UUID unless you were using it outside of your environment for some reason).

Unless there is some other requirement that I'm not aware of, I don't really understand what it buys you.

Combed Guids are normally only partly sequential​.

It means they go be generated on any machine, returned to the user, and the request placed on a queue.

Without completely messing up a clustered index.

I don't think you can guarantee the generation of anything sequential across multiple machines without some sort of authority. A single machine, sure as long as the resolution you're using is small enough and/or you have a single point of authority on that machine.

But across multiple machines?

Can you explain to me how you're doing that across multiple machines without any sort of authority? If there's a way to do that I'm genuinely curious.

Most simply use the system clock. Now if I would was building a machine critical distributed system I would advise against using the system clock and instead use some sort of logical clock. But in this case the clock doesn't need to be super accurate, it just needs accurate enough it get placed near the end of the table.

But the worst case in this instance is your insert is slow, and fragments your index. I trust infrastructure to make sure the clocks are accurate.

The one point I went looking for and didn't see is that UUID's make for very poor clustered indexes. SQL Server tried to mitigate this with the 'NEWSEQUENTIALID', but it essentially 'resets' when the server is rebooted.

If your data is small enough that it doesn't matter then fine, but in general UUID's are better as a surrogate key than as a primary key.

Very few people generate the guids database side. They normally generate them before they even insert it into the database.

Yes, that's in the OP, exactly what you said.

It was, I meant I went looking for it in the article and didn't see it :)

Sometimes, you don't want a workload spread across all the nodes of a cluster.

> secondary primary key

This is called a "candidate key" in existing literature. much has been written about such things.

Both UUID's and auto ID's are "surrogate keys" because they are arbitrary with respect to the data.

lastly, "natural keys" are combinations of columns that consist of the business data.

> > secondary primary key

> This is called a "candidate key" in existing literature. much has been written about such things.

To be picky, a "primary primary" key is also a candidate key. A superkey is any set of columns the values of which are guaranteed to be unique in any row in a given table; a candidate key is a minimal superkey (i.e., if you remove any column from the key, you destroy its uniqueness). (Relational purists will want to read "attribute" for "column", "tuple" for "row", and "relation" for "table".)

From a logical perspective, the designation of a candidate key as the primary key is more or less arbitrary; as per this article, that may well not be the case from a physical perspective.

This is super helpful for a mediocre devops guy like me. I had thought through all the issues in the article but didn't know if it was prudent to follow these principles. Now I know what to look for in the literature.

I've been using natural keys in a project recently without even knowing they were called that, or even that it was a pattern.

Is there a book or any other reading you would recommend to be more familiar with these sorts of details? The longest text I could find about this was the c2 wiki. I'd like to know what else I don't know.

Anything more recent by CJ Date. He was at IBM with EF Codd at the beginning of the development of the relational model. Try "Database in Depth: Relational Theory for Practitioners". It's good, when reading his stuff, to remember that SQL is a compromise to fit with a pure relational model, so you see things that make sense, but are not practical. Like choices for candidate keys, I suppose.

Very rarely have I found natural keys that end up working long term. I always default to a single column surrogate and use natural keys as exceptions.

As to your question, any introduction to RDBMSs book will teach all these concepts.

The ultimate problem with natural keys is that their uniqueness is a property of business rules. Unless you live in some magical fairytale land where business rules are stable, sensible, or both, depending on them for uniqueness is asking to have problems.

There is an argument that if your data does not have natural keys -- that if it depends on the addition of arbitrary system IDs to prevent duplication -- then it is not truly relational.

Sure, and I can see that, but my argument is that natural keys which depend on business logic are subject to definitions that can change given a long enough time frame.

Interesting that you would say that. It does seem to me that if your rows have these arbitrary IDs, they are more like objects than relational purists would like to admit. (I seem to recall that the relational people used to beat up on OODBs for lacking this relational purity. But OODBs are pretty much dead anyway, I guess.)

Interesting indeed. I'd say that a primary thing that makes an object is its identity - and the identity is never a part of object data, nor it could ever be. In programming languages, objects' basic identities are determined by however the language implements references; in databases, a natural solution would be a surrogate unique key...

That's exactly it.

And furthermore, objects in real world also have identities, which are distinct from their attributes. You aren't you because you have a certain name and date of birth, or a certain SSN. You're you because you're you. And we all intuitively know and understand this concept of identity. So any software that deals with real world entities has to reflect that concept, which is exactly what synthetic primary keys do.

And it might not be purely relational; but if purely relational model cannot handle this, then it's simply not a particularly useful model for most practical scenarios.

BTW, a pedantic argument could be made here that, ultimately, identity of physical objects is also data - you just need to go low-level enough (ultimately, the identity of any entity is its wavefunction) to see that. Of course, that ignores the practical aspects.

> You're you because you're you.

Are you sure that 'you' today is the same as 'you' yesterday? The notion of identity, however intuitive it may seem to some, is far from being universally obvious.

If you get all the way down to where philosophy mixes with quantum magic, then sure, I can close my eyes, open them again and not be sure if I haven't died in the meantime.

But there is an intuitive concept of identity that every human after some age has. A lot of philosophy goes to point out that this concept of identity doesn't really work well if you stress it enough - as evidenced e.g. by this identity being constant under ship of Theseus transformation. But under this level of scrutiny, a lot of other things break too, including the meaning of words themselves ("what do you mean by 'is'?").

As it is, this concept of identity is natural to us, it's the way we see the world and organize our societies, and it works fine for all practical purposes.

Actually, I was far from being philosophical. Even physics has nothing to do with this. Thinking otherwise would be reductionist. In fact, the notion of (a person's) identity depends on the social context (including law). There is even a process by which a person can have their identity changed.

It's universally obvious in a sense that every human culture and every human language has it (or at least I'm not aware of examples of the contrary).

And it may well be fuzzy, but it's rather telling that this fuzziness nevertheless remains confined to philosophy and metaphysics. Asharites, for example, believed that the world is recreated by God "at every moment" (and that this is what makes the notions of time, motion etc possible), so in that sense, every person is also re-created. But their societies did not derive any peculiar social customs from that belief - they still treated any person at any given moment as if they possessed stable identity over time in practice.

At odds with Reactocese, who suggested that the world at each moment is diffed against a 'shadow world' to apply the minimal set of changes that render the new reality

Identity means that:

1. We have some EQ function such that if EQ(X, Y) reports true, then X and Y are the same object; and

2. There doesn't exist any function which is more discerning than EQ; if EQ concludes that X and Y are the same, then no other function can find a difference.

In programming, the currrent object may be different from that object moments ago because of mutations. However, inside the virtual machine, we are forced conclude it's the same object. The reason is that within the machine, we do not have a function which can distinguish the two objects. And the reason for that is that under mutation, we we do not simultaneously have the previous object and the new object. Therefore, we cannot apply both of them to a function.

We can discuss differences between yesterday's you and today's you, and find differences. But the discussions of you are not you; they are ideas of you.

Likewise in the machine, we can snapshot the state of an object into some serialized format or whatever, and then two snapshots at different times are different, informing us how the object changed. The two snapshots are not EQ, but they are not the object.

This actually only underscores the usefulness of surrogate keys.

Let's say you are building a cloud-based HR software of some type. You have an employees table, and store employees.

If someone gets married and changes their name, or gets a promotion and changes title, they're still the same person (and the same database entry).

However, if that someone leaves that job, and then gets a new job at a totally separate company -- that also happens to use your service -- they are a new database entry.

I'm not sure I get your point? In the scenario you describe, the natural key for the employees table could be, for example, legal company number + employee's payroll number.

Or are you saying that by using a surrogate key, we only ever have to create one record for an employee, and just update their company details etc. when they change jobs?

I suppose you could derive a key like that, but then is it still a "natural" key? What's the benefit over just having a completely arbitrary key?

I'd argue for a surrogate key in this case because I wouldn't want someone to be parsing my id, and having to worry about changes and backwards compatibility.

E.g., the key might be 2360108 where 236 is company number. Then as we grow eventually you get an id like 10910249, which is ambiguous (company 109 grew to over 10,000 or company 1091?)

I deliberately choose a short prefix for illustration, but this could be an issue with any scheme (you'll never perfectly predict all the future requirements), which using surrogate keys completely avoids.

To be clear, I'm not proposing that the company and employee numbers be concatenated into a single column: I'm saying you could have a composite key, comprising the company number column and the employee payroll number column. This would still be a natural key: each component has a meaning outside the system (company numbers appear on company filings and other formal documentation, employee payroll numbers appear on payslips and are communicated to the tax authorities), and the two in combination have a clear business meaning.

Of course, you may still want to use a surrogate key for performance or security or other technical reasons, as the article describes.

I just wasn't clear on how the specific scenario you laid out made any additional case for their use.

Another good food for thought regarding identity: If you take a complex object like a jumbo jet and replace every single component in it one after another, at the end, is it still the same jumbo jet?

Less hypothetically, most of the atoms in a complex living organism are replaced every few months. (One objection could be that atoms of the same sort are all identical, but that makes the issue of 'identity' even more complex.)

> nor it could ever be

Well, never say never... In a world of immutable objects, an object's identity (reference) can be derived from its content. Look, for example, at how objects are stored in Git.

And of course, by now everyone has seen https://stackoverflow.com/questions/3475648/sha1-collision-d... . It's an interesting problem space, trying to generate a unique ID for the contents of a thing where the size of that unique ID is less than the size of the thing. If you consider that = the thing (a deep representation of the object) is the most obvious way to uniquely identify the thing, but then figure you can do better, it would seem that it would always be technically possible, outside of lossless compression algorithms (like gzip), to generate something with the same hash. I'm sure that there is tons of theory on this and that my terminology isn't correct here. Can someone steer me in the right direction regarding what this type of thing is called?

EDIT: important detail I'd forgotten: hash functions take arbitrary-length data in and spit out fixed-length hashes. I learned about https://en.wikipedia.org/wiki/Perfect_hash_function and https://crypto.stackexchange.com/questions/29264/why-cant-we... is relevant. Relevant quote: """ Mathematically speaking, there is no such thing as a collision-free hash. Practically speaking, there is. """

A hash function that takes arbitrary data and generates fixed-length hashes will always have collisions, because there is simply more possible inputs than possible outputs. Also known as Pigeonhole principle[0] - if you have 8 pigeons and 5 holes for them, and you want to put all pigeons in the holes, then at least 3 of them will have to go to an already occupied one.

The practical aspect of making hash function "collision-free" is by using trickery and dark magic to reduce the probability of collision down to somewhere below "Earth just got hit by an asteroid, we all have other problems". You want to go really, really low because hashes are mostly used to quickly verify data, and malicious actors will be happy to throw compute at trying to generate collisions so they can alter the data.

As for how the trickery and dark magic is performed, I'm a wizard of too low level to explain these details to you; all that I recall is that you try and make the hash maximally sensitive to every bit of the input data, so that a random bitflip anywhere has a huge probability of greatly changing the hash.


[0] - https://en.wikipedia.org/wiki/Pigeonhole_principle

But, to state the obvious, anonymous items in computing (e.g. anonymous functions, or untitled/unaddressed draft emails etc.) generally need to be assigned an id for the software to keep track. The practice of avoiding natural keys and assigning unique ids is analogous. No paradigm is implied.

In programs, anonymous objects always have an unique ID automatically assigned - by the very virtue of being separate entities. Usually those IDs are what we call "references" in programming.

I'm talking about the level of e.g. programming language implementation. A lambda needs to be assigned an id, for example, but it's not a reference that the user can manipulate. It's an implementation detail, as are surrogate keys.

You don't manipulate the reference directly (i.e. you shouldn't learn its numeric value (though you can - useful for debugging), and you can't change it) - but you do use it indirectly all the times with ==, eq, or however the referential equality operator is called in a given programming language.

This is what, given (excuse the Java):

    MyObject a = new String("choo choo");
    MyObject b = a;
    MyObject c = new String("choo choo");
allows you to observe that:

    a == b;
    a != c;
    b != c;
even though:

under string comparison.

This is exactly like surrogate keys. An implementation detail that's crucial to the meaning of the whole thing.

> Unless you live in some magical fairytale land where business rules are stable, sensible, or both, depending on them for uniqueness is asking to have problems.

Thankfully mine are legally enforced, so it works well :)

It works until it doesn't :). A lot depends on how well they're enforced. SSNs are a go-to example, but in other countries which have actual ID numbers for citizens[0], duplicates do happen every now and then because of e.g. clerical errors. Real world is messy, and something can be trivially fucked up, it will be.


[0] - as opposed to the way US handles the SSN, where the number was explicitly forbidden from being used for identification, but people use it for it anyway.

There are not only duplicates, but sometimes the documents can be reissued with another number or ID, sometimes there are people without the document that your system expects everybody to have.

Here in Brazil, we have the CPF (somewhat equivalent to US SSN), a "unique" number assigned by the government for identification. It's used by almost every business I have ever know as a key. Not always primary key, but certainly people treat it as unique.

Except for the fact they are not unique, and there have been cases where people from different states have the same number because each state generated this "random" number, but didn't actually check with each other if it's really unique. It is legally enforced, no two people should ever have the same, but they do.

The only way to have uniqueness is if you know the algorithm AND can control that it behaves as expected. Anything else is just kidding yourself.

Even that is not a forever guarantee. Laws can and do change.

And people act outside the bounds of the law. SSN seems like a good natural key candidate, until you discover some of the people your business interacts with are illegal aliens.

Or legal aliens, like tourists. Or foreigners, who aren't even in the country. Or really anyone else who doesn't work in the country, and doesn't have to report taxes.

Ironically, many illegal immigrants actually do report taxes. In which case they use ITIN, which is compatible to SSN format-wise (###-##-####)... but it is not unique over time.

Definitely true, but at a certain point you need to draw a line. If these things were to change, the entire product would have to change, not just what I use as my keys, so I think natural keys are a safe bet in my case.

> Thankfully mine are legally enforced, so it works well

What, you think laws never change?

I strongly agree. Natural keys are an ivory-tower pattern, and a software-development anti-pattern.

Why not change the definition of natural keys then?

Pretty much any undergraduate database course will discuss this early on. One of the first MOOCs #dbclass covered this in depth.

My database courses were pretty terrible, everything besides basic SQL syntax and some implementation specific stuff I've had to teach myself (some classes were pretty much an ad for MS SQL). And judging by many people I've worked with, I think this may be true more often than not.

I think I've found the #dbclass series on YouTube, is it the one from Stanford? And thanks for the help.

Yes, you want the course taught by Jennifer Widom of Stanford offered on Lagunita.

Links below:



Check out the Data Warehouse Toolkit. You might want to look into natural vs surrogate keys.


Found a pdf copy (just to browse, I promise!) and it looks great. Ordered a used copy, thanks.

Edit: no instances of candidate key when I ctrl+f though, still looks pretty good.

Fundamentals of Database Systems (which is an academic textbook, check Abebooks) covers a lot of this along with underlying design. I'm going through it now for a DB class.

> Botnets will just keep guessing until they find one.

Why does your security rely on primary key obscurity? This seems like you're doing something horribly wrong, put some authentication on that or something.

And no, no they won't. Hitting a collision is very hard if you're using cryptographic strength random UUIDs, you wouldn't even be able to bruteforce 64 bits over the internet in a reasonable timeframe.

Go ahead, try the math on that, the only reason small keys are vulnerable to local attack is because you can perform an enormous number of attempts per second, often in thousands of millions of attempts per second and they can keep at it for as long as they want. The database server won't let you query anywhere near that fast. You will never get anything like that for network based attacks as you're limited by bandwidth, latency and of course, the other side who will notice if you even try to do this for any significant period of time and likely block your attempts or limit them greatly.

> Why does your security rely on primary key obscurity? This seems like you're doing something horribly wrong, put some authentication on that or something.

You can't put authentication on everything. Some things are inherently vulnerable to guessing because you can't toss an access control on it (the thing is the access control), which what the high bit-lengths and rate-limiting are for. I wouldn't do this with a customer's UUID because it's static, but sooner or later you have to rely on security by obscurity somewhere (for example, password reset tokens).

As a theoretical matter a UUID shouldn't be the sole distinguishing data for authorized actions, but in practice it's probably best not to expose them in case you have an errant insecure direct object reference (which approximately all modern applications do, somewhere or other).

I'm just nitpicking your use of the phrase, "security by obscurity" here. That terminology is levied as a criticism against a lot of modern security design but it's really fine (and sometimes necessary) to use it for defense in depth (unlike, say, a cryptographic algorithm, which does not generally benefit from security by obscurity, though particulars of the peripheral infrastructure can).

You can't put authentication on everything... but you can put an HMAC on almost everything, e.g. /receipt?id=1&hmac=hmacsha256(1, somesecretkey), which decouples your enumeration prevention from your ID system.

True, but you've just shifted the goalpost a little. Instead of brute-forcing a random token I'm now brute-forcing a MAC, or a secret key. You don't really need to use a MAC if you're already implementing rate-limiting and time-based expiry (in addition to use-based). That's just a performance hit. You might as well send a token generated by a good PRNG and skip the integrity checks.

It's unrealistic to brute-force one well designed token in the absence of authentication (if it's long enough you could go without any other security controls, including expiry or rate-limiting, but that's also a performance hit at scale). Adding one with an integrity check is effectively rate-limiting anyway.

Funny story though - I was once on an assessment in which a company used a scheme just like that. The UUID was HMAC'd for a password reset request, and UUIDs could be found by messaging another user. Unfortunately, they used the same secret key for password reset HMACs as the one embedded in their mobile app for HMAC signing requests. Worse still, password resets didn't need to be activated, so you could effectively takeover any account by retrieving the HMAC secret key from the app and running the victim's UUID through it.

A performance hit on your web servers might very well be worth it if it means your database can cope with your data better, though obviously, test that. It tends to be a little more difficult to scale up your database server than your web server.

The other side of it is that safety from enumeration attempts is the sort of thing that gets added on well after the software's actually in use, unfortunately, and changing existing keys might not be practical.

A social network I use is in the position where they don't really want people enumerating their users by ID... but all their users are identified in the URL by sequential ID. An HMAC would help them, and isn't difficult to implement.

And yeah, bad implementations are bad, but I wouldn't trust someone who can't use an HMAC securely to generate and use unguessable IDs properly either.

The "performance hit" is the basis of most modern security [1].

[1] https://en.wikipedia.org/wiki/Computational_hardness_assumpt...

I'm well aware of the CHA (this is my primary research focus :). But that doesn't mean you should strive to make things computationally difficult if you don't need to (for example, Rijndael vs Serpent for AES). After a certain point there's no need for further computational cost.

I'm not saying a long random number cannot be secured or that using one is security by obscurity in the bad way, just that it's far too easy to leak a primary key by accident.

That quote refers to sequential IDs:

The original issue with simple auto-incrementing values is that they are easily guessable as I noted above. Botnets will just keep guessing until they find one.

This is why I'm starting to loathe SQL. The theory is great, but when the theory meets the practice and everything falls apart, the perfect kernel of relational beauty turns into a trash fire and I just want to get my freaking graph of objects out of the database. If I use numbers for keys, I deal with disaster when I try to merge from disparate sources. If I use guids as keys, I get terrible performance. Or I can just use a goddamned document store of Json or XML and have related objects get stored right next to their parents and tell the beautiful mathematics of relational algebra to shove it.

I'm tired of hearing "you don't have to say how to get the data, you have to tell the database what you want and it will get that in the most efficient manner" and then deal with an encyclopedia of byzantine rules to get it to do the aforementioned "efficient manner" with anything approaching decent performance. I can see the art, but the practicality mars it beyond recognition. It's like Venus de Milo sculpted out of duct-tape and bubble gum.

Sorry for the rant, I'm just getting frustrated with performance problems in small data sets. I've taken the courses, I've read Date and Darwen, and I'm just starting to get terribly disillusioned.

I get this feeling from developers from time to time, and occasionally I'll get to see them on a project where they use a document store for long enough and they start to appreciate why we started using RDMSes in the first place.

You trade complex retrieval of subgraphs for complex retrieval of aggregate measures. Sometimes it is worth it, but I usually find that if you need both (and that is a big if) it's easy to scale relational databases than it is to scale document stores.

This is for many reasons, but an underappreciated one is that it is easier to cache individual objects than it is to cache aggregate measures. So if you need a subgraph just store it as JSON in a table full of simple key value mappings with some tables that handle how to invalidate the cache. But because aggregate measures span so many business objects it it's harder to achieve this. Furthermore, it is more likely to be possible to enumerate subgraphs than it is to enumerate aggregations.

For example, imagine a tool that estimates the number of companies that meet certain criteria. "Count the companies by country that have at least one employee that earns at least $100k a year and who has had at least 2 previous positions." The number of permutations are endless, and while these types of queries aren't trivially fast on RDMSes, it is at least possible to handle them.

I'm not saying any of this is really easy, but humans don't live in an easy world. I used to be a structural engineer (EIT), and what amazed me was that despite the fact that we've built thousands of skyscrapers, each one was its own hell. Concrete trucks would go bad, water would gush out of places in the ground it wasn't supposed to, people would die. But we continue to push forward because we want to succeed.

SQL is the best when it comes to reasoning about information as tables. If you want to reason about tables as graphs, SQL becomes impractical for many applications.

Performance characteristics of different RDBMSs varies by orders of magnitude dependent on workloads.

Keys are also a flexible concept. They can be formalized to ensure referential integrity (domain), or you can join on whatever you want.

Recursive queries can handle some reasoning on graphs, but they quickly become impractical.

Even so: The flexibility SQL offers when you want to change which data you think should be related is tremendous - on HPC database machines :)

I recall reading something about this in the PostgreSQL mailing list, message written in 2016 but may still be relevant


“There's no substance to these claims. Chasing the links around we finally find this article: http://www.sqlskills.com/blogs/kimberly/guids-as-primary-key... which makes the reasonable argument that random primary keys can cause performance robbing fragmentation on clustered indexes.

But Postgres doesn't _have_ clustered indexes, so that article doesn't apply at all. The other authors appear to have missed this important point.

One could make the argument that the index itself becomming fragmented could cause some performance degredation, but I've yet to see any convincing evidence that index fragmentation produces any measurable performance issues (my own experiments have been inconclusive).”

MySQL's innodb has a clustered index. On DB2, SqlServer, Oracle, etc, they are configurable. PostgreSQL is more like MySQL myisam in that the data and index are separate. In this case, the random nature only screws up the index (assuming btree), not the data.

It can be a bigger issue if you proxy sql queries and shard to 1000 DB servers and you thus rely on part of the query to indicate which shard it is on. You can do this with a type of uuid that embeds the shard, or you have to include additional parts in the query beyond the PK to get the row (it could be a column that the sharing is based on, or a comment). And yes, careful with joins from an a root object's table that joins to other tables that use uuids unless you can guarantee that they are in the same node.

"Things got really bad in one company where they had decided to use Latin-1 character set. When we converted to UTF-8 several of the compound-key indexes were not big enough to contain the larger strings."

This shouldn't be right. UTF-8 encoding uses the same 8 bits for each valid UUID character that Latin-1 would. Unless someone put invalid characters in the UUID field, I would guess that the new encoding was actually UTF-16 or something.

In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.

All the bytes transmitted on the wire will be UTF-8 strings, but this is disconnected from the underlying storage engine.

I don't know if this is the case for other databases, but when I read this my first thought was "This must have been MySQL"

W... What? Why?

The whole point of UTF-8 is that it's identical to Latin-1 (edit: ASCII) except in cases where it can't be. In those cases, it uses a very efficient encoding.

I understand that people (for whatever reasons) don't want to store text in UTF-8, but that's incredible that such a popular piece of software would take that notion to the extreme. "Let's make UTF-8 almost equivalent to UTF-32 when the data is at rest! Yes!"

Are you sure that info is correct? And if it's correct, are there any valid engineering reasons for that decision? It's probably best to be charitable and assume there must be a reason.

It is beyond correct. Mysql's "utf8" only stores stuff from 1-3 bytes.


You have to use utf8mb4 to have proper utf8 support. Yes its stupid, this is the main reason I prefer not using mysql for anything. Too many gotchas lurking about.

Exactly, but in any case even mysql version of UTF-8 should store an UUID the same way as ASCII.

In dealing with conversions or characterset handling the UUID pitfall is the hyphen. This is an extremely vulnerable character when dealing with multiple data handling and feed sources (to say nothing of data conversions).

The hyphen has a handful of imposters and one of the more troublesome ones is in extended ASCII as "en dash" (there's another as "em dash"). These do not have a direct mapping to unicode as they are not "hyphens" but they sure do look like them to humans in a hurry. It will throw your meticulously thought out and implemented keys (or any other well-measured column) into a tailspin. Dealing with extended ASCII implementations has been the most hazardous area of characterset handling, at least for me.

Although probably unlikely the UUID is not a guaranteed direct ASCII to UNICODE conversion. I'd advise awareness and perhaps even some level of caution if you have a complex data flow (or, more apt, a seemingly ridiculously simple and bulletproof one).

Extended ASCII sucks so if you use it and don't have one or both feet nailed to the floor then get off of it, sooner.

Store, yes. But indexes don't do variable length (think about the index being a btree of binary values), so it gets expanded. UTF8 (utf8mb4) will thus expand to 4 bytes per character. And MySQL (pre 5.7) will stop you from making such stupid indices.

> The whole point of UTF-8 is that it's identical to Latin-1 except in cases where it can't be.

It's only compatible with ASCII, not Latin-1. You are maybe confusing UTF-8 with the Unicode codepoint space, of which the first 256 codepoints are the same as Latin-1.

Just in case there's any confusion:

- 7-bit ASCII encodes 128 characters; latin-1 encodes 256 characters; UTF-8 encodes all unicode characters

- latin-1 and UTF-8 both extend 7-bit ASCII; the three encodings are equivalent for ordinals 0-127

- UUID characters are part of 7-bit ASCII, so they have the same encoding in ASCII, latin-1 and UTF-8

100% correct - UUID strings are ASCII < 128 .. but there are imposters for character code 45 ("-"). Assumption based approaches to conversions or field handling should not be assumed to be safe ones just because UUID is "technically" ASCII. Only the keeper of the field can truly enforce that constraint.

Aha, thanks for calling me out. I was indeed thinking of ASCII.

I think this is correct for CHAR and incorrect for VARCHAR.

A CHAR column always allocates the maximum size, so a utf8 (MySQL's weird "UTF-8, but only for the BMP" encoding) CHAR(100) needs to reserve 300 bytes per row, and a utf8mb4 (MySQL's actual UTF-8 encoding) CHAR(100) needs to reserve 400 bytes per row.

But a VARCHAR is dynamically allocated, so, yes, a VARCHAR(100) that stores 100 lower-ASCII characters is only going to use 102 bytes of storage (using 2 bytes for the string length).

See, e.g., the "tip" section at the end of https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

Author mentions compound keys, so issue was probably with the index key prefix length, which is by default limited at 767 bytes. That limits you to a maximum of having just first 255 characters of string data indexed with utf8, but in latin1 (one byte per character) you can have 767 characters long prefix. So with latin1 it's perfectly ok to have compound index on 2 varchar(255) columns, but you can't do that with utf8. Converting from latin1 to utf8 will break all those indices.

The 767 limit is with the default row format. In 5.7 the row format changes, and allows 3000+

With innodb it is variable storage regardless of varchar or char when UTF8.

It wouldn't be the only utf8 quirk on mysql: utf8 isn't even utf8. For that you need utf8mb4.

(In David Attenboroughs voice): And here we have the gentle Sillysaurus in it's natural environment, grazing on a field of failweed in the tidal shallows. Notice however, that his eyes remain just above the surface of the water, ever vigilant for the dreaded Snarkasaurus Rex.

A flock of Correctosaurs glide by overhead, scanning the muck with uncanny vision for the telltale motion of their Suttlebug prey. It is their rare, mid-air collisions that give the Suttlebug away, despite their otherwise excellent camouflage.

Hi! I am the product manager for the mysql server. I can confirm that for innodb, this is not the case - storage is the variable length.

You might be thinking about the memory storage engine (and temp tables).

I believe everyone is referencing the index key size length in MySQL. In that case you can get "ERROR 1071 (42000): Specified key was too long; max key length is xxxx bytes" errors.

This happens even in InnoDB -- it's taking the maximum byte size of all the key fields. For a single VARCHAR(255) in MySQL utf8 that ends up being 3*255=765 for just that one field. If you add other fields, it's really easy to hit the max key length and have the create or alter fail.

See my reply to parent on why the max must be calculated (no false positives on duplicate key in edge case).

The default row format changes in 5.7, so key size is max is now much larger.

Oh yeah, that makes sense. At least for CHAR(36), but I'm not sure how VARCHAR(36) as given in the article would behave. (Never mind I just looked it up and it depends on the storage engine - e.g. MEMORY will always allocate the max size.)

You'd be insane to put the primary key in a varchar. Mysql is much faster with static width tables. Your main user table should be static width for all the fields.

As opposed to the insanity of storing UUIDs as strings at all? Once you start going into the rabbit hole, you better be prepared to go all the way down.

No doubt. You can store them as INTs! Tho it does get a little hairy dealing with the conversion with MySQL.

Is this true for all storage engines?

It is not true for innodb (default). See my other comments in this thread.

Storage, and index are different. You expand the UTF8 chars into bytes that are then part of a binary tree. It actually gets more interesting when you consider the collations.

Really? That sounds like a tremendously stupid design.

But even with that design, you'd expect VARCHAR(36) to have space for 36 ASCII characters, not 9.

It does - it allows for 36 code points using your current charset, which with UTF-8 can mean up to 144 bytes.

Of course, that is the worst case, but for index size limits, MySQL has to assume the worst case, which means you quickly bump into the default index key limit of 767 bytes with large string keys.

That can't be true. The article's anecdote is converting from LATIN-1 to UTF-8 causing the indexes to not be able to contain the "larger strings", but UUIDs are ASCII (when represented as strings), so the index size shouldn't change unless the storage engine is storing it internally as something other than UTF-8.

And the parent comment says "In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.", and if it's really using 3 or 4 bytes per character regardless of what that character is, then it's probably storing it internally as a sequence of unicode scalars instead of UTF-8.

Granted, upon re-reading this, neither source explicitly says that a CHAR(36) column won't actually be able to hold 36 characters, but if it can then I don't know why the indexes would stop working.

EDIT: It occurs to me that a CHAR(36) column would require 4 bytes per character to be allocated, since it's a fixed-width column, though it should still only actually use 1 byte for an ASCII character. But 144 bytes is still well under the 767 byte limit you mentioned.

It's true.

The problem isn't so much the content itself, but the column and index definitions: keeping the same length in characters in the definition will use more bytes in utf8 or utf8mb4, at least against the storage engine index limits. I'm not sure whether this is because MySQL is just lazily/smartly computing the worst case, or if it's actually storing using multiple bytes per character regardless. I assume it's the former.

The documentation's example of the problem with InnoDB goes like this:

This definition is legal, a 255-character index at 3 bytes per char coming in under the 767-byte limit:

    col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))
But to use utf8mb4 instead, which allows for characters beyond the BMP and can use up to 4 bytes per character, the index must be smaller, only covering the first 191 characters:

    col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
It comes up often when people have used a VARCHAR(255) as a default "kinda big" string column, are indexing the whole thing, and want to switch to utf8mb4.

In the case of the UUIDs in the article, I believe it mentioned it was a multiple-column index, so the increase of 3x (or 4x) from latin1 could easily push them over the limit if there's another big column or columns in the index.

You must be confusing UTF-8 with 4-byte Unicode. UTF-8 by design will use variable length encoding, otherwise it's not UTF-8.

I can confirm both these commands return 36 in my mysql installation:

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8_general_ci);

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8mb4_general_ci);

Nope, tveita is right.

MySQL's storage has some space limits defined in terms of bytes, and to express that in terms of characters it has to allow for the worst case.

If it allowed 255-byte space to be used as VARCHAR(255) in UTF-8, then an insert of 255-character long string with emojis would fail.

The name VARCHAR is confusing since it's not properly defined what a character is. 1 byte? Something on the unicode table? If it'd be named VARBYTE(255) it would be pretty obvious that a 255-emoji-string insert would fail.


That's not a real issue. I can understand this may bite some people that don't read the docs or don't understand encoding, but it's more misinformation than a technical problem.

UTF-8 is variable length anywhere, not just inside mysql. If you don't want this behavior, you can either use UCS-2 or UTF-32. UTF-16 and UTF-8 are variable length encodings, period.

Most databases that have use char length for unicode actually work with UCS-2, and use 2 bytes per char, like MS SQL Server.

Not sure what deserved a downvote here... Is there anything incorrect?

It's not incorrect, but a strawman (I didn't downvote you btw).

The question was why UTF-8 in MySQL has weird limits. It wasn't about what MySQL should have done in an alternative universe.

Yeah, but I didn't point out what mysql should have done in an alternative universe. Mysql DOES support fixed lenght unicode, just use the correct encoding: ucs2 or utf32. What it does with utf-8 is what any system that supports utf-8 must do.

You can specify encodings on a per-column basis, at least with ISAM tables, so you can have a UTF-8 database with Latin-1 key values. MySQL 5.7 has some functions to aid string wrangling of guids, so you can store and index them as bin(8) and address them as string values.

More fun MySQL Unicode facts...utf-16 uses 32 bits per character to ensure it can handle supplementary characters. UTF-8 only handles characters up to 3 bytes; gotta use UTF-8mb4 if you expect to handle 4-byte values. You can create prefix indexes for Unicode string values that exceed the maximum index size.

That's remarkably bizarre. Who implements "UTF-8" but restricts it to only the BMP?

Remember this is for in-table storage, so it makes a certain amount of sense - this saves a byte over UTF-16 with support beyond the BMP. You have a hard limit on the byte size of the table - how do you determine a priori how much storage a 20 character UTF8 field will consume? The alternatives are to store the value in a clob field or set a hard byte count on the field and let the application or user be surprised when 20 print characters are rejected. I actually don't know how other providers handle in-table Unicode fields, MySQL made some poor choices on naming things at the least.

At a bare minimum you'd expect it to be something like utf-8bmp for 3-byte storage and utf-8 for 4-byte.

It's historical. When it was implemented 4 byte unicode did not exist.

Now that it does they can not change the names anymore.

It's the same as why windows is stuck with utf-16, because when they implemented it unicode was 2 bytes.

> When it was implemented 4 byte unicode did not exist.

Incorrect. When UTF-8 was invented, it was actually variable up to 6 bytes in length, being capable of representing code points up to U+7FFFFFFF. It was only shortened to 4 bytes in 2003. There is no point in history where UTF-8 was only limited to 3 bytes.

That doesn't seem right.

The 1998 version of MySQL didn't support unicode at all yet.

Unicode 2.0 introduced UTF-16 in 1996, making the need for non-BMP characters very explicit.

And UTF-8 at the time supported 31-bit code points.

Sure, the VARCHAR(36) column would have enough room, but if you have a 36-byte index on that column, then suddenly it won't be indexing the whole column.

The reason is not that the indexes became larger, its that mysql has to protect against all 4 byte characters, since the standard defines the n in varchar(n) as chars not bytes. This is obviously exceedingly rare, but if if it instead truncated somewhere in the index you could get a duplicate key error for something not duplicate.

In mysql 5.7+ the default row format allows 3000+ potential bytes (up from 767 bytes), so this becomes a non issue for most cases.

sounds like the author thinks "uuids are a pain" and wants the benefits of them but with a smaller representation. but doesn't provide any reasonings why uuids are a pain other than not being able to remember them or say them out loud. these are not things anyone does with primary keys!

you'll never say this out loud : 7383929. you may be able to remember it, maybe. in a uuid you'll match the last few and first few letters just as fast in your head

uuids are fine. sorting is an issue but at scale (the entire point of this article) how often do you need to sort your entire space of objects by primary key? you'll have another column to sort on

hiding primary keys and having 2 keys seems like a great way to make all queries and debugging 2x as complicated

I've started using ULIDs when I need IDs that are time-sortable while still retaining the advantages of a UUID. Read more here:

The ultra-quick summary: 48 bits of timestamp followed by 80 bits of randomness. Also, when representing as strings, they use Crockford Base32 encoding, meaning 26 characters instead of a UUID's 36. (I'd argue that UUIDs should use this encoding as well.)

I would think it would be better to keep the time stamp in a separate column. In essence you're duplicating the time stamp across FKs which is space wasteful - and a little abstract as one field is holding multiple data.

Isn't that the point of Snowflakes anyway? Sortable UUID style keys?


Completely agree. I have been using UUIDs for a lot of stuff for a while, and you get used to it just as you get used to any other thing. Performance is usually indistinguishable. Most of the issues in the article are related to practices, some are related to specific platforms. Every piece of code may bite you if you don't know what you're doing.

Now, if the developer don't know when it makes sense to use a UUID, there are certainly more problems in the application to worry about. Which is 99% of the cases in most business applications.

In MySQL/InnoDB, using UUIDs as your primary key will eventually lead to pretty substantial fragmentation and use way more disk space than necessary.

You can see this visualized with the innodb_ruby project here:


This is true, but I wasn't able to find a real issue with this. This issue is likely to appear in benchmarks, but unless you're trying to squeeze that last bit from the database, not using any caching anywhere in your app, not using any index, not having enough free memory, etc you're likely to never notice the difference.

> you'll match the last few and first few letters just as fast in your head

Definitely true, in fact I tend to prefer hex for this purpose. But one minor note - assuming the whole thing is cryptographic-strength random, there's no need to compare first and last, just do first only or last only (last is usually preferred for UUIDs due to them sometimes containing a timestamp at the start), the odds of an accidental match is the same and the visual processing time of switching from one end to the other is lower.

You'll never say a phone number like 738-3929 out loud? Short identifiers are useful in a lot of places. I work in IOT, and it's nice to have a short device id when we need to troubleshoot something.

Where I come from, our tables go towards a billion rows, so there's copying of keys regardless of whether it's uuid or sequential id.

for event store databases, you'll need to have some sortable sequence ID-- you can't always even use timestamps, since on occasion you'll get multiple events happening at exactly the same time, and you still need to sequence based on received order.

Well, event stores are a very specific type of database that exists solely for the purpose of knowing exactly the order of events. It's not fair to say UUIDs are a problem because one very specific case shouldn't use it.

Definitely not! In fact, I use UUIDs everywhere else but the event store. I was just pointing out one case where having an int (well, bigint) primary key is actively useful, and UUIDs aren't.

(edit: spelling)

Maintaining UUIDs is much easier than maintaining id/int lookups that may be autonumbered (mssql, mysql, pg) or sequenced (oracle), even if using them internally and UUIDs externally. This especially comes into play when syncing across dev, staging and production environments and when clustering and servicing out parts of your app.

The moment any db starts to grow to these areas, UUIDs lead to far less issues than incrementing ids everytime.

Most RDBMS now have optimizations and native types (uniqueid) for UUIDs/GUIDs and this is really a moot point at this point, most UUIDs are no longer strings in DBs unless legacy from the time before native UUID types.

UUIDs are right for most projects but not all and as typical in any system, the environment and needs of your project will dictate whether it makes sense to use them.

UUIDs eliminating the round trip and negating dealing with autonumbering/sequencing is a massive benefit, the only real con of UUIDs is the extra 8 bytes but make up for it in less need to lookup during runtime when creating new or associating data with them.

It is like that line from jurassic park 2...

(The problem with UUID is that the two first letters are a cruel joke. But hey: people need to make their own mistakes because some things you can't just tell people: some rakes they need to step on until they learn or are unable to reproduce)

Universally Unique, if there is ever a conflict you can easily use a constraint or check upon creation on the app/server/service side, no different than checking if an int id or slug/public code is already used. No roundtrip is needed for every new item and if there ever was a conflict you just have an api error stating as much. You might want to also have it email you to play the lotto that week because it is virtually impossible.

Try syncing autonumbering across different database types where some use autonumbering and others use sequences. That is a nightmare scenario but there would be zero issues with UUID keys.

Primary keys as ints/longs will be constant rake stepping as you grow and maintain an app.

UUIDs emerged from a need that int keys weren't filling especially in a distributed/serviced world. I have used them in most DBs for over a decade and no rake steppage.

Side note off topic: If you were a piece of data would you want to be a universally unique piece of data or an int and merely a clone? All my data is a unique piece of data, I am a good Bender.

I've worked on quite a few distributed systems where you need to generate unique IDs and I can tell you that UUID is not a solution in quite a few of them because you actually need to understand what problem you are solving rather than taking the promises of others for granted or shrugging and gambling on thins sorting themselves out.

In many cases you have two problems: network partitions and/or high ID assignment rates. So statements "if there ever was a conflict you'd have an API stating as much" is simply passing the buck to some piece of magic and actively not knowing what goes on. It is a bullshit statement.

In my experience, this is the kind of problem where people tend to fail because they don't grasp that it is actually kinda hard. Especially if you have additional constraints (like the size of IDs). People tend to make all manner of assumptions and then not test those assumptions.

The arguments are familiar though. I hear a re-run of them about every 2 years, when someone has a system in production and the thing starts to generate duplicate IDs. Although both code and notes are available on previous solutions within the company, people tend to ignore that and just plow ahead.

Can confirm: using MySQL and for reasons... everything in the DB gets a primary key set by taking a random UUID, stripping the dashes, and then doing an `UNHEX(id)` in the stored procedures. Those IDs are both the primary keys and the keys used in the service's APIs.

One of our Ops guys did an experiment where they put a uniqueness constraint on the ID column and added an auto-incrementing primary key column that's never exposed to the code driving the thing. It apparently sped up our DB performance by orders of magnitude.

It also turns out that MySQL would perform faster just by leaving those values as strings instead of converting them to binary values. We've got some outside pressure to use Oracle instead of MySQL, and apparently it performs much better than MySQL with our current schema so we apparently aren't going to do anything to improve the MySQL performance or change any of this behaviour.

Shameless plug: Anyone bothered about the wasted space in UUID string representation (and using Ruby) can check out https://github.com/sudhirj/shortuuid - it re-encodes your UUID into any alphabet you choose, with a Base62 default (I find that to be a sweet spot that gives both URL safety and efficiency).

Let me know if you want ports in any other languages - the the algorithm is to really just treat the UUID as a hexadecimal number (that's actually what it is) and re-encode it into any other alphabet of choice.

That said, always use native UUID types in datastores - they'll convert to bytes / numbers internally and will always be the most efficient. For other situations, remember that they're just numbers, so you can write them in binary, ternary, octal, decimal, hexadecimal, vowels, baseXX or really any other alphabet you want. The bigger your alphabet (as long encoding remains efficient, like ASCII under UTF-8), the better your gains will be.

Yep, glad to see this posted. In the python world, this is why we have UUID.int (https://docs.python.org/3/library/uuid.html#uuid.UUID.int), though the native postgres UUID type with uuid-ossp works well too if you need them auto-generated in the DB rather than in application code.

Is there an auto-incrementing version of this? I'm trying to implement a time-series database and can't find a suitable method of generating incremental UUIDs.

You may try ULIDs - "Universally Unique Lexicographically Sortable Identifiers" posted here last year. https://news.ycombinator.com/item?id=12205158

You can generate v1 UUIDs. v4 are random by design. You can also create your own version of UUID, as long as you set the version bits correctly, although I wouldn't recommend it.

doesn't uuid-ossp store the data as binary?

The reason I don't like the internal-int-external-UUID strategy is that all of your queries now require an extra join. It's no longer "select microblog.* where userid = ?" now it's "select microblog.*,user.id from microblog,user where microblog.userid = user.id and user.uuid = ?".

This may be practical from a storage standpoint but string-based indexes on an SSD are pretty damned efficient.

The benefit of an external guid is that it obfuscates counts.

I've seen cases where it's possible for a business to track a competitor's growth by watching a public facing user id count increase.

I do that too.

Ask an invoice from a HN startup. See the reference number at the top "2017-0000438".

They had 438 potential customers this year :D

That's why you typically add caching (like cookies) that store the UUID with some user context so it can be provided directly in queries.

This article is so poorly written it's hard to take it serious. The entire paragraph about the size of a UUID takes reading it three or four times before you can actually understand what the author means...

In what context would a primary key change, even when sharding? In my entire career I have yet to see it. Also any sane person would never sort random values. If you need sorting in your table, provide some kind of indexed timestamp.

> In what context would a primary key change, even when sharding?

The only time I've seen that happen is when a DBA evangelized for just using 'natural' keys, which in this case were slugs based on the item name. The problem came when users decided they wanted to rename things and then wondered why the IDs they were seeing in URLs didn't match the new names.

I'm not a believer in natural keys anymore.

> If you need sorting in your table, provide some kind of indexed timestamp.

A v4 UUID contains a timestamp. So it seems a bit wasteful to add the separate timestamp column. Why not just use the timestamp built into the ID? This is what MongoDB's internally-generated IDs do. (Though they're not quite UUIDs.)

There's a catch with v4 UUIDs though: the timestamp portion of the ID is not the most significant digits, so a simple ORDER BY doesn't do what you want.

I like the solution developed by Instagram, which Rob Conery discusses at the link below. It uses a pretty simple plpgsql function to generate a 64 bit integer based on the current time, a shard ID, and a per-shard sequence. This lets you generate 1024 IDs per second per shard. It also puts the timestamp portion of the ID first, so if all the shards' data ends up in Hadoop or something you can order consistently across the whole set.


Item name was never suitable as a key... It's not stable. There are natural keys, they are just very rare. And the thing is even when you think you've got one it is often better to err conservatively. A good anti-example is SSN, often used in text books. They do change. Other than DNA sequences I can't think of a good person key. Just use an internal surrogate and be done with it.

> Other than DNA sequences I can't think of a good person key.

In theory, one could use 3D geoposition at time of birth, time of birth, and sibling order (for twins / triplets / etc delivered surgically), where said order is dictated by the parent(s) or ob/gyn present. Of course, the main problem with this in practice is that not everybody has this information.

I would have said you could do something via retinal imagery, but not everybody has eyes. If we had non-invasive neural imagery, would it maybe be possible to derive a key from a simplification of a person's physical brain topology?

Not to mention all the identical twins out there.

In a some countries national ID numbers are good primary keys, but that's only when they're designed (and used internally) as such by governments. e.g. Finnish, French, or Israeli ID numbers. Even then, though, using them limits you to people who are in the system - no dealing with international customers, for example.

At least in Finland, national ID can change in some cases, e.g. when you change your legal gender or if it's repeatedly misused by someone else (like in identify theft scenario).

I mixed up uuid versions above. v1 are the timestamp ones, and v4 purely random.

It's a DBA rant.

Unfortunately, the author doesn't really get UUIDs -- the whole point of generating them and using them as a primary key is that you get a unique entry that can be used between systems. That is a magical thing that gives you superpowers.

In a past life, we would keep the customers basic data in the production database forever, but denormalize and export most historical records out of the transaction system to a data warehouse after 18 months of account activity. We also dumped login/audit data to a big flat file for compliance purposes.

The beauty of using the UUID as the key in that scenario is that the data became fungible between the systems. We didn't care about the audit logs, but had to keep them -- if we had to process them, we could do so with text tools (i.e. Perl/Awk) and do whatever querying we needed to do with a subset of data.

Relational databases are shitty and expensive when used as file systems. If you don't need the relational magic, you don't need the cost and operational overhead of a RDBMS. Literally billions of dollars are wasted keeping unused data in databases.

Another wonderful reason to use UUIDs across systems: Consider what happens when one system's database has an accident, gets restored from a backup, and then starts generating numeric IDs which -- from the perspective of the other systems -- are duplicates.

Best case scenario, things break cleanly and you have to fix up your auto-incrementing counters. Worse is when you start getting data corruption because some remote system has its information about "Customer 11492" (who registered yesterday) overwritten by details for "Customer 11492" who registered today.


Also consider that you commonly have systems that you cannot control. What's the internal representatation of s user or action in a SaaS app?

Don't know, don't care. I slurp in data, with UUID as an attribute!

> Also any sane person would never sort random values.

I think sorting by a arbitrary, unique value as the last sorting dimension is useful to ensure you always get the same ordering when first dimensions are equal. For instance sorting users by (last_name, first_name, user_id) ensures homonymous users don't move around in the list from request to request.

This is quite a sane thing to do, don't you think?

Yes, I do that all the time. Even for tables that have a sequence column to indicate the desired display order, I'll add the primary key on the end of the order by to ensure consistency just on the off chance of two rows having the same seq value.

Have you never needed to merge datasets, that grew separately, in parallel? It is a royal pain when the rows have consecutive sequential primary IDs. Something has to make way for the other.

IMO, this is one thing Mongo got right: keys, by default, are a timestamp + some random string. This way, they still sort nicely, but don't collide with their cousins in a different instance of the DB.

Even if you come across a situation where your primary key has to change, it seems like you could "simply" migrate your initial primary key to the "additional column" concept referred to in the article, and add your new primary key as needed.

I can't imagine that a primary key migration would be easy even if you'd followed the author's suggestions around having an internal key and an external key, and I feel like if you're optimizing for a future primary key migration you might be focusing on the wrong part of your product.

> In what context would a primary key change, even when sharding?

If you're using natural keys, and one of them is displayable text, and then the business says "We decided to rename the 8th sales district. Can you change it for us?"

Was this a good design? Maybe not. But legacy happens.

We're dealing with a vendor right now that has references to look-up tables based on display text. They should have used surrogate keys of some kind. Or at the least, code columns so that the database is mostly unaffected when values shown to the users get renamed. (Also useful for I18N work)

> In what context would a primary key change, even when sharding?

When the model underlying the data is significantly incorrect, and needs to change. I.e., the relation you modeled does not map to the real-world concept like you thought it did. (Or, it did, but no longer does.)

I feel like this most often happens when two concepts end up being modeled by a singular table. You find out later that the table really represents separate, distinct things, and needs splitting. The split off things can — depending on how you do PKs / your migration — get new PKs.

I recently wrote about changing the Primary Key as a way to organize/cluster the data on disk for better query performance (MySQL, unsharded environment). http://techtime.getharvest.com/blog/improving-report-perform...

Some databases will pre-sort a stream of primary keys in order to read a table sequentially.

Lots of talk about performance, but no numbers cited. I did my own benchmarks before using sequential ("COMB") GUIDs as 'PRIMARY KEY' (yes, they're surrogate keys) and found no material performance difference. I didn't keep the results, but someone else has made their numbers public here: https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/gui...

Little rant on UUIDs:

Notice how the author assumes UUID v4[1] in the conversation. There are very few reasons to use the other versions but we are still paying for their price in code complexity all the time.

Look at this UUID parsing code: https://github.com/sporkmonger/uuidtools/blob/master/lib/uui...

What it really should be is `[uuid_string.gsub('-', '')].pack('H*')` (for non-rubyists: remove the dashes, decode the hex back to binary).

Their representation is also not that good since hex encoding is not very compact.

I guess what I'm trying to say is that UUIDs are often used as a default unique identifiers but they are actually not that good.

[1]: https://en.wikipedia.org/wiki/Universally_unique_identifier#...

Postgresql has a UUID type which should store them as a 16-byte number. If you use time-based UUIDs – for instance based on the unix time stamp in hex, like CouchDB – then you also get sortable primary keys, which conceptually might or might not be useful to your application, but it probably speeds up indexes. I've done exactly this for two different projects, and it works well.

On top of that you get IDs that are impractical to guess, which while wouldn't replace other security measures, would still give you some collision resistance and probably avoid some bugs because of the unlikeliness of accidentally picking the same key for two different entities.

I'm sure there are pathological cases for UUIDs as primary keys in certain scenarios, like perhaps a very high number of small records, but I've not come across them myself. You obviously have to know your own data and database if you have some very specific requirements.

I work on an Entity Framework Core Implementation for MySQL and we recently added sequential GUID generation for primary keys that are of type Guid. The first 8 bytes of the GUID are the current UTC timestamp in ticks and the last 8 bytes are cryptographically random.

One interesting thing we ran into when implementing is that C#'s binary format and string format must be different to be sequential. So we have to detect whether the GUID is stored as a string or binary and put the timestamp in the correct place to ensure it is actually sequential.

Here's the PR for the feature for anyone interested: https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...

Datomic [0] uses SQUUIDs [1] (Semi sequential UUIDs) to work around this:

    Many UUID generators produce data that is particularly
    difficult to index, which can cause performance issues 
    creating indexes. To address this, Datomic includes a 
    semi-sequential UUID generator, Peer.squuid. Squuids are 
    valid UUIDs, but unlike purely random UUIDs, they 
    include both a random component and a time component.
[0] http://www.datomic.com/

[1] http://docs.datomic.com/identity.html#sec-6

edit: formatting

> Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely on collation rules.

Why would you sort these to begin with; what ordering of essentially randomness (part of the point) makes sense?

I believe UUIDv1 is time-based, so you can sort the keys and get the sequential order the UUIDs were generated.

for when everything else is equal, you need something to let you get consistent sorting in your output

For example, if your page size is N and you have N+1 things with the same timestamp, you need a backup sort key or your pagination breaks down

Yep the Seek Method for pagination as suggested there would cause you to be sorting by UUID.

    SELECT uuid FROM t where filterable_field > $whatever and uuid > $lastSeen limit $pageSize
with a covering index on (filterable_field, uuid)

"UUIDs do not reveal information about your data" - this is false statement; in sensitive environments you need to be aware that some UUID versions can leak MAC addresses, timestamps, hashes of your data etc. - sometimes just enough to abuse this information.

Yes, UUIDs are guaranteed to be unique, not random. You shouldn't use them for things like Single Sign-on tokens.

That being said, the author does talk about not exposing the UUIDs either. I get the feeling it's slightly less bad if a UUID gets exposed accidentally than an increment sequence number.

UUIDv4 is random. UUIDv3 and v5 are just namespace+name hashes (where namespace is often Nil UUID adding to the problem). Those 3 hashes are, in general, not good candidates for pk on large resources because they mess up the index which likes monotonic increments on new values.

The bottom line is UUIDs are not bad, you just need to be aware of couple of things when working with them.

Why should it matter if you can guess IDs? Presumably records are locked in such a way that simply knowing a URL doesn't allow you to bypass security.

>Presumably records are locked in such a way that simply knowing a URL doesn't allow you to bypass security. In case this doesn't happen. If there is a leak, with sequential ID, you run the risk of leaking the whole table. With UUID, the leak can be more limited.

Also, sequential ID may reveal some information you may not wan to share. If someone's user ID is 1234, that may be an indication the service has at least 1234 users.

>Also, sequential ID may reveal some information you may not wan to share. If someone's user ID is 1234, that may be an indication the service has at least 1234 users.

You'd just start at a million or so, +/- some random amount. That's not a new thing. You should be able to spin all of the auto-number types in all the databases to whatever you want it to start at.

Timing based UUIDs (type 1) or ULIDs can leak (surprise!) timing information. If you can see them then you can infer the rate at which some resource is being created. For example the rate at which a service is creating users or business transactions. This can be valuable competitive information you don't want to share with the world.

I was responding to the article saying integers as IDs were bad because you could increment them to guess at the next record.

I mean, maybe, although usually they aren't a perfect sequence. I just have a hard time imaging a scenario where that information is so private you're worried about leaking it.

That depends on how the application is built and how it handles authentication - recently Steam had an issue where it was possible to approve your own apps and bypass the Greenlight process altogether, in part, by guessing that low value IDs belonged to Valve employees[0].


If you're trusting the client's version of the user ID you've got bigger problems.

The strategy "internal int-external uuid" can be simplified if you use encryption and hypermedia API. It's possible to encrypt int and some additional information and format it as uuid v4 (random). For external users that know natural keys of some objects it's possible then to discover the rest of objects by navigation via API, where UUIDs are just some pseudo-random parts of the URIs.

I was wondering a similar implementation. I encrypt/decrypt database IDs to opaque them on the API. For primary entities but not enums/etc. I separated API vs Core/Database so I wasn't too fond of adapting the entities to add a sort-of public id column so I create it at run-time.

I can share my code on Github, it will be interesting to find some commons. Probably, this approach deserves more attention.

Why would you format it as a UUIDv4 instead of just leaving all bits intact?

To take advantage of native UUID support in a database, e.g. PostgreSQL stores them in 128 bits. Also every UUID carries version number so minimal formating needs to be performed to comply with the standard.

So you're saying Postgres actually wastes cycles verifying the version information and doesn't treat UUID just as a 16 byte opaque structure?

Well, there's the "input format" (or rather two), which gets parsed into the internal representation. The input format can be either hex text, which allows dashes in some positions, or binary, in which it's just a bytearray. There's no verification otherwise, unless you treat generation of UUIDs, which allows to specify the type, as such.

Second point: Verifying version information would hardly matter cyclewise.

I'm not sure what you mean exactly. The textual representation of an UUID HAS all the bits intact. UUID formatting is not lossy compression.

UUID v4 "spec" leaves 122 bits for randomness and has 6 fixed bits for "version" information. I'm suggesting that following the spec on such stuff is a total waste and serves no point, and you're better off with a random 128-bit number.

Well, then it's not an UUID. It's just 128 random bits. If you just want to be the most random possible, you can go 256, 512, 1024, etc. That may be what you want, but you can't say UUID v4 is a "total waste" or "serves no point".

It's purpose is to never collide with another UUID v4, not to be the most unique value ever.

UUIDv4 is a 122 bit random number. Having another 6 random bits won't make them collide.

Also if there's UUIDs in your system colliding with a 128 bit random numbers then the rng is broken or your UUID generator would be great to crack encryption keys.

Well, that's possible, but I'd say it's generally better to keep close to the spec and v4 is the closest option here.

But what's the value in the "spec" here? I see people mentioning this and I'm wondering what the rationale is.

The rationale is that you can use them along with those other versions of UUID that contain nonrandom data.

Assume Bob uses one such not-entirely-random UUID variety and actually makes use of the embedded information. Charlie uses random UUID. Bob and Charlie merge their data.

If Charlie had filled the version bits with random garbage, the merge would poison Bob's use of the embedded information.

Basically, UUIDs are a discriminated union type.

Thanks, that's the best answer.

By violating the spec you're assuming you'll never want to use a tool that strictly adheres to the spec. Postel's Law works best because it accommodates some people not following it.

PostgreSql has a dedicated UUID column type. Those are fast and the storage difference is insignificant.

One huge benefit of UUID is how you can safely create them while being offline, and then sync them at a later stage without conflicts.

You can still have conflicts though, sometimes more than you think due to the UUID generator used. It's rare but it seems like an edge case that should be handled.

Doesn't invalidate your point about being able to make them offline though. Another option I considered to create PKs offline was a userid/timestamp pair.

Obviously you should be using UUIDv4 with a good secure pseudorandom number generator if you want to generate them offline. And typically no one handles collisions in large pseudorandom values.

I don't agree with many points.

1. Store uuids in a uuid field. Why starting the article with such a trivial finding that a text field is not optimal.

2. Use sequential uuids.

3. Several benchmarks have shown that the performace hit is minimal.

4. The only way to communicate with ids is to copy and paste them. Never try to memorize, talk about them or type them.

Excellent post, write ups like this are the reason I keep coming here.

What about the hi/lo algorithm as a middle ground?


In short, and I hope I don't oversimplify, each "shard" or "cluster" in the database gets a "block" of ids it can then go and assign on their own, the sequential "atomic" increase happens only once per hi "block", lowering the contention.

This gives you nice integers, incremental-ish most of the time.

I like the notion of integers internally and UIID (as integers of course! I would have never saved one as a varchar, I swear! ok, I was a noob... I deserve to be shamed)

Great post all in all!

"Then add a column populated with a UUID (perhaps as a trigger on insert). Within the scope of the database itself, relationships can be managed using the real PKs and FKs." That would mean doing lookups by UUID, which is /really/ bad for performance. UUIDs are evenly distributed, so index caches are rendered nearly useless. With sequential keys, and access patterns that touch mostly new data, all you need to find the row is likely to already be in RAM, no matter how many rows you have. With UUIDs, you'd end up doing random I/O. Might not sound like that big deal to some, but we got a 3x overall throughput increase in one of our apps by switching from UUIDs to sequential ints.

I've found many have made the reverse argument, I assume for write-heavy loads. With incoming new UUIDs likely being on different pages, any locking within the DB is likely to be uncontended with a UUID key. Most people's RDBMS use cases are read heavy anyway, though, so it's usually better to have sequential (clustered) keys.

I've heard that case made for certain write-heavy workloads, too, but I would expect that to be highly dependent on the details of the db engine's implementation and your particular workload. While UUIDs, as you said, would generally be written to different pages on insert, you'd also experience a higher page split rate when UUIDs arrive that need to be written to pages that are already full which could be expensive. Without knowing the details of your db engine and/or workload really well, I'd expect you'd have to do some testing of your workload to see which one would come out ahead. In a series of tests that I did on application with a 95% write, 5% read workload, integer IDs were substantially higher performing than GUID IDs.

Not all "UUID's" are distributed evenly. This is mentioned in the article.

This was on Oracle, which lacks MS SQL's special handling of UUIDs (which speeds up indexes on UUID columns by using the bytes for the time stamp in UUIDv1/v2 as the high-order bytes - clever hack!). But sure, if you need UUIDs and your database has support for speeding them up, by all means, use that. I'd still avoid them given a choice, though.

Another alternative to avoid guessing is to use randomized 64-bit integer keys. You still risk collisions over sharding/replication, but only if you truly have a lot of data. You potentially lose some index performance but it shouldn't be any worse than with guids. If you really need the full size of a guid, just use them for the key. I don't get the rest of his argument for hiding internal surrogate keys.

I am sorry but that is bad advice.

A GUID has much superior collision resistance than a bare random integer.

The GUID includes a part that is a random (like the int), a time component (prevents collision for hashes generate at different microsecond) and network information (prevent collision across host names).

In a time-series datastore, you may have to replace a set of invalid/corrupt events within an index. Having IDs that are in some way deterministic from the source data, you are able to replace the invalid documents by ID by simply re-indexing that time period with your patch applied. This is the most simple and least risky solution, with minimal downtime

If the IDs are UUID, then the easiest way to fix the values is to drop the index and re-create it, making all of the other data in the index unavailable as it's being recreated.

The less-easy way with UUIDs is to select just the broken events, create new patched events, delete the old events, and insert the new ones in the right index. But you'd have to branch off of your regular indexing logic to do this, probably writing a separate script. Of course if you make a mistake, you may end up with either duplicate documents or loss of data, compounding the original problem.

So I agree, have IDs that are deterministic (that they can be recreated using some known formula and source data, for example: documenttype_externalid_timestamp).

Using UUID for external means you've just forced all the problems with UUIDs on your users.

I'm dealing with that from several vendors atm.

> A naive use of a UUID, which might look like 70E2E8DE-500E-4630-B3CB-166131D35C21, would be to treat as a string, e.g. varchar(36) — don’t do that!! “Oh, pshaw”, you say, “no one would ever do such a thing.”

> Think twice — in two cases of very large databases I have inherited at relatively large companies, this was exactly the implementation. Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely on collation rules.

Eh, I've done that before because it made some interaction with Entity Framework easier (don't recall what now). Hasn't really mattered. The space for storing GUIDs has never been a meaningful constraint for anything I've ever worked on (9x is also nuts and assumes that your database uses 4 bytes per character). Sorting UUIDs is also generally uninteresting since they aren't meaningful by themselves. Maybe if you're doing lots of joins you might care about this.

It bothers me a bit why the author never mentioned what is the right data type to use instead of varchar(36).

I had to google it up to know there is a data type for GUID called "uniqueidentifier" (MS SQL) or "uuid" (PostgreSQL). As for MySQL, there is no such equivalent and suggestion is to use CHAR(38) instead.

Does that mean MySQL will have to live with the 9x cost in size issue?

binary(16) or two 64-bit integers are also options.

Thanks. Good to know that too.

Assuming your using mssql, I thought it had a special guid data type?

It does, but there was some odd interaction with EF that resulted in us using a varchar(36) instead of the GUID type. I actually don't recall what the issue was at this point, though.

This is just opinion and looks like UUID is bad for a particular case author is working on.

We have multiple components over different stacks and id could be generated anywhere in the components. We had to live with either building unique id per table separate infrastructure or UUID. UUID works perfectly and with POSTGreSQL, it's just awesome.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact