
Be Careful with UUID or GUID as Primary Keys - bkudria
https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
======
bdarnell
One of the post's points is that UUIDs will scatter your writes across the
database, and that for this reason you want a (more or less) sequential key as
your primary key. This crucially depends on both your database technology and
your query patterns.

In a single-node database or even a manually-sharded one, this post's advice
is good (For Friendfeed, we used a variation of the "Integers Internal, UUIDs
External" strategy on sharded mysql: [https://backchannel.org/blog/friendfeed-
schemaless-mysql](https://backchannel.org/blog/friendfeed-schemaless-mysql)).

But in a distributed database like CockroachDB (Disclosure: I'm the co-founder
and CTO of Cockroach Labs) or Google Cloud Spanner, it's usually better to get
the random scattering of a UUID primary key, because that spreads the workload
across all the nodes in the cluster. Sometimes query patterns benefit enough
from an ordered PK to overcome this advantage, but usually it's better to use
randomly-distributed PKs by default.

For CockroachDB, my general recommendation for schema design would be to use
UUIDs as the primary keys of tables that make up the top level of an
interleaved table hierarchy, and SERIAL keys for tables that are interleaved
into another. (Google's recommendations for Spanner are similar:
[https://cloud.google.com/spanner/docs/schema-
design#choosing...](https://cloud.google.com/spanner/docs/schema-
design#choosing_a_primary_key))

~~~
lugg
Thanks for this reclarification, lead me to do a little searching.

[https://mariadb.com/kb/en/mariadb/guiduuid-
performance/](https://mariadb.com/kb/en/mariadb/guiduuid-performance/)

Has some helpful tidbits, like how pks get implicit copies to all other
indexes making uuids quite expensive memory wise if you have a lot of
referencing tables and a lot of indexes.

One of the recommendations I'm not sure on is to compress and reorder the uuid
for time sequence.

I get the compression part (remove dashes, convert to bin with unhex), but I'm
concerned reordering the uuid so the time sections are together (to order your
writes better) might actually decrease entropy? and make collisions likely.

Does anyone here know much about uuid generation and whether the time sections
reordered would make a difference across a table of around 100m~?

I think I might just go with internal incremental and external uuid. Solves
the knowing key before insert problem with little downsides.

The time swap thing is now implemented in mysql8 (broken link in the mariadb
post)

[https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-
functi...](https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-
functions.html#function_uuid-to-bin)

~~~
caleblloyd
The probability of a collision can be estimated using the "Birthday Paradox"
equation:
[http://planetmath.org/approximatingthebirthdayproblem](http://planetmath.org/approximatingthebirthdayproblem)

Say you can tolerate a 1 in 1 Billion probability of collision:

With completely random 128-bit UUIDs, you would need slightly over 800
Trillion Rows to have a 1 in 1 Billion chance of a collision. Equation: sqrt(2
_(2^128)_ ln(1/(1-1/1000000000)))

Adapting the UUID to contain the first 8 bytes as the number of milliseconds
since the 1970 epoch would mean 64 bits of random data. You would need to
insert over 192,000 rows IN A GIVEN MILLISECOND to have a 1 in 1 Billion
chance of collision. Equation: sqrt(2 _(2^64)_ ln(1/(1-1/1000000000)))

------
platz
> secondary primary key

This is called a "candidate key" in existing literature. much has been written
about such things.

Both UUID's and auto ID's are "surrogate keys" because they are arbitrary with
respect to the data.

lastly, "natural keys" are combinations of columns that consist of the
business data.

~~~
squeaky-clean
I've been using natural keys in a project recently without even knowing they
were called that, or even that it was a pattern.

Is there a book or any other reading you would recommend to be more familiar
with these sorts of details? The longest text I could find about this was the
c2 wiki. I'd like to know what else I don't know.

~~~
matwood
Very rarely have I found natural keys that end up working long term. I always
default to a single column surrogate and use natural keys as exceptions.

As to your question, any introduction to RDBMSs book will teach all these
concepts.

~~~
awj
The ultimate problem with natural keys is that their uniqueness is a property
of _business rules_. Unless you live in some magical fairytale land where
business rules are stable, sensible, or both, depending on them for uniqueness
is _asking_ to have problems.

~~~
squeaky-clean
> Unless you live in some magical fairytale land where business rules are
> stable, sensible, or both, depending on them for uniqueness is asking to
> have problems.

Thankfully mine are legally enforced, so it works well :)

~~~
ksenzee
Even that is not a forever guarantee. Laws can and do change.

~~~
dbenhur
And people act outside the bounds of the law. SSN seems like a good natural
key candidate, until you discover some of the people your business interacts
with are illegal aliens.

~~~
int_19h
Or legal aliens, like tourists. Or foreigners, who aren't even in the country.
Or really anyone else who doesn't work in the country, and doesn't have to
report taxes.

Ironically, many illegal immigrants actually do report taxes. In which case
they use ITIN, which is compatible to SSN format-wise (###-##-####)... but it
is _not_ unique over time.

------
problems
> Botnets will just keep guessing until they find one.

Why does your security rely on primary key obscurity? This seems like you're
doing something horribly wrong, put some authentication on that or something.

And no, no they won't. Hitting a collision is very hard if you're using
cryptographic strength random UUIDs, you wouldn't even be able to bruteforce
64 bits over the internet in a reasonable timeframe.

Go ahead, try the math on that, the only reason small keys are vulnerable to
local attack is because you can perform an enormous number of attempts per
second, often in thousands of millions of attempts per second and they can
keep at it for as long as they want. The database server won't let you query
anywhere near that fast. You will never get anything like that for network
based attacks as you're limited by bandwidth, latency and of course, the other
side who will notice if you even try to do this for any significant period of
time and likely block your attempts or limit them greatly.

~~~
dsacco
_> Why does your security rely on primary key obscurity? This seems like
you're doing something horribly wrong, put some authentication on that or
something._

You can't put authentication on everything. Some things are inherently
vulnerable to guessing because you can't toss an access control on it (the
thing _is_ the access control), which what the high bit-lengths and rate-
limiting are for. I wouldn't do this with a customer's UUID because it's
static, but sooner or later you have to rely on security by obscurity
somewhere (for example, password reset tokens).

As a theoretical matter a UUID shouldn't be the sole distinguishing data for
authorized actions, but in practice it's probably best not to expose them in
case you have an errant insecure direct object reference (which approximately
all modern applications do, somewhere or other).

I'm just nitpicking your use of the phrase, "security by obscurity" here. That
terminology is levied as a criticism against a lot of modern security design
but it's really fine (and sometimes necessary) to use it for defense in depth
(unlike, say, a cryptographic algorithm, which does not generally benefit from
security by obscurity, though particulars of the peripheral infrastructure
can).

~~~
vertex-four
You can't put authentication on everything... but you can put an HMAC on
almost everything, e.g. /receipt?id=1&hmac=hmacsha256(1, somesecretkey), which
decouples your enumeration prevention from your ID system.

~~~
dsacco
True, but you've just shifted the goalpost a little. Instead of brute-forcing
a random token I'm now brute-forcing a MAC, or a secret key. You don't really
need to use a MAC if you're already implementing rate-limiting and time-based
expiry (in addition to use-based). That's just a performance hit. You might as
well send a token generated by a good PRNG and skip the integrity checks.

It's unrealistic to brute-force one well designed token in the absence of
authentication (if it's long enough you could go without any other security
controls, including expiry or rate-limiting, but that's also a performance hit
at scale). Adding one with an integrity check is effectively rate-limiting
anyway.

Funny story though - I was once on an assessment in which a company used a
scheme just like that. The UUID was HMAC'd for a password reset request, and
UUIDs could be found by messaging another user. Unfortunately, they used the
same secret key for password reset HMACs as the one embedded in their mobile
app for HMAC signing requests. Worse still, password resets didn't need to be
activated, so you could effectively takeover any account by retrieving the
HMAC secret key from the app and running the victim's UUID through it.

~~~
greenleafjacob
The "performance hit" is the basis of most modern security [1].

[1]
[https://en.wikipedia.org/wiki/Computational_hardness_assumpt...](https://en.wikipedia.org/wiki/Computational_hardness_assumption)

~~~
dsacco
I'm well aware of the CHA (this is my primary research focus :). But that
doesn't mean you should strive to make things computationally difficult if you
don't need to (for example, Rijndael vs Serpent for AES). After a certain
point there's no need for further computational cost.

------
Pxtl
This is why I'm starting to loathe SQL. The theory is great, but when the
theory meets the practice and everything falls apart, the perfect kernel of
relational beauty turns into a trash fire and I just want to get my freaking
graph of objects out of the database. If I use numbers for keys, I deal with
disaster when I try to merge from disparate sources. If I use guids as keys, I
get terrible performance. Or I can just use a goddamned document store of Json
or XML and have related objects get stored right next to their parents and
tell the beautiful mathematics of relational algebra to shove it.

I'm tired of hearing "you don't have to say how to get the data, you have to
tell the database what you want and it will get that in the most efficient
manner" and then deal with an encyclopedia of byzantine rules to get it to do
the aforementioned "efficient manner" with anything approaching decent
performance. I can see the art, but the practicality mars it beyond
recognition. It's like Venus de Milo sculpted out of duct-tape and bubble gum.

Sorry for the rant, I'm just getting frustrated with performance problems in
_small_ data sets. I've taken the courses, I've read Date and Darwen, and I'm
just starting to get terribly disillusioned.

~~~
3pt14159
I get this feeling from developers from time to time, and occasionally I'll
get to see them on a project where they use a document store for long enough
and they start to appreciate why we started using RDMSes in the first place.

You trade complex retrieval of subgraphs for complex retrieval of aggregate
measures. Sometimes it is worth it, but I usually find that if you need both
(and that is a big if) it's easy to scale relational databases than it is to
scale document stores.

This is for many reasons, but an underappreciated one is that it is easier to
cache individual objects than it is to cache aggregate measures. So if you
need a subgraph just store it as JSON in a table full of simple key value
mappings with some tables that handle how to invalidate the cache. But because
aggregate measures span so many business objects it it's harder to achieve
this. Furthermore, it is more likely to be _possible_ to enumerate subgraphs
than it is to enumerate aggregations.

For example, imagine a tool that estimates the number of companies that meet
certain criteria. "Count the companies by country that have at least one
employee that earns at least $100k a year and who has had at least 2 previous
positions." The number of permutations are endless, and while these types of
queries aren't trivially fast on RDMSes, it is at least possible to handle
them.

I'm not saying any of this is really easy, but humans don't live in an easy
world. I used to be a structural engineer (EIT), and what amazed me was that
despite the fact that we've built thousands of skyscrapers, each one was its
own hell. Concrete trucks would go bad, water would gush out of places in the
ground it wasn't supposed to, people would die. But we continue to push
forward because we want to succeed.

------
evadne
I recall reading something about this in the PostgreSQL mailing list, message
written in 2016 but may still be relevant

[https://www.postgresql.org/message-
id/20151222124018.bee10b6...](https://www.postgresql.org/message-
id/20151222124018.bee10b60b3d9b58d7b3a1839@potentialtech.com)

“There's no substance to these claims. Chasing the links around we finally
find this article: [http://www.sqlskills.com/blogs/kimberly/guids-as-primary-
key...](http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-
the-clustering-key/) which makes the reasonable argument that random primary
keys can cause performance robbing fragmentation on clustered indexes.

But Postgres doesn't _have_ clustered indexes, so that article doesn't apply
at all. The other authors appear to have missed this important point.

One could make the argument that the index itself becomming fragmented could
cause some performance degredation, but I've yet to see any convincing
evidence that index fragmentation produces any measurable performance issues
(my own experiments have been inconclusive).”

~~~
sroussey
MySQL's innodb has a clustered index. On DB2, SqlServer, Oracle, etc, they are
configurable. PostgreSQL is more like MySQL myisam in that the data and index
are separate. In this case, the random nature only screws up the index
(assuming btree), not the data.

It can be a bigger issue if you proxy sql queries and shard to 1000 DB servers
and you thus rely on part of the query to indicate which shard it is on. You
can do this with a type of uuid that embeds the shard, or you have to include
additional parts in the query beyond the PK to get the row (it could be a
column that the sharing is based on, or a comment). And yes, careful with
joins from an a root object's table that joins to other tables that use uuids
unless you can guarantee that they are in the same node.

------
sp332
"Things got really bad in one company where they had decided to use Latin-1
character set. When we converted to UTF-8 several of the compound-key indexes
were not big enough to contain the larger strings."

This shouldn't be right. UTF-8 encoding uses the same 8 bits for each valid
UUID character that Latin-1 would. Unless someone put invalid characters in
the UUID field, I would guess that the new encoding was actually UTF-16 or
something.

~~~
treve
In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes
per character.

All the bytes transmitted on the wire will be UTF-8 strings, but this is
disconnected from the underlying storage engine.

I don't know if this is the case for other databases, but when I read this my
first thought was "This must have been MySQL"

~~~
sillysaurus3
W... What? Why?

The whole point of UTF-8 is that it's identical to Latin-1 (edit: ASCII)
except in cases where it can't be. In those cases, it uses a very efficient
encoding.

I understand that people (for whatever reasons) don't want to store text in
UTF-8, but that's incredible that such a popular piece of software would take
that notion to the extreme. "Let's make UTF-8 almost equivalent to UTF-32 when
the data is at rest! Yes!"

Are you sure that info is correct? And if it's correct, are there any valid
engineering reasons for that decision? It's probably best to be charitable and
assume there must be a reason.

~~~
geofft
I think this is correct for CHAR and incorrect for VARCHAR.

A CHAR column always allocates the maximum size, so a utf8 (MySQL's weird
"UTF-8, but only for the BMP" encoding) CHAR(100) needs to reserve 300 bytes
per row, and a utf8mb4 (MySQL's actual UTF-8 encoding) CHAR(100) needs to
reserve 400 bytes per row.

But a VARCHAR is dynamically allocated, so, yes, a VARCHAR(100) that stores
100 lower-ASCII characters is only going to use 102 bytes of storage (using 2
bytes for the string length).

See, e.g., the "tip" section at the end of
[https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-
utf8...](https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html)

~~~
ivanhoe
Author mentions compound keys, so issue was probably with the index key prefix
length, which is by default limited at 767 bytes. That limits you to a maximum
of having just first 255 characters of string data indexed with utf8, but in
latin1 (one byte per character) you can have 767 characters long prefix. So
with latin1 it's perfectly ok to have compound index on 2 varchar(255)
columns, but you can't do that with utf8. Converting from latin1 to utf8 will
break all those indices.

~~~
morgo
The 767 limit is with the default row format. In 5.7 the row format changes,
and allows 3000+

------
foolfoolz
sounds like the author thinks "uuids are a pain" and wants the benefits of
them but with a smaller representation. but doesn't provide any reasonings why
uuids are a pain other than not being able to remember them or say them out
loud. these are not things anyone does with primary keys!

you'll never say this out loud : 7383929. you may be able to remember it,
maybe. in a uuid you'll match the last few and first few letters just as fast
in your head

uuids are fine. sorting is an issue but at scale (the entire point of this
article) how often do you need to sort your entire space of objects by primary
key? you'll have another column to sort on

hiding primary keys and having 2 keys seems like a great way to make all
queries and debugging 2x as complicated

~~~
trothamel
You'll never say a phone number like 738-3929 out loud? Short identifiers are
useful in a lot of places. I work in IOT, and it's nice to have a short device
id when we need to troubleshoot something.

~~~
edejong
Where I come from, our tables go towards a billion rows, so there's copying of
keys regardless of whether it's uuid or sequential id.

------
drawkbox
Maintaining UUIDs is much easier than maintaining id/int lookups that may be
autonumbered (mssql, mysql, pg) or sequenced (oracle), even if using them
internally and UUIDs externally. This especially comes into play when syncing
across dev, staging and production environments and when clustering and
servicing out parts of your app.

The moment any db starts to grow to these areas, UUIDs lead to far less issues
than incrementing ids everytime.

Most RDBMS now have optimizations and native types (uniqueid) for UUIDs/GUIDs
and this is really a moot point at this point, most UUIDs are no longer
strings in DBs unless legacy from the time before native UUID types.

UUIDs are right for most projects but not all and as typical in any system,
the environment and needs of your project will dictate whether it makes sense
to use them.

UUIDs eliminating the round trip and negating dealing with
autonumbering/sequencing is a massive benefit, the only real con of UUIDs is
the extra 8 bytes but make up for it in less need to lookup during runtime
when creating new or associating data with them.

~~~
bborud
It is like that line from jurassic park 2...

(The problem with UUID is that the two first letters are a cruel joke. But
hey: people need to make their own mistakes because some things you can't just
tell people: some rakes they need to step on until they learn or are unable to
reproduce)

~~~
drawkbox
Universally Unique, if there is ever a conflict you can easily use a
constraint or check upon creation on the app/server/service side, no different
than checking if an int id or slug/public code is already used. No roundtrip
is needed for every new item and if there ever was a conflict you just have an
api error stating as much. You might want to also have it email you to play
the lotto that week because it is virtually impossible.

Try syncing autonumbering across different database types where some use
autonumbering and others use sequences. That is a nightmare scenario but there
would be zero issues with UUID keys.

Primary keys as ints/longs will be constant rake stepping as you grow and
maintain an app.

UUIDs emerged from a need that int keys weren't filling especially in a
distributed/serviced world. I have used them in most DBs for over a decade and
no rake steppage.

Side note off topic: If you were a piece of data would you want to be a
universally unique piece of data or an int and merely a clone? All my data is
a unique piece of data, I am a good Bender.

~~~
bborud
I've worked on quite a few distributed systems where you need to generate
unique IDs and I can tell you that UUID is not a solution in quite a few of
them because you actually need to understand what problem you are solving
rather than taking the promises of others for granted or shrugging and
gambling on thins sorting themselves out.

In many cases you have two problems: network partitions and/or high ID
assignment rates. So statements "if there ever was a conflict you'd have an
API stating as much" is simply passing the buck to some piece of magic and
actively not knowing what goes on. It is a bullshit statement.

In my experience, this is the kind of problem where people tend to fail
because they don't grasp that it is actually kinda hard. Especially if you
have additional constraints (like the size of IDs). People tend to make all
manner of assumptions and then not test those assumptions.

The arguments are familiar though. I hear a re-run of them about every 2
years, when someone has a system in production and the thing starts to
generate duplicate IDs. Although both code and notes are available on previous
solutions within the company, people tend to ignore that and just plow ahead.

------
MithrilTuxedo
Can confirm: using MySQL and for reasons... everything in the DB gets a
primary key set by taking a random UUID, stripping the dashes, and then doing
an `UNHEX(id)` in the stored procedures. Those IDs are both the primary keys
and the keys used in the service's APIs.

One of our Ops guys did an experiment where they put a uniqueness constraint
on the ID column and added an auto-incrementing primary key column that's
never exposed to the code driving the thing. It apparently sped up our DB
performance by orders of magnitude.

It also turns out that MySQL would perform faster just by leaving those values
as strings instead of converting them to binary values. We've got some outside
pressure to use Oracle instead of MySQL, and apparently it performs much
better than MySQL with our current schema so we apparently aren't going to do
anything to improve the MySQL performance or change any of this behaviour.

------
sudhirj
Shameless plug: Anyone bothered about the wasted space in UUID string
representation (and using Ruby) can check out
[https://github.com/sudhirj/shortuuid](https://github.com/sudhirj/shortuuid)
\- it re-encodes your UUID into any alphabet you choose, with a Base62 default
(I find that to be a sweet spot that gives both URL safety and efficiency).

Let me know if you want ports in any other languages - the the algorithm is to
really just treat the UUID as a hexadecimal number (that's actually what it
is) and re-encode it into any other alphabet of choice.

That said, always use native UUID types in datastores - they'll convert to
bytes / numbers internally and will always be the most efficient. For other
situations, remember that they're just numbers, so you can write them in
binary, ternary, octal, decimal, hexadecimal, vowels, baseXX or really any
other alphabet you want. The bigger your alphabet (as long encoding remains
efficient, like ASCII under UTF-8), the better your gains will be.

------
makmanalp
Yep, glad to see this posted. In the python world, this is why we have
UUID.int
([https://docs.python.org/3/library/uuid.html#uuid.UUID.int](https://docs.python.org/3/library/uuid.html#uuid.UUID.int)),
though the native postgres UUID type with uuid-ossp works well too if you need
them auto-generated in the DB rather than in application code.

~~~
aphextron
Is there an auto-incrementing version of this? I'm trying to implement a time-
series database and can't find a suitable method of generating incremental
UUIDs.

~~~
imhoguy
You may try ULIDs - "Universally Unique Lexicographically Sortable
Identifiers" posted here last year.
[https://news.ycombinator.com/item?id=12205158](https://news.ycombinator.com/item?id=12205158)

------
dimgl
This article is so poorly written it's hard to take it serious. The entire
paragraph about the size of a UUID takes reading it three or four times before
you can actually understand what the author means...

In what context would a primary key change, even when sharding? In my entire
career I have yet to see it. Also any sane person would never sort random
values. If you need sorting in your table, provide some kind of indexed
timestamp.

~~~
bjt
> In what context would a primary key change, even when sharding?

The only time I've seen that happen is when a DBA evangelized for just using
'natural' keys, which in this case were slugs based on the item name. The
problem came when users decided they wanted to rename things and then wondered
why the IDs they were seeing in URLs didn't match the new names.

I'm not a believer in natural keys anymore.

> If you need sorting in your table, provide some kind of indexed timestamp.

A v4 UUID contains a timestamp. So it seems a bit wasteful to add the separate
timestamp column. Why not just use the timestamp built into the ID? This is
what MongoDB's internally-generated IDs do. (Though they're not quite UUIDs.)

There's a catch with v4 UUIDs though: the timestamp portion of the ID is not
the most significant digits, so a simple ORDER BY doesn't do what you want.

I like the solution developed by Instagram, which Rob Conery discusses at the
link below. It uses a pretty simple plpgsql function to generate a 64 bit
integer based on the current time, a shard ID, and a per-shard sequence. This
lets you generate 1024 IDs per second per shard. It also puts the timestamp
portion of the ID first, so if all the shards' data ends up in Hadoop or
something you can order consistently across the whole set.

[http://rob.conery.io/2014/05/29/a-better-id-generator-for-
po...](http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/)

~~~
kpxxx3
Item name was never suitable as a key... It's not stable. There are natural
keys, they are just very rare. And the thing is even when you think you've got
one it is often better to err conservatively. A good anti-example is SSN,
often used in text books. They do change. Other than DNA sequences I can't
think of a good person key. Just use an internal surrogate and be done with
it.

~~~
thinkloop
Even DNA changes!

[http://www.sciencemuseum.org.uk/whoami/findoutmore/yourgenes...](http://www.sciencemuseum.org.uk/whoami/findoutmore/yourgenes/wheredidyourgenescomefrom/whydopeoplevary/doesyourdnachange)

~~~
Turing_Machine
Not to mention all the identical twins out there.

------
rikkus
Lots of talk about performance, but no numbers cited. I did my own benchmarks
before using sequential ("COMB") GUIDs as 'PRIMARY KEY' (yes, they're
surrogate keys) and found no material performance difference. I didn't keep
the results, but someone else has made their numbers public here:
[https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/gui...](https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/guid-
vs-int-debate/)

------
zimbatm
Little rant on UUIDs:

Notice how the author assumes UUID v4[1] in the conversation. There are very
few reasons to use the other versions but we are still paying for their price
in code complexity all the time.

Look at this UUID parsing code:
[https://github.com/sporkmonger/uuidtools/blob/master/lib/uui...](https://github.com/sporkmonger/uuidtools/blob/master/lib/uuidtools.rb#L154-L172)

What it really should be is `[uuid_string.gsub('-', '')].pack('H*')` (for non-
rubyists: remove the dashes, decode the hex back to binary).

Their representation is also not that good since hex encoding is not very
compact.

I guess what I'm trying to say is that UUIDs are often used as a default
unique identifiers but they are actually not that good.

[1]:
[https://en.wikipedia.org/wiki/Universally_unique_identifier#...](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_.28random.29)

------
wvh
Postgresql has a UUID type which should store them as a 16-byte number. If you
use time-based UUIDs – for instance based on the unix time stamp in hex, like
CouchDB – then you also get sortable primary keys, which conceptually might or
might not be useful to your application, but it probably speeds up indexes.
I've done exactly this for two different projects, and it works well.

On top of that you get IDs that are impractical to guess, which while wouldn't
replace other security measures, would still give you some collision
resistance and probably avoid some bugs because of the unlikeliness of
accidentally picking the same key for two different entities.

I'm sure there are pathological cases for UUIDs as primary keys in certain
scenarios, like perhaps a very high number of small records, but I've not come
across them myself. You obviously have to know your own data and database if
you have some very specific requirements.

------
caleblloyd
I work on an Entity Framework Core Implementation for MySQL and we recently
added sequential GUID generation for primary keys that are of type Guid. The
first 8 bytes of the GUID are the current UTC timestamp in ticks and the last
8 bytes are cryptographically random.

One interesting thing we ran into when implementing is that C#'s binary format
and string format must be different to be sequential. So we have to detect
whether the GUID is stored as a string or binary and put the timestamp in the
correct place to ensure it is actually sequential.

Here's the PR for the feature for anyone interested:
[https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...](https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCore.MySql/pull/260/files)

------
mark242
The reason I don't like the internal-int-external-UUID strategy is that _all_
of your queries now require an extra join. It's no longer "select microblog.*
where userid = ?" now it's "select microblog.*,user.id from microblog,user
where microblog.userid = user.id and user.uuid = ?".

This may be practical from a storage standpoint but string-based indexes on an
SSD are pretty damned efficient.

~~~
_kyran
The benefit of an external guid is that it obfuscates counts.

I've seen cases where it's possible for a business to track a competitor's
growth by watching a public facing user id count increase.

~~~
user5994461
I do that too.

Ask an invoice from a HN startup. See the reference number at the top
"2017-0000438".

They had 438 potential customers this year :D

------
vkrm
Datomic [0] uses SQUUIDs [1] (Semi sequential UUIDs) to work around this:

    
    
        Many UUID generators produce data that is particularly
        difficult to index, which can cause performance issues 
        creating indexes. To address this, Datomic includes a 
        semi-sequential UUID generator, Peer.squuid. Squuids are 
        valid UUIDs, but unlike purely random UUIDs, they 
        include both a random component and a time component.
    

[0] [http://www.datomic.com/](http://www.datomic.com/)

[1]
[http://docs.datomic.com/identity.html#sec-6](http://docs.datomic.com/identity.html#sec-6)

edit: formatting

------
michaelcampbell
> Aside from the 9x cost in size, strings don’t sort as fast as numbers
> because they rely on collation rules.

Why would you sort these to begin with; what ordering of essentially
randomness (part of the point) makes sense?

~~~
fooey
for when everything else is equal, you need something to let you get
consistent sorting in your output

For example, if your page size is N and you have N+1 things with the same
timestamp, you need a backup sort key or your pagination breaks down

~~~
ahachete
You should really read this:
[https://www.slideshare.net/MarkusWinand/p2d2-pagination-
done...](https://www.slideshare.net/MarkusWinand/p2d2-pagination-done-the-
postgresql-way)

~~~
whateveracct
Yep the Seek Method for pagination as suggested there would cause you to be
sorting by UUID.

    
    
        SELECT uuid FROM t where filterable_field > $whatever and uuid > $lastSeen limit $pageSize
    

with a covering index on (filterable_field, uuid)

------
mirekrusin
"UUIDs do not reveal information about your data" \- this is false statement;
in sensitive environments you need to be aware that some UUID versions can
leak MAC addresses, timestamps, hashes of your data etc. - sometimes just
enough to abuse this information.

~~~
djsumdog
Yes, UUIDs are guaranteed to be unique, not random. You shouldn't use them for
things like Single Sign-on tokens.

That being said, the author does talk about not exposing the UUIDs either. I
get the feeling it's slightly less bad if a UUID gets exposed accidentally
than an increment sequence number.

~~~
mirekrusin
UUIDv4 is random. UUIDv3 and v5 are just namespace+name hashes (where
namespace is often Nil UUID adding to the problem). Those 3 hashes are, in
general, not good candidates for pk on large resources because they mess up
the index which likes monotonic increments on new values.

The bottom line is UUIDs are not bad, you just need to be aware of couple of
things when working with them.

------
emodendroket
Why should it matter if you can guess IDs? Presumably records are locked in
such a way that simply knowing a URL doesn't allow you to bypass security.

~~~
peterjlee
>Presumably records are locked in such a way that simply knowing a URL doesn't
allow you to bypass security. In case this doesn't happen. If there is a leak,
with sequential ID, you run the risk of leaking the whole table. With UUID,
the leak can be more limited.

Also, sequential ID may reveal some information you may not wan to share. If
someone's user ID is 1234, that may be an indication the service has at least
1234 users.

~~~
programd
Timing based UUIDs (type 1) or ULIDs can leak (surprise!) timing information.
If you can see them then you can infer the rate at which some resource is
being created. For example the rate at which a service is creating users or
business transactions. This can be valuable competitive information you don't
want to share with the world.

~~~
emodendroket
I was responding to the article saying integers as IDs were bad because you
could increment them to guess at the next record.

------
ivan_gammel
The strategy "internal int-external uuid" can be simplified if you use
encryption and hypermedia API. It's possible to encrypt int and some
additional information and format it as uuid v4 (random). For external users
that know natural keys of some objects it's possible then to discover the rest
of objects by navigation via API, where UUIDs are just some pseudo-random
parts of the URIs.

~~~
MichaelGG
Why would you format it as a UUIDv4 instead of just leaving all bits intact?

~~~
imhoguy
To take advantage of native UUID support in a database, e.g. PostgreSQL stores
them in 128 bits. Also every UUID carries version number so minimal formating
needs to be performed to comply with the standard.

~~~
MichaelGG
So you're saying Postgres actually wastes cycles verifying the version
information and doesn't treat UUID just as a 16 byte opaque structure?

~~~
anarazel
Well, there's the "input format" (or rather two), which gets parsed into the
internal representation. The input format can be either hex text, which allows
dashes in some positions, or binary, in which it's just a bytearray. There's
no verification otherwise, unless you treat generation of UUIDs, which allows
to specify the type, as such.

------
harel
PostgreSql has a dedicated UUID column type. Those are fast and the storage
difference is insignificant.

------
d0m
One huge benefit of UUID is how you can safely create them while being
offline, and then sync them at a later stage without conflicts.

~~~
lacampbell
You _can_ still have conflicts though, sometimes more than you think due to
the UUID generator used. It's rare but it seems like an edge case that should
be handled.

Doesn't invalidate your point about being able to make them offline though.
Another option I considered to create PKs offline was a userid/timestamp pair.

~~~
floatboth
Obviously you should be using UUIDv4 with a good secure pseudorandom number
generator if you want to generate them offline. And typically no one handles
collisions in large pseudorandom values.

------
JTenerife
I don't agree with many points.

1\. Store uuids in a uuid field. Why starting the article with such a trivial
finding that a text field is not optimal.

2\. Use sequential uuids.

3\. Several benchmarks have shown that the performace hit is minimal.

4\. The only way to communicate with ids is to copy and paste them. Never try
to memorize, talk about them or type them.

------
eranation
Excellent post, write ups like this are the reason I keep coming here.

What about the hi/lo algorithm as a middle ground?

[https://vladmihalcea.com/2014/06/23/the-hilo-
algorithm/](https://vladmihalcea.com/2014/06/23/the-hilo-algorithm/)

In short, and I hope I don't oversimplify, each "shard" or "cluster" in the
database gets a "block" of ids it can then go and assign on their own, the
sequential "atomic" increase happens only once per hi "block", lowering the
contention.

This gives you nice integers, incremental-ish most of the time.

I like the notion of integers internally and UIID (as integers of course! I
would have never saved one as a varchar, I swear! ok, I was a noob... I
deserve to be shamed)

Great post all in all!

------
mreftel
"Then add a column populated with a UUID (perhaps as a trigger on insert).
Within the scope of the database itself, relationships can be managed using
the real PKs and FKs." That would mean doing lookups by UUID, which is
/really/ bad for performance. UUIDs are evenly distributed, so index caches
are rendered nearly useless. With sequential keys, and access patterns that
touch mostly new data, all you need to find the row is likely to already be in
RAM, no matter how many rows you have. With UUIDs, you'd end up doing random
I/O. Might not sound like that big deal to some, but we got a 3x overall
throughput increase in one of our apps by switching from UUIDs to sequential
ints.

~~~
firebird84
I've found many have made the reverse argument, I assume for write-heavy
loads. With incoming new UUIDs likely being on different pages, any locking
within the DB is likely to be uncontended with a UUID key. Most people's RDBMS
use cases are read heavy anyway, though, so it's usually better to have
sequential (clustered) keys.

~~~
dhd415
I've heard that case made for certain write-heavy workloads, too, but I would
expect that to be highly dependent on the details of the db engine's
implementation and your particular workload. While UUIDs, as you said, would
generally be written to different pages on insert, you'd also experience a
higher page split rate when UUIDs arrive that need to be written to pages that
are already full which could be expensive. Without knowing the details of your
db engine and/or workload really well, I'd expect you'd have to do some
testing of your workload to see which one would come out ahead. In a series of
tests that I did on application with a 95% write, 5% read workload, integer
IDs were substantially higher performing than GUID IDs.

------
flatline
Another alternative to avoid guessing is to use randomized 64-bit integer
keys. You still risk collisions over sharding/replication, but only if you
truly have a lot of data. You potentially lose some index performance but it
shouldn't be any worse than with guids. If you really need the full size of a
guid, just use them for the key. I don't get the rest of his argument for
hiding internal surrogate keys.

~~~
user5994461
I am sorry but that is bad advice.

A GUID has much superior collision resistance than a bare random integer.

The GUID includes a part that is a random (like the int), a time component
(prevents collision for hashes generate at different microsecond) and network
information (prevent collision across host names).

------
krisdol
In a time-series datastore, you may have to replace a set of invalid/corrupt
events within an index. Having IDs that are in some way deterministic from the
source data, you are able to replace the invalid documents by ID by simply re-
indexing that time period with your patch applied. This is the most simple and
least risky solution, with minimal downtime

If the IDs are UUID, then the easiest way to fix the values is to drop the
index and re-create it, making all of the other data in the index unavailable
as it's being recreated.

The less-easy way with UUIDs is to select just the broken events, create new
patched events, delete the old events, and insert the new ones in the right
index. But you'd have to branch off of your regular indexing logic to do this,
probably writing a separate script. Of course if you make a mistake, you may
end up with either duplicate documents or loss of data, compounding the
original problem.

So I agree, have IDs that are deterministic (that they can be recreated using
some known formula and source data, for example:
documenttype_externalid_timestamp).

------
njharman
Using UUID for external means you've just forced all the problems with UUIDs
on your users.

I'm dealing with that from several vendors atm.

------
dpark
> _A naive use of a UUID, which might look like
> 70E2E8DE-500E-4630-B3CB-166131D35C21, would be to treat as a string, e.g.
> varchar(36) — don’t do that!! “Oh, pshaw”, you say, “no one would ever do
> such a thing.”_

> _Think twice — in two cases of very large databases I have inherited at
> relatively large companies, this was exactly the implementation. Aside from
> the 9x cost in size, strings don’t sort as fast as numbers because they rely
> on collation rules._

Eh, I've done that before because it made some interaction with Entity
Framework easier (don't recall what now). Hasn't really mattered. The space
for storing GUIDs has never been a meaningful constraint for anything I've
ever worked on (9x is also nuts and assumes that your database uses 4 bytes
per character). Sorting UUIDs is also generally uninteresting since they
aren't meaningful by themselves. Maybe if you're doing lots of joins you might
care about this.

~~~
max23_
It bothers me a bit why the author never mentioned what is the right data type
to use instead of varchar(36).

I had to google it up to know there is a data type for GUID called
"uniqueidentifier" (MS SQL) or "uuid" (PostgreSQL). As for MySQL, there is no
such equivalent and suggestion is to use CHAR(38) instead.

Does that mean MySQL will have to live with the 9x cost in size issue?

~~~
dpark
binary(16) or two 64-bit integers are also options.

~~~
max23_
Thanks. Good to know that too.

------
paragarora
This is just opinion and looks like UUID is bad for a particular case author
is working on.

We have multiple components over different stacks and id could be generated
anywhere in the components. We had to live with either building unique id per
table separate infrastructure or UUID. UUID works perfectly and with
POSTGreSQL, it's just awesome.

------
stollercyrus
I found this post super helpful. For anyone doing rails development, I wrote a
gem to make this really simple. I'd love feedback.

[https://github.com/cyrusstoller/public_primary_key](https://github.com/cyrusstoller/public_primary_key)

------
masklinn
> Another problem is fragmentation — because UUIDs are random

UUID-4, UUID-3 and UUID-5 are random (3 and 5 are hashes).

UUID-1 is time-based with the time leading, and you can often control the
_sequence_ (14 bits) and _nodeid_ (48 bits) fields to be used as whatever you
want to avoid collisions.

~~~
jonas21
With UUIDv1, the low bits of the time come _before_ the high bits, so they
still don't sort in time order.

~~~
Someone
That depends on how your database sorts UUIDs. SQL server, for example,
doesn't sort them in the obvious way
([https://blogs.msdn.microsoft.com/sqlprogrammability/2006/11/...](https://blogs.msdn.microsoft.com/sqlprogrammability/2006/11/06/how-
are-guids-compared-in-sql-server-2005/))

It even is different from how C# orders _System.GUID_ instances. I think that
could be something that can be put to good use if there ever is an
_underhanded C# contest_ mimicking the underhanded C contest
([http://www.underhanded-c.org/](http://www.underhanded-c.org/))

------
tsechin
At a previous company, we got burned using UUIDs as MySQL PKs. Turns out MySQL
keeps data on disk sorted in PK order, so even a moderate INSERT workload
would lead to lots IO and disk thrashing as pages kept needing to be
rewritten.

Fun times...

------
einrealist
My advice (and daily practice): If IDs are exposed, expose them as strings. If
that ID is a compound key of a database, serialize it into a single string. If
the ID is exposed via webservice, use URIs. In a entity provided by a
webservice (e.g. a JSON-LD document via HTTP), use URLs or URNs. If possible,
provide both and a translation service that translates URNs to URLs. URNs
should be used for long term storage, URLs for transient use.

If I follow my advice, the type of an ID is an implementation detail of the
persistence layer and/or service endpoint.

------
scandox
I use the internal int and external uuid strategy mentioned at the end. It
does make for somewhat confusing code for newcomers. I still don't love it.

------
jondubois
The reason given for not exposing UUIDs publicly (migration) doesn't apply to
most NoSQL databases because they let you set the ID yourself so you can just
copy each document as-is. Maybe the author was referring to databases which
automatically (and forcefully) generate the ID on insertion... Even in this
case, isn't there a way to tweak this temporarily just for the migration?

~~~
floatboth
SQL databases of course allow you to generate the UUID on the client. (In fact
not all database servers even _can_ generate UUIDs themselves. Postgres can,
if you enable an extension.)

------
manigandham
Use the hi/lo mechanism to generate IDs on the client. You can use a simple
transaction to reliable reserve a range of numbers and then easily have
incrementing numbers. Use longs and you can reserve a billion IDs per second
and never run out.

This solves basically all the problems and we use it in production to number
several tables with billions of events per day.

------
phamilton
I'm surprised the author calls out that knowing the pk before insertion is
useful, but doesn't once mention idempotence as a key benefit.

If you are building mobile apps that sync state, UUIDs make your life so much
easier. Optimistically perform writes locally, then perform writes remotely
and retry on exponential backoff in case of a network error.

------
tehlike
When I was a developer on NHibernate, one of my favorite ID generators was
something called HiLo.

Each of the clients reserve a chunk of Lo numbers, and increment the Hi
number. Basically, they would pre-allocate a chunk of id ranges, and this
allowed good distributed id allocation performance, while somewhat keeping
local ordering.

Client generated ids are very useful to do.

~~~
flavio81
You were a developer on NHibernate? Thank you very much, it was a pleasure
using NHibernate (many years ago).

It was/is a very good ORM! And I used it extensively.

~~~
tehlike
It was the best years of my career. I learned so much more than i did through
my formal CS education, I was still a student at the time.

------
clairity
for ruby on rails, acts_as_having_string_id [0] is a nice gem for not exposing
sequential int primary keys:

it's nicer than using UUIDs because the strings are much shorter.

[0]:
[https://github.com/hult/acts_as_having_string_id](https://github.com/hult/acts_as_having_string_id)

------
iask
I was at a new client the other day and notice that for all their tables in
SQL SERVER, they use an IDENTITY column for primary keys, obviously seeded by
SQL SERVER. What I found strange is that they allow deletes of records,
allowing gaps in the sequence.

Is that normal practice? Their DBA was insisting that its normal.

~~~
dethswatch
No, it's not.

Also- wrt SQL Server ignore anything they say about GUID's as indexes. Just
don't cluster them.

Also- enjoy the benefit of not having to ever ask the DB what the pk of those
records you want to insert is going to be.

------
sfeng
Using a better encoding than hex for the GUID would fix many of the storage
and memory issues he cites: [http://eager.io/blog/how-long-does-an-id-need-to-
be/](http://eager.io/blog/how-long-does-an-id-need-to-be/)

------
russdpale
This is what hash keys are great for. After getting the hash, convert to a
BIGINT. Works great for me. You still get everything you do with UUID, but as
a bigint so the numbers are much quicker, and its 8b.

------
scotty79
In my current project, in ms sql server, I have guid PK with unclustered index
and clustered index on another field filled with current time stamp on insert.

What do you think about such setup?

------
rickmode
I've yet to see anyone mention storing UUIDs in a BINARY(16) column. Use
exactly 128 bits to store 128 bits. We'd still have the random sort problem
though.

~~~
floatboth
Use a 'uuid' column if your DBMS was designed by people who knew what they're
doing :-)

------
brlewis
Shouldn't the title be appended with (2015)?

------
org3432
He missed one of the biggest issues, in most implementations they are slow to
generate due to the complexity and requiring a PRNG.

------
kazinator
> _Best of Both: Integers Internal, UUIDs External_

Database coder reinvents interned atoms.

------
tuxt
We use unixtime + server number + random as pk.

Works fine.

(10 million new rows everyday)

------
arrty88
How big a deal is this on Postgres?

~~~
Androider
Not at all. There's a native Postgres UUID column type, stored as a 128-bit
number as opposed to a string. Postgres indexes are also not clustered, so
will not suffer like MySQL when using an UUID as a PK.

Using an UUID is not performance optimal, but it is developer optimal and
mitigates many other problems listed in this thread (offline sync scenarios,
merging partitions or datasets, tracking across systems, opaque external
representation perfect for APIs etc.) so unless you're really squeezing the
last performance drops from a system I would simply use an UUID nowadays.

------
burntrelish1273
Yes, internal-use integer PKs are a really good idea, but difficult to scale
to user-facing sites securely.

Often though, you need an automatic, user-friendly unique id for PK's to give
users besides asking the user or generating a text slug.

Here's one construction:

friendly id = HMAC-SHA256(plaintext = "#{table-object-or-model}#{primary-
key}", nonce = app-wide-unique-secret-token)

Then in the server app, keep a limited MRU cache (Hash table) of digests back
to PKs.

Out-of-scope: authentication, authorization, CSRF/XSS and replay attacks.

------
wcummings
>The original issue with simple auto-incrementing values is that they are
easily guessable as I noted above.

I don't think this is a real problem. If you're relying on your ID's being
"unguessable" (and introducing engineering complexity to that end) for
security you've already failed.

~~~
jasonshaev
Auto-incrementing IDs also leak information. For example, if a service uses an
auto-incrementing customer_id column and I'm assigned customer_id 12345, I
know there are a maximum of 12344 other customers (making a few assumptions).
This information is potentially useful for competitors.

~~~
merb
what if I start at a random int number? consider your first customer to be
546472?

~~~
jasonshaev
Sure, not a bad idea :)

That being said, I can still extract information. For example, I can create a
user today, and then another user a week from today. From that information I
can extract how many new users a service adds in a week.

To avoid that, I can increment by a value other than one, but that might make
sharding harder unless I pick a number like 10, 1000, etc. (of course
depending on sharding strategy), which is probably easy to figure out.

I guess my point is that it's harder to avoid leaking information with auto
incrementing IDs. This may or may not matter depending on the use case.

------
cynoclast
This article sort of assumes you're using a relational database.

Most of the drawbacks discussed don't exist if you're using a key value store.

~~~
__jal
Right, and none of the drawbacks are relevant to toaster-oven or parachute
use.

