
SQL Keys in Depth - chmaynard
https://begriffs.com/posts/2018-01-01-sql-keys-in-depth.html
======
manigandham
After a decade of large systems relying on RDBMs, we now use 64-bit integers
for all primary keys with a global Hi/Lo id generation system (app reserves a
range of numbers on startup to assign to records automatically).

This means plenty of ID space, maintains rough numeric ordering, allows ID
creation without a roundtrip for every insert, is easily portable across
different databases, and produces unique IDs for every row in the database
which greatly simplifies everything from caching to replication.

~~~
walshemj
does seem to me that the author is a hobbyist or a junior developer with
little real word experience.

Literally the first Colum of almost every SQL table I have written has had a
column called id with auto increment.

~~~
sparkie
The trouble with "natural keys" is that they're rarely actually unique. The
barcode is a typical example. A naive developer might use a barcode as a
primary key, but will soon be in for a world of pain when he realizes that
products often use the same barcodes for different configurations (packaging
etc), which usually need different SKUs. The same product from different
origins may have the same barcode, which often matters.

~~~
pasta
And a product might get multiple barcodes.

I was a fan of natural keys but it is just too much trouble. For example you
have to url encode everything when a key is used in the url.

But sometimes I still use natural keys for tables with for example ISO
standards like country codes.

~~~
wvenable
I feel like an ISO country code is more of a surrogate key anyway; it's a
mnemonic made-up value specifically designed to be used as key.

~~~
dragonwriter
An ISO country code isn't a surrogate key; it's not db-specific, and it has
domain meaning (assuming you are using it for it's intended purpose). It's
ususlly, when used, a representation chosen for natural domain information, so
if it's a key, it's a natural key.

------
simonw
"For instance, a database of hobbyist club members could include uniqueness on
the two columns first_name, last_name of members. Duplicates are likely
unintentional at this scale, and if necessary the constraint can be dropped.
Until an actual conflict happens the key is a reasonable precaution."

Absolutely do not do this.

People have names that are duplicates. A situation where someone is unable to
join a club because their name clashes with an existing member is not OK.

Expecting a club administrator to be able to drop a uniqueness key from their
database in order to resolve this situation is not a reasonable solution!

~~~
MarHoff
I agree with you, but I think you did overthink his metaphorical example a
little.

I think the point was that for a 40 people club odd are very very poor that
two people would have the same name. And even if so a club manager could still
differentiate by adding a middle name or a nickname (in a 40 people scenario).

Of course IF your neighborhood club expand and you need to manage a lot of
people you'll have to switch to a better technique. His point was that you
must fit to you business case. (But hey some people change name when they
marry so... problems can happen fast, but that's mutability issue not
uniqueness).

Overall very good article but keeping critical mind as you did is needed.

~~~
olavk
I'm not convinced that the odds are "very very poor". Some names (like Robert
Smith or Maria Garcia) are quite common. And a hobby club is not a random
selection of people, members would typically tend towards a specific
geographic area, social class or culture depending on the subject. Some
cultures have a very limited pool of names, depending on tradition.

~~~
bastih
Just from looking at a 200 member club I'm part of, I can see at least two
name collisions in our current club roster. So absolutely agree with your
assessment.

------
foreigner
At what scale does all this stuff start to actually matter? I have an database
with ~100 tables and ~500M rows driving a medium-traffic web app and various
back-end systems. We use auto-incrementing integers as primary keys and try
not to expose them externally. Indexes are added as necessary to enable
specific queries. We don't enforce any other constraints (e.g. not-null or
foreign keys) at the database level.

... and it all works and performs just fine? The considerations the author
mentions all make some sense to me in theory, but when do they actually matter
in practice in a modern system?

~~~
adambatkin
I disagree that UUIDs are generally preferable over integers. For one, they
take up more space (on disk and in memory). And for something like a key, it
is likely that there will be multiple copies of that value stored, since it
will exist in the table itself, at least one index (possibly more) and foreign
keys. More space means fewer records per page on disk, more I/O and more
memory usage (potentially leading to more I/O). I would wager that for most
users (including this case), the cost of this additional I/O is far greater
than some theoretical scalability limitation on generating IDs.

Most database vendors make sequence generation (whether through explicit
SEQUENCE objects or auto-incrementing columns) performant by making a few
compromises:

* Numbers may not always be sequential (you might get 1, 3, 2 - in that order)

* There may be gaps (you might get 1, 2, 5, 6)

But since these are supposed to be opaque identifiers, neither of these
compromises should be a concern for most users. But this means that these
sequences can live outside of a transaction (you might grab an ID, rollback,
and that ID is gone) and that systems with multiple nodes can be allocated a
"block" of numbers from which they can quickly pull new values, without
needing to coordinate with a master node.

~~~
extrapickles
UUIDs are the best choice when developer time is more important than space
usage. Also, they can be generated and used by the client when the connection
to the db is frequently down (eg: clients store/query data locally in SQLlite
and replicate to master).

~~~
RobertRoberts
I am not a db expert, which is why I like read these kinds of comment threads.
But what about this?

[https://en.wikipedia.org/wiki/Universally_unique_identifier#...](https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions)

"The random nature of standard version 3, 4, and 5 UUIDs and the ordering of
the fields within standard version 1 and 2 UUIDs may create problems with
database locality or performance when UUIDs are used as primary keys. For
example, in 2002 Jimmy Nilsson reported a significant improvement in
performance with Microsoft SQL Server when the version 4 UUIDs being used as
keys were modified to include a non-random suffix based on system time. This
so-called "COMB" (combined time-GUID) approach made the UUIDs non-standard and
significantly more likely to be duplicated, as Nilsson acknowledged, but
Nilsson only required uniqueness within the application."

~~~
WorldMaker
This article mentions this question. Database locality is a complex scaling
question that may not be a problem depending on how read/write heavy your
operations are. On the one hand, randomly distributed UUIDs may be a bad idea
on a single SQL server trying to balance a single B-Tree index (and thrashing
that index with a lot of incoming data); on the other hand it can be somewhat
ideal for partition sharding across multiple servers. As with any trade-offs
in database design, your mileage will vary with your application needs and
resource availability.

Also, there are other options for time-ordered GUID/UUID alternatives. ULID is
the one I've been heavily using in projects lately:
[https://github.com/ulid](https://github.com/ulid)

(ULID uses a timestamp prefix and random suffix for reasonable time-ordered
database locality; an interesting compromise between V1 and V4 UUIDs, though
not directly compatible with either. The L stands for "lexicographic" in that
its also meant to be sortable in string indices as well, which can be
important for database locality in many document/NoSQL databases.)

------
lifepillar
The most important property of surrogates, which I never see mentioned, is
that two tuples, _anywhere in the database_ , have the same surrogate key _if
and only if_ they are (believed) to refer to the same real-world entity. This
is a crucial difference wrt auto-incrementing or randomly generated values
that are independent in different relations (which is a common practice).
Among the rest, with surrogates defined as above you may freely join relations
on surrogate keys being sure that you are joining related data.

In his landmark 1979’s paper, Codd defined surrogates and pretty clearly, and
that part of his work was based on an earlier paper entitled On Entities and
Relations (I don’t recall the authors right now), so this is not exactly news.

Unfortunately, current DBMSs provide little to no support for properly
implemented surrogates.

~~~
dragonwriter
> Unfortunately, current DBMSs provide little to no support for properly
> implemented surrogates.

UUIDs instead of per-table autoincrement keys seem to be a solution which
covers the part of the problem that involves DB support as opposed to data
model design, and several DBs have adequate support for UUIDs.

~~~
ProblemFactory
UUIDs (and other mostly random IDs) also have the important advantage that
they can be generated offline - in mobile apps, client-side javascript or
elsewhere.

With autoincrement keys, you need to either contact the database server for
each data object you create, or implement complex logic to replace temporary
IDs with database IDs later.

~~~
dullgiulio
Nope, you can't. You still have to check for uniqueness on the server side
before committing them, so you are back from the start.

Always keep in mind that anything coming from the client cannot be trusted.

~~~
ProblemFactory
You can't blindly overwrite existing data based on the UUID, sure.

But I find that the security checks against "stealing UUIDs" are much simpler
to implement than a system of composite primary keys (deviceid+autoincrement
or userid+autoincrement) or some sort of temporary primary keys. If you see
suspicious UUIDs, you can just discard the data.

Having the client generate the "final" IDs for objects is particularly
convenient when the client can produce a complex graph of objects offline, for
example an order + order items + comments + more. If using temporary IDs until
first contact with server, all these links would have to be patched up when
the objects get real server-side-assigned IDs.

------
chmaynard
A review article that combines original research, excellent writing, and
useful advice. Impressive!

~~~
Ericson2314
Yeah, not the usual poorly argued heresay. Good job author.

------
stickfigure
On Postgres I prefer to use a single sequence to generate ids across all
tables. This reduces the chance of accidents (eg accidentally deleting the
wrong thing with id #123) and reduces information leakage ("oh, I see I'm
customer #5, you must only have 4 other customers").

~~~
eropple
That's not a bad option, but I think I'd find locally-unique integer IDs
spread across tables to be a little confusing. In that case I think I'd lean
towards UUIDs, which may be a little easier overall, or use separate sequences
but expose IDs via Hashids in my API/frontend:
[http://hashids.org](http://hashids.org)

~~~
int_19h
In Postgres, having multiple tables share a single ID sequence is trivial
thanks to SEQUENCE/NEXTVAL.

~~~
WorldMaker
Sure, but Postgres doesn't to my knowledge support a cross-table UNIQUE index
to insure your sequences are followed in all cases. I imagine that all it
takes is one confused junior developer to add a bunch of manually IDed data by
hand to create some weird and confusing corner cases in this scheme.

~~~
int_19h
You can define the field as GENERATED ALWAYS, though, in which case specifying
the value explicitly in INSERT won't work (there's an escape hatch, but it has
to be explicitly and very verbosely invoked).

------
dizzystar
It's an interesting article with interesting ideas.

I'm squarely in the camp of using natural PKs until there is good reason to
use surrogate PKs.

I generally disagree with the notion of using a combination of surrogate and
natural keys. In SQL, a PK isn't just another unique key: a PK an important
block of communication.

As a rule of thumb, when a PK is attached to a semantic value, it is saying
that this is the identifier of the table. If a PK is on a surrogate key, it is
saying that there is no good unique identifying value in the table. When you
are dealing with larger data sets, this distinction isn't minor, as it helps
to understand the intent of the data when working with it. PKs serve as
guideposts in your design along with guideposts to the person who has to
maintain (or fix) your database later on.

I know some disagree with me on that, but there are many undeniably good
reasons to use a natural key. A good place is a check-constraint table, where
you say, have a list of US states and you want to ensure that "New York" and
not "New Yoerk" inserted into the state column of an address table. Put a PK
on valid_state_names and FK to the PK from addresses.

~~~
wvenable
> If a PK is on a surrogate key, it is saying that there is no good unique
> identifying value in the table.

Almost always there is no good unique _unchanging_ identifying value on a
table. Take something as simple as "Person" \-- there is literally no unique
unchangeable value for such an entity. And that's the rule more than the
exception.

I take exactly the the opposite approach; nearly all entities should be
identified with a surrogate key. You pretty much cannot go wrong with this
approach.

In the US state names are pretty stable, but in other countries state/province
names have changed so even that value is poor choice for a primary key.

~~~
dizzystar
Yes, each set of data should reflect the data you want to use. That's why I
said I default to natural keys unless there is good reason to use a surrogate
key.

If I have to compare databases I've worked on over the years, the databases
that are all surrogate will have many more data errors than a database that is
predominantly natural keys. I could write a book on why that happens, but if
you don't know relational theory cold (not saying you don't), you should just
use natural keys. The short answer is that using all surrogate keys is
equivalent to not using unique constraints at all.

Data is bound to change at some point in the future. The entire point of
having constraints is to ensure that your data keeps its integrity, now and
later on, no matter how the data needs to be updated.

~~~
wvenable
I'm working with a 3rd party database right that used more than a few natural
keys and it's awful. The primary key for "users" is username so we can't
rename our users to improve our company's username security policy. And that's
just _one_ of the problems. We'll be offline for a full 24 hours in April to
handle changing the value (for necessary business reasons) of the natural
primary key in the client table.

I'm all for constraints, especially uniqueness constraints, where appropriate.
You can use unique indexes; you don't have to use natural keys as primary
keys. Data is bound to change, so make that change possible.

~~~
dizzystar
A badly built database isn't the reason to toss out good practices.

It's actually quite obnoxious to see how bad many databases are, and this
often causes a blow-back of tossing the baby out with the bathwater. It's a
trap.

I'm not really sure what all the problems are, but if changing username to
something else is causing a 24 shutdown, there are many deeper issues than a
PK on username.

~~~
wvenable
> A badly built database isn't the reason to toss out good practices.

It's a good reason to toss out bad practices like, for example, using natural
keys as primary keys. Which is, almost universally, considered a bad practice.
It is, in fact, the bad practice that causes the most pain in this bad
database.

> if changing username to something else is causing a 24 shutdown, there are
> many deeper issues than a PK on username.

This is true; the software is so old it pre-dates most modern RDBMS features
and that is a factor in the database design. And it contains a few decades
worth of data. And the process to go through the _all_ records and re-do every
foreign and primary key value is expensive (through the app server) but is a
well-tested path. None of this would be necessary, however, if they had used
surrogate keys for these tables. They did use surrogate keys for other tables
so it wasn't a technical limitation.

~~~
brlewis
I'm curious what database this is. Even mysql supports ON UPDATE CASCADE back
to 5.5 (or earlier?).

------
erpellan
The very first example implies its own counterexample. In most decks of cards
there are 2 jokers.

~~~
dannyfraser
One red, one black.

------
B-Con
> One thing to avoid in publicly exposed numbers is sequential order. It
> allows people to probe for resources (/videos/1.mpeg, /videos/2.mpeg etc)
> and also leaks cardinality information. Add a Feistel cipher on top of a
> sequence. This hides the ordering while maintaining uniqueness.

> The PostgreSQL pseudo encrypt wiki gives an example cipher function: > [...]

I get that they want to leave the key value in the schema, but practically
this kind of thing feels worth pulling out into the storage engine. Just
generate a random string as the ID upon row insertion and force uniqueness on
that column. If insertion fails, generate another random string. (And it
should pretty much never fail because otherwise you're still suspect to the
enumeration attack you're trying to prevent.)

Although given that Postgres can read from a file, could the above be done in
the schema by reading from /dev/urandom? If so that seems like the better
approach.

------
slg
>Modern SQL is supposed to abstract from the physical representation. Tables
model relations, and should not expose an implicit order in their rows.
However even today SQL Server creates a clustered index by default for primary
keys, physically ordering rows in the old tradition.

Which is why the UUID (or GUID in SQL Server speak) can have other drawbacks
there in comparison to auto incrementing bigints, namely delays due to the
data being ordered randomly on disk. There are obviously ways to counter that
like adding extra non-clustered indexes or changing the clustered index to
something besides the primary key, but at that point the extra time and
overhead might defeat the benefits you gained from going with UUIDs in the
first place.

~~~
jeremy_wiebe
Isn’t that exactly what the NEWSEQUENTIALID function was created for?

[https://docs.microsoft.com/en-
us/sql/t-sql/functions/newsequ...](https://docs.microsoft.com/en-
us/sql/t-sql/functions/newsequentialid-transact-sql)

~~~
slg
In theory yes, it solves some of the problems and introduces others. It isn't
usable everywhere NEWID is usable, so if you assign UUIDS as part of a query
you are out of luck. It also assumes you are generating all your UUIDs on the
same machine without restarting which once again negates some of the benefits
of UUIDs.

~~~
tatersolid
ULIDs solve these issues.

Basically 48 bits of millisecond timestamp then 80 bits of randomness.

The technique has been used in MSSQL apps since about 2003 (although people
called them COMBs instead of ULIDs back then)

[https://github.com/oklog/ulid](https://github.com/oklog/ulid)

~~~
slg
Nice, I hadn't seen that before. It looks pretty good on first glance.

------
Ericson2314
This is well done. And yet many people get this stuff wrong. As a language
person, I think about how we could make these choices more natural by
rewriting SQL; make them the path of least resistance rather than requiring
much pondering and wisdom.

~~~
hug
Let's be honest: If _you_ could make these choices more natural then you would
have done so by now. If you could make SQL better than it is, you already
would have done that as well. In fact, if most people who are interested in
SQL could make SQL better than SQL is, we wouldn't be using SQL anymore.

We're not using it because we have some weird tradition we enforce. We're not
using it because we like the way it looks, the way it writes, or the way it
handles. We're not using it because we've tried to use ORMs and those ORMs
have proved conclusively better across use cases than SQL.

The reason we use it is because we don't yet have a better solution. If you
can write a better solution than SQL I would love to hear about it: I
absolutely loathe SQL, but I use it because out of everything I've seen, it's
the simplest and most elegant solution to the problem that I've ever come
across.

So yeah, while we could possibly and should definitely make SQL better if we
can, I don't think that suggesting that we just re-do it from scratch is the
way to go. That way lies madness, and possibly javascript frameworks.

~~~
Terr_
I think you're discounting network-effects and historical inertia.

Even things which are functionally very simple to implement won't necessarily
catch on, ex: "If _you_ could make keyboard layouts more practical than
QWERTY, then you would have done so by now."

~~~
hug
I mean, maybe I am, maybe I'm not. How many other languages from the '70s
haven't been replaced in a domain where they should have been?

And, I mean, maybe your example isn't really the best example you could have
given. Name a layout quote-unquote more practical than QWERTY that should have
caught on but didn't. Dvorak? No faster. Colmac? Nah. Any, y'know, studied
benefits? None I can find, at least not with a P-value worth mentioning.

I'm not saying SQL is the definitive end of database query languages, of
course, because that would be silly. There is, almost certainly, a more
objective truth. What I'm saying is good luck finding it; It's only semi-
cynically that I say I wish you luck finding a better language than SQL. I
honestly, with all my heart, wish that SQL could be replaced with something
better, because I hate SQL.

When some guy comes along and is like "we should do a better SQL!" the answer
is yes, we should, but I'm not exactly going to hold my breath,

~~~
tome
> And, I mean, maybe your example isn't really the best example you could have
> given. Name a layout quote-unquote more practical than QWERTY that should
> have caught on but didn't. Dvorak? No faster. Colmac? Nah. Any, y'know,
> studied benefits? None I can find, at least not with a P-value worth
> mentioning.

Do you believe that all keyboard layouts are of indistinguishable
practicality? That seems to be the inescapable conclusion of your claim.

~~~
EdwardDiego
> Do you believe that all keyboard layouts are of indistinguishable
> practicality?

If you have evidence to the otherwise, a lot of us would be keen to read it.

~~~
tome
I'm happy to take hug's claim that there _is_ no such evidence at face value.
What I'm really interested in, however, is whether the absence of such
evidence causes him/her to believe that _there is no difference_.

------
rusbus
One thing to keep in mind with Postgres is that unique key indices can lead to
deadlock and stall other transactions that insert the same key.[1]

I wouldn't say "unique constraints considered harmful" but I would definitely
say they can be surprising and must be used with care.

[1]: [https://rcoh.me/posts/postgres-unique-constraints-
deadlock/](https://rcoh.me/posts/postgres-unique-constraints-deadlock/)

~~~
lobster_johnson
The relatively new "INSERT ... ON CONFLICT" mechanism is a great way to
circumvent this issue. Using that feature, you can tell it to ignore
conflicting dupes, or you can have it transform the insert into an update.

------
navium
"There’s no need to manually create indexes on columns already declared
unique; doing so would just duplicate the automatically-created index." Is the
last part real? Makes me feel like forking postgesql just to save the world
from accidental duplicate expensive indexes.

~~~
jimktrains2
Yes, in PG at least an index is created, otherwise the unique check would
become prohibitive.

~~~
cheapsteak
I think the question is more on whether a duplicate index would be created
instead of having the unique constraint and primary key designation on the
same column share an index

------
danielovichdk
Funny example in regards of a card-deck. How would you have unique keys for
the 3 jokers in the deck ?

~~~
LeonM
A joker does not have a suit, nor a number, so it would not fit in the example
table anyway.

~~~
WorldMaker
Arguably Joker is the number, and like Ace is 1 you might use 0 or 14
depending on math preferences as a surrogate number (or stick to the character
designations like A, K, Q, J).

Decks with four jokers typically have suited jokers. Decks with two are
typically (though not always) red/black and could be given arbitrary suits
(say hearts/spades).

------
Dowwie
programmers who are intrigued by the recommended pg_hashids extension have a
wide variety of alternatives to using a pg extension for the same key
obfuscation logic: [http://hashids.org/](http://hashids.org/)

------
txutxu
Great article.

But MAC address cannot be considered unique, as soon as software like
keepalived is running.

~~~
daenney
OK. But how does that relate to this article? If you're using keepalived what
you're doing is moving the connection from one instance to another if the
current one is detected as being dead. Even if only the MAC was used to
generate the UUID (which it is not, timestamp is factored in too) I struggle
to find a scenario in which this would be a problem. Even if you time it down
to the nanosecond, if keepalived is used as intended you'd never generate a
UUID at the same time on a number of machines sharing the MAC.

And from a privacy concern, if the MAC cannot be considered unique, that's
only a bonus?

~~~
txutxu
The article list the MAC address as a _natural unique key_

    
    
        Here are some values that often work as natural keys:
    
        [...]
        mac address on a network
        [...]
    

that was the reason for my comment.

~~~
daenney
Ah, I missed that this was what you were commenting on. I thought it was about
the use of the MAC in the UUID generation.

Thanks!

------
mikst
hey, is it me being far from expert, or:

why use UUIDv1 if you can just use timestamp?

wouldn't time-based solution be prone to time misconfig across servers? Or
that little shuffle doesn't really matter?

------
epx
The reading was worth because of the curse generator anedocte.

------
bullen
There are two things this article could point out:

The difference between centralized, distributed and decentralized primary
keys.

And how public/private key encryption should be a natural part of every
primary key.

The problem is all SQL databases uses sync clients which make them completely
useless in any distributed or decentralized setting.

The real tension here is that there is no guarantee that a large UUIDs are
unique, and we have not embraced the random occurrence of a collision.

