
On Primary Keys - dstroyallmodels
http://orchestrate.io/blog/2014/05/22/the-primary-key/
======
outworlder
"In fact, even in RDBMS’s developers should be using a natural key as their
primary key instead of an auto-incremented ID. "

No, they shouldn't. The fact that your dataset may contain a piece of
information that is usually unique its completely unrelated to the need of
RDMSs to have identifiers to implement table relationships and to identify
records in a table.

Not to mention that it can bring lots of headaches down the line. The classic
example is using social security numbers (or another country's equivalent,
such as CPF in Brazil) to identify individuals. Down the line you get a
requirement to support international customers. That is now a major
undertaking.

You are also offloading the responsibility of maintaining the uniqueness
property to a third party, in effect trusting their database.

~~~
phkahler
The noob never saw a person get married and change both their last name and
email address. There is no such thing as a "natural key" that can never
change, that's why auto-incrementing integers are used. Also, users should not
rely on the keys being monotonic.

------
tbrownaw
DO NOT USE USER-SUPPLIED DATA AS KEYS

Database keys are more-or-less equivalent to pointers. In the case of natural
keys, they're pointers that (a) do not have ASLR, and (b) are exposed to the
user.

As the article mostly ignores, you _will_ guess wrong about what can be
reasonably considered permanent or unique. Email addresses change (and aren't
always unique), names change (and aren't unique), etc. This is much harder to
fix if you used that guess as the basis for your entire data model.

The problem with proper keys was given as "auto_increment doesn't work on
distributed systems". This is rather silly, as there are plenty of
alternatives (GUID, UUID, the mentioned snowflake keys).

It's also interesting that proper keys are a "crutch" for designing data
models and are best replaced with cleverness, and apparently are only useful
to dictate record order (I know that Oracle at least doesn't work that way).
This brings to mind "debugging code is twice as hard as writing code; if you
write code as cleverly as possible, you are by definition not clever enough to
debug it". DON'T BE (OVERLY) CLEVER. Do the least-fragile thing that doesn't
add complexity (that is, doesn't add complexity _visible to you_.). Only do
clever things that add fragility if you have an actual measured reason to do
so.

------
buckbova
"In fact, even in RDBMS’s developers should be using a natural key as their
primary key instead of an auto-incremented ID. This will lead to better
performance when the natural key is the most commonly used identifier, but is
often not considered out of a habit."

As a database architect I always use auto-incrementing as the primary key.
This ID is actually a surrogate key for the real primary composite key, which
I assign to a unique key.

So, with the unique key in place, the performance will be nearly as good as
you'd expect for direct looks by key attributes. However, the performance gain
using the surrogate ID within join tables and as foreign keys is much greater.

"It’s also worth noting that over their years of use in SQL databases they’ve
become somewhat of a crutch when designing the data model for a system."

They are not a crutch. They are a wonderful feature. One could model with
GUIDs if necessary or something similar for your surrogate key but nothing
beats the simplicity of an auto incrementing ID.

Even a NoSQL database like MongoDB creates a unique _id for you for each
document.

------
Jedd
An article promising to be quite technical demonstrates early on the author's
inability to get their head around the spelling of the possessive form of
"its". (Perhaps it's just me that finds this an unsettling juxtaposition. The
About page wasn't useful in identifying native writer status.)

Then we dive into some mistruths about emails being unique per user (I know
plenty of people who _still_ share email addresses, both personal and also in
shared commercial accounts). I worked at a place a few years ago that had
designed their entire accounts system around the idea of email addresses being
immutable. They'd subsequently had to introduce a secondary column for
_actual_ email addresses -- customers would login with their historical email
address as their ID. The idea that a uid is a 'natural key' just seems
perverse (or at least demonstrates a profound misunderstanding of what a
natural key is).

A sideways reference - without actually using the phrase - to CAP theorem.
Composite keys are suggested as a solution to a vaguely phrased problem
without acknowledging the performance issues, and subsequent design
constraints, they introduce. Ultimately there seems to be a misunderstanding
about the differences between a (primary) key and an index.

------
al2o3cr
The biggest problem with "natural" keys, as referenced in the article: they
have a nasty habit of _changing_. In the article's example, making "username"
or "email" the PK is going to be messy if you have foreign keys referencing
the value when it updates.

~~~
userbinator
Using email I can see being a problem, but username? If this is for
identifying accounts, then that shouldn't be one that ever changes.

~~~
avemg
Jane Doe gets married and changes her name Jane Smith. She no longer wishes to
log into your system as janedoe and would prefer to login as janesmith.
Certainly one could make the argument that you should just tell her that the
name can't be changed and she'll have to just live with it. Depending on your
business, that may or not be acceptable.

~~~
emeraldd
Even worse, Jane Smith decides she hates that _bleep_ John and they get
divorced. She really won't want to see Smith as part of her username anymore.

------
mobiuscog
In a recent large project, I moved to using GUIDs. Following advice from this
article: [http://www.codeproject.com/Articles/388157/GUIDs-as-fast-
pri...](http://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-
under-multiple-database) it was fairly straightforward and performant.

Obviously YMMV (as may your definition of 'large' and 'performant').

~~~
deathanatos
UUIDs are wonderful: they're fairly small (16 bytes), and can be generated on
multiple servers independently with no coordination. If you use version 4 or 5
UUIDs, the first couple of bytes are random, so short-circuiting comparisons
should in theory bail quickly… but I've never had a need to measure.

Perhaps the only gotcha I've run into is that JS can't reliably generate them.
(You need a decent RNG, and while JS has an RNG, no guarantees are made about
it.) (Trivially work-aroundable with an AJAX endpoint on the server, of
course.)

~~~
AaronBBrown
Strictly speaking they are 16 bytes when stored as binary, which sounds grat,
but when represented as strings (as they usually are) they are 36 characters
which is 36-bytes in ASCII, but 108(!) bytes in 3-byte unicode.

So, in practice they are terrible primary keys in systems that do not have a
native UUID type. In that case, an 8-byte bigint generated externally by a
ticket service like Snowflake is often far superior.

~~~
deathanatos
> they are 36 characters which is 36-bytes in ASCII, but 108(!) bytes in
> 3-byte unicode.

I'm going to argue that that's not the norm: "3-byte unicode" is kinda WTF,
since it doesn't really exist. If you're in UTF-8, it's a 36 byte comparison.

That sounds, however, a lot like MySQL, which has perhaps one of the more
braindead "UTF-8" implementations. That said, I'm not sure that it uses 3
bytes for code points that don't require it. (At least, that it spaces them
that way: there may be nulls past the data, sure, but those won't count in a
comparison.)

------
bcoughlan
"In fact, even in RDBMS’s developers should be using a natural key as their
primary key instead of an auto-incremented ID. This will lead to better
performance when the natural key is the most commonly used identifier, but is
often not considered out of a habit."

Primary/unique keys are indexed. Creating an index of strings instead of
integers is surely going to result in a huge index and slow performance?

~~~
Amezarak
Well, like everything, it depends on your database, your data and how its
accessed. I've had a lot of varchar indexes in my time.

The article makes it seem like your primary key and main index have to be the
same, but that's not necessarily the case. Most DBs have some support for
physically ordered indexes.

So take username, for example. Let's say username has to be unique, but it can
change. (It doesn't even matter too much if it's not totally unique, just
mostly unique.) Let's say you're going to be doing most of your lookups in the
user table by username.

If performance is an issue, keep your auto-incrementing guaranteed-unique
primary key and add a clustered index on user name. The data is now physically
stored by its natural key, but you still have a nice unique, unchanging
primary key. Lookups are very fast, though depending on the implementation,
inserts/updates might not be so much.

Postgres supports this with the CLUSTER command, although it has to be run
manually - it doesn't keep up as updates and inserts come in. SQL Server calls
it a clustered index and enforces the physical ordering on inserts and
updates. Oracle has indexed-ordered tables which can achieve a similar effect,
though IIRC the index has to be the PK (nothing to stop you from having a
unique constraint on a sequence-generated column, though.)

~~~
batbomb
In oracle, the index on an index ordered table only needs be a unique. So, for
example:

    
    
          1234 "John Doe" null null null
          5824 "John Doe" "Sale 1" "item 1" $5
          4321 "John Doe" "Sale 1" "item 2" $50
          4382 "John Doe" "Sale 2" "item 1" $4
          
          
          CREATE UNIQUE INDEX (name, sale, item);

------
cratermoon
We've been having this argument about primary keys for three decades at least.
If the natural key vs. auto-generated ID theories can't be reduced to
practice, is there any home for software development as a profession?

------
aaronem
> An email address is a piece of information that is guaranteed to be unique
> to each user because everybody has their own email address.

I once let myself in for a lot of headaches by designing a system around this
unwarranted assumption.

------
bcbrown
It looks like the purpose of the article is to explain why it's not a big deal
that the company's database solution doesn't support auto-incrementing primary
keys.

------
dpeck
Instagram had a nice post on how they did their PK generation/sharding a
couple years back. Still a good read, [http://instagram-
engineering.tumblr.com/post/10853187575/sha...](http://instagram-
engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram).

Previously discuss here,
[http://news.ycombinator.com/item?id=3058327](http://news.ycombinator.com/item?id=3058327)

------
dbrower
There's nothing that requires most auto id keys to be monotonically
sequential, which is the bottleneck when distributed. If uniqueness is all
that's required, blocks of id's can be handed to different hosts and
replenished on demand.

Gap-free, monotonically increasing id's should only be used for things that
have regulatory compliance issues. For example, there shall be no gaps in
invoice id's, and voided ones shall be marked.

------
brandtg
Small project that may be of interest to those involved in this discussion:
[https://github.com/brandtg/alicia](https://github.com/brandtg/alicia)

Kind of "eventually monotonic" distributed auto-increment keys. Your mileage
may vary.

------
campbellsoup
Could someone point out to me how filtering and selecting records over some
parts of the primary key (ex {sensorID} over
{deviceID}_{sensorID}_{timestamp}) is better than just having regular indexed
fields (ex {sensorID}) containing those pieces of information?

~~~
couchand
I believe that technique is intended for nosql document storage systems that
otherwise don't have indexed fields.

------
TheLoneWolfling
Slightly off-topic:

With JS disabled, the page header is white text on a transparent background.

With JS enabled, I see no way to remove the header. I have little enough
vertical screen real estate, adding an extra chunk of stuff on the top doesn't
help. I ended up removing it entirely.

