> UUID primary keys? Horrible advice that's only applicable at Google/Facebook scale
UUID primary keys aren't needed for big keyspaces, they are needed when you need multiple processes (especially a flexible number of multiple processes) to generate unique IDs independently of each other and the central database, which will eventually get stored in the central database. This can be important at much smaller than Google/Facebook scale, depending on the use case.
Oh yes! We ran into precisely this problem when we had a number of machines that mostly run independently, that needed to communicate with a central server. Integer ID's don't work well for that, and especially in light of the fact that Mysql reuses them in certain circumstances. I couldn't do anything about Mysql being on the machines, but used Postgres on the server, of course.
The second generation of machines, where I did have a say in things, used UUID's and Postgres.
With 64-bit integers (in Postgres, "bigint") then you can have over 9 quintillion rows before you run out of numbers.
UUIDs are for when you have more than one database server that people can write to, and those databases must not share IDs. In many cases this is not a design requirement but a design after too little research.
Which is fine for circumstances where a database round trip is acceptable at that point; there's situations where assigning IDs to items is something you want to have happen without database interaction at all.
You can also do the same in certain versions SQL Server. Change the ID field from auto-increment enabled to disabled. Add and delete a few fields. Then re-enable auto increment. Now, normally the server should start new IDs at the last position. Normally being the operative word.
Am I the only one who thinks that primary keys should be derived from the actual data? That way it's impossible for two processes to accidentally create the same conceptual piece of data (which is still possible with uuids). It also makes it much easier to recover from situations when you have to quickly promote a slave to a master role without first verifying that the slave is up to date. The main bonus though is that the database is much more comprehensible, e.g. foreign keys are legible without having to join back to the primary table. The relational model for data is pretty cool and breaks down when e.g. UNION doesn't work if 2 rows differ only in an arbitrary integer primary key.
The problem is there's very few examples of real-world data that actually matches a primary key - i.e. is guaranteed to be unique and never changes. I've been burned by this so many times. In reality, everything needs to at least have some capability to change.
Unique and never changes can be two things concatenated, not one thing.
If that concatenation is too long you can ram anything thru a hash to get a constant length smoothly distributed key. Its also fun to use "weak" hashes for this because it trolls wanna be security types who don't understand the application.
Given the above, some important things concatenated and hashed works. Can always add an application ID or importer ID or process ID or a timestamp of high enough resolution.
Obviously needs are different if you're trying to create a bank user database vs deduplicating sampled engineering data.
So something that's unique is NOW() (assuming low enough sample rate LOL) and something that never changes is a laser serial number, concatenate those and ram thru a hash to make it small and fit.
Concatenating helps you with uniqueness, but not with whether things change (if anything that's part of a composite key changes, it can no longer reliably be part of a key). And it's the changing part that's not realistic with natural keys.
Once you get into application ids / timestamps / whatever you're no longer really using a natural key in my mind, you're just making your own algorithm for a surrogate key.
> So something that's unique is NOW() (assuming low enough sample rate LOL) and something that never changes is a laser serial number, concatenate those and ram thru a hash to make it small and fit.
That sounds very similar to UUID version 1 except that you replaced MAC address with some other identifier and added a hashing step.
> Unique and never changes can be two things concatenated,
No, for a good primary key, the "unique" parts can be due to concatenation, but the whole key (and thus all the elements) needs to be unchanging. [0]
[0] Modern databases can actually deal with changeable PKs, though with a potentially serious performance hit, but in many use cases where data travels outside the database for some kind of interaction where the results need to get reentered into the database, this is still a problem, since you can't (in any general way) cascade updates to things (which may not even be online systems, e.g., paper records feeding human processes) outside of the DB.
> So something that's unique is NOW() (assuming low enough sample rate LOL)
We can all invent algorithms for nice unique values when we include the caveat "apart from the edge cases". It's those edge cases that bugger everything up.
"Hash of the data defining the entity identified" is how version 3 and 5 UUIDs work (plus a namespace). If you really want a weak hash use version 3 - it uses MD5.
The problem is usually with the assumptions about natural fields being unique, or not change over time. One day you realize they can actually change, or they are not unique. Changing the data model at that time to support a different primary key would be very difficult.
> Am I the only one who thinks that primary keys should be derived from the actual data?
Where a valid natural key exists, it absolutely should be used. Surrogate keys should only be used where there isn't a natural attribute (or composite of such attributes) that corresponds to the unique identity of a tracked entity.
Okay so you run a school and want to make a students database. What do you use a a PK?
1) Social security #? Fails when you have international student.
2) Last Name, First Name, Middle Name? Fails when you have a repeat name.
3) Last Name, First Name, Middle Name, Home Town, Start Year? I guess this works for most of the time. Now you need to join to this table from the classes table. So now you need all 5 keys duplicated in the classes table to do the join.
Simple Integer primary keys "suck" in that you are adding bogus data to your database that has no value.. but man, they sure solve a LOT of problems. You need to think fairly hard before you get rid of them. In some cases you totally can. But making your default datamodel include an Integer PK solves a LOT of problems.
I have actually done this for a classroom management app I made mostly for fun when I taught high school. At first I wanted to use the district issued ID number as the primary key for each student. Except some students came and started classes and needed attendance records before the district got around to issuing them their official numbers. So I fell back to auto-increment int as the primary key, with their "district_id" nullable, default null, but unique constrained as just another field equivalent to their first_name.
The problem is that a lot of people end up screwing that up and picking something that doesn't work well as a primary key. Also, just using an integer is often a great way to get started without thinking about things so much. That works for a lot of web apps.
Version 3 and 5 UUIDs are derived from the data. They can be a good fit some places, but you have to be sure the only (pre-hash) collisions you'll have are the collisions you want.
Including timestamp may be appropriate for log data, where otherwise identical records differing by time should be considered different entities. Otherwise it loses the property asked for upthread that our primary key detects/eliminates duplication.
A special case where UUID keys are extremely helpful is when you have to convert your data from one system to another - customer "105" can be different in different databases, but customer "7C823BE3-CB5B-4E34-BE8D-4B5A71945D3F" is truly unique. This takes dealing with foreign keys from "nightmare" to "non-issue".
UUIDs absolutely have their place - both the problems of distributed ID generation (mostly in remote clients / apps) and avoiding information leakage (-> German Tank Problem) are very real world examples where using them is a sensible thing.
However, my point was that using them as a default choice has a lot of drawbacks. All over our data model, we've got like 60-70 entities with serial IDs and about 2 with UUID columns.
Also, don't underestimate how fast PostgreSQL can deal you out from a sequence via nextval(), should you really go down that route. But yes, if the database is more than a millisecond away, UUIDs definitely have their time and place.
Depending on your configuration and needs, using a method to generate primary keys in a way unique to each system might still yield better performance. For example, in MySQL you can use auto_increment_increment and auto_increment_offset to force one server use insert using even integers and one odd, or any combination you need. It does eat bits from the key size though. Changing it later can be a pain, but it possible if you think it through. I don't doubt something like this is possible in Postgres as well.
Indeed. I recently built a todo app inspired by a particular paper-based methodology (FVP) to scratch a personal itch, and it was a web app that uses localStorage. It was much easier to use UUID keys for the todos than to store/retrieve the max ID every time the app was opened.
That's basically the 'hello world' of web apps example where UUID's make sense already.
If you have a lot of data to synchronize, you probably still want some kind of 'version vector' thing, which in the simple case with a centralized server, is an integer 'change counter' (I forget the exact term for it) that will let you fetch only the updated/new items. You'd still use a UUID for your ID though.
wouldn't filtering by date field be good enough, generally?
Either way, once I realized that I use my own little todo-app constantly, a server-based implementation became necessary. I'm working on a Horizon/RethinkDB version right now which uses UUID's by default. But if that were not the case I would've probably gone for simple incrementing ID's.
> UUID primary keys aren't needed for big keyspaces
Agreed. UUID's are great for distributed/eventually consistent dbs. However, the article gives the primary reason the reason given to prefer UUID was specifically for key exhaustion.
The main problem I have seen with UUIDs from a performance standpoint has to do with indexing and ordering.
To start with ordering, it is inherently a NP-complete problem. UUID can add a good bit of over head here since it's less computationally expensive to order an integer than it is to order something like a unique identifier.
This is also why UUIDs tend to increase fragmentation with indexes, since indexing can be so closely tied to ordering.
SQL Server has made some improvements here with the newsequentialid() function, so maybe it's less of a problem moving forward.
To contradict myself for a second, perhaps the move to more distributed systems and service models over the long term increases the need for the move from INTs to UUIDs. Better to architect for the future now and save yourself some pain down the road. It will be interesting to see how that plays out.
To be fair, most of my experience with UUID/GUID is with SQL Server, and in that case it's stored as a 16-byte binary.
Character string format: '6F9619FF-8B86-D011-B42D-00C04FC964FF'
Binary format: 0xff19966f868b11d0b42d00c04fc964ff
UUID primary keys aren't needed for big keyspaces, they are needed when you need multiple processes (especially a flexible number of multiple processes) to generate unique IDs independently of each other and the central database, which will eventually get stored in the central database. This can be important at much smaller than Google/Facebook scale, depending on the use case.