I would strongly say strings are actually the _only_ type you should use for IDs. Prevents the vast majority of buggy client behaviour and gives you good flexibility to change how you do things over time.
For us, a big part of it was usability with cursor selection etc - in addition to it being immediately obvious what the ID was for.
Once we finally had that rolled out everywhere, we ended up collecting up every other ID we'd ever used and mapped it to its KSUID resource equivalent, so now all our IDs work standalone without type/context info, even across environments (and thankfully we'd never had any collisions on the old IDs)
---
Going back to the typing - the most difficult part of migrating our IDs actually was converting them all to string types. With Postgres this is a little slow but ultimately fine, but with Mongo you have to actually remove and reinsert every document - you cannot (or at least could not) update IDs in place.
Strings have their own issues. One place I worked had a bug where users could take over accounts because of missing/inconsistent unicode canonicalization. Case can be a problem, as can special characters.
There's something to be said for strings, though. Prefixed IDs that mark the type can be nice to work with when it's an otherwise opaque ID, but they're a pain to handle internally.
If you're storing third-party IDs, you probably want strings...unless their clever JSON API returns 1.3E6 as a number. Or the string "null;" that's always an adventure.
IDs should really be their own type: string operations don’t really make sense on them. If you use strings, there should probably be a constraint on the acceptable characters and lengths: e.g. only digits, exactly 10 or something like that.
I personally like the idea of using urls using a domain you control + a path that reflects the type of data it’s an ID for.
If the main benefit of prefixed IDs is debugging then a dev tool can be created that would just look up the ID in all tables and report what object it corresponds to.
> I would strongly say strings are actually the _only_ type you should use for IDs.
Just got that in the face at work. In a large custom integration we read orders, they have unique order numbers, nice 5 digit things. So in the database, they became integer primary keys, with lots of child tables.
Fast forward 5 years, customer switches ERP system and they ask "hey, the order numbers, you do support 10+ digits right?"
Changing the database is relatively easy, changing the code is a chore, but worst part will be going over all the queries and their parameters, especially in the reports.
I have an honest question. Why would anybody, ever, make an order number that only has 5 digits. Even if it's just a home-hobby project, the cost of changing to 7-10 digits is so small a d negligible that I can't see any reason for choosing anything lower. Like the 640k that was once "enough for the long term future" in DOS. I understand that hindsight is 2020, but can't wrap my head around not starting bigger when there's no extra cost (nearly).
Since 10 digits puts us in billions, you're really asking "why would someone do 'CREATE TABLE ... id INT AUTO_INCREMENT' when they could use a BIGINT?" These days, there's rarely a reason not to use a BIGINT, but I also have a little trouble faulting someone for thinking 2B would be enough when they're currently at 10k.
Please don't use BigInt. It is the same size (usually 16 bytes) as GUID/UUID in binary but has lots of drawbacks and no real upside other than they are easier to type in when they are still small.
1) The ids are clustered around the starting point so an invalid join will return data when it shouldn't. MIN_ID exists for almost every table/object type. With a well designed ID this shouldn't return any data.
2) BigInts are different sizes on different platforms. I've worked at a place 1 order of magnitude (base2) from overflowing JS Number and breaking most client code. This was due to bug that consuming a Seq ids quickly but there was no going back smaller.
3) Will often be coded client side as a Int and no one will notice until you grow large enough to overflow INT
Really, just avoid any sequential, numeric IDs and you will be good, IMO.
Bigint is a 64-but (8 byte) signed integer. It’s half the size of a UUID and using a sequence rather than random ids leads to significantly less WAL ok Postgres when full page writes are enabled.
First off, I had a brainlapse, their order numbers were 6 digits. I'm not entirely sure what they max was in their system, it was just that's where they were at in the series.
And they didn't have that many orders per year, less than 10k, as one order could be for say five containers of goods. So it was not like they'd exceed 9 digits in the foreseeable future.
Obvious answer: because they started at 1 and have had less than 100,000 orders.
I know of a company that had enough trouble with vendors after their PO numbers rolled over that they modified their EDI transmissions to add 1,000,000,000 to every order number.
> The GUID vs Int problem is incidental to the real problem: poorly formatting logs.
This.
The author's arguments boil down to "UUIDs are bad because the way I generate my logs is an unusable mess".
The ints vs UUIDs problem is also specious because a UUID is nothing more than a 16 byte int that's generated in a clever way. It might typically be presented in a particular string representation but the text string is not the value of the UUID.
The Python package `shortuuid` makes working with UUIDs a little easier by encoding them as strings:
https://github.com/skorokithakis/shortuuid#usage
(uses base-57 encoding, with alphabet consisting of A-Za-z0-9 with potentially confusing symbols skipped)
The string representation is what you show to users, but under the hood it's still a UUID and compatible/interoperable with any other system that needs UUID-shaped identifiers.
The coolest part is youcan even truncate the string encodings to get shorter IDs, which correspond to UUIDs with lots of leading zeros.
> The Python package `shortuuid` makes working with UUIDs a little easier by encoding them as strings:
This. In my opinion the only issue with UUIDs is that their standard textual representation can be very verbose and cumbersome in some scenarios.
A single ID takes over around 40 string characters in a line. That's half the width of a default terminal's width.
Thus the solution is obviously to use other textual representations. Instead of base16 then let's ramp up the base to shorten up the text dump, and while we are at it let's pick a readable format.
GUID v6 is pretty nice when you need monotonously increasing numbers that are globally unique.
On another note, I worked somewhere that prefixed GUIDs with the environment the app was running on. All of production, staging, even dev machines all used the same connection string.
There was even a stored procedure to copy user accounts, etc of prod for your machine. It was hands-down the best debugging experience when a customer had an issue.
> Sometimes, we want a zero inconsistency approach to storing objects, so it might make sense to make the identifier (or part of it) the checksum of the content that is to be stored. This guarantees that the underlying content has not been modified.
No, it does not. I come across this false statement again and again. It seems that a lot of developers do not understand what checksums or hashes guarantee and what they absolutely do not guarantee.
Let's set this once and for all:
1. Differing checksums or hashes guarantee that the content is different.
2. Identical checksums or hashes do not guarantee anything. The content could be identical or not.
The main point of the article is that even using a UUID for database objects, when other functions in the backend code query multiple classes of objects, those function signatures are poorly written, ex:
ban(id: uuid):
pass
There's no way for the reader to know what object the ID is referring to, and also the wrong object ID can be used in the function. Their solution is to wrap the ID in a type, which is correct, but did not continue writing that the function signatures in the backend code should be updated as well. Something like this should close the loop in the blog:
ban(id: UserId):
pass
Where the compiler can now check that a Team.id is not being passed in the method, only a User.id can be passed into the method.
The article mentions KSUID which has a very similar spec, but with a wider random segment.
For comparison: KSUID is 160-bit wide versus ULID sticks to 128-bit wide (which is somewhat more compatible with UUID-like database column types, for instance, so long as they don't try things like UUID version checks). KSUID uses Base62 encoding versus ULID uses a much simpler to (lexicographicly) sort Base32 encoding.
(I've found ULID useful in some of my own projects.)
This is one thing I like about type systems where you can declare a primitive type as MyImportantThing. This ensures the string or what have you is explicitly defined as MyImportantThing. Rust does this pretty well. C/C++ AFAIK will let you pass in the raw string, and so will Java if you "extend String".
I tried many things for making sync work across devices. I tried GUIDs, and partitioning ranges of ints, and several versions of it.
But what worked amazing?
Use NATURAL keys (or their hash) + version field. That is all you need most of the cases. It make sync far easier, easier to trace stuff (thanks to version), immune to problems of timestamps (some computers have their cloks wrong). In short:
struct Order {
code: String, //natural key
version:usize
}
struct Location {
code: Hash //hash of city + country
city:String,
country:String,
version:usize
}
Natural keys are global if well defined. In some places where it is not obvious, hashing the whole row and put a nice encode is the same.
This also will reveal when something TRULY need a guid or similar. For example, for invoices in my country the law demand partition of ranges with certain characteristics (ie: INV-1-XXX in machine 1, INV-2-XXX in machine 2).
Add another id:i64 become redundant most of the time. If your Order.code is duplicated or whatever it will be the same problem with or without an extra id:i64, so is better to deal with the problems of the ACTUAL data when is need and not mask it with other stuff.
The downside is that the key become repeated in JOINS (like in InvoiceLine) but honestly all rdbms handle triggers, and it actually become very nice to see the Order.code in the child relations (far easier to correlate).
Isn't the obvious difficulty here that such natural keys tend to change? Eg names and boundaries of cities & countries change - relatively - all the time. You don't want your IDs to be changing alongside
I work for ERP/eCommerce, so this issue happens. But for real, I need to store historical facts anyway ie: In a invoice/order I must store the ship info at the time of the transaction.
A artificial ID is useless in a lot of cases where "and if the key change, what?" because if that is an issue, is mostly because the data must retain history, and then, you need to store that anyway.
The best answer is the humble integer. The only reasonable arguments I have ever seen against using integer keys universally are as follows:
#1 Integer keys have finite range.
#2 Integer keys betray the identity of other sensitive resources when exposed as a public identity.
#3 Integer keys are "difficult" to sequence in the face of multiple networked participants.
My resolutions and counter-arguments are as follows:
For many systems, #1 is not a concern, because the number of expected entities is well-bounded by a 64 bit integer. For others, #1 can be resolved by usage of more complex types such as BigInteger (C#). If utilized carefully, these can be treated just like normal integers, and quickly converted to/from byte arrays of appropriate length to satisfy the required range. In virtually all SQL implementations, blob columns containing these values can be indexed with the exact same semantics as with a 64-bit integer column. Whether this performs better or worse than GUID keys probably depends on if you can provoke a >120 bit BigInteger representation. This is quite unlikely, even for Google.
#2 is trivially solved by simply applying encryption to sensitive keys as they traverse the boundary between your system and the outside world. AES256 would do the trick here. You could also generate entirely separate keys of any appropriate type for public consumption (i.e. maybe some YT-style identifier format).
#3 is solved by anticipating the maximum possible # of nodes in your system, and then producing a key space in which identities are sharded out by a simple constant factor of that max quantity. This would certainly produce concern regarding all of the skipped identities (assuming you start with a small number of hosts on day 1), but the proposed resolution above for #1 (BigInteger) alleviates these concerns with a practically infinite range of keys. Skipping 10k identities is a non-event when you have all of infinity to pull from.
There are also other considerations with this. GUID keys are a pain to communicate. Integers, even of massive range, are easy for most humans to communicate verbally when appropriate digit grouping and other reasonable measures are undertaken.
Also consider a situation in which you decide to use 1 global integer range to key every single entity in your system. This allows for interesting database structures in which foreign keys are all referring to the same keyspace, so the specific type of a thing is no longer a hard constraint in a relational sense. Some would probably take substantial offense to this proposal, but I have found in many cases this allows for powerful optimizations. Anything can be used irresponsibly.
> Integers, even of massive range, are easy for most humans to communicate verbally when appropriate digit grouping and other reasonable measures are undertaken.
Presumably this is after not taking your own advice of performing AES256 encryption on the key before sending it to the user?
Your three easy steps together seem a lot more complex than just using guids.
Not all communications around keys are of a sensitive nature. Virtually daily, we will communicate keys internally in order to clarify details regarding some activity. Applying AES to keys internally would only serve to hinder our operations & support efforts.
How often are you verbally communicating primary keys/id's internally within an organization?
I've never understood arguments against guid's, except for table performance with databases that might take a performance hit because of either inserts, or because data won't necessarily be partitioned in the "correct" order (e.g. MySQL writes to buffer pools).
...and then you proceed by suggesting ID code generation methods that replicate UUIDs.
It sounds like you are trying to criticize UUIDs while being totally unaware of what UUIDs are.
I mean, UUIDs are nothing more than integers that are generated following one of half a dozen different methods so that they can be probably unique without relying on a central authority.
One of the UUID versions is a pure random number. Another is a MAC address concatenated with a timestamp complemented with a local counter. There are literally two UUID methods that consist of hashing data.
Your supposed solutions to the alleged problem posed by UUIDs is a reinvention of what UUIDs have always been. But UUIDs have been though through and are standard and ubiquitous.
Absolutely everything in this article is true, and I wish the folks who wrote stuff on Azure would read this... But if you actually need an opaque identifier for an arbitrary resource, what actually works better?
I also tried to use custom types for IDs, but in my opinion, it has more cons than pros. I have to write custom serializers and model binders for them, explicitly convert them to other types, write separate validators etc... At the end I finished with even more bugs than before.
I found serialization of typesafe IDs trivial in Kotlin with Jackson. We have a single generic supertype that has the correct annotation to tell Jackson to use the `value` field as the serialized value, and defines helpful methods like equals. Each new ID type is defined as a single line of code, simply inheriting from this base ID type and providing the actual underlying type (int, string, uuid).
Validators likewise are a non-issue. Jackson handles that automatically during the underlying type conversion, and Java won't allow you to construct a UUID that is invalid.
I see explicit conversion between types as a pro, not a con. If I'm going to take one ID and try to use it as an ID for a different type of entity, I'd better have a very good reason.
Again, I did this with a specific set of tools, but the functionality I used should be available in pretty much any language and serialization framework.
I agree, the author is mixing up presentation with the data type itself, and then pinning the blame on the poor presentation on the way he chose to present the data.
If the author's problem is not being able to grep logs easily with UUIDs because they miss context info then the solution to his problem is to output logs that are searchable by providing additional context.
His arguments regarding searchability are also specious. Take for example
> For example, a task ID would look like: `job-123-task-1`. This also helps in ad-hoc database queries to find relevant rows without complex JOINs.
Well, if he's already running SQL queries then running an exact match on two attributes is far easier and elegant and less error prone than running a string pattern search on a derived attribute.
Typescript doesn't have functionality for non-equivalent types with identical primitives or interfaces (e.g. you can't have string-equivalent type A ≠ string-equivalent type B), though there's ongoing discussion around adding it (https://github.com/Microsoft/TypeScript/issues/202).
---
My company ended up with a simple KSUID implementation of our own - https://www.cuvva.com/product-updates/showing-off-our-fancy-... (having originally used UUIDs and Mongo ObjectIDs)
For us, a big part of it was usability with cursor selection etc - in addition to it being immediately obvious what the ID was for.
Once we finally had that rolled out everywhere, we ended up collecting up every other ID we'd ever used and mapped it to its KSUID resource equivalent, so now all our IDs work standalone without type/context info, even across environments (and thankfully we'd never had any collisions on the old IDs)
---
Going back to the typing - the most difficult part of migrating our IDs actually was converting them all to string types. With Postgres this is a little slow but ultimately fine, but with Mongo you have to actually remove and reinsert every document - you cannot (or at least could not) update IDs in place.