Versions make sense in a lot of areas where stability is needed. But they should be seen as issuing binding contracts to your users. You should spend a lot of time thinking about what the terms and conditions are before doing so.
> With each change, we explicitly assign a distinct version to serializers.
> We do this independent of source code or build versioning. We also store the serializer version with the serialized data or in the metadata. Older serializer versions continue to function in the new software. We find it’s usually helpful to emit a metric for the version of data written or read. It provides operators with visibility and troubleshooting information if there are errors. All of this applies to RPC and API versions, too.
Backwards compatibility and the ability to roll out new changes (that can co-exist with current data) are the primary drivers.
See a prior book I wrote on HN regarding this: https://news.ycombinator.com/item?id=25309248
The current domain model I am working with utilizes a global integer sequence to key all entities. This implicitly eliminates the class of bugs where the same keys of different types overlap and would otherwise mask exceptions. It also enables powerful domain modeling techniques in which the identities of things are themselves to be thought of as first class entities and referred to as a common class of thing. This is a little mind-bending at first, but it enables some really powerful abstractions that would otherwise be infeasible if we had to switch over all possible types of keys.
The benefits of an integer key vs a guid key are quite profound when you get into the academics of information theory. They provide implicit creation order of things, whereas GUIDs cannot. They are deterministic in that there will never be a collision. Their range can be made to be infinite. Integers are perfectly efficient, even if the computer representation isn't necessarily so - BigInteger types scale gracefully.
We had this requirement, but a semi-UUID solution is desirable in a distributed setting. Ref prior art by Instagram engineering: https://archive.is/Dydln
(from your linked comment)
> If you are worried about security (i.e. someone hitting sequential keys in your URLs), then this is arguably an application problem. You should probably generate an additional column that stores some more obfuscated representation of the primary key and index it separately.
Yep, see: https://github.com/ai/nanoid and https://hashids.org/
And the odds of a guid collision is extremely low, and for most applications is acceptable. Having worked with petabytes of data guid performance isn't really an issue as there are more important factors to worry about.
At which specific integer does the scaling start to slow down?
UUIDs can be generated on many machines with no awareness of each other and merged later.
If you know beforehand the maximum number of participants in your system, you can divide the keyspace across that quantity. If you are using BigInteger or equivalent, you have an infinite number of these things to work with, so it doesnt really matter if you wind up skipping trillions of identities at first. The original article even advocates for this as its first point, but without as much practical justification.
A UUID is just a 128-bit integer, with creation algorithms designed to partition that space by things that already have enough entropy to need no further synchronization.
What you're proposing sounds like roll-your-own-UUID, which might be similarly inadvisable as roll-your-own-crypto.
It's not about the stakes. It's the idea that in rolling your own, you're going to get it wrong, or otherwise do worse than existing ways that have already solved the same problem.
UUIDs are not always a good solution.
If you order things by id (often useful for pagination, since incrementing ids are usually ordered by creation date) uuid are of no help.
You would need to sort by created_at and this would require an additional index.
No to speak of pagination by id ranges or whatever is used sometimes.
UUIDs are mpossible to remember and thus somewhat cumbersome to use too.
The weird id mixed up with index problem from the article is something which never happened to me ever in over 10 years of web development.
I stay away from PHP as much as possible so that may be the reason ;)
The use of ordering by a surrogate key is to have a stable ordering with no semantic meaning, so UUIDs work just as well as IDs there.
> You would need to sort by created_at and this would require an additional index.
Yes, if you want to sort by a semantically important data element in a table with surrogate primary key, you'll probably want an index on the data element. So?
> No to speak of pagination by id ranges or whatever is used sometimes.
Assuming a serial column is creation-ordered is not a good idea, but it's usually not too wrong; assuming it's dense, as well, is going to be wrong more often than it is right, except where data is never deleted. Assuming it is dense over the range of a query is even less reliable.
Paginating by serial ID ranges is, almost without exception, a horrible idea.
Yes, you should be sorting things by created_at. Having another index is not a bad thing.
> The weird id mixed up with index problem from the article is something which never happened to me ever in over 10 years of web development.
How do you know? It's a remarkably difficult bug to detect.
All of this points to using serial numeric IDs as both a unique key and the clustering key. UUIDs do nothing to help with any of those requirements, and generally hurt all of them. They do that because of certain characteristics they have that are specifically there in order to solve one of the few problems we didn't have to worry about.
Tangentially, in software systems design, I've long since realized that the word "always", when left to roam around freely, unchaperoned by any qualifiers to limit its universality, is an indicator of limited breath of experience. So, when you encounter it, it's useful to mentally insert "in my experience" as a stand-in qualifier. Having done so, the next conundrum is that "in my experience" advice is only actionable to the extent that you know what experience the advice giver has to draw on.
You are very wrong and wildly misunderstanding the problem that UUIDs solve
Personally I've never experienced the "whole class of bugs" that starting with a big integer is supposed to solve. I'm not using PHP so maybe that's why?
However if you want any sort of efficient lookup on the external key (UUID), your database still needs an index on the UUID, and you are back at square one.
I choose to forego the integer PK and just use UUID since I have to create index over it anyway.
Yes and no. It depends on whether your database treats primary keys differently than other indexes. For example, in InnoDB primary keys are always clustered indexes: the row data is directly stored in a btree arranged by the PK; secondary indexes just store PK values in their leaf nodes, so that they can do a lookup on the clustered index.
As a result, in InnoDB smaller PKs are preferable. So performance is generally better when using an incremental ID as PK and then UUID as a secondary index, as opposed to the reverse. Assuming you have multiple secondary indexes, the total table size will also be smaller in InnoDB with integer PK than with UUID PK.
Well. Many systems already do a form of this. They store a session ID (the browser cookie) in something like redis which maps to an internal database id which is often incremented. The internal database ids are never seen outside the DB.
In this case it's fine because the external IDs are ephemeral (relatively) and centralization is a hard criteria (you typically can't have two people creating an account with the same user name, or having one email address linked to multiple record IDs, etc.).
This is really why these discussions are pointless without specifics and a concrete system.
That is, unless you decide to put the same interface into your database and your API, what is not rare for OOM-only programmers to do, but always ends in tears.
A common disadvantage of integer keys is that programs will have bugs and use a foo_id as a bar_id. If most ids are small integers then it is likely that a valid foo_id may be a valid bar_id, whereas uuids probably won’t collide. This can be somewhat mitigated with a sufficiently strong type system. Even in a dynamic language like lisp you can represent your ids as e.g. (foo . <id>), and only need to get the tags right on the boundary.
An advantage of integer ids is density: if your ids are likely close together, there are probably some better or more compressed data structures you can use.
Assuming you can stomach the latency hit, the best solution is usually a lookup table where you take all the friendly URL keys and map them to internal identifiers. So if you're making a multiplayer game site and want to create a page where folks can find their friends, then you might support yourgame.net/user/username, yourgame.net/character/charactername, yourgame.net/steam/steamlogin, yourgame.net/xbox/xboxgamertag, etc. Internally you have an inverted index that maps [type, string] to the internal ID for the player, then proceed normally.
the "big integer" stops this happening for other integers that might be used in your code and that you might accidentally send to the database as a user_id.
Unless you might also use big integers in your code. 32-bits is big enough for all numbers-used-as-numbers-instead-of-ids is...a risky assumption.
A UUID is only the size of two 64-bit integers anyway. And either identifier likely will be a tiny fraction of the data in a particular row. So I'd doubt this is any real performance problem in the vast majority of applications.
UUIDs look big and scary in hex notation, but underneath it's a compact and fast binary format, just a 128-bit integer.
An ID should be:
1.) Actually an identifier, i.e. it needs to identify objects uniquely and must be immutable throughout the lifetime of the object.
2.) Assigned at object-creation time; you can't have an object without an ID (in persistent storage), lest you'll have no way to refer to that object later. This presents particular problems for a lot of workflows where you want to introduce the concept of a user early, before soliciting personal data. Think of storing browsing histories for guest users, or persisting shopping carts before checkout.
3.) Integral to the storage system. IDs will usually be the ways that you lookup and join objects; the performance characteristics of your database can influence the type of data you choose for an ID.
4.) Because of #2 & #3, oftentimes a significant concurrency bottleneck for object creation. This is the downfall of many auto-increment integer schemes.
5.) The foreign key for other objects. This has space implications that you often have to trade-off against latency implications. If you store useful information within the ID, you can use that info without needing to make a separate query or join against the DB. However, you need to carefully consider whether that'll run afoul with #1, and the more information you put in the ID, the bigger the size bloat for other objects that reference that ID.
6.) Oftentimes a security & PII risk. Because IDs are the primary means for lookup and are guaranteed to be unique, it's awfully tempting to put them in your public APIs (like HTTP URLs). But then anyone who has a URL has all the information included in the ID. If you use sequential IDs, they also have the ability to scan your entire database; this was the downfall of Parler.
Integer keys do really well for #1, #2, #3, and #5, but are very problematic for #4 and #6. But then, there are fairly easy workarounds for those: #4 is often solved by hashing a unique natural key of the object (also solving #6), while #6 can be solved by never exposing internal IDs to the outside world and instead using a lookup table on some friendly URL scheme, which is better for UX anyway (at the cost of #3). GUIDs do well for #1, #2, #4, and #6, but perform worse for #3 and #5.
Natural keys (where you make the identifier some combination of the actual data) are also frequently underrated. The most obvious use for these are relation tables, where each row just indicates that two entities have the relationship that the table describes; you wouldn't normally put an auto-increment ID on that. But think also of something like a search refinement: the most natural key for that is [query, language] => [list of suggestions], and that uniquely identifies each set of refinements, and it has other useful properties where the rest of your search engine doesn't need to know that refinements exist (it already has the query), and you don't need a lookup call from query to some search_refinement_id or vice versa, and you can easily enumerate the set of languages that a given query has data for, and you can follow a chain of refinements without any intermediate lookups. If you were to suggest either integers or UUIDs for this problem I'd say that you're overengineering. And if the problem domain changes such that the key-set changes (for example, you want to include past search context, or you want to personalize refinements to each user) then I would recommend you start a different system from scratch and eventually replace or merge in the current one, because those problems have sufficiently different requirements that your whole data pipelines are going to be different. (In particular, personalized search requires PII handling, a lot of care in logging, encryption of the data when at rest, knowledge of a lot more entities in the system, etc.)
This is where an attacker is able to leak information from your system just by guessing ids. If you used a sequential id then you can cycle from 1 to X and probably easily find which are valid users. You can then likely see how many valid users/ids there are and potentially pull their data if the authentication has been implemented incorrectly.
By using a UUID, the key space is so large that you cannot reasonably guess user ids. So you can't use those same brute force techniques and it makes extricating data much harder.
Also, and this is the main one for me - if I mess up my code and accidentally use a document_id instead of a user_id, I'll just get "not found" instead of someone else's data.
And yes, every DB backend has solved collision. But as I said, it's useful to be able to generate ID's in code sometimes. This is possible with UUID's and not possible with integers.
UUID is a loosely defined concept. There are many implementations. Some of them have no chance of collision at all, some have an astronomical chance, and some have very real odds that you'll receive a call at 3AM during the new year's celebration.
In the types of systems that need UUIDs there is probably no easy way to check for collisions. The prospect of mystery data corruption with no ability to trace it down frightens the hell out of me.
The only reason that issue was reported is because someone was actually doing the collision checking. That's not going to be the norm in UUID systems. Think about it.
It's a common bug to get different ID types mixed up, and a gigantic offset will do nothing to help you with that.
In Ada it's standard practice. In C++ you really need a library to do it easily, but there's a good one out there ready to go:
They’d missed the part that UUID_SHORT() returns an unsigned integer and created the column as the default signed integer. MySQL uses an algorithm where the top n bits are based on the server ID, which worked on the old server which had id=1 and never returned a number where the first bit was 1. The new cluster fortunately always did so the problem was immediately identified, but it was confused by one of those bonus MySQL data-destruction features – the way it silently truncated data meant that it was silently truncating new IDs to the same value but the logged value wasn’t in the database at all.
See for example https://pypi.org/project/shortuuid/
That would break with any webserver that is serving files from a case-sensitive filesystem. Which is most of them.
There's a reason why Apple got away with only showing the hostname of a URL: It doesn't matter to ordinary people. :)
In a recent project I arrived at, I started seeing 612 in random places in the code. It was, naturally, the ID of a very specific user. Having an easy ID to remember, it is a temptation for sloppy programmers to just hardcode certain checks against a particular ID instead of following proper procedures. It was a bad project and a bad team, and after some 10 years or so, the code was now flooded with 612 and a couple of other IDs.
Sure, you could avoid such a thing with code revisions and such, but then again it's better to simply not put the temptation in front of the programmers, isn't it?
In a process that joined two tables the match counts were off. Somehow, on one side of the join the id was being converted to an integer and then back to a string, which stripped the zero padding. Meaning that `00238974` was failing to match `238974`.
Linux could learn something about filenames here.
Personally I would also disallow anything below 32 to avoid having filenames contain escape sequences.
I have absolutely no need for a filename that contains an escape character, and would see this as a major bug, like his description of SQL injection. Better to fail fast.
It never hurts to make people believe you are bigger and better than you really are.
If you believe that having "black=bad, white=good" connotations built into our jargon is harmful in some way, then yes, you should strive to use alternate terms.
If you don't think that's harmful, and you're forceful in defending "black=bad, white=good" as being too established and too inconvenient to change, that's when you're likely to get pushback.
Personally I've been trying to use "deny-list" and "allow-list" (with partial success – changing your jargon is hard). They feel a bit clunky, but I suspect that clunkiness will fade with time.
Regardless, this article is from 2011, so it predates the discussion.
But I have another extension that just modifies the format of some pages to make them easier on my eyes. It took experimentation for me to realize that items on the whitelist are modified, while those on the blacklist are not. My intuition told me whitelists are for sites that are good the way there are, and sites that are hard on the eyes should be put on the blacklist.
It's not the worst mix-up in the world, and perhaps many smarter than myself would have no such confusion, but I think it could be avoid with words that mean what they do. There's plenty of room in the extension configuration to put "apply to these sites" or "transformed", as well as "allow original format" or "unchanged."
Yes, there was never some contract that we all signed or some thing some majority of us formally voted that says "we're not supposed to call it a blacklist/whitelist anymore.
Just what some random groups decided and enforced at their own domains (companies, orgs, etc.).
It's also based on an American preoccupation with race issues, seeing everything through it's own guilt-ridden history, concerns not relevant to other parts of the world (where code is written and English is also spoken, as first or second language for IT).
The connotations of black/white s terms have nothing to do with slavery or blacks, the term blacklist was first used (recorded) in an English theater play, as the list of the enemies of the kind (black alluding to shady, dark motives, etc, not to skin color), and its common colloquial use in the 20th century was also not about blacks or had anything to do with slavery: it was the list where employees put union members, strikers, etc not to hire.
It's better for people in the US to concentrate on fixing actual racial issues (from incarceration rates and cop shootings, to school funding, redlining and loan access) than to play with words to pat itself in the back.
People all over the world have used black/white to certain things (sometimes the inverse too, e.g. in some asian cultures white is associated with death), and it has nothing to do with the US practicing slavery, seggregation, and racism to blacks.
We use those terms with some connotations for centuries before blacks were sold as slaves to pick your cotton, even at times when slaves where whites working for other whites (as in Ancient Persia, Greece, the Roman Empire, feudal times, and so on).
I can imagine how companies will be taxed extra for this somewhere in the (probably not so near) future.
The longer you wait, the more difficult it is to solve.