You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Identity columns don't, and avoid the issue altogether.
I'll stop short of giving a recommendation or using the word "should", but ill give encouragement to consider using uuid's for keys. I have used them in several systems and have never had any issues with them, and they solve so many issues. The ability to generate a key on the client or on the server or in the database is great for one. And the fact that keys are unique not only in a table but in the system (or many systems) is a huge advantage.
- Avoids people being able to just iterate through records, or to discover roughly how many records of a thing you have
- Allows you to generate the key before the row is saved
I think I default to auto-increment ID's due to:
- Familiarity bias
- They have a temporal aspect to them (IE, I know row with ID 225 was created before row with ID 392, and approximately when they might be created)
- Easier to read (when you have less than +1,000,000 rows in a table)
I agree and think you're right in that UUID's are probably a better default.
Though you can never find a "definitive" guide/rule online or in the docs unfortunately.
I had reviewed existing UUIDv7 implementations and many were incorrect or had subtle timing bugs.
We ended up implementing UUIDv7 in our ID generation library https://github.com/MatrixAI/js-id. And we have a number of tests ensuring that it is truly monotonic even across process restarts.
Came here to also hype ULID. It's just like INT, but 128bit, timestamps, every lang, and moving data around is trivial because no ID ever collide with anything else ever. Makes finding crap in my logs easier. (But UUID has that feature too) - it's the sorting that does it for me.
Something I always liked but have never done -- start auto-increment IDs at some number other than 1. I worked on a system where account IDs started with 1000000, invoice IDs started with 2000000, etc. That way, if you saw a random number laying around like "1001234", you knew it was an account ID. "2000123", an invoice! I don't remember how often it helped things, but it was more than 0 times, and it always "felt" good. I never loved the implicit upper bound, but it was also never a problem.
(And remember, the values are in different tables, so the only disadvantage is that your millionth user has ID 2000000. To the trained eye they look like an invoice; but there is no way for the system itself to treat that as an invoice, so it's only confusing to humans. If you use auto-increment keys that start at 1, you have the same problem. Account 1 and Invoice 1 are obviously different things.)
> but there is no way for the system itself to treat that as an invoice, so it's only confusing to humans.
I agree in principle, but then you have Windows skipping version 9 because of all of the `version.hasPrefix("9")` out in the world that were trying to cleverly handle both 95 and 98 at once. If a feature of data is exploitable, there's a strong chance it will be exploited.
I use this for internal items, as in: real customer IDs start at 10,000, so we have 9,999 possible users/invoices/whatever for our own usage in the database. That makes it easy to filter demo accounts of your sales employees, or production test accounts, in queries: just add an offset.
Re: the upper bound: if you reach a million customers, you have lots of other nice problems, like how to spend all the money they earn you :-)
I worked on an existing database once that stored geographic locations and used the signed bit for namespacing.
Positive IDs were home-grown, negative ones were Unique Feature Identifiers from some old GNS system import (from some ancient US Government/Military dataset).
Then when something had to be edited/replaced/created it would possibly change a previously negative ID to positive, fun times.
I’ve seen systems that had to track only two types of things - one started at half the value and went up; the other started just below that and went down. Think 500000+ vs 499999-
I prefer UUIDs as well but one other benefit of an integer type key is in the index. UUIDs, especially v4, make for bloated indexes. If there’s a very specific performance reason I’ll use int, otherwise uuid.
One of hidden pros is that UUID has no gaps visible to product owner or manager. More than once in my life I have try convince people, that the gap in sequence caused by transaction rollback is OK. Several times we were «fixing data» in db manually
This is what I do as well. Most schema structure is on tables that are not very large, and it's nice especially when dealing with multiple environments to prevent errors, or to produce schemas with some polymorphism or joins between several tables without a chain of id-resolving joins in-between.
There are UUID variants that can work well with indices, which shrinks the case for big-integers yet further, to micro-optimizing cases that are situational.
The person you're replying to didn't even mention those advantages though! The ones they did mention are more significant to me.
Of your disadvantages... if I wanted to know when rows were created, I'd just add a created timestamp column.
But "easier to read" is for real -- it's easy when debugging something to have notes referencing rows 12455 and 98923823 or in some cases even keep them in your head. But UUIDs are right out.
And you definitely don't want to put UUIDs in a user-facing URL -- which is actually how you get around "avoids people being able to just iterate through records", really whether you have UUIDs or numeric pks I think putting internal PK's in URLs or anything else publicly exposed is a bad idea, just keep your pks purely internal. Once you commit to that, the comparison between UUIDs vs numeric as pks changes again.
I mean, the links are going to be a template either way, so you can change the identifier part to change the item; it's just a question of whether the identifier portion is the internal rdbms PK, or something else.
It's considered undesirable because it's basically an "implementation detail", it's good to let the internal PK change without having to change the public-facing URLs, or sometimes vice versa, when there are business reasons to do so.
And then there are the at least possible security implications (mainly "enumeration attacks") and/or "business intelligence to your competitors" implications too.
But it's true that plenty of sites seem to expose the internal PK too, it's not a complete consensus. Just kind of a principle to separate internal implementation from public interface.
Random keys have the positive aspect of not creating any hotspots, and they have the negative aspect of not creating any hotspots.
So if you have concurrent updates which are correlated with row insertion time, random keys can be a win. On the other hand, if your lookups are correlated with row insertion time, then the relevant key index pages are less likely to be hot in memory, and depending on how large the table is, you may have thrashing of index pages (this problem would be worse with MySQL, where rows are stored in the PK index).
(UUID doesn't need to mean random any more though as pointed out below. My commentary is specific to random keys and is actually scar tissue from using MD5 as a lookup into multi-billion row tables.)
I generally use both - a serial for internal RDBMS storage to coerce writes to be in order, and a secondary GUID that is app-specific and generated. The app will never care about the serial, and the RDBMS doesn't care about the GUID other than secondary index look-ups (since they're rarely used for range scans).
Be wary when using guids for PK in clustered indices, in databases that support them, see for example https://stackoverflow.com/questions/11938044/what-are-the-be...
. Sadly, Postgresql doesn't have them maintained on-the-fly like Ms SQL does so it is less used feature.
That is true, but that has never been a bottleneck for me. We have a table with multiple hundreds of millions of rows, 20ish UUIDs indexes (yes it did get out of hand) and insert time isn't an issue.
Ofc, your app performance requirements might vary, and it is objectively true that UUIDs aren't ideal for btree indexes.
Then you are lucky. I once found an insert performance drop of sometimes 5x once a table reaches 500,000 elements or so with BTree's and Uuids in particular. Problem is: UUID's often scale well on the app side, especially with multiple writers so the app wants to use them.
Unless you are using a time sorted UUID, and you only do inserts into the table (never updates) avoid any feature that creates a BTree on those fields IMO. Given MVCC architecture of Postgres time sorted UUID's are often not enough if you do a lot of updates as these are really just inserts which again create randomness in the index. I've been in a project where to avoid a refactor (and given Postgres usage was convenient) they just decided to remove constraints and anything that creates a B-Tree index implicitly or explicitly.
It makes me wish Hash indexes could be used to create constraints. They often use less memory these days in my previous testing under new versions of Postgres, and scale a lot better despite less engineering effort in them. In other databases where updates happen in-place so as not to change order of rows (not Postgres MVCC) a BRIN like index on a time ordered UUID would be often fantastic for memory usage. ZHeap seems to have died.
Sadly this is something people should be aware of in advance. Otherwise it will probably bite you later when you have large write volumes, and therefore are most unable to enact changes to the DB when performance drastically goes down (e.g. large customer traffic). This is amplified because writes don't scale in Postgres/most SQL databases.
Some databases don't give you a choice, and your table data is actually stored in a clustered index sorted by primary key. This means random inserts when using a UUID (without embedded time), and more page splitting.
Technically there are many advantages, but operationally, I find it extremely useful in systems where an int PK is auto-generated to use it as a pretty good proxy for a relative created time with relation to other records, with built-in indexing that avoids having to index on an actual created datetime column for a lot of less precise lookup purposes. The backend frameworks I use around data access make the security benefits of UUIDs pretty moot.
What will really be nice is when UUID V7 has wide support, because in many ways it is the best of all possible worlds:
1. The UUIDs are generated in increasing order, so none of the b-tree issues others have mentioned with fully random UUIDs.
2. They're true UUIDs, so migrating between DBs is easy, ensuring uniqueness across DBs is easy, etc.
3. They also have the benefit of having a significant amount of randomness, so if you have a bug that doesn't do an appropriate access check somewhere they are more resistant to someone trying to guess the ID from a previous one.
A friend of mine recently recommended CUIDs, which I'd never heard of before. Any thoughts on those? Neither usecuid.org nor https://www.npmjs.com/package/cuid give me much confidence, though (the former is not https, and the latter has not been updated in 2 years... which might mean it's done, or might mean it's abandoned).
One huge factor for me: assuming you stick to v4, you can use UUIDs directly as identifiers in public contexts without leaking any business-related info and without having predictable keys.
"A collision is possible but the total number of unique keys generated is so large that the possibility of a collision is almost zero. As per Wikipedia, the number of UUIDs generated to have atleast 1 collision is 2.71 quintillion. This is equivalent to generating around 1 billion UUIDs per second for about 85 years."
part of the point of UUIDs is that you don't have to. Collision is unlikely enough that your time would be more wisely spent worrying about the sun exploding unless you generate an absolutely absurd amount of data.
Unless you are writing nuclear warhead management system, writing UUID collision handling is waste of time. Client can retry on top level if request failed.
my worries were more about access control, it is sort of fine if a costumer experiences data loss because an insert fails and the application doesn't retry, it is less fine if a collision causes a user's documents to be swapped with another user's docoment and they end up showing kinky porn on a live press conference.
Sort of the distinction between unspecified behaviour and undefined behaviour in C.
The application shouldn't report the insert as successful if it actually failed. That way, the user don't go around thinking the insert actually succeeded, and there is no data loss (if it didn't succeed it must be retried, by the app or manually)
You need some UX like an error message in a red rectangle or something.
there would be data loss if the insert/update would meet a collision and either decided to overwrite the record or to report the operation as already completed (the latter is what I imagine git would do for a collision scenario).
the solution might as well just be not to care about this case (no sarcasm).
But if we're talking about a database insert and there's a collision (and the uuid column has an unique constraint - as it should, since it's meant to be a primary key), then the insert will not be successful.
Your application might not care about treating this error, but the DB will report it.
Another favor you should do yourself is use bigint keys (unless using uuid as suggested elsewhere, or you're very, very sure you'll never chew through 2 billion ids).
I often move ID generation into the application layer (this also helps avoid things like enumeration attacks), and actually quite a lot of cool Postgres features blur that line a little bit. It's interesting to think sequences and other computational mechanisms in a DB, and whether they make architecting applications easier or harder. I don't have a strong opinion either way, but I'm interested in HN's opinion.
One often hears the counterargument 'but using DB-specific features makes your application less portable!' to which I like to argue: When was the last time you moved an application from SQL-db to SQL-db engine? Follow up question: When was it ever 'easy' if you did?
If you start from the basic premise that the database engine and the application are intertwined and are not loosely coupled, using Postgres-specific features feels much less icky from an architectural point of view. They are essentially part of your application given that you use something like Flyway for migrations and you don't manually run SQL against your production to install functions and triggers and such.
So, I did an easy migration of an analytics app from elasticsearch to ClickHouse: what made it easy was (a) we could replay an event stream to regenerate the database and (b) we had interfaces around the database defined in terms of the questions we needed the database to answer, and not in terms of ES-specific concepts.
But, what makes this sort of design so nice is that you can use DB-specific stuff behind the interface because you’re not trying to write all your queries in the minimally supported subset of SQL or something.
I've given up on believing that there is such a thing as portable SQL. If you use the lowest common denominator, you're likely to have queries that perform suboptimally. Hell, even temp tables aren't portable because of the different syntax between databases. I've worked with tons of queries that would would take minutes to run without temp tables while running in milliseconds with them. People may as well take advantage of all the nice things their database offers instead of fearing a situation that is unlikely to happen and which will be a tough change anyway.
Adding on, even when you are trying to make a portable application, you tend to want to make it work on either X or Y, and you can still exclude people from switching their existing data from one to the other.
I've used software that at least tried to be portable, so you could install it with whatever database you had available.
I did not cite reasons of portability for this reason, and generally I agree with you. Also some of the non-portable extensions of Postgres like PostGIS are practically indispensable for certain applications.
A more reasonable argument for application layer ID generation is flexibility. It's likely that I want some specific primary key generation (e.g. by email, or username, or username prepended with role, and so on).
Done it multiple times, most recently we got an MVP off the ground using only ElasticSearch as a backend, since the primary use case we wanted to validate and demonstrate was discoverability.
As we added more features, the lack of transactions and relational structures started to slow us down, so we dropped in Postgres as a backend, and having application-generated UUID4s as primary keys was a big part in making that move fairly painless
We're currently in the process of moving from MongoDB to CockroachDB. It's not been easy. Certain persistence needs to be duplicated for both of them. Certain queries need to be compared between the two, and we need to make a conscious decision when we pick from which DB.
Having said that, moving logic for id generation to the application because it's less portable otherwise is an odd reason.
> One often hears the counterargument 'but using DB-specific features makes your application less portable!'
OK, sorry for the probably stupid question: Isn't it just a matter of, for each sequence, selecting its current value and then creating the new one in the target database to start from there? Should be, if perhaps not easily, still reasonably scriptable... Or what am I missing?
For backends, I typically use ID generation in the application layer (the backend). And with ID generation I mean the HiLo generator from Hibernate (or NHibernate). By default the HiLo generator will still use a monotonic increasing number, possibly with gaps. (Either way, as long as you expose some form of monotonic increasing number in your api, enumeration attacks could work. The correct way is to simply not expose the original id if that is an issue: you could expose some kind of opaque token in the api instead of a literal id.) The advantage of the HiLo generator is that the ORM knows the generated id up-front and does not need a round-trip to the database to determine it. So it can generate more performant queries. Also, contrary to UUIDs it does not have the disadvantages regarding indices. The only reason I have ever used UUIDs was because of a requirement that external clients needed to generate the ids.
>You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Maybe it is because I'm too old, but making id grow by sequence is the way how things 'ought' to be done in the old skool db admin ways. Sequences are great, it allows the db to to maintain two or more sets of incremental ids, comes in very handy when you keeping track of certain invoices that needs to have a certain incremental numbers. By exposing that in the CREATE statement of the table brings transparency, instead of some magical blackbox IDENTITY.
However, it is totally understandable from a developer's perspective that getting to know the sequences is just unneeded headache. ;-)
Why? Concrete ownership is a good thing. I prefer more rigid structures and I like designing my tables around that. It was the Wild West when I did database design around MySQL 15ish years ago. Of course I didn’t know nearly as much as I do now back then.
https://hakibenita.com/postgresql-unknown-features#prevent-s...
You might think this is trivial -- but SERIAL creates an "owned" (by a certain user) sequence behind the scenes, and so you run into massive headaches if you try to move things around.
Identity columns don't, and avoid the issue altogether.
https://www.2ndquadrant.com/en/blog/postgresql-10-identity-c...