So yeah, a fifty-fifty chance of a collision after only five billion values. You’re at 10% chance before even two billion, and 1% after 609 million. I wouldn’t care to play this random game with even a million keys, the 64-bit key space is just not large enough to pick IDs at random. UUIDs are 128 bits; that’s large enough that you can reasonably assume in most fields that no collisions will occur.
Storing a string is also inefficient, wasting more than three, and probably eight or more bytes (I’m not certain of the implementation details in PostgreSQL), growing index sizes and making comparisons slower. It’s more efficient to store it as a number and convert to and from the string form only when needed.
I'd go a step further than that honestly. Even at 80 bits I'd say you can really be sure that it will never happen. Mind that 2^64 is quite large, and with every single +1 of that power you halve the chance. Anything above 80 bits is quite safe even when someone purposefully tries to find your input, let alone by random chance. The 128 bits includes a decent security margin.
Plain MD5 is pretty much the fastest thing you can find. A decent GPU can do 25GH/s on that. If you can have a $1 000 000 hardware budget, it takes you 150 years to have a 50/50 chance of cracking a 2^80 secret with Hashcat. (Ballpark numbers: can be off by an order of magnitude, but not <10 years for that budget if you use today's hardware and today's hashcat.)
We haven't even cracked the 3DES key used for Adobe's user database backup, which I'm sure a lot of people would be interested in. Not $1M interested maybe, but still.
2^128 is really gigantic and no collision will ever occur there unless you have a quantum computer. It's more than a "reasonable assumption in most fields".
Of course, I agree that 2^64 is too small to just generate a token and assume it's unique without checking if it exists. I'm just trying to prevent someone from assuming that we need 512-bit hashing or something and creating crypto soup (as a security dude, this happens all too often).
2¹²⁸ is indeed overkill for most situations, but it’s about trying to make sure choosing IDs randomly is safe. (I was probably playing it a little too safe in saying “most fields”.)
Faster than BLAKE3?
You can find some numbers at it's site here:
The table on the page you linked also says in the notes whether an algorithm was meant for cryptographic purposes, and indeed all of those are at the bottom by quite a margin.
The trouble is that it’s generating random IDs, so you hit the birthday problem. If you went for sequential IDs instead (whether directly, from AAAAAAAAAAA to ___________; or permuted so that sequential numbers yield visually-random outputs), then yeah, 11 base64 characters nets you 2⁶⁶ possible IDs. Even 8 base64 characters would be ample for almost every purpose, 2⁴⁸ being around 281 trillion.
If the above leads to any serious application problems, then you've got a lurking bug already, because you never know when there may be a network outage for example, and this would have the same effect.
- https://github.com/uuidjs/uuid/issues/303 (also mentioned: https://gist.github.com/1ma/c837e6f30c0869cfc1222f67ac8b1c52)
I personally use v1 UUIDs as my main identifier and just take the performance hit. ksuids look interesting but don't think there's any need to use that when v1 uuids are good enough.
I put data about fragmentation when using uuid there.
- I don't use ORMs for the most part (query builders for basic CRUD, my personal favorite ORM is typeorm precisely because you don't have to use it as an ORM)
- I don't use .NET (I'm bad at it, and the last time I used it was when EF was in the middle of the transition to core/standard and it was hell)
- HiLo is the thing that you all recommended, but there's almost no information about it
- The post basically says HiLo > Comb > UUIDv4, but with very little actual testing
After looking around for what HiLo was, I found an article about the generation for an ORM in PG. But I'm somewhat confused -- it's actually perfectly possible to pre-generate UUIDv1s, I do it in my applications before the query. I'm not sure what HiLo actually improves upon other than that.
So basically that leaves COMB, and as I've said -- I don't feel the need to use COMB if UUIDv1 is good enough.
1. ID generation is generic problem. You can go two ways about it, one is letting your DB handle/generate it, and one handling it at application level. Your ORM can do it, or they may not. TypeORM/Sequelize are primitive ORMs compared to hibernate/nhibernate, but ORM could be a place where you can handle ID generation, or you can handle it in your business code.
2. .NET has nothing to do with it, if you want to extract information out of it, it applies to any language, concepts are programming concepts.
3. At the time i suggested HiLo > Comb > UUIDv4 and i actually gave my rationale. These days i choose sequential uuid/comb instead, mostly because i don't care about readable ids.
If you want mostly sequential human readable numbers, instead of using DB generated identity, it's smarter to use a simple algorithm like HiLo algorithm. Hi/Lo is High + Lo referring to bucketization of the numbers. Hi goes from 0...infinity, Lo goes from 0 to bucketsize-1.
The basic idea is, if you have an application (or a distributed system for that matter), you preallocate id ranges to each node instance in your application.
Bucket size: 128
Instance 1: Hi=0, Lo=0
Instance 2: Hi=1, Lo=0
Instances assign ids based on the following formula: Hi * bucketSize + (Lo++);
This means instances can assign ids from 0...127 and and 128 ...255 on their own, without hitting database. When they run out of their range, they make a new query and increment Hi parameter stored in database.
Applications assigning ids without hitting database is important. Think of the following scenario:
If you have Order + Order Details, you can send all the inserts in one go if the order id is preassigned. Otherwise, you have to insert the order, get the id, then insert the order details with order id you just obtained.
On Comb creates less fragmentation in the database. I actually put a small bit of image there that simply shows what happens if you insert thousands of records with id=uuid4 vs comb. One of the tables will be more fragmented. the other will be significantly less..
UUIDv1 is structured as [low bits of timestamp, mid bits of timestamp, high bits of timestamp....] which means first bits of uuid moves faster than middle/higher bits, which is very problematic for the fragmentation.
Feel free to ping me on email if you want to talk more:
I was a contributor on NHibernate around 2008.
Edit: Found it https://www.nuget.org/packages/RT.Comb/
CREATE EXTENSION IF NOT EXISTS "pgcrypto";
-- 8 bytes gives a collision p = .5 after 5.1 x 10^9 values
gkey := encode(gen_random_bytes(8), 'base64');
gkey := replace(gkey, '/', '_'); -- url safe replacement
gkey := replace(gkey, '+', '-'); -- url safe replacement
key := rtrim(gkey, '='); -- cut off padding
I would specifically want to know if I was using a library with a certain implementation, changed how it did its main thing in a potentially meaningful way (e.g. dependencies, formulas, character set, output length, etc). A README is a good place to document these points.
Might as well make it 12 characters instead of 11 and encode a full 9 bytes.
Although seems like Youtube's done the same thing.
I guess if you really want to keep it to 11 bytes, it would be better to generate 9 bytes of random data, base64 it (which uses the full 12 characters, with no padding), and truncate to 11. You're throwing away 6 bits of random, but now all 88 bits in your 11-character key are useful (instead of only 86 bits).
'=' is URI (path) safe 
'=' is not query-param safe 
Indeed, from hashids.org:
> Do you have a question or comment that involves "security" and "hashids" in the same sentence? Don't use Hashids.
and then they link to a blog post with a detailed analysis (which makes it look like their characterization as just converting a number to a shuffled alphabet doesn't capture all the details) demonstrating that it's weak.
Instead just use a sequential integer (or UUID). When you need to expose it publicly, use something like hashids ^.
The main downside to straight UUIDs as primary key, is the B-tree used to store the primary key is larger if using UUIDs which are 16 bytes vs 8 for longs. In this case though the IDs are 8 bytes so this point is moot. Also the data is stored randomly on disk, which will make things like range queries slower if querying by ID (but you probably want a different index for querying anyway).
Thx a lot for this. It works great.
Today I'm working on a custom `ShortUUIDField` (a subclass of Django's UUIDField) so I can stick to standard building blocks internally, but show sortuuids to users.
I'm making some choices very specific to my use case (truncated to length 7, and use of a prefix), so not directly PR-able, but can followup in January if you find this useful.
(Going to have to figure out how to test with all Djangos and DB backends to be sure it works)
a bigint generated with this: 2457081803026990508
base64 encoded: MjQ1NzA4MTgwMzAyNjk5MDUwOA==
> [A library] that generates short, unique, non-sequential ids from numbers.
> It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.
> You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.
Of course, you shouldn't use this as a means of any real security. However, neither should you rely on hiding ids through randomness as security either. If someone gets their hands on an id that doesn't belong to them - they still should not be able to do anything with it.
With that in mind, you'll get even better performance with a simple url-safe base64 scheme.
Right, this whole discussion is kind of confusing to me. I get the allure of "unlisted" without needing to log in but for public videos what is the benefit to keeping people from enumerating them?
The best option in general is probably to divide the key space across some constant number that is higher than the most nodes you will ever need to add to the cluster. That said, pre-allocating numeric ranges in anticipation of future growth could introduce some concerns WRT cardinality. If every insert into a single node cluster consumes a bunch of primary key candidates (which are reserved for future nodes), you may worry that 64 bits is not enough range. This is something that has bothered me enough to try to find some options. If you use C#, there is a type called BigInteger which can be very conveniently converted to and from byte arrays. This type can handle an infinite range of integers, so is an excellent fit for this problem domain. All you need to do is define a BLOB instead of an INTEGER for your PK column and it should compare these using memcmp, providing similar semantics to an integer key. Granted, the performance is not going to be as good during lookups with this approach, but it probably wont be noticeable in most applications without some targeted profiling.
Generally, I find primary keys based on entropy games to be questionable at best. I don't want to spend a lot of time worrying if I have enough bits in my hash functions to cover my ass. Deterministic, sequential integers are much more confidence inspiring if we are talking about the fundamental uniqueness of a thing.
If you are worried about security (i.e. someone hitting sequential keys in your URLs), then this is arguably an application problem. You should probably generate an additional column that stores some more obfuscated representation of the primary key and index it separately. This has the added benefit of being decoupled from the primary keyspace and is easily adjustable in the future. The consequence of a collision between public video id surrogate keys is probably much less dire than the collision between actual primary keys in the backend representing those videos.
The issues with this are
1) as more IDs are generated, the probability of collision increases
2) a non integer primary key is slow
For a single database instance, it's far more performant to leave it as an autoincrementing integer and when it needs to be exposed, encrypt it in the backend before sending it out. Why not hashids.org? It's insecure: https://carnage.github.io/2015/08/cryptanalysis-of-hashids
That's quite a bold claim, isn't it? I also don't see how it's an alternative to OP considering it doesn't even mention how to integrate the code with any database, let alone Postgres.
As an aside, how does your approach deal with collisions? Why would the chance of collisions be any lower than with the random approach if you're using what seems to come down to a cryptographically secure hash function?
As for collisions, a hash function maps an input of any length to an output of fixed length, so collisions will happen eventually. What I'm using is encryption, which cannot collide, otherwise decryption would be impossible https://crypto.stackexchange.com/questions/60473/can-collisi...
I've used twang's 64 bit mix in places where the domain is just as big as the hash space - 64 bit to 8 byte hash to 12 byte Y64 strings.
That said, the results are almost always ugly, if I had to do it again, I'd do it the way gfycat does (adjective, adjective, animal).
Left the PK as an int or UUID, but added another column, with a trigger to auto-populate, that created a base-64 kind of thing. The trigger also detected colisions and tried again, up to 3 times before giving up and erroring.
There's no rule that says what you expose to the public has to be the pk at all. It seemed to me a good idea for it to not be.
If you look up the resource by some other field, that means that you now need to support two indexes - one for the primary key, and one for the "public" primary key. This obviously requires more storage and comes with the performance overhead of keeping the second index updated on modifications to the table. Additionally, for something like DynamoDB where you pay per index, it could be cost prohibitive.
A better pattern is to simply encrypt/decrypt the primary key before exposing it publicly, such as in a URL. This requires no additional database overhead.
Xid uses the Mongo Object ID algorithm to generate globally unique ids with a different serialization (base64) to make it shorter when transported as a string: https://docs.mongodb.org/manual/reference/object-id/
- 4-byte value representing the seconds since the Unix epoch,
- 3-byte machine identifier,
- 2-byte process id, and
- 3-byte counter, starting with a random value.
Is this actually true in Postgres? Unless things have changed Postgres doesn't keep clustered indexes so index fragmentation is not an issue.
That being said, in PostgreSQL, you are correct --- having a UUID (or something similar) as a PK is usually fine assuming you understand the implications. However I would absolutely avoid it in a DB like MySQL where PKs are clustered.
Source: worked there.
You could also maybe rotate the key once in a while, but only for videos created after the rotation, and indicate the "key version" with a third field in the video ID
w.r.t encryption cost, I think AES is very cheap on modern hardware compared to the cost of SQL or the overhead of writing your app in an interpreted language.
If your encrypted value is not unique, then your encryption algorithm is broken. You have to be able to turn the encrypted value back into the unencrypted value. if you have two unique unencrypted values that result in the same encrypted value, then that essentially makes it impossible to unencrypt.
Wrong. Flipping a single plaintext bit flips all ciphertext bits with 50% probability.
>Because you aren't guaranteed uniqueness
Wrong. Encryption is bijective.
>And because the overhead for doing it that way may be too high
Wrong. Symmetric ciphers are extremely fast.
Here's my use case: I wrote this because I was (and am) working on a music streaming service, and I needed a youtube-like shortkey for public links to artists, albums etc. For my scale (~10,000 artists, ~120,000 tracks), I haven't run into any problems.
Its extremely easy to use uuids as the primary key; make the column a `uuid` type and the `uuid-ossp` extension will generate them. Your sql would look like:
CREATE TABLE posts (
post_id uuid DEFAULT uuid_generate_v4(),
As to whether or not primary keys should even ever be exposed in URLs, different story.
What do i mean by application generated?
1. UUID -> Bad for index, but for some use cases it's ok.
2. Sequential UUID -> Good with index, works most of the time.
3. HiLo (a server gets a chunk of id range from database, and assign that to entities on their own).
This allows for more performant applications especially if you are designing somewhat complex data model.
I've done something similar for weakly-secured spendable tokens in the past, packing some orderly lookup state into 64 bits and then exposing the Speck64/32. That application was actually backed by an object-array sort of storage with numeric indices, so I stored something like object ID, timestamp, and sequence number, and then looked up the right property on the object at validation time. (If I needed strongly-in-the-crypto-sense unguessable IDs for security, I wouldn't do this without doing more research/analysis, since a database-wide key feels like it could be a weak point; compare how peppering is used for password storage in conjunction with salting but not as a replacement. Someone else probably already knows that answer.)
(Edit to add cross-reference: ah, looks like tmpyt in another comment claims that YouTube used to do this but then switched away because of the “knowing the small key lets you enumerate everything” loophole.)
(I understand there'll still be an index for the public key -- the argument would be that you don't bear the cost for that for most queries.)
Create a separate table mapping integers to unique IDs, and join with this. You can generate that in advance, so can avoid gradually building the index.
Your INSERT becomes even more expensive because you have 2 update 2 indexes, and your lookup doesn't change.
I get the benefit if you use the id as a FK or if the database must have data on disk in PK order like MySQL or MSSQL (?). On PostgreSQL if you just insert & read on that key only it seems objectively worse.
Whether this matters entirely depends on the type of data you're storing. If you're storing unrelated data which is always random access anyway, non-sequential ids don't really matter. If you're storing related data which is mostly accessed sequentially, e.g. Impressions filtered by a creation date range, it's definitely better to have sequential ids for non-fragmented storage. UUIDv6 isn't as commonplace in libraries as v1 or v4 but offers a good compromise for this.
(edit, at least, some common mysql table types do)
(further edit -- You don't actually need a primary key in postgres. You might want one, and depending on how you do it, it's generally good design, but there's no actual requirement for them.)
PostgreSQL stores the actual records separately from the primary key index so the actual records won't fragment but the primary key index will. The records themselves will roughly be stored in insertion order. A random primary key in PostgreSQL mainly means that inserts into the primary key index will be more expensive and that it will bloat the primary key index to some extent.
1. Random value vs. string Hash (e.g. git SHA-1)
2. If Hash, string(s) and/or Salt used for the Hash
3. Fixed Length Key vs. Variable Length URLs/IDs
4. DB Storage: INTEGER, CHAR, VARCHAR, BINARY, VARBINARY
5. Scope of the Single Source of Truth for collisions (e.g. single PostgreSQL instance scoped by its WAL or globally unique ID generating service backed by Cloud Spanner and TrueTime)
alphanumeric case insensitive
above combined with date or topic
just about anything is better than case sensitive mixed alphanumeric plus dashes and underscores, which is a challenge even to techies.
Just imagine trying to tell someone a youtube id over voice
b) the benefit, as stated above, is human readability and shareability and typeinability and writedownability
I've found ULID to be great for my purposes. Wonder if anyone else has tried using them too?
Just skip to a Snowflake-like ID generator from the beginning, or encrypt/decrypt your auto incrementing bigserial for external purposes.
And what does it do for collisions, does it detect/recover?
"Don’t be naive": that's just talking about storing the UUID wrong. You could design your schema badly and store your integer primary key as a string type too, but the post doesn't mention that.
"UUIDs are a pain": so just don't memorize primary keys?
"Planning for real scaling": ok, each record uses 16 bytes instead of 8. But their proposed solution, "Best of Both", uses both 8 and 16 byte records.
"Primary keys get around in normalized databases": finally a real problem.
"It’s really hard to sort random numbers": also potentially a problem—I'm not familiar enough with clustering so I can't offer any thoughts.
"My keys will never change (until they do)": they never explain why they needed to change PKs when migrating DBs
Not a migration scenario, but thinking of other kinds of consolidations, like folding together separate Employee tables after a company merger.
So the question holds, when do you change a PK ?
The author even concludes (confusingly) with:
> So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever expose them.
Can someone shed light on this?
> Can someone shed light on this?
Some of the answers over at https://softwareengineering.stackexchange.com/questions/2183... cover the reasoning that I had been taught in the mid-nineties: it's an implementation detail and there are better choices to expose outside the database.
When it comes to UUIDs, these often have a usability penalty when exposed to users. There may be something better to expose to users depending on the context, but not always. It's never an absolute. Since few people now start by building their database first with relatively rigorous modeling, it's hard to avoid the outside exposure.
You should consider the situation though. In the early-2000s, there was a raft of web apps built in the .NET space that exposed GUIDs just because they were cool (I've literally heard them described as that), or devs didn't think about the implications. As with many ASP.NET abstractions at the time, these were problematic in the contexts used: Web Forms relied on POSTing everything and now your URLs weren't "hackable".
Retrofitting UUIDs is possible if you are using other keys. If you are merging databases you can introduce these to get the data merged, as long as you are OK with some level of fix-ups to the original (integer?) primary key that you've exposed to the application. Ultimately, it becomes an application problem if you have collisions in the original IDs.
One nice thing about using random IDs is that in a multi master setup you are significantly less likely to run into issues with two nodes generating the same ID for two different rows. Yeah you can set up different offsets and increments but that’s fragile.
Also, under what circumstances would you want to change the values of your primary keys?
Also half the things in that post seem like they have nothing to do specifically with UUIDs at all (scaling, foreign keys, exposing primary keys)
> But there’s a far more compelling reason not to use any kind of PK in a public context: if you ever need to change keys, all your external references are broken. Think “404 Page Not Found”.
> When would you need to change keys? As it happens, we’re doing a data migration this week, because who knew in 2003 when the company started that we would now have 13 massive SQL Server databases and growing fast?
Drawbacks are clearly stated, feels like you have not read the project page.
Plus, most of the objections from your article are IMO invalid.