Hacker News new | past | comments | ask | show | jobs | submit login
Pg-shortkey: YouTube-Like Short IDs as Postgres Primary Keys (github.com/turbo)
162 points by tosh 50 days ago | hide | past | favorite | 135 comments



> -- 8 bytes gives a collision p = .5 after 5.1 x 10^9 values

So yeah, a fifty-fifty chance of a collision after only five billion values. You’re at 10% chance before even two billion, and 1% after 609 million. I wouldn’t care to play this random game with even a million keys, the 64-bit key space is just not large enough to pick IDs at random. UUIDs are 128 bits; that’s large enough that you can reasonably assume in most fields that no collisions will occur.

Storing a string is also inefficient, wasting more than three, and probably eight or more bytes (I’m not certain of the implementation details in PostgreSQL), growing index sizes and making comparisons slower. It’s more efficient to store it as a number and convert to and from the string form only when needed.


> UUIDs are 128 bits; that’s large enough that you can reasonably assume in most fields that no collisions will occur.

I'd go a step further than that honestly. Even at 80 bits I'd say you can really be sure that it will never happen. Mind that 2^64 is quite large, and with every single +1 of that power you halve the chance. Anything above 80 bits is quite safe even when someone purposefully tries to find your input, let alone by random chance. The 128 bits includes a decent security margin.

Plain MD5 is pretty much the fastest thing you can find. A decent GPU can do 25GH/s on that. If you can have a $1 000 000 hardware budget, it takes you 150 years to have a 50/50 chance of cracking a 2^80 secret with Hashcat. (Ballpark numbers: can be off by an order of magnitude, but not <10 years for that budget if you use today's hardware and today's hashcat.)

We haven't even cracked the 3DES key used for Adobe's user database backup, which I'm sure a lot of people would be interested in. Not $1M interested maybe, but still.

2^128 is really gigantic and no collision will ever occur there unless you have a quantum computer. It's more than a "reasonable assumption in most fields".

Of course, I agree that 2^64 is too small to just generate a token and assume it's unique without checking if it exists. I'm just trying to prevent someone from assuming that we need 512-bit hashing or something and creating crypto soup (as a security dude, this happens all too often).


We’re still talking about two rather different things: you’re talking about the ability to find a particular collision; I’m talking about the probability of any collision occurring, which is the birthday problem. An 80-bit hash has collision resistance of only 2⁴⁰, a mere trillion. This is ample for many applications, but nowhere near enough for many other applications if you consider collisions a bad thing.

2¹²⁸ is indeed overkill for most situations, but it’s about trying to make sure choosing IDs randomly is safe. (I was probably playing it a little too safe in saying “most fields”.)


> Plain MD5 is pretty much the fastest thing you can find.

Faster than BLAKE3?


As always, it depends on the platform and the size of the input :) There's some good data here: https://bench.cr.yp.to/results-hash.html


Again, ballpark numbers, but yes you might be right. From the sibling comment's benchmarks, it seems though that MD5 is indeed even faster than BLAKE3 (5.4 cycles/byte for blake3 vs 4.95 for md5).


I've been comparing other hashes to MD5 during the last few months to compare and identify large photos and movie files. My experience is that xxhash is much faster than good old MD5.

You can find some numbers at it's site here: https://github.com/Cyan4973/xxHash


Those are different kinds of algorithms. Something like CRC32, siphash, etc. are much faster than a hashing function meant to be cryptographically secure. They just don't have the same constraints. This xxHash claims it "successfully completes the SMHasher test suite" which is good in the same way that a PRNG passing randomness tests is good but not necessarily cryptographically secure.

The table on the page you linked also says in the notes whether an algorithm was meant for cryptographic purposes, and indeed all of those are at the bottom by quite a margin.


And this is why YouTube uses 11 characters in base 64, not 8. This creates a space of ~7.38E19 possible IDs.


The keys are the same length. pg-shortkey is encoding 8 bytes (64 bits) as base64, which means 11 characters (at 6 bits per character; two bits are wasted).

The trouble is that it’s generating random IDs, so you hit the birthday problem. If you went for sequential IDs instead (whether directly, from AAAAAAAAAAA to ___________; or permuted so that sequential numbers yield visually-random outputs), then yeah, 11 base64 characters nets you 2⁶⁶ possible IDs. Even 8 base64 characters would be ample for almost every purpose, 2⁴⁸ being around 281 trillion.


64-bits is plenty as long as you code it properly. The ids should be generated as part of the INSERT statement itself. This means in the rare case which has a duplicate id generated, you will get a primary key error and the transaction will fail. In most cases this means user gets an error message and retries.

If the above leads to any serious application problems, then you've got a lurking bug already, because you never know when there may be a network outage for example, and this would have the same effect.


A couple resources with rigorous examination of UUIDs and the tradeoffs involved with using them:

- https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

- https://github.com/uuidjs/uuid/issues/303 (also mentioned: https://gist.github.com/1ma/c837e6f30c0869cfc1222f67ac8b1c52)

I personally use v1 UUIDs as my main identifier and just take the performance hit. ksuids look interesting but don't think there's any need to use that when v1 uuids are good enough.


Check my post from 10 years ago

https://nhibernate.info/blog/2009/03/19/nhibernate-poid-gene...

I put data about fragmentation when using uuid there.


Hey thanks for linking that post -- I have to say that I didn't find much value in it and here's why (most of which has nothing to do with you so please don't take it personally, just thought it was worth mentioning):

- I don't use ORMs for the most part (query builders for basic CRUD, my personal favorite ORM is typeorm[0] precisely because you don't have to use it as an ORM)

- I don't use .NET (I'm bad at it, and the last time I used it was when EF was in the middle of the transition to core/standard and it was hell)

- HiLo is the thing that you all recommended, but there's almost no information about it

- The post basically says HiLo > Comb > UUIDv4, but with very little actual testing

After looking around for what HiLo was, I found an article about the generation for an ORM in PG[1]. But I'm somewhat confused -- it's actually perfectly possible to pre-generate UUIDv1s, I do it in my applications before the query. I'm not sure what HiLo actually improves upon other than that.

So basically that leaves COMB, and as I've said -- I don't feel the need to use COMB if UUIDv1 is good enough.

[0]: https://github.com/typeorm/typeorm

[1]: https://www.npgsql.org/efcore/modeling/generated-properties....


The blog mentions ORM, but it actually has nothing to do with orms, if you read it closer.

1. ID generation is generic problem. You can go two ways about it, one is letting your DB handle/generate it, and one handling it at application level. Your ORM can do it, or they may not. TypeORM/Sequelize are primitive ORMs compared to hibernate/nhibernate, but ORM could be a place where you can handle ID generation, or you can handle it in your business code.

2. .NET has nothing to do with it, if you want to extract information out of it, it applies to any language, concepts are programming concepts.

3. At the time i suggested HiLo > Comb > UUIDv4 and i actually gave my rationale. These days i choose sequential uuid/comb instead, mostly because i don't care about readable ids.

On HiLo: If you want mostly sequential human readable numbers, instead of using DB generated identity, it's smarter to use a simple algorithm like HiLo algorithm. Hi/Lo is High + Lo referring to bucketization of the numbers. Hi goes from 0...infinity, Lo goes from 0 to bucketsize-1.

The basic idea is, if you have an application (or a distributed system for that matter), you preallocate id ranges to each node instance in your application. Bucket size: 128 Instance 1: Hi=0, Lo=0 Instance 2: Hi=1, Lo=0

Instances assign ids based on the following formula: Hi * bucketSize + (Lo++);

This means instances can assign ids from 0...127 and and 128 ...255 on their own, without hitting database. When they run out of their range, they make a new query and increment Hi parameter stored in database.

Applications assigning ids without hitting database is important. Think of the following scenario: If you have Order + Order Details, you can send all the inserts in one go if the order id is preassigned. Otherwise, you have to insert the order, get the id, then insert the order details with order id you just obtained.

On Comb creates less fragmentation in the database. I actually put a small bit of image there that simply shows what happens if you insert thousands of records with id=uuid4 vs comb. One of the tables will be more fragmented. the other will be significantly less..

UUIDv1 is structured as [low bits of timestamp, mid bits of timestamp, high bits of timestamp....] which means first bits of uuid moves faster than middle/higher bits, which is very problematic for the fragmentation.

Feel free to ping me on email if you want to talk more: tehlike@gmail.com

I was a contributor on NHibernate around 2008.


Anyone looking for a c# implementation of Comb guid can check NHibernate codebase.

https://github.com/nhibernate/nhibernate-core/blob/master/sr...



Nice. One note for anyone going for this on Postgres, ordering is different so requires a different generator. There's a nuget package for that but i can't remember the name right now.

Edit: Found it https://www.nuget.org/packages/RT.Comb/


you can use application generated sequential ids.


Additional approaches in this space:

* https://hashids.org/

* https://github.com/ulid/spec


Why don't README's state how the main thing is done?

  CREATE EXTENSION IF NOT EXISTS "pgcrypto";

  LOOP
    ...
      -- 8 bytes gives a collision p = .5 after 5.1 x 10^9 values
      gkey := encode(gen_random_bytes(8), 'base64');
      gkey := replace(gkey, '/', '_');  -- url safe replacement
      gkey := replace(gkey, '+', '-');  -- url safe replacement
      key := rtrim(gkey, '=');          -- cut off padding
    ...
  END LOOP


Generally so you don't have to update the README every time you update the source.


I wasn't suggesting that the actual code be in the README--that was laziness on my part--more in terms of what's used and expected output characteristics.

I would specifically want to know if I was using a library with a certain implementation, changed how it did its main thing in a potentially meaningful way (e.g. dependencies, formulas, character set, output length, etc). A README is a good place to document these points.


Fair enough. If it's bugged you enough to think through what you're looking for in a README, that'd be a good post for HN.


Weird choice to pick 8 bytes then cut off the padding, you're effectively not using 2 bits in your last character that way.

Might as well make it 12 characters instead of 11 and encode a full 9 bytes.

Although seems like Youtube's done the same thing.


Well you do save a byte in the actual key.

I guess if you really want to keep it to 11 bytes, it would be better to generate 9 bytes of random data, base64 it (which uses the full 12 characters, with no padding), and truncate to 11. You're throwing away 6 bits of random, but now all 88 bits in your 11-character key are useful (instead of only 86 bits).


the '=' isn't url safe.


That's not the point. encoding 9 bytes doesn't give any '=', because 3 bytes encode to 4 b64 characters.


AFAIK:

'=' is URI (path) safe [0]

'=' is not query-param safe [1]

Ex:

[0] https://golang.org/pkg/net/url/#PathEscape

[1] https://golang.org/pkg/net/url/#QueryEscape


Well you just save me a ton of time. Thank you


Looks similar to https://hashids.org/ which instead uses a numerical seed to generate a String representation, which lets you use sequential IDs in your database.


I briefly read about it, it says the method is similar to just converting numbers to hex (modulo + lookup in an 'alphabet', then shift, repeat until you are at zero), but with a shuffled alphabet. That doesn't sound secure _at all_ -- if an attacker can cause you to generate a bunch of IDs very quickly, it's likely that some of them will be sequential, and from there it should be easy to figure out what the shuffled alphabet was.

Indeed, from hashids.org:

> Do you have a question or comment that involves "security" and "hashids" in the same sentence? Don't use Hashids.

and then they link to a blog post with a detailed analysis (which makes it look like their characterization as just converting a number to a shuffled alphabet doesn't capture all the details) demonstrating that it's weak.


I was going to recommend this. No reason to use a special data type for the ID in your database unless you have some sort of actual constraints (like a 64 bit limit or something).

Instead just use a sequential integer (or UUID). When you need to expose it publicly, use something like hashids ^.

The main downside to straight UUIDs as primary key, is the B-tree used to store the primary key is larger if using UUIDs which are 16 bytes vs 8 for longs. In this case though the IDs are 8 bytes so this point is moot. Also the data is stored randomly on disk, which will make things like range queries slower if querying by ID (but you probably want a different index for querying anyway).


I needed this a few (9, wow) years back and wrote this Python library:

https://pypi.org/project/shortuuid/


I came here to post about this!

Thx a lot for this. It works great.

Today I'm working on a custom `ShortUUIDField` (a subclass of Django's UUIDField) so I can stick to standard building blocks internally, but show sortuuids to users.


I've been thinking of adding that to ShortUUID, do you mind showing me when you're done (or issuing a PR)?


@StavrosK the code still very much WIP, but you can take a look at the links I posted on this issue: https://github.com/skorokithakis/shortuuid/issues/50

I'm making some choices very specific to my use case (truncated to length 7, and use of a prefix), so not directly PR-able, but can followup in January if you find this useful.

(Going to have to figure out how to test with all Djangos and DB backends to be sure it works)


glad you did! <3 me some shortuuid :)


<3


I use nanoid (1) and the length control it gives makes it easy to generate appropriate visible ids. Simple, fast, small and works well.

(1) https://github.com/ai/nanoid


i just started using ksuid[0] as my postgres (and in general) primary keys

[0] https://github.com/segmentio/ksuid


Just came to say the same thing. While not 100% bulletproof as neither go nor pgsql internals are areas of experience, I’ve managed to cobble enough together so ksuid is a valid pgsql type with the ability to auto-create during a record insert.


I’ve always liked instagrams id_generator() for sortable bigints without an external point of failure.

http://www.livinginthepast.org/2016/02/24/sharding-into-bigi...


me too - and I Base64 encode it to make it human readable


not sure I understand the base 64 encoding:

a bigint generated with this: 2457081803026990508

base64 encoded: MjQ1NzA4MTgwMzAyNjk5MDUwOA==

why?


Sounds like they mean they're just hashid'ing the bigint id, likely stored in a separate slug column with the output


the bigint is just 64 bits so I b64 encode the binary representation which is about half the size.


This is what Hashids was built for - https://hashids.org/

> [A library] that generates short, unique, non-sequential ids from numbers.

> It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.

> You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.

Of course, you shouldn't use this as a means of any real security. However, neither should you rely on hiding ids through randomness as security either. If someone gets their hands on an id that doesn't belong to them - they still should not be able to do anything with it.

With that in mind, you'll get even better performance with a simple url-safe base64 scheme.


>However, neither should you rely on hiding ids through randomness as security either.

Right, this whole discussion is kind of confusing to me. I get the allure of "unlisted" without needing to log in but for public videos what is the benefit to keeping people from enumerating them?


On youtube, uploaders can switch their videos to be public/private/unlisted anytime, you can't rely on this status to be set in stone at upload.


I find that 64 bit integers are by far the best key material. The only compelling argument I've ever seen for using GUIDs vs int is horizontal scalability, but you can usually get around this with some math tricks.

The best option in general is probably to divide the key space across some constant number that is higher than the most nodes you will ever need to add to the cluster. That said, pre-allocating numeric ranges in anticipation of future growth could introduce some concerns WRT cardinality. If every insert into a single node cluster consumes a bunch of primary key candidates (which are reserved for future nodes), you may worry that 64 bits is not enough range. This is something that has bothered me enough to try to find some options. If you use C#, there is a type called BigInteger which can be very conveniently converted to and from byte arrays. This type can handle an infinite range of integers, so is an excellent fit for this problem domain. All you need to do is define a BLOB instead of an INTEGER for your PK column and it should compare these using memcmp, providing similar semantics to an integer key. Granted, the performance is not going to be as good during lookups with this approach, but it probably wont be noticeable in most applications without some targeted profiling.

Generally, I find primary keys based on entropy games to be questionable at best. I don't want to spend a lot of time worrying if I have enough bits in my hash functions to cover my ass. Deterministic, sequential integers are much more confidence inspiring if we are talking about the fundamental uniqueness of a thing.

If you are worried about security (i.e. someone hitting sequential keys in your URLs), then this is arguably an application problem. You should probably generate an additional column that stores some more obfuscated representation of the primary key and index it separately. This has the added benefit of being decoupled from the primary keyspace and is easily adjustable in the future. The consequence of a collision between public video id surrogate keys is probably much less dire than the collision between actual primary keys in the backend representing those videos.


I built something better that's actually secure and performant here: https://github.com/boogerlad/crypt-ids/

The issues with this are

1) as more IDs are generated, the probability of collision increases

2) a non integer primary key is slow

For a single database instance, it's far more performant to leave it as an autoincrementing integer and when it needs to be exposed, encrypt it in the backend before sending it out. Why not hashids.org? It's insecure: https://carnage.github.io/2015/08/cryptanalysis-of-hashids


> I built something better that's actually secure and performant here

That's quite a bold claim, isn't it? I also don't see how it's an alternative to OP considering it doesn't even mention how to integrate the code with any database, let alone Postgres.

As an aside, how does your approach deal with collisions? Why would the chance of collisions be any lower than with the random approach if you're using what seems to come down to a cryptographically secure hash function?


Technically, it can be integrated with Postgres with https://github.com/wasmerio/wasmer-postgres. That said, it makes little sense to integrate this at the database level since it'll bloat up the column size, unless just the encrypted integer is stored (without converting to a base58 string).

As for collisions, a hash function maps an input of any length to an output of fixed length, so collisions will happen eventually. What I'm using is encryption, which cannot collide, otherwise decryption would be impossible https://crypto.stackexchange.com/questions/60473/can-collisi...



There are reversible bit hash functions too.

I've used twang's 64 bit mix in places where the domain is just as big as the hash space - 64 bit to 8 byte hash to 12 byte Y64 strings.

That said, the results are almost always ugly, if I had to do it again, I'd do it the way gfycat does (adjective, adjective, animal).


Thanks for pointing me to the concept of bit mix functions, very interesting.


The hashids documentation is pretty explicit about it not being secure. It says verbatim:

> Do you have a question or comment that involves "security" and "hashids" in the same sentence? Don't use Hashids.


It only states that near the end of the page. What is the purpose of it if it's not secure? It only prevents an attacker from performing naive attacks (+1 to get the next id, for example), which might be good enough... But anyone really determined can crack this.


I've done something like this but not as the PK.

Left the PK as an int or UUID, but added another column, with a trigger to auto-populate, that created a base-64 kind of thing. The trigger also detected colisions and tried again, up to 3 times before giving up and erroring.

There's no rule that says what you expose to the public has to be the pk at all. It seemed to me a good idea for it to not be.


In a database like MySQL where a clustered index exists for the primary key, or even a NoSQL DB like DynamoDB, it often makes sense to expose some version of the PK to the public if your lookup pattern for that resource is going to be by PK --- versus having some other "public" primary key field like you describe.

If you look up the resource by some other field, that means that you now need to support two indexes - one for the primary key, and one for the "public" primary key. This obviously requires more storage and comes with the performance overhead of keeping the second index updated on modifications to the table. Additionally, for something like DynamoDB where you pay per index, it could be cost prohibitive.

A better pattern is to simply encrypt/decrypt the primary key before exposing it publicly, such as in a URL. This requires no additional database overhead.


See also u/poitrus' https://github.com/rs/xid

Xid uses the Mongo Object ID algorithm to generate globally unique ids with a different serialization (base64) to make it shorter when transported as a string: https://docs.mongodb.org/manual/reference/object-id/

- 4-byte value representing the seconds since the Unix epoch,

- 3-byte machine identifier,

- 2-byte process id, and

- 3-byte counter, starting with a random value.


Ah, I wish I would've known about format-preserving encryption when I did something similar a few years back! I ended up using IDEA, which (at the time) was still in most crypto libraries and has a 64 bit block size.


FPE is complex and slow. I'd prefer a 64-bit blockcipher when it fits the application.


> 2) a non integer primary key is slow

Is this actually true in Postgres? Unless things have changed Postgres doesn't keep clustered indexes so index fragmentation is not an issue.


It is somewhat slower if you are looking up records by key. You still need an index to do that, and non-integer PKs such as UUIDs can be twice as large as the integer alternative, taking longer to search while requiring more memory.

That being said, in PostgreSQL, you are correct --- having a UUID (or something similar) as a PK is usually fine assuming you understand the implications. However I would absolutely avoid it in a DB like MySQL where PKs are clustered.


Why not use ints and encrypt them when returning them?


That's exactly what youtube originally did - video id is just an encrypted (3DES, iirc) 64-bit sequential primary key straight from mysql database. You can convert it back and forth. Problem: if you know the encryption key (it was sitting right there in the code for all devs to see, easy to steal) you could enumerate all the YT videos, including all unlisted videos. At least until a few years ago - they've since switched to random database keys instead of autoincrement to fix this exact hole.

Source: worked there.


One of my hobbies when working at YouTube was to periodically search the monorepo to find new copies of the key, and try to refactor code so there was just one implementation in each of C++, Java, Python, Go


Maybe you could split the ID into two parts: the first is an encrypted integer representing the DB row, the second is a "password" for the video. Then you get fast indexing but you still can't enumerate videos without the ID.


In this scheme, how do you look up a row in the DB by the resulting encrypted ID?


You store the key in the source code as in the other poster's description of Youtube 1.0. So employees/contractors/etc could still see the number of total videos and sequential order of uploads, but at least they can't enumerate them.

You could also maybe rotate the key once in a while, but only for videos created after the rotation, and indicate the "key version" with a third field in the video ID


I would recommend a computed index (given that scheme, if you liked it; I don't, but it doesn't matter).


Youtube and their python code :) keystore would have helped solve this.


I'm not an expert, but here's some conserns I'd have: Because the returned payload isn't necessarily uniform. Because you aren't guaranteed uniqueness (or that guarantee is mathematically lower). And because the overhead for doing it that way may be too high (we want to achieve it in a way that is computationally light, and most encryption isn't).


I am not sure what your concern ab ou it uniqueness is: encrypti on has to be invertible, therefore if each non encrypted integer is unique, it means that each encrypted representation is also unique.

w.r.t encryption cost, I think AES is very cheap on modern hardware compared to the cost of SQL or the overhead of writing your app in an interpreted language.


> Because you aren't guaranteed uniqueness

If your encrypted value is not unique, then your encryption algorithm is broken. You have to be able to turn the encrypted value back into the unencrypted value. if you have two unique unencrypted values that result in the same encrypted value, then that essentially makes it impossible to unencrypt.


>Because the returned payload isn't necessarily uniform

Wrong. Flipping a single plaintext bit flips all ciphertext bits with 50% probability.

>Because you aren't guaranteed uniqueness

Wrong. Encryption is bijective.

>And because the overhead for doing it that way may be too high

Wrong. Symmetric ciphers are extremely fast.


Something to be considered when assigning random-letter IDs to public things: the random combination of letters can include offensive words in them. You should keep a ban list of words in your application and check your assigned ID against the list each time you get an ID.


That's part of the fun


Better to just exclude vowels and numbers that look like vowels.


fckr.


Hello, creator here. This is (I hope obviously) just a short snippet you can and should use as a starting point to hack on. For smaller data sets, this is perfectly fine, but if you're at the point where you need this, you will certainly have business-specific requirements that I don't really want to bake into this from the beginning.

Here's my use case: I wrote this because I was (and am) working on a music streaming service, and I needed a youtube-like shortkey for public links to artists, albums etc. For my scale (~10,000 artists, ~120,000 tracks), I haven't run into any problems.


If you do this in your application code, you can just store it as a 64 bit integer in postgres and optimize your index significantly, instead of using a varchar as the backing data type. These ids pack just right into 64 bits.


Could it be backed by a 64bit integer while still exposing it as a well formatted code without relying on application layer intervention.


Curious as to why not simply use UUIDs instead?

Its extremely easy to use uuids as the primary key; make the column a `uuid` type and the `uuid-ossp` extension will generate them. Your sql would look like:

CREATE TABLE posts (

  post_id uuid DEFAULT uuid_generate_v4(),

  PRIMARY_KEY(post_id)
);


I'd guess the idea would be to have more succinct URLs.

As to whether or not primary keys should even ever be exposed in URLs, different story.


If you use Postgresql 13 version you don't need to install `uuid-ossp` extension. There's already built-in function to generate UUID which is `gen_random_uuid ()`.


I came to the conclusion that IDs should be application-assisted, if not application generated. Offloading that to DB brings some unnecessary work that needs to be handled on the application side otherwise, and tends to o be bad for performance.

What do i mean by application generated? 1. UUID -> Bad for index, but for some use cases it's ok. 2. Sequential UUID -> Good with index, works most of the time. 3. HiLo (a server gets a chunk of id range from database, and assign that to entities on their own).

This allows for more performant applications especially if you are designing somewhat complex data model.



Why is this better than picking an appropriate number of random characters from [A-Za-z0-9]? The base64 encoding seems weird and unnecessary. Plus you get dashes and underscores which can screw up copypaste.


I wrote this specifically to match youtube's format (see my other comment in this thread).


So... add dash and underscore to your character set? The contortions seem pointless.


You need to make sure the last character only has 16 possibilities so that it can round-trip to a 64 bit int. But once you're embedding two lists of characters you might as well just do base64 then a punctuation swap to be simpler and easier to understand.


Instead of doing random bytes, you can also just use a 64 or 128 bit permutation ("block cipher") on standard primary keys, which doesn't rip and tear your indices.


Won't the resulting data look random too? In that case I see little benefit of your approach over an UUID as a PK or using a sequential integer as a PK and adding a UUID field with a btree index.


You don't store the scrambled key, you store the sequential-looking key and expose the scrambled key externally. That avoids things like easy key-sequence probing/inference from the outside, but means you don't take the hit of an index with random-looking keys.

I've done something similar for weakly-secured spendable tokens in the past, packing some orderly lookup state into 64 bits and then exposing the Speck64/32. That application was actually backed by an object-array sort of storage with numeric indices, so I stored something like object ID, timestamp, and sequence number, and then looked up the right property on the object at validation time. (If I needed strongly-in-the-crypto-sense unguessable IDs for security, I wouldn't do this without doing more research/analysis, since a database-wide key feels like it could be a weak point; compare how peppering is used for password storage in conjunction with salting but not as a replacement. Someone else probably already knows that answer.)

(Edit to add cross-reference: ah, looks like tmpyt in another comment claims that YouTube used to do this but then switched away because of the “knowing the small key lets you enumerate everything” loophole.)


Second additional thought: if you need strong unguessability, you could probably tack it onto that scheme by adding a second column that isn't indexed, but only used as a bearer-token-like add-on to the ID. So the index lookup only happens on the (unscrambled) first part, but the access is aborted if the second part doesn't match.


So using random GUIDs is equally harmful to indices? And how so?


Because inserting at the end of a btree (the data structured used by most databases for indexes) is much cheaper than doing random inserts, and as a bonus inserting at the end does not create index bloat since you can mostly avoid page splits. So you will get faster inserts and smaller indexes.


But is there an alternative if you need these hard-to-guess public identifiers? They will always have some randomness and you'll need to index them so it seems unavoidable.


Use integers as primary keys and don't expose them, and add a (unique) random ID column that's used to interface with the rest of the world.

(I understand there'll still be an index for the public key -- the argument would be that you don't bear the cost for that for most queries.)


This stackoverflow answer to a related question gives a better suggestion https://stackoverflow.com/a/6955055/3403369:

Create a separate table mapping integers to unique IDs, and join with this. You can generate that in advance, so can avoid gradually building the index.


> the argument would be that you don't bear the cost for that for most queries

Your INSERT becomes even more expensive because you have 2 update 2 indexes, and your lookup doesn't change.

I get the benefit if you use the id as a FK or if the database must have data on disk in PK order like MySQL or MSSQL (?). On PostgreSQL if you just insert & read on that key only it seems objectively worse.


Encrypt them on the backend before exposing them to the public. Decrypt in the backed on access. See https://hashids.org/



Postgres stores records by primary id sequence. If you have completely random IDs, records inserted at the same time are fragmented across the entire set, which makes accessing them slower if they're meant to be sequential.

Whether this matters entirely depends on the type of data you're storing. If you're storing unrelated data which is always random access anyway, non-sequential ids don't really matter. If you're storing related data which is mostly accessed sequentially, e.g. Impressions filtered by a creation date range, it's definitely better to have sequential ids for non-fragmented storage. UUIDv6 isn't as commonplace in libraries as v1 or v4 but offers a good compromise for this.


No, Mysql does. Postgres doesn't.

(edit, at least, some common mysql table types do)

(further edit -- You don't actually need a primary key in postgres. You might want one, and depending on how you do it, it's generally good design, but there's no actual requirement for them.)


No, I do not think this is a good explanation. What you describe is how InnoDB works.

PostgreSQL stores the actual records separately from the primary key index so the actual records won't fragment but the primary key index will. The records themselves will roughly be stored in insertion order. A random primary key in PostgreSQL mainly means that inserts into the primary key index will be more expensive and that it will bloat the primary key index to some extent.


Note that this can actually be beneficial if you are sharding the table as now new inserts will be put on different shards. (Of course ideally you would be inserting sequentially but on different shards)


There are many ways to skin a cat. Some considerations:

1. Random value vs. string Hash (e.g. git SHA-1)

2. If Hash, string(s) and/or Salt used for the Hash

3. Fixed Length Key vs. Variable Length URLs/IDs

4. DB Storage: INTEGER, CHAR, VARCHAR, BINARY, VARBINARY

5. Scope of the Single Source of Truth for collisions (e.g. single PostgreSQL instance scoped by its WAL or globally unique ID generating service backed by Cloud Spanner and TrueTime)


A pattern you see often is people combining a human readable slug with an ID tacked on. This gives you a more readable slug and an easy way to reference the entry.

https://wellfire.co/learn/fast-and-beautiful-urls-with-djang...


YouTube ids are not human-friendly and undermine URL use. Not a great idea to copy all around, imo.


What do you suggest otherwise?


There are many options...

alphanumeric case insensitive

numeric only

keyword based

above combined with date or topic

just about anything is better than case sensitive mixed alphanumeric plus dashes and underscores, which is a challenge even to techies.

Just imagine trying to tell someone a youtube id over voice


And for each of those, how do you avoid risk of collision? A numeric only ID with the same risk of collision as a UUIDV4 would be 39 digits long, which is longer than a UUIDV4. So is there really a benefit? If it's keyword based, how long is the URI allowed to be? Given the need to support [probably well] over a billion records, I don't think your suggestions meet the design constraints.


Collisions are a separate issue from encoding. Since we're treating youtube IDs as acceptable, let's assume 64 bits is enough. It's 11 characters with this encoding. With something like Base32, designed so that you can read it out loud and ignore the case, it would take 13 characters. That sounds like a worthy trade-off to me. It's not visually much bigger, and saying it out loud or transcribing it is actually much faster because you don't have to specify case or squint to determine I/l/1 or O/0 etc.


a) it's similar for avoiding collisions, and depends more on your hash algo than the display format.

b) the benefit, as stated above, is human readability and shareability and typeinability and writedownability


I've been using ULID as an alternative to UUIDs, e.g.: https://github.com/ulid/spec

I've found ULID to be great for my purposes. Wonder if anyone else has tried using them too?


Nah.

Just skip to a Snowflake-like ID generator from the beginning, or encrypt/decrypt your auto incrementing bigserial for external purposes.


It would be good to put the width (total number of possible keys) in the README.

And what does it do for collisions, does it detect/recover?


It will keep generating keys until a non-colliding key is found. Birthday problem collision rates are explained in the code's comments.


This is a really bad idea as explained eloquently in this blog:

https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-ca...


I've always disagreed with that post, and I've seen others with the same thoughts when this was mentioned on HN before. Specifically:

"Don’t be naive": that's just talking about storing the UUID wrong. You could design your schema badly and store your integer primary key as a string type too, but the post doesn't mention that.

"UUIDs are a pain": so just don't memorize primary keys?

"Planning for real scaling": ok, each record uses 16 bytes instead of 8. But their proposed solution, "Best of Both", uses both 8 and 16 byte records.

"Primary keys get around in normalized databases": finally a real problem.

"It’s really hard to sort random numbers": also potentially a problem—I'm not familiar enough with clustering so I can't offer any thoughts.

"My keys will never change (until they do)": they never explain why they needed to change PKs when migrating DBs


> they never explain why they needed to change PKs when migrating DBs

Not a migration scenario, but thinking of other kinds of consolidations, like folding together separate Employee tables after a company merger.


Wouldn't you have a lot more issues than just ids ? It very improbable that folding 1:1 would be even possible. You'll much more likely have to do a migration merging relevant information into a new table.

So the question holds, when do you change a PK ?


Database migrations


Maybe I just don't know enough about this field, but I didn't actually understand the objections in that post. They were either orthogonal to UUID use (don't store them as string, don't share them publicly) or seemed to not actually come to a conclusion. The "UUIDs are a pain" point makes sense as does the "Primary keys get around" section. But the section "Planning for real scale" seems to stop short of saying anything against UUIDs (it only talks about ints versus bigints?).

The author even concludes (confusingly) with:

> So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever expose them.

Can someone shed light on this?


>> So, in reality, the right solution is probably: use UUIDs for keys, and don’t ever expose them.

> Can someone shed light on this?

Some of the answers over at https://softwareengineering.stackexchange.com/questions/2183... cover the reasoning that I had been taught in the mid-nineties: it's an implementation detail and there are better choices to expose outside the database.

When it comes to UUIDs, these often have a usability penalty when exposed to users. There may be something better to expose to users depending on the context, but not always. It's never an absolute. Since few people now start by building their database first with relatively rigorous modeling, it's hard to avoid the outside exposure.

You should consider the situation though. In the early-2000s, there was a raft of web apps built in the .NET space that exposed GUIDs just because they were cool (I've literally heard them described as that), or devs didn't think about the implications. As with many ASP.NET abstractions at the time, these were problematic in the contexts used: Web Forms relied on POSTing everything and now your URLs weren't "hackable".

Retrofitting UUIDs is possible if you are using other keys. If you are merging databases you can introduce these to get the data merged, as long as you are OK with some level of fix-ups to the original (integer?) primary key that you've exposed to the application. Ultimately, it becomes an application problem if you have collisions in the original IDs.


Sorry, but my question is about whether UUIDs are or are not recommended for use as primary keys. The article seems to be saying both No and Yes to that question.


I can buy some of those arguments for databases where your indexes are so huge that making them 4x the size would make things a lot more expensive. But outside of that I think you would be hard pressed to find a real world difference.

One nice thing about using random IDs is that in a multi master setup you are significantly less likely to run into issues with two nodes generating the same ID for two different rows. Yeah you can set up different offsets and increments but that’s fragile.

Also, under what circumstances would you want to change the values of your primary keys?


It seems like the biggest arguments against UUIDs in that post don't apply here, and the other arguments are explicitly acknowledged in the "caveats" section of OP.

Also half the things in that post seem like they have nothing to do specifically with UUIDs at all (scaling, foreign keys, exposing primary keys)


The gist of it:

> But there’s a far more compelling reason not to use any kind of PK in a public context: if you ever need to change keys, all your external references are broken. Think “404 Page Not Found”.

> When would you need to change keys? As it happens, we’re doing a data migration this week, because who knew in 2003 when the company started that we would now have 13 massive SQL Server databases and growing fast?


Amazon just launched Babelfish in case you're moving to postgres


RTFM ?

Drawbacks are clearly stated, feels like you have not read the project page.

Plus, most of the objections from your article are IMO invalid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: