Hacker News new | past | comments | ask | show | jobs | submit login

They also use "_" and "-" according to Tom Scott.

https://www.youtube.com/watch?v=gocwRvLhDf8






Which would bring it up to a nice 64 choices, making it exactly 6 bits per character.

It's a URL-friendly form of base64.

11 chars encode 66 bits, but actually 2 bits are likely not used and it's simply an int64 encoded to base64.

Given everyone and their grandma is pushing 128-bit UUID for distributed entity PK, it's interesting to see YouTube keep it short and sweet.

Int64 is my go to PK as well, if I have to, I make it hierarchical to distribute it, but I don't do UUID.


> Given everyone and their grandma is pushing 128-bit UUID for distributed entity PK, it's interesting to see YouTube keep it short and sweet.

The trade-off you make when using short IDs is that you can't generate them at random. With 128-bit Id, you can't realistically have collisions, but with 64-bit ones, because of the birthday paradox, as soon as you have more than 2^32 elements, you're really likely to have collisions.


Youtube video ids used to be just base64 of a 3DES-encrypted mysql's primary key, a sequential 64-bit int - collisions are of zero concern there. By birthday paradox it's about as good as 128-bit UUID generated without using a centralized component like database's row counter, when you have to care about collisions.

However theft of the encryption key is a concern, since you can't rotate it and it just sat there in the code. Nowadays they do something a bit smarter to ensure ex- employees can't enumerate all unlisted videos.


> You seem to know about their architecture. What do they do now?

Random 64-bit primary keys in mysql for newer videos. These may sometimes collide but then I suppose you could have the database reject insert and retry with a different id.


So a single cluster produces those keys? I thought it’s more decentralized.

With random database keys I would think they can just be generated at random by any frontend server running anywhere. Ultimately, a request to insert that key would come to the database - which is the centralized gatekeeper in this design and can accept or reject it. But with replication, sharding, caching even SQL databases scale extremely well. Just avoid expensive operations like joins.

You seem to know about their architecture. What do they do now?

The reason why we want ids to be purely random is so we don't have to do the work of coordinating distributed id generation. But if you don't mind coordinating, then none of this matters.

Surely if it was a great chore for YouTube to have random-looking int64 ids, they would switch to int128. But they haven't.

I'm a big fan of the "works 99.99999999% of the time" mentality, but if anything happens to your PRNGs, you risk countless collisions to slip up by you in production before you realize what happened. It's good to design your identity system in a way that'd catch that, regardless of how unlikely it seems in the abstract.

The concept of hierarchical ids is undervalued. You can have a machine give "namespaces" to others, and they can generate locally and check for collisions locally in a very basic way.


> but if anything happens to your PRNGs, you risk countless collisions to slip up by you in production before you realize what happened.

UUID generation basically has to use a CSPRNG to avoid collisions (or at least a very large-state insecure PRNG).

Because of the low volume simply using /dev/urandom on each node makes the most sense. If /dev/urandom is broken so is your TLS stack and a host of other security-critical things; at that point worrying about video ID collisions seems silly.


I worry about state corrupting problems, because they tend to linger long after you have a fix.

Is the extra 64 bits simply used to lower the risk of collision?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: