Hacker News new | past | comments | ask | show | jobs | submit login
Plan B for UUIDs: double AES-128 (pvk.ca)
88 points by ingve 2 days ago | hide | past | favorite | 59 comments





I really like ULID for this problem (e.g: https://github.com/ahawker/ulid)

- similar characteristics to UUID: same number of bytes, equivalent collision resistance, hard to guess, etc.

- start with a date, is monotonic, and is lexicographically sortable: not only it has great locality but sorting is super easy and most code get it right by default.

- can be created from an existing date or uuid, and exported to a uuid, so there is a migration path.

- you get the creation date for free, which is always nice. As a bonus, you can now use the uuid as a base for dispatching, bracketing, sharding, etc.


> you get the creation date for free, which is always nice

Leaking information is not “always nice”. For example using that as an id for a medical record would give you specific information for the date of diagnosis or treatment.


Sure. And a extra $100000 in your bank account isn't "always nice". It may get you flagged by the IRS and you may have more chance to now be inspected.

They are always exceptions.

"always nice" is a matter of speaking.

There are very few things that are always true, and I assume reader know that.


While I don't disagree with this statement, it also seems to make sense to me that you were corrected. Words mean things and if you use a word in a way that someone disagrees with, there you go.

Probably just don't take the correction so personally. I think quite often it is the case that a person "corrects" in this manner (remember: this is a public online forum) because they want the thing they say to be visible alongside the thing they responded to.

Not you, but instead the next reader, will read what you wrote and then build their medical/finance/legal application thinking that the availability of additional metadata can't possibly have any downsides. But (perhaps) now that reader will read the response and not make said mistake.


> "always nice" is a matter of speaking.

Incorrectly. It's a matter of speaking incorrectly.


You remind me of the TV bits in the movie "the invention of lying".

They can't lie, so they are perfectly correct.


Also a fan of this format though I wish it was native to postgres so we dont have to convert to UUID for storage (maybe a custom type would do it).

For reference here is my plpgsql implementation: https://gist.github.com/Sytten/8fa334e8a5de5d49924c794811a80...


isn't UUID v7 basically ULID? So you can just use UUID v7 for same effect

UUIDv7 is still a draft, but seems relatively uncontroversial so will likely be accepted. Still a lot of libraries and systems with uuid support are hesitant to add v7 support until it got gets finalised, so for now ULID has more widespread support.

They're basically the same, except uuidv7 has a few bits less randomness for UUID fields like version and variant, so I'm sure in 5-10 years the answer will be just use v7, but for the time being, for those use cases I'd stick to ULID.


We evaluated ULID but we wanted something that is future proof for our decentralised secret sharing system. So we implemented UUIDv7 in TypeScript called `IdSortable` https://github.com/MatrixAI/js-id

It allows strict monotonic IDs when you provide it the previously generated ID. The resulting data structure is a Uint8Array making it easy to put into binary structures. Can also be used as a key inside any POJO record.


Same here. I am really, really hopeful there is a drop in replacement for ULIDs if they ever get abandoned.

I expect the new UUID formats (https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...) to fill that need once they get standardized and made available in standard libraries or popular packages.

Especially UUID7, it it becomes standard, ULID won't be necessary anymore, unless you want a shorted text representation.

But ULID exists right now, UUID7 is still in the making.


The last draft expired without any consensus so I dont have high hopes for this one. We will see.

It's not a very complicated algorithm, they can't get abandoned, you can always recode it, and use it.

> where the fundamental problem is the tension between nicely structured primary keys that tend to improve spatial locality in the storage engine, and unique but otherwise opaque identifiers that avoid running into Hyrum’s law when communicating with external entities and generally prevent unintentional information leakage.

I tend to use an integer for the database PK & internal associations, then an external_id that is used with external party data exchanges.


Imma just leave this here... https://github.com/dragontamer/AESRand/blob/master/AESRand_L...

Algorithm:

    void AESRand_increment(__m128i& state){
     state += increment; 
    }

    std::array<__m128i, 2> AESRand_rand(const __m128i state){
     __m128i penultimate = _mm_aesenc_si128(state, increment); 
     return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)};
    }
TL;DR: with respect to the original article. The AESenc instruction generates significant entropy so long as all four uint32 changes. Instead of adding +1 to the 128-bit "pre-AES encryption", add " __m128i increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); "

Because this is a 64-bit add of an odd number, it will repeat after 2^64 iterations, which should be good enough for most people's purposes.

Once you have that kind of add happening, you end up with more than sufficient entropy from AESenc. For those unaware, AESenc is approximately 10% of a proper AES, its the hardware AES instruction on x86. Repeating AESenc over the expanded-key 10 times results in a proper AES encryption (or decryption).

---------

Because modern CPUs have 2x parallel AES pipelines (2 AES instructions per clock tick), my RNG supports a 256-bit output every 4 clock cycles. I accomplish this by having AESenc on the first argument, and then AESenc / AESdec on the results of that (encode being the first 128-bit result, and decode being the 2nd 128-bit result).

--------

If you took my RNG output (which can be generated deterministically every 4 clock cycles), and then just AES encrypted that random number generator, you probably would get your cryptographically random numbers that are guaranteed not to repeat.


Many years ago people (including me [1]) started using the Skipjack cipher for the purpose of obfuscating sequential numbers when they needed to be externalized (e.g. putting auto-incremented ids in urls). Since it was originally designed to encrypt media in realtime on 90s era silicon it's quite low overhead. I've often wondered if/when a database vendor would get around to building similar functionality directly into their engine since that would make the approach much more practical to adopt.

[1] https://www.npmjs.com/package/skip32


It seems to me that this is (albeit in a somewhat convoluted way), "solving" one problem but opening a whole can of (cryptographic) worms on another.

Namely the biggest chestnut being the old classic of performing decryption on untrusted data. Since basic AES-128 is not an AEAD cipher, you are basically in the realms of trusting untrusted input.

Personally my preference would be to maintain UUID (or possibly KSUID-type implementations depending on application). And then if you need to partition/shard/whatever you do that by some other means (i.e. by or in conjunction with a different database column).


What’s the threat model exactly? Here AES is just a bijection between 128bit numbers.

It’s not like the AES cipher is going to blow up when you try to decrypt a special harmful bit pattern! It’s just a bunch of ones and zeros…

If someone gives you a bogus ID, then it’s gonna decrypt to a non-existent ID. Which can also happen with plaintext IDs…


> Rather than try to pinpoint the exact level of dedication we’re trying to foil, from curious power user to nation state actor, let’s aim for something that’s hopefully as hard to break as our transport (e.g., HTTPS). AES should be helpful.

AES is in that range for the short term. Not over months or years.

Anyone considering this should keep in mind youtube's mass privating of older unlisted videos, which was needed because they used to use an encrypted counter for video IDs. That key was in so many places, and it got out.


Have you any source for that being the reason? https://support.google.com/youtube/thread/114633828/changes-... is all I find, and it says this:

> In 2017, we rolled out a security update to the system that generates new Unlisted video links. This update included security enhancements that make the links for your Unlisted videos even harder for someone to discover if you haven’t shared the link with them.

That doesn’t imply a leaked key to me (I’d consider that wording dishonest if it actually is describing a leaked key), but rather an increase in the sparsity of the ID space, presuming they were using encrypted sequential IDs.


> Have you any source for that being the reason?

Not a particularly good one.

> That doesn’t imply a leaked key to me (I’d consider that wording dishonest if it actually is describing a leaked key)

I'm not sure exactly what you mean. Even if they did announce a leaked key, I don't think it would go in that paragraph anyway. So that paragraph wouldn't be dishonest no matter what.

> but rather an increase in the sparsity of the ID space, presuming they were using encrypted sequential IDs.

The reason you do that is because you're worried about key leaks, isn't it? I can't really think of another reason.

And then they felt the need to pull the emergency lever and hide all videos on the old system.


Do you have any source? Is this something you heard, or an interpretation of your own? (I’m not being hostile, I’d just like to know.)

For that matter, have you any definite knowledge that YouTube uses encrypted sequential identifiers for its URLs? Looking from the outside, there’s no visible difference between this approach and something based on randomness, save that something random is likely to be longer because that approach is normally paired with statistical collision avoidance rather than checked collision avoidance.

The reason you increase the sparsity of an ID space is because you have discovered a weakness in the generation scheme (which allows faster guessing of valid IDs), faster testing has become possible than you planned for, or occupation has exceeded what you deem to be acceptable probabilities (which could be that occupation has been increasing, or that it was constant but you’ve changed your opinion on acceptable probabilities). If you’re worried about key leaks, merely changing keys gets you only a brief respite for an already-leaked key.

In this instance, they gave one month’s notice of a change, and the underlying ID scheme change had clearly occurred about six months earlier. This suggests to me that they did not know of any immediate threat, but were being cautious in just the way they described, for the second or third reason.

If they did know of a leaked key, I would expect them to state the consequences more forthrightly in the announcement, and have made the change rather more rapidly than they did.


> Do you have any source? Is this something you heard, or an interpretation of your own? (I’m not being hostile, I’d just like to know.)

I don't remember where I first heard it, and I've seen two people who claim to have direct knowledge say so on HN. One specified it was a 3DES key.

> For that matter, have you any definite knowledge that YouTube uses encrypted sequential identifiers for its URLs?

I mean the whole point is that they don't anymore, but that doesn't change the old video IDs. I have no proof but I'll keep an eye open to see if the supposed key ever becomes more available.

> Looking from the outside, there’s no visible difference between this approach and something based on randomness, save that something random is likely to be longer because that approach is normally paired with statistical collision avoidance rather than checked collision avoidance.

Youtube can easily afford checked avoidance. They're around a hundred videos a second. And they have good reasons to keep video IDs the same short size.

> The reason you increase the sparsity of an ID space is because you have discovered a weakness in the generation scheme (which allows faster guessing of valid IDs), faster testing has become possible than you planned for, or occupation has exceeded what you deem to be acceptable probabilities (which could be that occupation has been increasing, or that it was constant but you’ve changed your opinion on acceptable probabilities). If you’re worried about key leaks, merely changing keys gets you only a brief respite for an already-leaked key.

Well yes the claim is they changed the generation because of weakness. And they didn't change keys, they got rid of the key for new videos.

> In this instance, they gave one month’s notice of a change, and the underlying ID scheme change had clearly occurred about six months earlier. This suggests to me that they did not know of any immediate threat, but were being cautious in just the way they described, for the second or third reason.

> If they did know of a leaked key, I would expect them to state the consequences more forthrightly in the announcement, and have made the change rather more rapidly than they did.

They seem to have acted fast enough to avoid any worries, so why act faster? And why directly invite bad press? And if it comes up they could still talk about how unlikely it is to find any specific video, and guessing was always going to be possible, etc.


> AES is in that range for the short term. Not over months or years.

Why do you say so?


Oh, I guess my second paragraph is not a good enough explanation?

The key leaked, which means anyone that has the key can generate all the IDs from videos uploaded before 2017. If your key ever gets out, you face the same problem. The secrecy of all existing IDs is gone.

With HTTPS, you can rotate keys often, older connections would have to have been packet-captured to get decrypted, and newer connections mandate perfect forward secrecy.


Oh, I somehow managed to understand those two paragraphs as two separate observations. Sorry.

Yes, the key can leak and there's no forward secrecy in this scheme. That's a valid point.


I read this as Plan B for IUDs and was confused why we would want to encrypt birth control.

Same. Had to take a sip of coffee and do a double take.

Maybe I'm way off base here, but this seems like a really really niche solution for a situation you probably(?) won't be in for long.

You have a single DB running on a single machine. But you also care this much about write performance that you need sequential ids. How much longer is that single DB going to work for you, if write performance has become an issue? I wonder. Because soon you might need to shard, which you will lose sequential ids and benefit of spatial locality right? There seems a narrow band of operation where you're seeing the benefit of performance from sequential ids but can still function on a single DB instance. I feel like you may be painting yourself into an architectural corner and a headache rather than just doing the easy thing first and using UUID.


Most projects can reasonably and usefully stick with central ID allocation forever.

And even if you want to decentralise ID allocation, so long as you still have control over the nodes, you can still use roughly sequential identifiers if you choose, by multiplying by some standard value and offsetting the sequence by a node ID. You could reserve ten least significant bits for a node ID, for example, allowing up to 1024 nodes to generate IDs. So node 0 would allocate IDs {0, 1024, 2048, …}, node 1 {1, 1025, 2049, …}, &c. This requires more planning and possibly more maintenance, but will have the advantage of producing shorter IDs which can be highly desirable.

Random IDs do have a place, but they’re massively overrated. The length required to control the collision risk is very significant.


> You have a single DB running on a single machine. But you also care this much about write performance that you need sequential ids. How much longer is that single DB going to work for you, if write performance has become an issue?

Potentially a long time. Anecdotal: I helped a PG user in ~2012 (+- 2 years) to move to some sequential-ish uuid-like thing because they were facing outages, and as of last year they were still using a single instance for that service. And they grew by a lot.

> Because soon you might need to shard, which you will lose sequential ids and benefit of spatial locality right?

You don't need (not even want) perfect spatial locality. It's ok as long as it's not uniformly random.


I remember years ago rejecting this approach (encrypt internal keys) because the key size I was using wasn't the same as the block size of any standard block cypher and I didn't know of any variable size block size cypher.

If the target is 128 bit identifiers though you can use AES.


Convert the identifier to a multiple of block size and encrypt that and back, AES round trip is lossless.

I like this because once you decrypt you can have all the nice properties of integer keys, like ordering.

This is an approach that has been around for a long time, mostly with 32-bit and 64-bit block sizes. (e.g. if you search around, you’ll hear about Skip32 as a common choice Back In The Day, a variant of Skipjack crafted for the purpose.)

I’ve been working on this problem for a few months from the perspective of yielding human-friendly IDs, and reached what I consider a finished state a few weeks ago, but haven’t published it because I was thinking about other things. I’ll probably publish it in the middle of next week. If anyone’s interested in a sneak peak (and final review!), they can email me and I’ll give a link.

My initial just-about-ready-to-publish version encrypted with Speck32/64 (for 32-bit IDs) and Speck64/128 (for 64-bit IDs), and did base-58 stringification (ASCII alphanumeric minus 0, O, I, l). This produced about four billion 6-character IDs, then 11-character IDs after that.

I was also toying with things with different block sizes, since I wasn’t enthusiastic about the 6-/11-character ID length. After feedback from interested people I became more determined to do so, because the feedback convinced me that case-sensitivity was sufficiently undesirable, and going down to base-36 or lower took it up to 7/13 characters.

I tried Hasty Pudding, but it was too big and complex and unknown for my liking, though I did implement it and it worked… though I couldn’t even find the official test vectors, since they’re on a dead FTP site and the author didn’t respond to an email. I looked into FF1 and FF3-1 too, but patent encumbrance decided me against them. I ended up daring to create a variant of Speck with multiples-of-five word sizes. This actually ended up making my code smaller: the JavaScript implementation of my entire scheme is 1680 bytes minified, 915 gzipped with zopfli (and shedding a bit of functionality it can easily be taken under 1200/700).

So now it’s using my custom Speck variant and a base-32 alphabet (lowercase letters and numbers minus 0, o, 1, l). [0, 2²⁰) becomes four characters, [2²⁰, 2³⁰) six, [2³⁰, 2⁴⁰) eight, &c. up to [2⁹⁰, 2¹⁰⁰) twenty. I also add one more step to allow for making valid IDs less guessable and the inclusion of a type discriminant if desired (and these are why I had the cipher go up to a 100-bit blocksize): id → sparsity × id + discriminant.

Performance: my Rust implementation encodes all four-character IDs (about a million of them) in about 50ms on my AMD Ryzen 5800HS laptop, single-threaded, which at about 4.4GHz is a bit over 200 cycles each; but that’s actually 99% stringification: a benchmark of just encrypting all 20-bit values takes about 475,000ns, which looks like it might be exactly 2 cycles each, while still being Real Encryption. (I also have Python and JavaScript implementations, but I haven’t benchmarked them.)


This is a pretty clever idea, but I have to wonder how much of a concern an "ExternalId" column is today.

Assuming you are really paranoid, and each is 256 bits of the best CSPRNG output available, then for a table containing 1 million records you are only talking about an additional ~32 megabytes of overhead.

With modern CPUs and most applications, this extra amount of data is not going to move the needle at all.


The two approaches can be combined!

You can use AES to generate and store the ExternalId column. As you pointed out, this saves you from having to encrypt/decrypt it on the fly every time, at the cost of an extra lookup and some storage (you're going to want an index on that column as well).

If it ever comes up that you prefer to do it on the fly, you can just drop the column. All your external references will still be valid since the ExternalId was generated deterministically.


Yup, I once tried using aes to create uuids but then ended up with a page that would send a thousand results to the client each one needing a uuid generated, multiply that by a thousand users and the cpu time started to add up it obviously wasn't going to be as scalable as just storing the uuid.

Possibly only doing one round of aes or skipping the key expansion steps would make it more scalable but I didn't have the knowledge to do that securely.


Plus the advantage of changing the way you generate your externalIds. As long as they are stored the external links to said record stay stable which is typically important with integration

I've been collecting all my notes on UUIDs here https://github.com/sw-yx/brain/blob/master/R%20-%20Dev%20Not... in case it is helpful to anyone

There is one thing which troubles me here: Postgres does not support clustered indexes, so the "spatial locality of the storage engine" is non-existent, only the indexes should be impacted.

Why would you have "UUIDs writing to pages all over the place"? Or is that, and the WAL amplification, be solely about the updates to random btree leaves?


(Not a DB Engineer but poke at enough different DBs to armchair semi-confidently)

In the case of Postgres I'd expect to see it indeed be mostly about random updates to leaves.

I suppose Partitioning could come into play but I'm not sure how you'd set it up in a way that was beneficial and dynamic enough for the situation in the first place.

But, It's worth remembering that most other DBs (Oracle, MSSQL, MySql/etc) do have clustered indexes, and in some cases you can't just flip over to Postgres. (Even if you want to!)


both MSSQL and Postgresql support fill factor as well, which nobody seems to mention. Setting a lower fill factor leaves more room in the pages to avoid page splits with random writes.

Also random writes can be beneficial in that you are not hot spotting the last page as you are with sequential ID's.

Its not black and white, there are some good advantages with random ID's but you need to manage them properly.


I struggle to think of use cases where involving encryption in creating, parsing, evaluating, or otherwise using primary keys is anything but a mistake. But I'm sure I'm insufficiently creative.

Encrypting sequential identifiers is an excellent way of avoiding leaking the ID sequence, which is regularly at least mildly sensitive data, generally more sensitive than you expect, and preventing ID enumeration (either completely, if the ID space is sparsely occupied, or at least sequentially if it’s fullish).

In what circumstances does the value of that encryption outweigh its costs? Particularly in high-RPS (e.g. 1M+ RPS) applications?

My intuition is that UUIDs, ULIDs, etc. would be orders of magnitude faster to {generate, parse} than anything involving an {encryption, decryption} step; am I off base?


Internally you can work completely with sequential int64 IDs, and at the API edge you have to spend ~30 clock cycles (about the cost of an L3 cache hit) on each id to encrypt/decrypt it. For any workload that can benefit from better data locality from sequential IDs this could well be a performance win over UUIDs, especially at scale.

Of course ULIDs offer something very similar. But they leak more information, which makes them less desirable for some use cases.


Encryption is much, much faster than you think. In my carefully-optimised Rust implementation of my scheme (see another comment in this thread) that uses a variant of Speck128 for encryption, and simple base-conversion into a 20-character array (no allocations!), the base conversion takes takes 100× as long as the encryption, which, in large batches, takes two cycles. Now AES (the topic here) is slower than Speck (which is one of the very fastest “real encryption” schemes out there), so the base-conversion wouldn’t be as much slowerer (certainly nowhere near 100×), but it still goes to show something.

I would not expect you to be able to even measure the difference on general workloads. On massive batch workloads, I think there’s a good chance that the encryption approach with my scheme at least (AES would be a little more iffy) and a 64-bit ID would be just a teensy bit faster than a UUID primary key because of dealing with less data in the database, network and app before the encryption is done—in fact, assuming equivalent stringification I have decided to confidently assert my scheme would be faster, since the only thing that can slow it down is the added encryption code and round keys, which at scale are negligible, and my scheme will save around 60 cycles on stringification due to needing only 20 characters rather than 26. (And you could reduce it another six characters and sixty cycles if you limited it further to 70-bit IDs.) But even there I think you’d be straining to measure it.


came here for this comment. Why in the hell are people combining encryption with UUID? If you are putting any encryption with a primary key, you have failed in some other aspect of your security design. This whole article just screams red flag to me.

UUIDs have always seemed to be an over-engineered solution to a minor problem. In the cases I've seen them used folks are trying to get away from auto-incremented identifying serial numbers. Ostensibly, this is because it makes guessing other IDs in the database (or whatever) fairly simple and that could be a source for leaking information.

To me this is just security though obscurity. Just because I know an identifier doesn't mean I should be able to do _anything_ with it. If there are contexts where you really can't share the ID you should salt and hash it and use that value.


> Just because I know an identifier doesn't mean I should be able to do _anything_ with it.

Depends. If it’s a sequential ID, then sure, but if it’s sufficiently random and sparse, then it can be fine to use it as authentication.

But it’s not always about authentication and security. The information leak itself is regularly a problem. See https://en.wikipedia.org/wiki/German_tank_problem for one practical example, though you normally don’t even need anything that fancy. Businesses do this where they can to analyse competitors, figure out how big they are, how they should price things, &c.


If it was easier to have scoped id's it wouldn't be so bad. For example, instead of one OrderID for all orders in the order table, have customer specific order numbers that start at 1, and are only unique in tandem with a customer id.

It's a pain in the but to implement this though in most databases.


At a minimum, sequential generated IDs trivially expose creation rate, which is almost always significant information.

>If there are contexts where you really can't share the ID you should salt and hash it and use that value.

That's UUIDv5.


I actually using this scheme for several years now, but AES itself is not authenticated and I added truncated SHA256 to the ID to make it more robust.

What exactly would you be authenticating? The scheme discussed here is just using AES to generate the equivalent of a UUID.

A maximal length LFSR will give sequential IDs without the problem of known zero bits.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: