Hacker News new | past | comments | ask | show | jobs | submit login
Analyzing New Unique Identifier Formats (UUIDv6, UUIDv7, and UUIDv8) (scaledcode.com)
82 points by futurecat 52 days ago | hide | past | favorite | 66 comments



I am a big fan of the new uuid v7 format.

It has the advantage of being a drop in replacement most places everyone uses v4 today. It also has the advantage over other specs of ulid in that it can be parsed easily even in languages and databases with no libraries because you just need some obvious substr replace and from_hex to extract the timestamp. Other specs typically used some custom lexically sortable base64 or something that always needed a library.

Early drafts of the spec included a few bits to increment if there were local ids generated in the same millisecond for sequencing. This was a good fit for lots of use cases like using the new ids for events generated in normal client apps. Even though it didn’t make the final spec I think it worth implementing as it doesn’t break compatibility


There’s already a 72-bit random part. That should be sufficient to address conflicts.

Incrementing a sequence completely kills the purpose of a UUID, and requires serialization/synchronization semantics. If you need that, just use a long integer.


There is utility in knowing that event a comes before event b in the same local system even if both are generated at the same millisecond. I have found this useful eg when ui latency gets so low that you can have a user interaction and a menu opening in the same millisecond. Being able to plot them on a timeline without any kind of joins is nice.

Anyway, as I said, it was dropped from the spec


If you have a billion users and they each generate 64 random 72-bit numbers, you have a ~63% chance of a collision.


If they all did that in the same millisecond?


Today were at 768 threads on the latest AMD system. Sub millisecond performance is possible (I don't know with this algo).

If you got a spare 50k kicking around we could set up a test system and find out how likely it is to happen...


Billion in one millisecond...


If you have a thousand cores running at 2GHz, that's a two billion cycles a millisecond.

An RNG like ChaCha8 can run at about 2 cycles per byte, and extrapolating, 18 cycles for 72 bits. Not far off.


Ah, yeah, ignore me.


What do you consider the purpose of a UUID?


Asynchronous unique ID generation.


You can have both asynchrony and sequence by encoding thread ID in the UUID too, and make the sequence a thread local state.



Then, just use thread ID and integer sequence pairs instead of trying to stuff them into an arbitrary binary format.


Thread IDs get repeated across reboots. Integer sequence may also repeat in a distributed scenario, unless you want a massive bottleneck. You do need other stuff (timestamps, random number, etc.).


Yes, but that’s what parent proposed? :shrug:


I have recently wondered why Ruby on Rails is using a full-length SHA256 for their ETag fingerprinting (64 characters) when a UUID at 36 chars would probably be entirely enough to prevent collisions and be more readable at the same time. Esbuild on the other hand seems to use just 32bit (8 chars) for their content hash.


Isn’t it because you can generate the same content two different times and hash it and come to the same ETag value?

Using UUID here wouldn’t help here because you don’t want different identifiers for the same content. Time-based UUID versions would negate the point of ETag, and otherwise if you use UUIDv8 and simply put a hash value in there, all you’re doing is reducing the bit depth of the hash and changing its formatting, for limited benefit.


I would assume that you would only create a new UUID if the content of the tagged file changed serverside.

Benefits are readability and reduced amount of data to be transferee. UUID is reasonably save to be unique for the ETag use case (I think 64 bits actually would be enough).


The point of the content hash is to make it trivial to verify that the content hasn’t changed from when its hash was made. If you just make a uuid that has nothing to do with the file’s contents, you could easily forget to update the UUID when you do change its content, leading to invalid caches (or generate a new UUID even though the content hasn’t changed, leading to wasteful invalidation.)

Having the filename be a simple hash of the content guarantees that you don’t make the mistakes above, and makes it trivial to verify.

For example, if my css files are compiled from a build script, and a caching proxy sits in front of my web server, I can set content-hashed files to infinite lifetime on the caching proxy and not worry about invalidating anything. Even if I clean my build output and rebuild, if the resulting css file is identical, it will get the same hash again, automatically. If I used UUID’s and blew away my output folder and rebuilt, suddenly all files have new UUID’s even though their contents are identical, which is wasteful.


SHA256 has the benefit that you can generate the ETAG deterministically without needing to maintain a database (i.e. content-based hashing). That way you also don’t need to track if the content changes which reduces bugs that might creep in with UUIDs. Also, if typically you only update a subset of all files, then aside from not needing to keep track of assigned UUIDs per file, you can do a partial update. Reasons to do content-based hashing are not invalidated because of a new UUID format.


UUIDs and hashes are not the same.

For example, hashes are often taken over untrusted data, which could be manipulated to produce a collision.

UUIDs aren't meant to protect against that.

I'm sure RoR just did the straightforward thing, didn't get cute, and called it a day.


For the same reason that git blobs are identified by their SHA and not a synthetic identifier. It’s a content hash.


I don’t understand the part where monotonicity of UUIDs is discussed. UUIDs should never be assumed monotonic, or in a specific format per se. If you strictly need monotonicity, just use an integer counter. Let UUIDs be black boxes, and assume that v7 is just a better black box that deals with DB indexes better.


The nice thing about them is you don’t have to assume, though, because the version is baked into an octet. Does the 3rd field start with a 4? v4. 7? v7. Etc.

Re: monotonicity, as I view it, v7 is the best compromise I can make with devs as a DBRE where the DB isn’t destroyed, and I don’t have to try to make them redesign huge swaths of their app.


The part I'm talking about proposes "counters" in UUID, not just date/time.


The monotonicity can be useful in multiple contexts: colocating database data by time, providing "sooner than" comparisons.

Integers are monotonic but can't be distributed like UUIDs.

Unless you make them 128 bits ;)

As usual, most people are not dumb most of the time, even if it seems that way.


> [integers] can’t be distributed like UUIDs

They can, to an extent. The use of integers as a primary key has been a solved problem for quite some time, usually by either interleaving distribution among servers, or a coordinator handing chunks out.

If you mean enabling the ability to do joins across physical databases, my counter to that is it’s an unsupported method by any RDBMS, and should be discouraged. You can’t have foreign key constraints across DBs, and without those, I in no way trust the application to consistently do the right thing and maintain referential integrity. I’ve seen too many instances of it going wrong.

The only way I can see it working is something involving Postgres’ FDW, but I’m still not convinced that could maintain atomic updates on its own; maybe with a Pub/Sub in addition? This rapidly gets hideously complicated. Just design good, normalized schema that can maintain performance at scale. Then when/if it doesn’t, shard with something that handles the logic for you and is tested, like Vitess or Citus.


For example, imagine a client that can generate a UUID and at a later time save that to remote database.

Or imagine two separate databases that get merged.


> For example, imagine a client that can generate a UUID and at a later time save that to remote database.

DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.

> Or imagine two separate databases that get merged.

This is sometimes a legitimate need, yes, but it does make me smirk a bit since it goes against the concept of microservices owning their own domain (which I never thought was a great idea for most). However, it’s also quite possible to merge DBs that used integers. Depending on the amount of tables and their relationships (or rather, lack of formally defined ones) it may be challenging, but nothing that can’t be handled.

I mostly just question the desire to dramatically harm DB performance (in the case of UUIDv4) for the sake of undoing future problems more easily.


An example of the latter is when I worked with healthcare systems.

It was not uncommon for systems to merge datasets, either due to literal M&A or to share records and coordinate care.

A globally unique ID was important, despite not having a globally centralized system.


> DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.

Assuming you can+want to talk to a database right then.

The useful part of UUIDs is that they can be generated anywhere, locally, remotely, same DB, separate DB, online, offline, and never change.


> colocating database data by time, providing "sooner than" comparisons.

If you need to perform date/time related operations, use date/time related data types, not an unrelated type that happens to have some arbitrary timestamp embedded in its binary layout.

> Integers are monotonic but can't be distributed like UUIDs.

Yes, use UUIDs if you need distribution, use integers if you need monotonicity. If you need "monotonic and distributed", you need an external authority for proper distribution of those IDs. Then, an integer would still work.


> use date/time...not...timestamp

:/


And if you have a clustered index like in MS SQL Server, a non monotonic uuid results in inserting the data in the middle of the table (bad performance) rather than appending to the end.


For the Postgres fans out there, it also kills performance on that side of the fence. You have things like wal amplification due to using things like UUID v4 (random prefix). I think v7 should greatly help with that.


It also hurts query performance in some circumstances even in Postgres, due to the Visibility Map.


Integer counters are a problem because they leak information. In most cases I've encountered that's not acceptable.


So don’t expose them in the URL. Or have separate internal and external IDs. So many options that don’t destroy B+trees.


UUIDv7, for example, leaks the timestamp

I’ve met more than one architect who hands waves that fact away during a “leaking integers is bad!” campaign


Monotonic UUIDs leak information too.


For v7, the last chunk of bits (rand_b) can be "pseudorandom OR serial". There is no flag bit that must indicate which approach was used.

Therefore, given a compliant UUIDv7 sample, it is impossible to interpret those bits. You can't say if they are random or serial without knowing the implementation, or stochastic analysis of consecutive samples. It's a black box.

The standard would be improved if it just said those bits MUST be uniquely generated for a particular timestamp (e.g. with PRNG or atomic counter).

Logically, that's what it already means, and it opens up interesting v8-style application-specific usages of those bits (like encoding type metadata in a small subset, leaving the rest random), while also complying with the otherwise excellent v7 standard.


Serial is just a terrible idea for UUID. UUIDs shouldn’t require synchronization to be generated.


v7 is really helpful for meaningful UX improvements.

ex. I'm loading your documents on startup.

Eventually, we're going to display them as a list on your home screen, newest to oldest.

Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.

Is it perfect? No, ex. we could have a really old doc thats the most recently modified, and the doc ID is a proxy for the creation date.

But its much better than the status quo of "we're parsing 1000+ docs at ~random at startup, please wait 5 seconds for the list to stop updating over and over."


Tho presumably the uuid would give you the creation date but not the modified date. Still very useful.


Or just use the file date.


> Or just use the file date.

Your parent says they don't want to wait for the file system:

> Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.


> Your parent says they don't want to wait for the file system:

That information comes for free when you're iterating files in a directory. There's no extra waiting than the file name itself because file dates are kept in the same structure that keeps the file names.


This doesn't seem to be true in the standard directory implementations for iterating a directory in Python or Rust.


You're right. It's free on Windows, but isn't on Unix apparently. https://doc.rust-lang.org/std/fs/struct.DirEntry.html#method...

Still, I find that justification to rely on certain binary format of an ID format weird. Just use the dates in filenames if you truly need such a mechanism.


What if I need an ID in the filename and I get the date sorting for free?

You know, like described in the OP.

Is it okay if thats useful?


Yes. I agree with the sentiment that UUIDv7 being chronological can be useful. But, in this specific example, I think it’s a design smell to design your feature around the format of the filename and UUID generation algortihm. I’d say, wait for the FS if you have to instead of creating failure-prone dependencies like that.


What part is failure prone?

What is it being relied upon for?

Alternatively, more explicitly, lets look at it from this angle:

Let's follow exactly what you're recommending: parse it from the file.

Then add a fault-tolerant layer in front that parses a UUID-v7 from the filename.

What do you think of that?


The assumption that the app files would always use the same format and the same UUID algorithm in that format is a totally unnecessary tight coupling for a “loading UI”. The potential future costs isn’t worth it.

Adding layers, etc. Again, it’s a loading UI.

Obviously, we’re talking about a fantasy app here. I’m weighing options based on my understanding of it.


Gotcha, a better name for it is fantasy app.

Let's have the fantasy app do exactly as you're recommending.

Now, the fantasy app also happens to store its file using this filename format: {uuid}.json

What objections are there to parsing the uuid from the filename and using it to sort?

Assuming you again mention the filename not be a valid UUID:

Is it possible to account for that and fallback to the safe behavior? :)


No, because you wouldn’t know if UUID algorithm was changed. It’s a completely unnecessary coupling, like tying your shoelaces together before running.


Reductio ad absurdum: same argument applies to any persisted UUID.

Do you understand? On second read, could be too short and unnecessarily Latin-y. :)


This is fine-ish till O(10^3)


I'm having trouble understanding the use of v8. It can be pretty much any bits as long as it has 1000 in the right spot? It strikes me as too minimal to be useful. I must be missing something


The useful part is you can do anything you want with the other bits and have it still be a valid UUID.


Being able to do anything with the remaining bits is very useful.

You can do any scheme that suits your individual features needs, and it will be a valid UUID still.

This also means future schemes can be implemented right now without having to get a formal UUID version.

You could use the first few bits to indicate production vs qa vs dev data.

Or a subtle hint to what it might be for (eg is this UUID a product identifier or a user identifier or a post or a comment etc). Similar to how AWS etc prefix IDs with their type.


But it kind of defeats the purpose of encoding which of a fixed set of generation methods you are using in the ID, which is presumably to avoid having to check that none of the O(N^2) pair-wise combinations of N methods produce collisions.


[flagged]


The cool thing about the various verions of UUID is that they're all compatible. The differences almost all come down to database locality (and therefore performance.)

The exception is if you're extracting the time portion of a time-based UUID and using it for purposes other than as a unique key, but in my experience this is typically considered bad practice and time is usually stored in a separate column for cases where it matters for business purposes.


Well, technically this is all about different versions of the same standard.


But no one ever said UUID v_ replaces all the others.

They aren't "versions" so much as variants.


It's not necessarily that its dismissive, more so that its that a fuzzy pattern-matching comment, thats incorrect, and just a wordless link. Trivial to make, nontrivial to respond to: "Funny", in the way in-group cultural references usually are - responding means you're taking it too seriously. Yet, incorrect enough that'll misinform anyone who isn't diligently reading the full article and understands historical context. Noise thats likely to generate noise. Trolling, just missing active intent to derail.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: