It has the advantage of being a drop in replacement most places everyone uses v4 today. It also has the advantage over other specs of ulid in that it can be parsed easily even in languages and databases with no libraries because you just need some obvious substr replace and from_hex to extract the timestamp. Other specs typically used some custom lexically sortable base64 or something that always needed a library.
Early drafts of the spec included a few bits to increment if there were local ids generated in the same millisecond for sequencing. This was a good fit for lots of use cases like using the new ids for events generated in normal client apps. Even though it didn’t make the final spec I think it worth implementing as it doesn’t break compatibility
There’s already a 72-bit random part. That should be sufficient to address conflicts.
Incrementing a sequence completely kills the purpose of a UUID, and requires serialization/synchronization semantics. If you need that, just use a long integer.
There is utility in knowing that event a comes before event b in the same local system even if both are generated at the same millisecond. I have found this useful eg when ui latency gets so low that you can have a user interaction and a menu opening in the same millisecond. Being able to plot them on a timeline without any kind of joins is nice.
Thread IDs get repeated across reboots. Integer sequence may also repeat in a distributed scenario, unless you want a massive bottleneck. You do need other stuff (timestamps, random number, etc.).
I have recently wondered why Ruby on Rails is using a full-length SHA256 for their ETag fingerprinting (64 characters) when a UUID at 36 chars would probably be entirely enough to prevent collisions and be more readable at the same time. Esbuild on the other hand seems to use just 32bit (8 chars) for their content hash.
Isn’t it because you can generate the same content two different times and hash it and come to the same ETag value?
Using UUID here wouldn’t help here because you don’t want different identifiers for the same
content. Time-based UUID versions would negate the point of ETag, and otherwise if you use UUIDv8 and simply put a hash value in there, all you’re doing is reducing the bit depth of the hash and changing its formatting, for limited benefit.
I would assume that you would only create a new UUID if the content of the tagged file changed serverside.
Benefits are readability and reduced amount of data to be transferee. UUID is reasonably save to be unique for the ETag use case (I think 64 bits actually would be enough).
The point of the content hash is to make it trivial to verify that the content hasn’t changed from when its hash was made. If you just make a uuid that has nothing to do with the file’s contents, you could easily forget to update the UUID when you do change its content, leading to invalid caches (or generate a new UUID even though the content hasn’t changed, leading to wasteful invalidation.)
Having the filename be a simple hash of the content guarantees that you don’t make the mistakes above, and makes it trivial to verify.
For example, if my css files are compiled from a build script, and a caching proxy sits in front of my web server, I can set content-hashed files to infinite lifetime on the caching proxy and not worry about invalidating anything. Even if I clean my build output and rebuild, if the resulting css file is identical, it will get the same hash again, automatically. If I used UUID’s and blew away my output folder and rebuilt, suddenly all files have new UUID’s even though their contents are identical, which is wasteful.
SHA256 has the benefit that you can generate the ETAG deterministically without needing to maintain a database (i.e. content-based hashing). That way you also don’t need to track if the content changes which reduces bugs that might creep in with UUIDs. Also, if typically you only update a subset of all files, then aside from not needing to keep track of assigned UUIDs per file, you can do a partial update. Reasons to do content-based hashing are not invalidated because of a new UUID format.
I don’t understand the part where monotonicity of UUIDs is discussed. UUIDs should never be assumed monotonic, or in a specific format per se. If you strictly need monotonicity, just use an integer counter. Let UUIDs be black boxes, and assume that v7 is just a better black box that deals with DB indexes better.
The nice thing about them is you don’t have to assume, though, because the version is baked into an octet. Does the 3rd field start with a 4? v4. 7? v7. Etc.
Re: monotonicity, as I view it, v7 is the best compromise I can make with devs as a DBRE where the DB isn’t destroyed, and I don’t have to try to make them redesign huge swaths of their app.
They can, to an extent. The use of integers as a primary key has been a solved problem for quite some time, usually by either interleaving distribution among servers, or a coordinator handing chunks out.
If you mean enabling the ability to do joins across physical databases, my counter to that is it’s an unsupported method by any RDBMS, and should be discouraged. You can’t have foreign key constraints across DBs, and without those, I in no way trust the application to consistently do the right thing and maintain referential integrity. I’ve seen too many instances of it going wrong.
The only way I can see it working is something involving Postgres’ FDW, but I’m still not convinced that could maintain atomic updates on its own; maybe with a Pub/Sub in addition? This rapidly gets hideously complicated. Just design good, normalized schema that can maintain performance at scale. Then when/if it doesn’t, shard with something that handles the logic for you and is tested, like Vitess or Citus.
> For example, imagine a client that can generate a UUID and at a later time save that to remote database.
DBs can return inserted data to you; Postgres and SQLite can return the entire row, and MySQL can return the last generated auto-increment ID.
> Or imagine two separate databases that get merged.
This is sometimes a legitimate need, yes, but it does make me smirk a bit since it goes against the concept of microservices owning their own domain (which I never thought was a great idea for most). However, it’s also quite possible to merge DBs that used integers. Depending on the amount of tables and their relationships (or rather, lack of formally defined ones) it may be challenging, but nothing that can’t be handled.
I mostly just question the desire to dramatically harm DB performance (in the case of UUIDv4) for the sake of undoing future problems more easily.
> colocating database data by time, providing "sooner than" comparisons.
If you need to perform date/time related operations, use date/time related data types, not an unrelated type that happens to have some arbitrary timestamp embedded in its binary layout.
> Integers are monotonic but can't be distributed like UUIDs.
Yes, use UUIDs if you need distribution, use integers if you need monotonicity. If you need "monotonic and distributed", you need an external authority for proper distribution of those IDs. Then, an integer would still work.
And if you have a clustered index like in MS SQL Server, a non monotonic uuid results in inserting the data in the middle of the table (bad performance) rather than appending to the end.
For the Postgres fans out there, it also kills performance on that side of the fence. You have things like wal amplification due to using things like UUID v4 (random prefix). I think v7 should greatly help with that.
For v7, the last chunk of bits (rand_b) can be "pseudorandom OR serial". There is no flag bit that must indicate which approach was used.
Therefore, given a compliant UUIDv7 sample, it is impossible to interpret those bits. You can't say if they are random or serial without knowing the implementation, or stochastic analysis of consecutive samples. It's a black box.
The standard would be improved if it just said those bits MUST be uniquely generated for a particular timestamp (e.g. with PRNG or atomic counter).
Logically, that's what it already means, and it opens up interesting v8-style application-specific usages of those bits (like encoding type metadata in a small subset, leaving the rest random), while also complying with the otherwise excellent v7 standard.
v7 is really helpful for meaningful UX improvements.
ex. I'm loading your documents on startup.
Eventually, we're going to display them as a list on your home screen, newest to oldest.
Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.
Is it perfect? No, ex. we could have a really old doc thats the most recently modified, and the doc ID is a proxy for the creation date.
But its much better than the status quo of "we're parsing 1000+ docs at ~random at startup, please wait 5 seconds for the list to stop updating over and over."
Your parent says they don't want to wait for the file system:
> Now, instead of having to parse the entire document to get the modified date, or wait for the file system, I can just sort by the UUID v7 thats in the filename.
> Your parent says they don't want to wait for the file system:
That information comes for free when you're iterating files in a directory. There's no extra waiting than the file name itself because file dates are kept in the same structure that keeps the file names.
Still, I find that justification to rely on certain binary format of an ID format weird. Just use the dates in filenames if you truly need such a mechanism.
Yes. I agree with the sentiment that UUIDv7 being chronological can be useful. But, in this specific example, I think it’s a design smell to design your feature around the format of the filename and UUID generation algortihm. I’d say, wait for the FS if you have to instead of creating failure-prone dependencies like that.
The assumption that the app files would always use the same format and the same UUID algorithm in that format is a totally unnecessary tight coupling for a “loading UI”. The potential future costs isn’t worth it.
Adding layers, etc. Again, it’s a loading UI.
Obviously, we’re talking about a fantasy app here. I’m weighing options based on my understanding of it.
No, because you wouldn’t know if UUID algorithm was changed. It’s a completely unnecessary coupling, like tying your shoelaces together before running.
I'm having trouble understanding the use of v8. It can be pretty much any bits as long as it has 1000 in the right spot? It strikes me as too minimal to be useful. I must be missing something
Being able to do anything with the remaining bits is very useful.
You can do any scheme that suits your individual features needs, and it will be a valid UUID still.
This also means future schemes can be implemented right now without having to get a formal UUID version.
You could use the first few bits to indicate production vs qa vs dev data.
Or a subtle hint to what it might be for (eg is this UUID a product identifier or a user identifier or a post or a comment etc). Similar to how AWS etc prefix IDs with their type.
But it kind of defeats the purpose of encoding which of a fixed set of generation methods you are using in the ID, which is presumably to avoid having to check that none of the O(N^2) pair-wise combinations of N methods produce collisions.
The cool thing about the various verions of UUID is that they're all compatible. The differences almost all come down to database locality (and therefore performance.)
The exception is if you're extracting the time portion of a time-based UUID and using it for purposes other than as a unique key, but in my experience this is typically considered bad practice and time is usually stored in a separate column for cases where it matters for business purposes.
It's not necessarily that its dismissive, more so that its that a fuzzy pattern-matching comment, thats incorrect, and just a wordless link. Trivial to make, nontrivial to respond to: "Funny", in the way in-group cultural references usually are - responding means you're taking it too seriously. Yet, incorrect enough that'll misinform anyone who isn't diligently reading the full article and understands historical context. Noise thats likely to generate noise. Trolling, just missing active intent to derail.
It has the advantage of being a drop in replacement most places everyone uses v4 today. It also has the advantage over other specs of ulid in that it can be parsed easily even in languages and databases with no libraries because you just need some obvious substr replace and from_hex to extract the timestamp. Other specs typically used some custom lexically sortable base64 or something that always needed a library.
Early drafts of the spec included a few bits to increment if there were local ids generated in the same millisecond for sequencing. This was a good fit for lots of use cases like using the new ids for events generated in normal client apps. Even though it didn’t make the final spec I think it worth implementing as it doesn’t break compatibility