Hacker News new | past | comments | ask | show | jobs | submit login
RFC 9562: Universally Unique IDentifiers (May 2024) (rfc-editor.org)
48 points by htunnicliff 15 days ago | hide | past | favorite | 43 comments



> UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds

This just seems to be a way of creating a huge class of subtle bugs. Now, when two things happen to be created in the same millisecond, they may or may not be monotonically increasing.

Plenty of systems will end up accidentally depending on the ordering of the UUID's being the same order the UUID's were generated in. And that will hold true till the system hits production and suddenly there is enough load for that not to be true for a handful of records and the whole system fails.


Monotonicity is addressed in section 6.2, but it's optional.


i collect a list of UUID implementations and concerns to think thru here https://github.com/swyxio/brain/blob/master/R%20-%20Dev%20No...


TL;DR: Several new UUID versions have been standardized

UUIDv5 is meant for generating UUIDs from "names" that are drawn from, and unique within, some "namespace" as per Section 6.5.

UUIDv6 is a field-compatible version of UUIDv1 (Section 5.1), reordered for improved DB locality. It is expected that UUIDv6 will primarily be implemented in contexts where UUIDv1 is used.

UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded. Generally, UUIDv7 has improved entropy characteristics over UUIDv1 (Section 5.1) or UUIDv6 (Section 5.6).

UUIDv8 provides a format for experimental or vendor-specific use cases. The only requirement is that the variant and version bits MUST be set as defined in Sections 4.1 and 4.2. UUIDv8's uniqueness will be implementation specific and MUST NOT be assumed.

The only explicitly defined bits are those of the version and variant fields, leaving 122 bits for implementation-specific UUIDs. To be clear, UUIDv8 is not a replacement for UUIDv4 (Section 5.4) where all 122 extra bits are filled with random data.

Background for the changes:

Many things have changed in the time since UUIDs were originally created. Modern applications have a need to create and utilize UUIDs as the primary identifier for a variety of different items in complex computational systems, including but not limited to database keys, file names, machine or system names, and identifiers for event-driven transactions.


I'm curious why they specify the UUID must have dashes in string format. It makes the UUID difficult to select with a double click.


As with IP addresses, UX/DX is not the primary concern


Try a triple-click.


probably because the dashes have semantic meaning


you do understand that they existed way before the mouse and button became the norm ?


I think widespread mouse usage and early uuid usage was similar in time, 1980's to early 1990's.

Not sure when the "doucle-click to select" UI paradigm became common though.


> Some UUID implementations, such as those found in Python and Microsoft, will output UUID with the string format, including dashes, enclosed in curly braces.

No … Python doesn't emit them enclosed in curly braces?

  >>> str(uuid.uuid4())
  '593a2ffb-eafc-484a-9a90-93bc91805651'


> UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded.

That seems like a rather vague way of addressing leap seconds for UUIDv7. For positive leap seconds, an 'exclusion' of that second would suggest that the millisecond counter is halted until the leap second is over, which doesn't seem ideal for monotonicity. And an 'exclusion' of a negative leap second hardly makes any conventional sense at all, with regard to the millisecond counter.

Contrast with the timestamp of UUIDv1/v6, where positive leap seconds can just be handled by incrementing the clock sequence.


That’s the normal way IETF RFCs describe unix seconds since the epoch, though there ought to be a normative reference to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...


The problem with "seconds since the Epoch" is that if you naively add milliseconds, it is no longer monotonic.

Since 2016-12-31T23:59:60Z is 1483228800 seconds since the Epoch, and 2017-01-01T00:00:00Z is also 1483228800 seconds since the Epoch, that means that 2016-12-21T23:59:60.xxxZ would have the same timestamp as 2017-01-01T00:00:00.xxxZ, for all xxx.

This corresponds to the counter jumping backward 1000 milliseconds at 2017-01-01T00:00:00.000Z.


I would say that the instance (or second long interval) in time that we name 2016-12-31T23:59:60Z is 1483228836 seconds after 1970-01-01T00:00:00Z, and that 2017-01-01T00:00:00Z is 1483228837 seconds after 1970-01-01T00:00:00Z [0].

The key here is that I use "seconds" to mean a fixed duration of time, like how long it takes light to travel 299792458 meters in a vacuum[1], and this version of seconds is independent of Earth orbiting the Sun, or the Earth spinning or anything like that[2]. If I understand you correctly, you use "seconds" more akin to how I use "days in a year": Most years have 365 days, but when certain dates starts to drift too far from some astronomical phenomenon we like to be aligned with (e.g. that the Northern Hemisphere has Summer Solstice around the 21st of June) we insert an extra day in some years (about every 4th year).

I haven't read RFC 9562 in detail, but if you use my version of "seconds" then "seconds since the Epoch" is a meaningful and monotonically increasing sequence. I suspect that some of the other commentors in this thread use this version of "seconds" and that some of the confusion/disagreement stems from this difference in definition.

The paragraph in Section 6.1 titled "Altering, Fuzzing, or Smearing" also seems relevant:

    > Implementations MAY alter the actual timestamp. Some examples include ..., 2) handle leap seconds ...
    > This specification makes no requirement or guarantee about how close the clock value needs to be to the actual time.
[0] Please forgive any off-by-one errors I might have made.

[1] I know that the SI definition between meters and seconds is the other way around, but I think my point is clearer this way.

[2] I ignore relativity as I don't think it is relevant here.


When I used that term in my last comment, I specifically meant the timescale formally defined by POSIX and linked above, which it misleadingly calls "seconds since the Epoch". This is the timescale people usually mean by "Unix time", and it's what UUIDv7 aspires to align to: in its own words, it's "derived from the widely implemented and well-known Unix Epoch timestamp source".

But Unix time isn't a count of SI seconds, as you might wish it to be. Instead, it's effectively "the number of whole days between 1970-01-01 and UTC-today, times 86400, plus the number of SI seconds since the last UTC-midnight." It smashes each UTC day into 86400 'seconds', regardless of whether it is truly longer or shorter due to leap seconds. This makes Unix time non-monotonic at the end of a leap second, since it rolls back that second at the following midnight.

It's nice that the RFC mentions leap seconds at all, but it's really playing with fire to leave it so vague, when monotonicity is an especially important property for these UUIDs. Its brief definition of the UUIDv7 time source as "the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded" especially doesn't help here, because the discrepancies in UTC are even worse than leap seconds:

> I would say that the instance (or second long interval) in time that we name 2016-12-31T23:59:60Z is 1483228836 seconds after 1970-01-01T00:00:00Z, and that 2017-01-01T00:00:00Z is 1483228837 seconds after 1970-01-01T00:00:00Z [0].

Before 1972, the length of a UTC second was shorter than the length of an SI second. If you account for that, 2016-12-31T23:59:60 UTC is ~1483228827.999918 SI seconds after 1970-01-01T00:00:00 UTC; and 2017-01-01T00:00:00 UTC is ~1483228828.999918 SI seconds after 1970-01-01T00:00:00 UTC. Meanwhile, the "Unix time" (POSIX's "seconds since the Epoch") is 1483228800 for both seconds. (At least, these SI-second intervals are based on the widespread TAI − UTC table [0]. Beware that this table is also inaccurate, in that it was constructed from an older EAL − UTC table using a constant offset, but EAL and TAI continue to run at a different rate due to relativistic effects [1]. Since 1977, EAL − TAI has grown to over 0.00108 SI seconds.)

> I suspect that some of the other commentors in this thread use this version of "seconds" and that some of the confusion/disagreement stems from this difference in definition.

I wish. But the RFC says "leap seconds excluded", and it's designed to align with "Unix time" (which jumps back after every leap second), so clearly it isn't a count of physical SI milliseconds. There's a trilemma here: you can't (a) lie about ("exclude") leap seconds, (b) keep monotonicity, and (c) maintain a constant length of the second, all at the same time; you have to pick two. You yourself would prefer (b) and (c), and POSIX mandates (a) and (c) for "Unix time". But UUIDv7, to align with "Unix time", really wants (a) and (b), which requires smearing, or halting the time scale, or some other dedicated mechanism. Yet the RFC is nearly silent on this.

[0] https://hpiers.obspm.fr/eop-pc/earthor/utc/TAI-UTC_tab.html

[1] https://webtai.bipm.org/ftp/pub/tai/other-products/ealtai/fe...


There will not be any leap seconds after 2035, and very likely there will never be any negative leap seconds.


That's plenty of time for the CGPM to change its mind, or to implement some other mechanism to bound the UT1 − UTC difference. It will eventually be an issue in any case, since it's not like they decided to let the difference grow without bound.


I interpreted it to mean the timer is monotonic and ignores leap seconds completely. It does make it easy to implement wrong if your most convenient time API does implement leap seconds. (I don’t see why this would have anything to do with the millisecond timer? Leap seconds happen on the second.)


Unix timestamps are not monotonic when a positive leap second is applied: the next day must always start at a multiple of 86400 seconds, even if the UTC day is 86401 seconds long. Unless some part of the day is smeared, the timestamp must be set back at some point. So either the UUIDv7 timer is not monotonic, or it does not align with Unix timestamps.

As for the millisecond timer, recall that a positive leap second lasts for 1000 milliseconds. So to 'exclude' the leap second, by one interpretation, would be to exclude each of those milliseconds individually as they arise; in other words, to halt the timer during the leap second.


The way I read it, they don't claim to align with Unix timestamps. They claim being aligned with the same source time.


As I read it, the value is specifically aligned with "the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded". (And they really must not have been considering the rubber seconds in UTC up to 1972!)

Consider a day ending in a positive leap second. Suppose that at 23:59:59.500...Z, the millisecond counter is at T − 500. By the start of the leap second (23:59:59.999...Z), the millisecond counter is at T. Then, at the end of the leap second (00:00:00.000...Z), the counter must be at T, since the leap second must excluded from the counter by definition. By 00:00:00.500...Z, it's at T + 500, and so on.

The question is, what is the value of the counter between 23:59:59.999...Z (when it is at T) and 00:00:00.000...Z (when it is at T), during the course of the leap second? The definition doesn't make this clear.


What is "source time", to you?

Like, I have a timestamp, in the format YYYY-MM-DD HH:MM:SS.ffff Z. What rules do I use to translate that into/from a set of bits? Whatever answer you give here, it seems like it must run afoul of the problems the parent poster is pointing out!


Count the number of seconds that have elapsed from the Unix epoch time until that moment, excluding leap seconds. This increases monotonically and is consistent with the Unix epoch source time.


At 2016-12-31T23:59:59.999Z, 1483228799.999 seconds had elapsed from the epoch, excluding leap seconds, according to "Unix epoch source time".

At 2017-01-01T00:00:00.000Z, 1483228800.000 seconds had elapsed from the epoch, excluding leap seconds, according to "Unix epoch source time".

Now, at 2016-12-31T23:59:60.500Z, how many seconds had elapsed from the epoch, exluding leap seconds? What about 2016-12-31T23:59:60.000Z, or 2016-12-31T23:59:60.750Z? The only monotonic solution is for all of these to have the exact same timesamp of 1483228800.000 seconds. But then that runs into a thousandfold increase in collision probability.


You can use whatever solution you want - hold the timestamp, smear time, it's up to you. It's still monotonic, and uses the same epoch.

You've still got 74 random bits.


> You can use whatever solution you want - hold the timestamp, smear time, it's up to you. It's still monotonic, and uses the same epoch.

"Whatever solution" I want? So… let's assume the programmer thinks "it's just POSIX time", they call their language of choice's "gimme POSIX time" function. This function, under default circumstances, is probably going to say "the leap second in POSIX time is also the last second of the day", i.e., it repeats the timestamp. Thus, it isn't monotonic.

(Which is the point of the parent comment…)


Surprising we're using 128 bits - some back of the napkin math tells me that may not be enough to avoid collisions...


Depends on your problem domain. You can be Twitter/Discord sized and get away with 64 bits. When you start dedicating parts of your UUID to a timestamp the possibility of collisions does go way up since now a significant chunk of the UUID will be the same for everyone. But when you deploy this variant you aren't trying to make globally unique ids anymore, you're trying to make application unique ids. You are sill very unlikely to not also have a globally unique id because 128 bits gives a lot of room to play around.


For hash functions, maybe not anymore, given the birthday paradox/pigeon-hole principle and other math problems in bucketing inputs versus the attack patterns for breaking hash functions and causing intentional collisions. For mostly purely random entropy in uses like UUID (and IPv6) the classic answer is that it is still more overall space than "atoms in the visible universe".


Care to share your math? My understanding of the birthday paradox is that it is astoundingly unlikely.


It's just about on the cusp. We would need to generate 1.1774sqrt(2^128) UUID's before getting a collision with 51% probability. That's about 2.17 10^19 total UUIDs.

The real question is how many UUIDs are generated per second around the world. This RFC suggests using them for automated processes, transactions, etc and generally seems to view them as an inexhaustible resource. If humanity collectively generates 1 trillion per second we can expect to see a 51% chance of collision in 8 months; if it's 100 billion it'd be 10 years, and if it's only 10 billion it'd be 100 years. I would expect even just one single computer with a modest GPU could get in the ballpark of these numbers if it wanted to just spawn UUIDs all day, let alone a huge server farm using them as part of some automated process.


But not all UUIDS are going into the same "pool". There is no problem if your GPU generates a collision with one of my database identifies, since I only care about my identifiers being unique in my system.

For comparison, before UUIDs most databases were using auto increment integers. That means nearly everyone had id 1, 2 etc in use. Still not a problem.


True, as universally unique identifiers, 128 (less a few) bits is not enough. You're talking about humanity generating 505 exabytes per year of just UUIDs. That won't happen any time soon.


Of course it will:

"The UUID generation algorithm described here supports very high allocation rates of 10 million per second per machine or more, if necessary, so that they could even be used as transaction IDs."

This is the use case they had in mind when building this algorithm, and it would only take about 10k machines worldwide to reach the above levels.


"Only" 10k machines producing a combined 100 billion transactions per second is pretty hard to imagine, least of all that would all be producing transactions that are part of the same namespace. Virtually all UUIDs are meaningless outside of a particular system in which they were created.

There is a solution that doesn't require extending UUIDs (which has a storage cost everyone pays), which is to use a URI/URN instead of a UUID to provide a namespace. In practice this already occurs, except the namespace (scheme, path) containing the UUID is implicit, as it hasn't been named.


The time field ensures that collisions cannot occur until at minimum the time field rolls over.


Having been badly bitten by the timestamp equivalent of eating gum you scraped off the bottom of a desk just let me add that there is an implicit "if your time source is good/non-adversarial" assumption here.

Friends don't let friends use malicious clocks!


HNGPT, please summarize the important changes?


They seem better geared for usage in databases as primary keys, specifically UUID versions 6 and onwards:

> Motivation. One area in which UUIDs have gained popularity is database keys. This stems from the increasingly distributed nature of modern applications. In such cases, "auto-increment" schemes that are often used by databases do not work well: the effort required to coordinate sequential numeric identifiers across a network can easily become a burden. The fact that UUIDs can be used to create unique, reasonably short values in distributed systems without requiring coordination makes them a good alternative, but UUID versions 1-5, which were originally defined by [RFC4122], lack certain other desirable characteristics [...]

> UUIDv6 is a field-compatible version of UUIDv1 (Section 5.1), reordered for improved DB locality. It is expected that UUIDv6 will primarily be implemented in contexts where UUIDv1 is used. Systems that do not involve legacy UUIDv1 SHOULD use UUIDv7 (Section 5.7) instead.

> Instead of splitting the timestamp into the low, mid, and high sections from UUIDv1, UUIDv6 changes this sequence so timestamp bytes are stored from most to least significant. That is, given a 60-bit timestamp value as specified for UUIDv1 in Section 5.1, for UUIDv6 the first 48 most significant bits are stored first, followed by the 4-bit version (same position), followed by the remaining 12 bits of the original 60-bit timestamp. [...]

> UUIDv7 features a time-ordered value field derived from the widely implemented and well-known Unix Epoch timestamp source, the number of milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded. Generally, UUIDv7 has improved entropy characteristics over UUIDv1 (Section 5.1) or UUIDv6 (Section 5.6).


[flagged]


The problem that this standard solves isn't a math problem. It's an engineering problem of defining (adding) UUID formats that are suitable for use in database keys (and some other things). Previous proposals had disadvantages for the use-case.

This is discussed in the "Update Motivation" section of the document: https://www.rfc-editor.org/rfc/rfc9562.html#name-update-moti...


> but we cant come up with a decent UUID scheme

maybe because we can’t come up with an unambiguous definition of “decent.”


We do and we dont frequently agree on whats "decent".

Most routers implement a set of security standards/protocols for VPN's that are "decent" and make the play nicely with each other.

The "Redis protocol" gets re-implemented frequently because its "decent" and useful to many vendors.

I cant speak for "encryption" but there has to be numerous implementations of various algorithms.

And this is true for many other protocols.

UUID seems mathematically "provable" or "verifiable", why are we wasting time on needless "wrong" / "non decent" implementations?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: