Hacker News new | past | comments | ask | show | jobs | submit login
Sortable Collision-Free UUIDs (github.com/kpdemetriou)
94 points by kpdemetriou 11 days ago | hide | past | favorite | 62 comments





I'd advise UUID v6 over this which is at least an RFC 4122 extension. As coded, this isn't UUID compatible other than being 128 bits.

http://gh.peabody.io/uuidv6/

Also some recent similar submissions:

Timeflake is a 128-bit, roughly-ordered, URL-safe UUID.

https://news.ycombinator.com/item?id=25870482

https://github.com/anthonynsimon/timeflake

ULIDs:

https://news.ycombinator.com/item?id=18768909

Sonyflake:

https://news.ycombinator.com/item?id=25592325

KSUIDs (can't find any discussion here):

https://github.com/segmentio/ksuid

This comment lists other prior art:

https://github.com/bradleypeabody/gouuidv6/issues/3


Is there a reason or need to be UUID compatible? I honestly don't know.

I use them in databases and know they're pretty safe to use when integrating data across multiple databases because collisions are astronomically unlikely, if implemented properly.


So here's a real-world use case that having an RFC 4122 UUID was useful for me.

I have a server which accepts reports and stores them in S3. For each new report, a v4 UUID is generated that that is used as the base of the S3 object name. This UUID becomes the report ID.

An entire system has been built around the report ID, expecting a hex UUID. Recently, I needed to change how the objects are stored in S3 in order to partition them into multiple S3 prefixes. Instead of storing every object in a single place in the S3 bucket, I needed to do something like:

    "reports/%s/%s" % (report_id[:2], report_id[2:])
The issue was that I had one component that was writing the reports, and a second component reading the reports. And somehow, the reader needed to know whether a report was stored using the old path layout:

    "reports/%s" % (report_id,)
Or the new path layout. I took advantage of the fact that RFC 4122 UUIDs have four bits set aside for version. After generating a v4 UUID, I update its version to 5. The reader can then check the report UUID version to know which path layout to use. Once all the reports stored using the old path layout expire, I can undo this hack.

Of course, I could have made the reader try the new path layout, then fall back to the old path layout. Or I could have updated the entire system to have a better way of communicating the path layout, but that would've been less efficient or have meant touching a lot more code.

I guess the moral of the story is: you never know when you're going to need to change something in a backwards compatible way, and having a few bits set aside even for something as simple as an object ID can be useful. I'm fortunate the UUID designers thought of that.


The version bits are for the UUID version though, not for your own versioning use. It doesn't sound like your v5 UUID was actually a UUID v5 (with namespace), so are your new UUID's even RFC 4122 compliant anymore?

At this point, you could just use one of these not-a-UUID replacements and include a few versioning bits of your own. Of course, you'd have to plan it in advance, my point is you used a lucky hack, not a feature of UUID's.


If I hadn't been using RFC 4122 UUIDs, I wouldn't have had any reserved bits to work with. There are a bunch of places in this system that restricted the report ID to be a hex string of length 32 that would have needed updating if I switched to a different format.

There's also no functional difference between a v4 UUID and a v5 UUID with fixed namespace and random bits for the name portion, so I disagree what I'm doing is no longer RFC 4122 compliant.

It's lucky I used RFC 4122 UUIDs in the first place, that I could take advantage of the designers foresight to set aside some extra bits.


That was very interesting, thank you. I didn't realize some uuids were anything other than... unique. A version seems very useful.

In broader industry these days, UUID effectively means “128-bit unique identifier” with no other standardization implied. I’ve seen dozens of custom “UUID” designs with no public description in the wild used at massive scales. A major reason for this is that the classic standard UUIDs v1-v5 have a broken design for some use cases so companies invent their own equivalents.

There is nothing wrong with designing a custom pseudo-UUID, and in fact there are often real advantages. The caveat is that you will want to use the same scheme consistently.


I have stored a wide manner of arbitrary data in UUIDs, and had great success with that technique.

First, for something like this, the details matter a lot. How many bits of randomness is this, how many bits used for the timestamp, what's the format, why does it yield the advantages claimed, and most of all, how does it provide collision resistance? Is it using a MAC address (like UUID v1/v6) or a custom namespace (like UUID v5), or what?

In this case...looking at the code.... It exposes two formats.

Format 1: 4 bytes to store the number of seconds since a custom epoch of... Sep 13 2020 12:26:40, and 12 bytes of randomness. Puzzling choice.

Format 2: 8 bytes to store the number of nanoseconds since the same custom epoch, and 8 bytes of randomness.

And the collision resistance...doesn't exist at all; it's just relying on 12 (or 8) bytes of randomness and a time prefix. Which seems like it'll be totally fine in practice, but if you actually care about collisions (and in my experience, almost everyone who thinks they do really doesn't), this is unlikely to be an optimal choice.

Functionally, I think this is closest to KSUIDs (https://github.com/segmentio/ksuid) which are also the result of combining a timestamp (with a custom epoch) and some randomness, and then concatenating them in a way which is not a valid UUID, will work efficiently as a DB index, and is extremely unlikely to collide. But KSUIDs are much more widely adopted and better documented.


Umm.. 12 bytes of randomness has 7*10^28 unique uuid per second. We generally take order of square root of this due to birthday paradox which means that if you generate less than something like 10^10 UUID per second you couldn't get collision.

Well, yes. As I said, it'll be "totally fine". :)

The problem is, the submission said "collision-free". This isn't "collision-free", it's "collisions are so unlikely you don't need to worry even under extremely conservative assumptions, assuming you have a decent source of randomness". And that's good enough for me, absolutely.

But...if that's good enough, then really, any of the common UUID and UUID-like schemes will be good enough. I'm taking the submission title to mean either "guaranteed collision-free" or, failing that, as "more collision resistant than some alternatives", or maybe "guaranteed collision-free in some specific use cases". But this doesn't seem to be any of those; it's just a timestamp and enough randomness that you'll be safe under reasonable assumptions.

In short, I'm not saying "don't use this because you'll get collisions", I'm saying "even though as a practical matter you won't see any collisions, I don't think the collision free label is appropriate".


Let's do some math. Assume you generate a million UUID per second. The probability of collision in a second is c=1-e^((-10^6)^2/(2*7*10^28))=-7*10^-18. The probability it will collide once in next 10 years is 1 in a billion(seconds in 10 year * c).

> you couldn't get collision.

You mean probably. With which you mean there's a low chance of this happening. For that one second. But there are many more seconds after that. Plus you have to assume your PRNG never has any weird hiccups - such as when someone does VM shenanigans.

This isn't "guaranteed" to be collision free at all.

I'm not playing dice when I want implement reliable systems.


The collision risk depends on the application and use case. In sufficiently large real systems, a probabilistic pseudo-UUID with only 96 bits of entropy has a collision probability that is very small but not so small that you can treat it as effectively zero if avoiding collisions is critical. I’ve seen multiple systems that generate unique identifiers at rates > 10^9 per second. The 128-bit size has a lot of advantages, but probabilistic identifiers are expressly disallowed for some applications and contracts because of the collision tail risk. Making the identifier larger than 128-bits has other significant costs so that generally is not an option either.

That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.


There's also the risk of bad randomness sources and/or bugs.

One popular UUID library got a bug report stating: "We are generating about 1M UUID4 a day, and we are getting several hundred collisions a day". And so they were; turned out to be a bug/weird interaction between the OpenSSL library they were using for randomness and forking. (Details here, although it was all fixed years ago of course: https://github.com/ramsey/uuid/issues/80)

On paper, you should never, ever, ever see a collision when generating a mere million v4 UUIDs a day, much less hundreds of collisions. But that doesn't mean it can't happen!

This is also an interesting bit of analysis; comes from a company that processed a lot of UUIDs generated in browsers, checked, and discovered about 5 collisions per million UUIDs. Again, not what you'd naively expect! (Turned out to be mostly driven by misbehaving crawlers.) https://medium.com/teads-engineering/generating-uuids-at-sca...


> That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

Are you sure? Fast cryptographic hashes are around one cycle per byte on a recent processor, and slower methods aren't that much slower.


I'm curious, what systems were those that generate >10e9 UUIDs/s?

Many types of telemetry and sensor data models; vast networks of machines can generate a stupendous quantity of events. UUIDs are not good design choice in these cases but it is a hard requirement for some users for compatibility with legacy systems which require a 128-bit unique identifier for everything.

Another issue with probabilistic UUIDs, possibly more important than collisions, is that they don’t compress very well when stored.


POV: You're AWS and logging every API call.

Just as an example where you could easily reach high ballparks, though I doubt it's that many.

Maybe some some distributed computing applications on server farms otherwise? Certainly not anything that you'd store in a regular database.


laptop% date -d 'Sep 13 2020 12:26:40Z' +'%s' 1600000000

Yes, the offset is a round decimal number, but I'm not sure why that makes it a very good epoch. The code actually has it as "16 * 10 * 8", but it seems like an obvious optimization would be to hard code it as, yes, 1600000000. But if you're hard coding that, you could hard code 1609459200 (for a round date) or 1620073869 (for when the initial commit was made) or whatever instead.

Any choice will inevitably be arbitrary, but this feels like it may have been chosen for odd reasons to me.


Looks similar to ULID[0] (I am the author of a popular python implementation[1]).

It appears to have a similar constraint that two ID's generated within the same timestamp (ms, ns) have no strong guarantee of ordering. That might not be a deal breaker depending on your use case but something to consider.

* https://github.com/ulid/spec

* https://github.com/ahawker/ulid


I think the issue with a lib like this is that people might use it, thinking they don't need the IDs to be secure... until they need to be. But by then plenty would have been generated, and a lot of code would rely on this sortable property.

In fact, I don't see the point of a library like this, it's trying to encode two pieces of information into one string. Why is that? Why not encode geolocation or IP while they're at it. If some data is important, like the order, then a separate database column can easily be used and that would make for a much more durable and reliable solution.


Secure? These fuuids are supposed to replace regular uuids, which aren't specified to be secure either. UUIDs are supposed to be quasi-unique (very low probability of collisions), not secure. They are, in my humble opinion, the wrong tool if you want secure.

>Why not encode geolocation or IP while they're at it.

Version 1 (and 2) of UUIDs encoded the MAC-Address + ns timestamp. So there you go.

Version 3 and 5 just hash a namespace string and a name string with md5 and sha1 respectively (same inputs generate same output)

Version 4, which these days is used by almost everything, just uses whatever pseudo-random numbers the implementation chose to use. If you can figure out what PRNG (pseudo-random number generator) was used (if any) and the internal state of that PRNG, then you may be able to guess/forge UUIDs. The spec does not mandate using a cryptographically-secure PRNG, although a lot of implementations out there today will probably use one.

Anyway, thinking UUIDs would be secure is wishful thinking, unless you did the legwork to make sure you control the generator, and the generator you use is using Version 4 with a cryptographically-secure PRNG, and you can guarantee that that will not change in the future; then you get 122 bits of secure random numbers (+ 6 static bits for version and variant).

By comparison, this implementation of "fuuids" gives you 96 bits of secure-enough random numbers or 64 bits in the nanosecond case (it uses os.urandom(12) and os.urandom(8), compared to os.urandom(16) that the standard lib in python uses)


Version 4 would still be more secure in the sense that it doesn’t leak any extra info and as you note it can be generated using a a cryptographically-secure PRNG.

With sortable IDs, you could potentially know when some item was generated, which may or may not be useful info, but why take the risk? I think it makes sense to minimise the data that the app or service exposes. With a regular uuid you can control whether you expose or not the creation time, while with sortable ids you can’t.


Sortable IDs have pleasant properties on insertion in traditional btree indexes as all the new values are on one edge of the tree. Truly random IDs end up with random I/O on the index tree.

With a database like postgres with full page writes enabled, it can blow out quite quickly at scale.


Though something to point out that this project misses is that UUID sorting is its whole own weird can of worms. I just went through this exercise trying to get ULID to GUID conversion/round-tripping that matched MS SQL Server's GUID sort order in hopes for the benefits of those small, pleasant properties.

In short, because SQL Server's sort order is based on GUID/UUID v1, the "group" that matters most significantly for sort order (in SQL Server, but other GUID tools are different) is the tail 6 bytes (not the head bytes as these FUUIDs are using which would allow them to sort in unix sort in GUID form). (As the tail 6 bytes are the "machine ID" in v1 UUIDs.)

For anyone curious how deep (and which endian) the rabbit hole goes in GUID sorting, I kept referring to Raymond Chen's somewhat comprehensive description of GUID sort orders across Microsoft products in my efforts: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...


On the other hand, doesn’t that mean that a workload that inserts new sortable uuids would experience lots of contention, while one that inserts random uuids would better spread load? (This is probably more significant in a distributed database sharded by uuid, where this means all the write load is going to one shard at a time.)

It’s worse with clustered indexes — all of the data experiences this issue, not just the index.

At least in Postgres you have the advantage/disadvantage that real-time clustered tables aren't a feature on offer - so that re-clustering cost is eaten on a periodic basis whenever you invoke a re-cluster operation on the table in question.

Additionally re-reading things closely I believe this package wouldn't benefit databases at all since the "sortable" form appears to be the common string representation while postgres will internally store UUIDs (at least in a UUID column) as their binary values - leading to indexing still being out of order unless you specify a ::TEXT casting.


This seems to be one of those things that you can never convince people of. Everyone thinks that their database is going to have so much data in it that having a sortable ID is important for "performance", and that sorting by another field is silly because of "performance".

ID's are there to identify the row. You can sort it by whatever you like, preferably something that makes sense from a user perspective so the users don't get handed lists of stuff in what seems to them to be a stupid order.


I agree, to an extent, with your comment. But having sortable/chronological IDs on a single column is a really nice feature to have.

Obviously, if you rely on the IDs to be unguessable/brute-forceable as part of your access control, then it is certainly bad to use such IDs. But I have not yet seen a case where I need both sortable and secure IDs. Being collision-free is what we want in a database context, and if it can hold some more information for the order, all the better.


Hi everyone, OP here. Let me offer some background. fuuids are designed to be sortable and collision-free (for most practical purposes, details below) IDs within a 16-byte footprint making them interpretable as UUIDs.

Lazare very helpfully detailed the internals of the two available formats but I'll summarize them here and I'm happy to answer any questions:

Format A (fuuid) consists of a 4-byte second timestamp concatenated with a 12-byte random tag.

Format B (fuuid_ns) consists of an 8-byte nanosecond timestamp concatenated with an 8-byte random tag.

As far as collision resistance is concerned, here's a list of probabilities at various production rates for Format A. Collisions for Format B are only relevant when the production rate begins to approach 10^9 IDs per second (assuming IDs are produced uniformly in time).

- 2^-90.51 at 10 IDs/second

- 2^-83.73 at 100 IDs/second

- 2^-77.07 at 1,000 IDs/second

- 2^-70.42 at 10,000 IDs/second

- 2^-63.78 at 100,000 IDs/second

- 2^-57.14 at 1,000,000 IDs/second

Hope this helps!


What is the difference between this and ULID? The latter is implemented in dozens of languages already, and many ULID libraries support conversion into UUID format.

https://github.com/ulid/spec

"Universally Unique Lexicographically Sortable Identifier"


Are they sortable as a string or as binary? Once you deploy something like this, that becomes important because you'll end up storing binary and string representations in different spots, unless you're _really_ careful (and DBs can insert way faster if you're always doing so near the end of the primary key sorting). And unless you use a totally custom string function, you can't have both, since base64 has letters before numbers, and ASCII is vice versa. It can be a real pain.

The examples use both letters and numbers in the ID composition so I would guess not - in all honesty though your question needs some clarification because the string representation of UUIDs is pretty fluid - there's the generally expressed version like `01324332-f66a-054a-76e4-fbdc7f772cd1` which appears to be what this generator favors - but for indexing UUIDs are usually considered to be their binary values so it's highly unlikely that a DB would see anything beyond a coincidental marginal benefit from using this package.

It's hard to tell since the package says "sortable using UNIX sort" so it's not a format agnostic sorter - but it's also not conclusive whether the author meant the common human-readable strings can be sorted using UNIX sort or the binary strings can be sorted using UNIX sort.


I think if you store as hex, which is the traditional string representation of a UUID, then the string and binary representation will sort identically.

Regular old v1 UUIDs contain 60 bits of the system time and are sortable by it, all you need is a uuid library to extract that information from them. Eg, in haskell:

    ghci> sortOn Data.UUID.Util.extractTime (map (fromJust . Data.UUID.fromString) ["8fca290c-ac63-11eb-9e74-79cdba6ee3eb", "8d543a32-ac63-11eb-9e74-79cdba6ee3eb"])
    [8d543a32-ac63-11eb-9e74-79cdba6ee3eb,8fca290c-ac63-11eb-9e74-79cdba6ee3eb]

Not having the leading bits be the timestamp limits the utility, though, because then you don't get sorting "for free" just by having an index on the ID.

At least for me, that is the primary appeal of a time-base UUID.


I always just define a custom comparison operator, which is easy enough in both C++ and PostgreSQL.

OT: Github READMEs have got to the point where they don't say what it actually does, how it does it, things you can depend on or not. They are now QUICKSTART.md. With the dependency insecurities going around, why would anyone think this is okay?

i have a loosely curated list of "State of the Art of UUIDs" here: https://github.com/sw-yx/uuid-list/

just added FUUIDs, thanks OP


I've used something very similar in the past, called SimpleFlake[0], which is essentially a 64 bit version with the same principles. I've used it in Lisp, C, C++, Clojure, Python, and Rust. It's conceptually simple, and fits in a 64bit int, which is natively available in a lot of databases.

[0] SimpleFlake - https://github.com/SawdustSoftware/simpleflake/blob/f2b51f76...


I started looking into TSID/KSUID/ULID in order to support cursor based pagination schemes in GraphQL against non-integer based unique id fields (such as uuids or unique string ids).

A couple of notable Java libs:

https://github.com/f4b6a3/ulid-creator

https://github.com/f4b6a3/tsid-creator

https://github.com/akhawaja/ksuid


Please don't use the akhawaja ksuid generator for Java unchanged. We used it at my last job until we realized 1. the Base62 code is not thread-safe (it uses a static StringBuilder) and more importantly 2. The epoch used is different from the standard, so the generated ksuid aren't portable.

Use https://github.com/ksuid/ksuid insteads


Good to know, thanks.

4 byte timestamp is brave.

    int(time() - 16 * 10 ** 8)

Python's integers are unbounded.

The format only reserves for bytes

Is this a time swapped version of UUID v1? Like what MySQL will do when converting v1 uuid to binary for use in an index (when using the swap flag)?

The Python-code behind this:

  ts = int(time() - 16 \* 10 \*\* 8)
  return ts.to_bytes(4, "big") + os.urandom(12)

There's also deterministic Guids ( .net) . Useful in some cases when you want to make them consistent.

Eg. https://github.com/Informatievlaanderen/deterministic-guid-g...


I don't get why anyone would need to sort them. Just store a timestamp if you want to sort records by order of creation.

If this was about efficient indexing or something I would get it. But the only justification is "you can run it through sort".


Cool. I don’t see the readme warning about this, so:

Use wisely and with full awareness. Revealing creation sequence can leak information which can be an issue in scenarios where such information might compromise security.


Given that time-ordered UUIDs are hardly a new concept, it would be nice if the author did a comparison with the current state of the art rather than let such a bland README do the talking.


This really needs an explanation of how the “UUID” is generated. What properties does it have? How secure is it against attackers?

Hold on, aren't UUIDv1 already generated based on a timestamp? What's the point of this?

Here we go again. Why do we invent sortable uuids every single year?

also this https://news.ycombinator.com/item?id=27004098

w3c distributed id spec/primer


putting meaning into the on-purpose meaningless bit sequence of UUID sounds IMO like a recipe for desaster. The result mimics a UUID but isn't.

Why UUIDs were designed in such a way that collides?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: