Also some recent similar submissions:
Timeflake is a 128-bit, roughly-ordered, URL-safe UUID.
KSUIDs (can't find any discussion here):
This comment lists other prior art:
I use them in databases and know they're pretty safe to use when integrating data across multiple databases because collisions are astronomically unlikely, if implemented properly.
I have a server which accepts reports and stores them in S3. For each new report, a v4 UUID is generated that that is used as the base of the S3 object name. This UUID becomes the report ID.
An entire system has been built around the report ID, expecting a hex UUID. Recently, I needed to change how the objects are stored in S3 in order to partition them into multiple S3 prefixes. Instead of storing every object in a single place in the S3 bucket, I needed to do something like:
"reports/%s/%s" % (report_id[:2], report_id[2:])
"reports/%s" % (report_id,)
Of course, I could have made the reader try the new path layout, then fall back to the old path layout. Or I could have updated the entire system to have a better way of communicating the path layout, but that would've been less efficient or have meant touching a lot more code.
I guess the moral of the story is: you never know when you're going to need to change something in a backwards compatible way, and having a few bits set aside even for something as simple as an object ID can be useful. I'm fortunate the UUID designers thought of that.
At this point, you could just use one of these not-a-UUID replacements and include a few versioning bits of your own. Of course, you'd have to plan it in advance, my point is you used a lucky hack, not a feature of UUID's.
There's also no functional difference between a v4 UUID and a v5 UUID with fixed namespace and random bits for the name portion, so I disagree what I'm doing is no longer RFC 4122 compliant.
It's lucky I used RFC 4122 UUIDs in the first place, that I could take advantage of the designers foresight to set aside some extra bits.
There is nothing wrong with designing a custom pseudo-UUID, and in fact there are often real advantages. The caveat is that you will want to use the same scheme consistently.
In this case...looking at the code.... It exposes two formats.
Format 1: 4 bytes to store the number of seconds since a custom epoch of... Sep 13 2020 12:26:40, and 12 bytes of randomness. Puzzling choice.
Format 2: 8 bytes to store the number of nanoseconds since the same custom epoch, and 8 bytes of randomness.
And the collision resistance...doesn't exist at all; it's just relying on 12 (or 8) bytes of randomness and a time prefix. Which seems like it'll be totally fine in practice, but if you actually care about collisions (and in my experience, almost everyone who thinks they do really doesn't), this is unlikely to be an optimal choice.
Functionally, I think this is closest to KSUIDs (https://github.com/segmentio/ksuid) which are also the result of combining a timestamp (with a custom epoch) and some randomness, and then concatenating them in a way which is not a valid UUID, will work efficiently as a DB index, and is extremely unlikely to collide. But KSUIDs are much more widely adopted and better documented.
The problem is, the submission said "collision-free". This isn't "collision-free", it's "collisions are so unlikely you don't need to worry even under extremely conservative assumptions, assuming you have a decent source of randomness". And that's good enough for me, absolutely.
But...if that's good enough, then really, any of the common UUID and UUID-like schemes will be good enough. I'm taking the submission title to mean either "guaranteed collision-free" or, failing that, as "more collision resistant than some alternatives", or maybe "guaranteed collision-free in some specific use cases". But this doesn't seem to be any of those; it's just a timestamp and enough randomness that you'll be safe under reasonable assumptions.
In short, I'm not saying "don't use this because you'll get collisions", I'm saying "even though as a practical matter you won't see any collisions, I don't think the collision free label is appropriate".
You mean probably. With which you mean there's a low chance of this happening. For that one second. But there are many more seconds after that. Plus you have to assume your PRNG never has any weird hiccups - such as when someone does VM shenanigans.
This isn't "guaranteed" to be collision free at all.
I'm not playing dice when I want implement reliable systems.
That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.
One popular UUID library got a bug report stating: "We are generating about 1M UUID4 a day, and we are getting several hundred collisions a day". And so they were; turned out to be a bug/weird interaction between the OpenSSL library they were using for randomness and forking. (Details here, although it was all fixed years ago of course: https://github.com/ramsey/uuid/issues/80)
On paper, you should never, ever, ever see a collision when generating a mere million v4 UUIDs a day, much less hundreds of collisions. But that doesn't mean it can't happen!
This is also an interesting bit of analysis; comes from a company that processed a lot of UUIDs generated in browsers, checked, and discovered about 5 collisions per million UUIDs. Again, not what you'd naively expect! (Turned out to be mostly driven by misbehaving crawlers.) https://medium.com/teads-engineering/generating-uuids-at-sca...
Are you sure? Fast cryptographic hashes are around one cycle per byte on a recent processor, and slower methods aren't that much slower.
Another issue with probabilistic UUIDs, possibly more important than collisions, is that they don’t compress very well when stored.
Just as an example where you could easily reach high ballparks, though I doubt it's that many.
Maybe some some distributed computing applications on server farms otherwise? Certainly not anything that you'd store in a regular database.
Any choice will inevitably be arbitrary, but this feels like it may have been chosen for odd reasons to me.
It appears to have a similar constraint that two ID's generated within the same timestamp (ms, ns) have no strong guarantee of ordering. That might not be a deal breaker depending on your use case but something to consider.
In fact, I don't see the point of a library like this, it's trying to encode two pieces of information into one string. Why is that? Why not encode geolocation or IP while they're at it. If some data is important, like the order, then a separate database column can easily be used and that would make for a much more durable and reliable solution.
>Why not encode geolocation or IP while they're at it.
Version 1 (and 2) of UUIDs encoded the MAC-Address + ns timestamp. So there you go.
Version 3 and 5 just hash a namespace string and a name string with md5 and sha1 respectively (same inputs generate same output)
Version 4, which these days is used by almost everything, just uses whatever pseudo-random numbers the implementation chose to use. If you can figure out what PRNG (pseudo-random number generator) was used (if any) and the internal state of that PRNG, then you may be able to guess/forge UUIDs. The spec does not mandate using a cryptographically-secure PRNG, although a lot of implementations out there today will probably use one.
Anyway, thinking UUIDs would be secure is wishful thinking, unless you did the legwork to make sure you control the generator, and the generator you use is using Version 4 with a cryptographically-secure PRNG, and you can guarantee that that will not change in the future; then you get 122 bits of secure random numbers (+ 6 static bits for version and variant).
By comparison, this implementation of "fuuids" gives you 96 bits of secure-enough random numbers or 64 bits in the nanosecond case (it uses os.urandom(12) and os.urandom(8), compared to os.urandom(16) that the standard lib in python uses)
With sortable IDs, you could potentially know when some item was generated, which may or may not be useful info, but why take the risk? I think it makes sense to minimise the data that the app or service exposes. With a regular uuid you can control whether you expose or not the creation time, while with sortable ids you can’t.
With a database like postgres with full page writes enabled, it can blow out quite quickly at scale.
In short, because SQL Server's sort order is based on GUID/UUID v1, the "group" that matters most significantly for sort order (in SQL Server, but other GUID tools are different) is the tail 6 bytes (not the head bytes as these FUUIDs are using which would allow them to sort in unix sort in GUID form). (As the tail 6 bytes are the "machine ID" in v1 UUIDs.)
For anyone curious how deep (and which endian) the rabbit hole goes in GUID sorting, I kept referring to Raymond Chen's somewhat comprehensive description of GUID sort orders across Microsoft products in my efforts: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...
Additionally re-reading things closely I believe this package wouldn't benefit databases at all since the "sortable" form appears to be the common string representation while postgres will internally store UUIDs (at least in a UUID column) as their binary values - leading to indexing still being out of order unless you specify a ::TEXT casting.
ID's are there to identify the row. You can sort it by whatever you like, preferably something that makes sense from a user perspective so the users don't get handed lists of stuff in what seems to them to be a stupid order.
Obviously, if you rely on the IDs to be unguessable/brute-forceable as part of your access control, then it is certainly bad to use such IDs. But I have not yet seen a case where I need both sortable and secure IDs. Being collision-free is what we want in a database context, and if it can hold some more information for the order, all the better.
Lazare very helpfully detailed the internals of the two available formats but I'll summarize them here and I'm happy to answer any questions:
Format A (fuuid) consists of a 4-byte second timestamp concatenated with a 12-byte random tag.
Format B (fuuid_ns) consists of an 8-byte nanosecond timestamp concatenated with an 8-byte random tag.
As far as collision resistance is concerned, here's a list of probabilities at various production rates for Format A. Collisions for Format B are only relevant when the production rate begins to approach 10^9 IDs per second (assuming IDs are produced uniformly in time).
- 2^-90.51 at 10 IDs/second
- 2^-83.73 at 100 IDs/second
- 2^-77.07 at 1,000 IDs/second
- 2^-70.42 at 10,000 IDs/second
- 2^-63.78 at 100,000 IDs/second
- 2^-57.14 at 1,000,000 IDs/second
Hope this helps!
"Universally Unique Lexicographically Sortable Identifier"
It's hard to tell since the package says "sortable using UNIX sort" so it's not a format agnostic sorter - but it's also not conclusive whether the author meant the common human-readable strings can be sorted using UNIX sort or the binary strings can be sorted using UNIX sort.
ghci> sortOn Data.UUID.Util.extractTime (map (fromJust . Data.UUID.fromString) ["8fca290c-ac63-11eb-9e74-79cdba6ee3eb", "8d543a32-ac63-11eb-9e74-79cdba6ee3eb"])
At least for me, that is the primary appeal of a time-base UUID.
just added FUUIDs, thanks OP
 SimpleFlake - https://github.com/SawdustSoftware/simpleflake/blob/f2b51f76...
A couple of notable Java libs:
Use https://github.com/ksuid/ksuid insteads
int(time() - 16 * 10 ** 8)
ts = int(time() - 16 \* 10 \*\* 8)
return ts.to_bytes(4, "big") + os.urandom(12)
If this was about efficient indexing or something I would get it. But the only justification is "you can run it through sort".
Use wisely and with full
awareness. Revealing creation sequence can leak information which can be an issue in scenarios where such information might
w3c distributed id spec/primer