Out of curiosity, I just wrote a PHP ulid() function. [1]
It took me a little more than 5 minutes, so it wasn't that hard after all. It appears to run at only about 35% the speed of the Go version, but hopefully there's still room for some little improvement.
I wonder why you would ever want it to be case-insensitive? Surely that increases the possibility of clashes, and violates the very good principle of least surprise (pretty much every other ID in the world is case-sensitive).
I'd be surprised if many developers these days actually compared UUIDs case-insensitively, though, even if they're intended to be. Microsoft, which have relied on GUIDs for years going back to their version of DCE/RPC, probably does it right.
If it's part of a spec, then it's simpler, but it's still a potential point of surprise.
- An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature
you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning.
- An ulid also encodes Time, an uuid doesn't.
- An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
- An uuid is also url safe.
- An ulid is case insensitive. I don't see how this is an advantage.
If you sort a list of ULIDs they sorted by time of creation, with random order only among ULIDs created in the same millisecond. As far as guessable, you still have 80 random bits. Though I'll admit I'm not qualified to comment on whether the difference between 80 and 128 bits is significant when you're trying to make sure your identifiers are not guessable.
As for less chance of clashing, 128 vs 80 bits. 80 bits is only within the same millisecond. The 128 bits is for the lifetime of your application. Based on my basic understanding of the birthday paradox, there is a 50% chance of collision if you have 2^40 ULIDs generated during the same millisecond. And there is a 50% chance of collision if you have 2^64 UUIDs generated during the lifetime of the application.
Case insensitivity and the chosen character set is an advantage if you want to use this ID as a filename without having to worry about the limits of the filesystem.
Sure. I just don't even understand why it's a point in a discussion about UUID (hex) vs ULID (base-32), since both are case-insensitive... Which I think is why TeeWEE was stating that it's not really an advantage for ULID.
Which has a similar list, but presents them as features/facts, not advantages.
Edit: Perhaps it was noted to specifically call out that it maintains case insensitivity, even though it's choosing a more dense approach than just hex.
> But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature you normally want from an uuid.
Why so? As the name says they just need to be universally unique. Even relying purely on randomness seems bad, because you are at the mercy of the entropy source. Mess that up and you can end up with nasty problems throughout your system.
I've been working on a closely related problem of universal identification in distributed computing for a few years now. It's now in standards review and hopefully publishable soon.
We came to the conclusion that universal ids should represent identity only, and explicitly not have 'metadata'. What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?
Also, while I think speak-ability is actually very important (many disagree), I'd assert that capitalization is better addressed from the opposite direction (i.e. UI). Instead of removing capital letters from the ID itself affecting all cases, address the human factors in the few cases it comes up.
I think this is good advice in general:
When spoken aloud we suggest that you don't indicate capitalization at first. In many cases, such as search, human validation etc, this is more than enough precision, then one can add capitalization for disambiguation as needed.
For example:
"ab2Cd3Ef1g"
Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"
With the capitalization part optional depending on context.
> What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?
For some designs it's useful to have identifiers have other properties than uniqueness. In this case, this property is relative lexicographical (and binary) order based on time so that you can leverage the order between the things the identifiers identify without looking at the things. The entropy is there to satisfy the uniqueness property (with some acceptable degree of collision, application dependent). The time is there to satisfy the ordering property.
Yes very good point. However, as @danbruc rightly points out this raises all sorts of other concerns. A user of these IDs my not realize that the reliability of the ordering can be substantially reduced depending on where the IDs are generated.
Some applications may be able to tolerate inconsistencies in ordering, others may not. Are IDs being generated on multiple machines? Are they in sync? What happens if the system clock is adjusted, or a container/VM is restarted on different hardware?
This design implies that these IDs are being generated in different locations, but this usage leads to the least reliable time. How many bits of approximate time does one really need? Not 48 surly.
On the other hand if you generate the IDs in the most reliable model, a single host with persistent storage to prevent regression, you've basically made an unnecessarily complicated vector clock. A simple incremental counter would work at least as well, and be far simpler.
The point being these distinctions lead to the conclusion that this identifier isn't 'generally' useful, and even under optimal conditions it's utility is questionable. For example extra precision for an approximate value is not application dependent at all. The low order bits of the time component have no actionable meaning, though they imply sort order. That's the kind of subtile error in reasoning that is really easy to make here. I think there are too many land mines hidden here to make this useful.
You are letting the perfect be the enemy of the good. Perfect general applicability to all problem domains is not a requirement of utility. Engineering is tradeoff analysis.
I'm not sure how I feel about time, but I feel like case sensitivity is asking for trouble if you're expecting anything to be vocalized/shorthand-written, particularly without knowing individual use cases. It also seems like the sort of thing that might occur so infrequently nobody remembers to cover the "fringe" case, particularly for things a couple layers removed, assuming the set size is as huge as it seems intuitively.
Having also spent some time on the topic, let me share my thoughts.
As you said, don't include information beyond identity for the simple reason that nothing is guaranteed to never change, every information you put in there may become outdated. One may think you can surly put the birth country of the customer into its customer number, but what if the customer made a mistake when signing up?
It may be useful to have some information in identifiers, for example a prefix C or O to indicate a customer or order number, but that can be implicit most of the time and just be added in the user interface or when printing labels.
To guarantee global uniqueness one has essentially two choices, enough randomness or unique identifiers for space and time. The first option requires no coordination, not synchronizing clocks and not making sure every process has a unique identifier, the second option allows you to have locality in identifiers generated at similar times or places.
Locality can be useful when identifiers are used as keys in databases or other data structures because it may avoid having every insert happen at a random place. On the other hand this may easily leak information, whether the number of your customers or orders per month or the number of servers you are using, especially if you are using counters, see further down.
One can use identifiers with locality internally but encrypt them before exposing them to the outside, but this comes with an entire new set of problems. One may have to include initialization vectors in the identifiers possibly making them longer. One may have to think about key rotations, at least for the case keys get compromised, possibly making identifiers even longer having to include key identifiers. One will have to do key management anyway and encryption of course costs clock cycles.
If one wants to include a time component, one can choose between actual time and counters. If one uses actual time, you have to ensure your clock never goes backwards and you may have to account for limited resolution, possibly falling back to counters or randomness when identifiers are produced faster than the clock ticks. If one use counters, you have to ensure it never runs backwards, for example persisting it across application restarts including crashes.
If several processes share counters, synchronization overhead between cores may become an issue in high performance scenarios. Modern processors offer high frequency hardware timestamp counters that can be synchronized across cores. This may be especially interesting if the timestamps are also used for purposes like operation or transaction ordering.
Identifiers should be readable if only for debugging by developers. Similar characters like O, 0, 1, I, l, U and V should be avoided. One can consider a checksum or even error correcting codes for user facing identifiers.
Capitalization is a tradeoff between length and ease of use but I like your idea of dealing with it behind the scenes. One probably loses on average less than one bit per character when ignoring capitalization so that for sufficiently long identifiers collisions are still very improbable, at least if one does not use identifiers with a lot of locality. But I fear users will just tend to include the capitalization because they are not aware it can probably be ignored. It surly adds complexity to the application, if only to the user interface, because you have always to be able to handle collisions.
Word of caution to anybody looking to use a similar method working with SQL Server: the SQL Server uniqueidentifier type is stored differently on disk than binary(16). Instead of being ordered by bytes 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1.
Yes, but NEWSEQUENTIALID isn't globally ordered, it's only ordered for the computer on which it was generated. For that matter, on a single computer the ordering isn't guaranteed[1]. This makes NEWSEQUENTIALID fairly useless if you're looking to merge records from disparate sources and sort them independently of origin.
Not the author, but presumably because 1, I and l are visually hard to distinguish in some fonts. Same goes for O and 0.
As for U, I don't know about that, but I recently stumbled upon the issue that Google's OCR software seems to incorrectly recognize U as II relatively often.
This is so bazar that I don't even know where to start to try an understand it.
Is it a joke? Tongue-n-cheek? It must be. I can't understand how it could possibly be a ligament concern. What about other languages? As adults who doesn't use 'fuck' conversationally? If it were a real concern then surly drawing attention to it in the design of the ID is a more obvious then the very rare occasion of 'dirty words' (by some definition of dirty) appearing in otherwise random strings.
So I have to conclude it's a joke, but then what the fuck?
This type of thing puts noise into engineering details that just ends adding to confusion and ambiguity. If there's no (important) reason for 'U' to be in the omission set, then just say it's an arbitrary choice, if there is an OCR, or legibility reason then say that.
Half of my brain says it's funny, and half my brain says it's fucked up.
I see where you're coming from. On the other hand, since they were after base32, they were able to choose an additional letter to omit. By omitting the U, they happen to exclude the most popular english language expletives.
These things can and do end up in, for example, urls. Think something like an ecommerce store order id. While not likely to happen, the below could happen in an email, and might invite unneeded controversy:
Ah yes ok I can understand that. I can see from a marketing point of view that you might want to simply and quietly avoid the (roughly) 1 in 1,000,000 ids which start with 4 specific characters.
Still, it's a bit culturally specific. It gives the impression of solving a problem that is doesn't. Depending on your market/language a different character might be a better choice. Plus there are quite a few 'bad' words that can still appear.
Filtering and rejecting proposed IDs based on a list of objectionable words seems like a far more realistic solution, then baking a very specific exclusion into the plumbing.
Yes if you use a good source of entropy. In Go, the easiest and most secure is the `crypto/rand.Reader`. But it's also slow. If you don't need the security properties, an instance of `math/rand.Rand` should suffice. Otherwise, you can implement your own. It's just an io.Reader!
Doesn't that mean that it will usually — but not guaranteed always — be unique? That is, won't there always be a (minute and negligible depending on the intended use) chance of collision?
.. and it appears to be working