Hacker News new | past | comments | ask | show | jobs | submit login
Universally Unique Lexicographically Sortable Identifier in Go (github.com/oklog)
89 points by tsenart on Dec 6, 2016 | hide | past | favorite | 52 comments

Seems that someone ported the javascript ulid library to Python: https://github.com/mdipierro/ulid

.. and it appears to be working

Out of curiosity, I just wrote a PHP ulid() function. [1]

It took me a little more than 5 minutes, so it wasn't that hard after all. It appears to run at only about 35% the speed of the Go version, but hopefully there's still room for some little improvement.

[1] https://github.com/phptools/ulid

Fellow PHP implementation: https://github.com/Lewiscowles1986/ulid

I wonder why you would ever want it to be case-insensitive? Surely that increases the possibility of clashes, and violates the very good principle of least surprise (pretty much every other ID in the world is case-sensitive).

The hex representation of GUIDs/UUIDs is case insensitive in most interpretations, I'd say that counts enough.

I'd be surprised if many developers these days actually compared UUIDs case-insensitively, though, even if they're intended to be. Microsoft, which have relied on GUIDs for years going back to their version of DCE/RPC, probably does it right.

If it's part of a spec, then it's simpler, but it's still a potential point of surprise.

Note. You cant compare ULID with UUID's

  - An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature 
    you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning. 
  - An ulid also encodes Time, an uuid doesn't. 
  - An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
  - An uuid is also url safe.
  - An ulid is case insensitive. I don't see how this is an advantage.

You just compared UUIDs with ULIDs.

Maybe he means programatically?

I think he's joking.

And nailed it.

That's contrasting, not comparing.

If you sort a list of ULIDs they sorted by time of creation, with random order only among ULIDs created in the same millisecond. As far as guessable, you still have 80 random bits. Though I'll admit I'm not qualified to comment on whether the difference between 80 and 128 bits is significant when you're trying to make sure your identifiers are not guessable.

As for less chance of clashing, 128 vs 80 bits. 80 bits is only within the same millisecond. The 128 bits is for the lifetime of your application. Based on my basic understanding of the birthday paradox, there is a 50% chance of collision if you have 2^40 ULIDs generated during the same millisecond. And there is a 50% chance of collision if you have 2^64 UUIDs generated during the lifetime of the application.

Case insensitivity and the chosen character set is an advantage if you want to use this ID as a filename without having to worry about the limits of the filesystem.

>An ulid is case insensitive. I don't see how this is an advantage.

It's helpful if you've ever had to read something like a ULID over the phone. Base64 isn't fun in that situation.

I've never seen a UUID in base-64 instead of hex...

Hex is also case insensitive. Was trying to answer why case insensitive might be an advantage.

Sure. I just don't even understand why it's a point in a discussion about UUID (hex) vs ULID (base-32), since both are case-insensitive... Which I think is why TeeWEE was stating that it's not really an advantage for ULID.

My guess is that the README was mostly copied from here: https://github.com/alizain/ulid

Which has a similar list, but presents them as features/facts, not advantages.

Edit: Perhaps it was noted to specifically call out that it maintains case insensitivity, even though it's choosing a more dense approach than just hex.

> An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.

Actually... that's 80 bits for the "random" part of the ULID, which only disambiguates within the same millisecond.

ULIDs are meant to be compared with Time UUIDs in particular. See: https://www.famkruithof.net/guid-uuid-timebased.html

> But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature you normally want from an uuid.

Why so? As the name says they just need to be universally unique. Even relying purely on randomness seems bad, because you are at the mercy of the entropy source. Mess that up and you can end up with nasty problems throughout your system.

I've been working on a closely related problem of universal identification in distributed computing for a few years now. It's now in standards review and hopefully publishable soon.

We came to the conclusion that universal ids should represent identity only, and explicitly not have 'metadata'. What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?

Also, while I think speak-ability is actually very important (many disagree), I'd assert that capitalization is better addressed from the opposite direction (i.e. UI). Instead of removing capital letters from the ID itself affecting all cases, address the human factors in the few cases it comes up.

I think this is good advice in general:

When spoken aloud we suggest that you don't indicate capitalization at first. In many cases, such as search, human validation etc, this is more than enough precision, then one can add capitalization for disambiguation as needed.

For example:


Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"

With the capitalization part optional depending on context.

> What is the use of 48 bites of time? It reduces the overall entropy, for what? If time is important then why not make the id literally time (i.e. UnixNano), if it isn't they why not make all bits rand?

For some designs it's useful to have identifiers have other properties than uniqueness. In this case, this property is relative lexicographical (and binary) order based on time so that you can leverage the order between the things the identifiers identify without looking at the things. The entropy is there to satisfy the uniqueness property (with some acceptable degree of collision, application dependent). The time is there to satisfy the ordering property.

Yes very good point. However, as @danbruc rightly points out this raises all sorts of other concerns. A user of these IDs my not realize that the reliability of the ordering can be substantially reduced depending on where the IDs are generated.

Some applications may be able to tolerate inconsistencies in ordering, others may not. Are IDs being generated on multiple machines? Are they in sync? What happens if the system clock is adjusted, or a container/VM is restarted on different hardware?

This design implies that these IDs are being generated in different locations, but this usage leads to the least reliable time. How many bits of approximate time does one really need? Not 48 surly.

On the other hand if you generate the IDs in the most reliable model, a single host with persistent storage to prevent regression, you've basically made an unnecessarily complicated vector clock. A simple incremental counter would work at least as well, and be far simpler.

> A user of these IDs my not realize that the reliability of the ordering can be substantially reduced depending on where the IDs are generated.

That can only be addressed with improved documentation and shared understanding of the subtleties and pitfalls of distributed time synchronisation.

> Some applications may be able to tolerate inconsistencies in ordering, others may not.

Indeed. Proper thought must be but into this sort of thing. ULIDs aren't an exception nor a silver bullet.

> How many bits of approximate time does one really need?

Entirely application dependent.


The point being these distinctions lead to the conclusion that this identifier isn't 'generally' useful, and even under optimal conditions it's utility is questionable. For example extra precision for an approximate value is not application dependent at all. The low order bits of the time component have no actionable meaning, though they imply sort order. That's the kind of subtile error in reasoning that is really easy to make here. I think there are too many land mines hidden here to make this useful.

You are letting the perfect be the enemy of the good. Perfect general applicability to all problem domains is not a requirement of utility. Engineering is tradeoff analysis.

I'm not sure how I feel about time, but I feel like case sensitivity is asking for trouble if you're expecting anything to be vocalized/shorthand-written, particularly without knowing individual use cases. It also seems like the sort of thing that might occur so infrequently nobody remembers to cover the "fringe" case, particularly for things a couple layers removed, assuming the set size is as huge as it seems intuitively.

Having also spent some time on the topic, let me share my thoughts.

As you said, don't include information beyond identity for the simple reason that nothing is guaranteed to never change, every information you put in there may become outdated. One may think you can surly put the birth country of the customer into its customer number, but what if the customer made a mistake when signing up?

It may be useful to have some information in identifiers, for example a prefix C or O to indicate a customer or order number, but that can be implicit most of the time and just be added in the user interface or when printing labels.

To guarantee global uniqueness one has essentially two choices, enough randomness or unique identifiers for space and time. The first option requires no coordination, not synchronizing clocks and not making sure every process has a unique identifier, the second option allows you to have locality in identifiers generated at similar times or places.

Locality can be useful when identifiers are used as keys in databases or other data structures because it may avoid having every insert happen at a random place. On the other hand this may easily leak information, whether the number of your customers or orders per month or the number of servers you are using, especially if you are using counters, see further down.

One can use identifiers with locality internally but encrypt them before exposing them to the outside, but this comes with an entire new set of problems. One may have to include initialization vectors in the identifiers possibly making them longer. One may have to think about key rotations, at least for the case keys get compromised, possibly making identifiers even longer having to include key identifiers. One will have to do key management anyway and encryption of course costs clock cycles.

If one wants to include a time component, one can choose between actual time and counters. If one uses actual time, you have to ensure your clock never goes backwards and you may have to account for limited resolution, possibly falling back to counters or randomness when identifiers are produced faster than the clock ticks. If one use counters, you have to ensure it never runs backwards, for example persisting it across application restarts including crashes.

If several processes share counters, synchronization overhead between cores may become an issue in high performance scenarios. Modern processors offer high frequency hardware timestamp counters that can be synchronized across cores. This may be especially interesting if the timestamps are also used for purposes like operation or transaction ordering.

Identifiers should be readable if only for debugging by developers. Similar characters like O, 0, 1, I, l, U and V should be avoided. One can consider a checksum or even error correcting codes for user facing identifiers.

Capitalization is a tradeoff between length and ease of use but I like your idea of dealing with it behind the scenes. One probably loses on average less than one bit per character when ignoring capitalization so that for sufficiently long identifiers collisions are still very improbable, at least if one does not use identifiers with a lot of locality. But I fear users will just tend to include the capitalization because they are not aware it can probably be ignored. It surly adds complexity to the application, if only to the user interface, because you have always to be able to handle collisions.

Word of caution to anybody looking to use a similar method working with SQL Server: the SQL Server uniqueidentifier type is stored differently on disk than binary(16). Instead of being ordered by bytes 0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1.

MSSQL does have a similar identifier, though - NEWSEQUENTIALID.

Yes, but NEWSEQUENTIALID isn't globally ordered, it's only ordered for the computer on which it was generated. For that matter, on a single computer the ordering isn't guaranteed[1]. This makes NEWSEQUENTIALID fairly useless if you're looking to merge records from disparate sources and sort them independently of origin.

[1] https://connect.microsoft.com/SQLServer/feedback/details/475...

Ah right, I stand corrected.

> This alphabet excludes the letters I, L, O, and U to avoid confusion and abuse.

Can anyone please ellaborate? I don't get the confusion or abuse potential

Douglas Crockford goes into why he chose those 4 letters in this page: http://www.crockford.com/wrmg/base32.html.

I, L, and O can be confused with the digits 1, 1, and 0, respectively.

He omits U to avoid "accidental obscenity."

Not the author, but presumably because 1, I and l are visually hard to distinguish in some fonts. Same goes for O and 0.

As for U, I don't know about that, but I recently stumbled upon the issue that Google's OCR software seems to incorrectly recognize U as II relatively often.

I'm guessing the following pairs can be confused visually: O/0, I/l, U/V..

Depending on the font used, 'I' and 'L' can be easily confused by humans. 'O' can be read as '0' too. This is meant to prevent that.

As for the 'U', I'm not sure why the original ULID spec left it out. Thanks for raising this. I'll investigate.

Typically, 'U' is omitted to avoid accidentally putting the word FUCK into the generated output. No, not kidding.

See http://www.crockford.com/wrmg/base32.html, Ctrl-F, search for obscenity.

Edit: Interestingly, since it already omits the letter I, adding U covers all of George Carlins' 7 Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words

This is so bazar that I don't even know where to start to try an understand it.

Is it a joke? Tongue-n-cheek? It must be. I can't understand how it could possibly be a ligament concern. What about other languages? As adults who doesn't use 'fuck' conversationally? If it were a real concern then surly drawing attention to it in the design of the ID is a more obvious then the very rare occasion of 'dirty words' (by some definition of dirty) appearing in otherwise random strings.

So I have to conclude it's a joke, but then what the fuck?

This type of thing puts noise into engineering details that just ends adding to confusion and ambiguity. If there's no (important) reason for 'U' to be in the omission set, then just say it's an arbitrary choice, if there is an OCR, or legibility reason then say that.

Half of my brain says it's funny, and half my brain says it's fucked up.

I see where you're coming from. On the other hand, since they were after base32, they were able to choose an additional letter to omit. By omitting the U, they happen to exclude the most popular english language expletives.

These things can and do end up in, for example, urls. Think something like an ecommerce store order id. While not likely to happen, the below could happen in an email, and might invite unneeded controversy:

"Dear Mr Customer, Please Click Here to see your order status: http://example.com/status/FUCKUP6433432334234 "

Ah yes ok I can understand that. I can see from a marketing point of view that you might want to simply and quietly avoid the (roughly) 1 in 1,000,000 ids which start with 4 specific characters.

Still, it's a bit culturally specific. It gives the impression of solving a problem that is doesn't. Depending on your market/language a different character might be a better choice. Plus there are quite a few 'bad' words that can still appear.

Filtering and rejecting proposed IDs based on a list of objectionable words seems like a far more realistic solution, then baking a very specific exclusion into the plumbing.

Omitting U is great bang for your buck.

When I wrote a simple password generator as well as omitting u or U I also omitted all of 1iIlL and oO0.

Then I thought - the hell with it, I'll leave out all vowels to reduce the chance of offending someone.

I had to increase the length to compensate. By how much I'll leave as an exercise for the reader.

Well you need to leave 4 characters out to get to 32, so unless there is a better character the selection is arbitrary.

I think this is a problem that was accented quite well for people that ever had to use passwords in video games.

Sibling posts explain why better. This isn't something that you forget once it has bitten you.

Will the ulid be unique even if it is generated from 2 different machines at the same millisecond?

Yes, if your source of entropy is worth its salt.

That was bad and you should feel bad. :)

Yes if you use a good source of entropy. In Go, the easiest and most secure is the `crypto/rand.Reader`. But it's also slow. If you don't need the security properties, an instance of `math/rand.Rand` should suffice. Otherwise, you can implement your own. It's just an io.Reader!

Doesn't that mean that it will usually — but not guaranteed always — be unique? That is, won't there always be a (minute and negligible depending on the intended use) chance of collision?

Yes, there's always a negligible chance of collision.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact