
Universally Unique Lexicographically Sortable Identifier in Go - tsenart
https://github.com/oklog/ulid
======
Dowwie
Seems that someone ported the javascript ulid library to Python:
[https://github.com/mdipierro/ulid](https://github.com/mdipierro/ulid)

.. and it appears to be working

~~~
aruggirello
Out of curiosity, I just wrote a PHP ulid() function. [1]

It took me a little more than 5 minutes, so it wasn't that hard after all. It
appears to run at only about 35% the speed of the Go version, but hopefully
there's still room for some little improvement.

[1] [https://github.com/phptools/ulid](https://github.com/phptools/ulid)

~~~
tsenart
Fellow PHP implementation:
[https://github.com/Lewiscowles1986/ulid](https://github.com/Lewiscowles1986/ulid)

------
lobster_johnson
I wonder why you would ever want it to be case-insensitive? Surely that
increases the possibility of clashes, and violates the very good principle of
least surprise (pretty much every other ID in the world is case-sensitive).

~~~
quink
The hex representation of GUIDs/UUIDs is case insensitive in most
interpretations, I'd say that counts enough.

~~~
lobster_johnson
I'd be surprised if many developers these days actually _compared_ UUIDs case-
insensitively, though, even if they're intended to be. Microsoft, which have
relied on GUIDs for years going back to their version of DCE/RPC, probably
does it right.

If it's part of a spec, then it's simpler, but it's still a potential point of
surprise.

------
TeeWEE
Note. You cant compare ULID with UUID's

    
    
      - An ulid is "sortable". But the whole point of an UUID is a random unique ID. Non guessable. Sortable is not a feature 
        you normally want from an uuid. And still UUIDs are still sortable. But it doesnt have any meaning. 
      - An ulid also encodes Time, an uuid doesn't. 
      - An uuid has less change of clashing: Its 128 bit vs 80 bit for ulid.
      - An uuid is also url safe.
      - An ulid is case insensitive. I don't see how this is an advantage.

~~~
_ak
You just compared UUIDs with ULIDs.

~~~
cloudhead
Maybe he means programatically?

~~~
icholy
I think he's joking.

~~~
throwaway98237
And nailed it.

------
joshuak
I've been working on a closely related problem of universal identification in
distributed computing for a few years now. It's now in standards review and
hopefully publishable soon.

We came to the conclusion that universal ids should represent identity only,
and explicitly not have 'metadata'. What is the use of 48 bites of time? It
reduces the overall entropy, for what? If time is important then why not make
the id literally time (i.e. UnixNano), if it isn't they why not make all bits
rand?

Also, while I think speak-ability is actually very important (many disagree),
I'd assert that capitalization is better addressed from the opposite direction
(i.e. UI). Instead of removing capital letters from the ID itself affecting
all cases, address the human factors in the few cases it comes up.

I think this is good advice in general:

When spoken aloud we suggest that you don't indicate capitalization at first.
In many cases, such as search, human validation etc, this is more than enough
precision, then one can add capitalization for disambiguation as needed.

For example:

"ab2Cd3Ef1g"

Spoken becomes: "a b two c d three e f one g. Capitalize 4 and 7, c and e"

With the capitalization part optional depending on context.

~~~
tsenart
> What is the use of 48 bites of time? It reduces the overall entropy, for
> what? If time is important then why not make the id literally time (i.e.
> UnixNano), if it isn't they why not make all bits rand?

For some designs it's useful to have identifiers have other properties than
uniqueness. In this case, this property is relative lexicographical (and
binary) order based on time so that you can leverage the order between the
things the identifiers identify without looking at the things. The entropy is
there to satisfy the uniqueness property (with some acceptable degree of
collision, application dependent). The time is there to satisfy the ordering
property.

~~~
joshuak
Yes very good point. However, as @danbruc rightly points out this raises all
sorts of other concerns. A user of these IDs my not realize that the
reliability of the ordering can be substantially reduced depending on where
the IDs are generated.

Some applications may be able to tolerate inconsistencies in ordering, others
may not. Are IDs being generated on multiple machines? Are they in sync? What
happens if the system clock is adjusted, or a container/VM is restarted on
different hardware?

This design implies that these IDs are being generated in different locations,
but this usage leads to the least reliable time. How many bits of approximate
time does one really need? Not 48 surly.

On the other hand if you generate the IDs in the most reliable model, a single
host with persistent storage to prevent regression, you've basically made an
unnecessarily complicated vector clock. A simple incremental counter would
work at least as well, and be far simpler.

~~~
tsenart
> A user of these IDs my not realize that the reliability of the ordering can
> be substantially reduced depending on where the IDs are generated.

That can only be addressed with improved documentation and shared
understanding of the subtleties and pitfalls of distributed time
synchronisation.

> Some applications may be able to tolerate inconsistencies in ordering,
> others may not.

Indeed. Proper thought must be but into this sort of thing. ULIDs aren't an
exception nor a silver bullet.

> How many bits of approximate time does one really need?

Entirely application dependent.

~~~
joshuak
Agreed.

The point being these distinctions lead to the conclusion that this identifier
isn't 'generally' useful, and even under optimal conditions it's utility is
questionable. For example extra precision for an approximate value is not
application dependent at all. The low order bits of the time component have no
actionable meaning, though they imply sort order. That's the kind of subtile
error in reasoning that is really easy to make here. I think there are too
many land mines hidden here to make this useful.

~~~
sagichmal
You are letting the perfect be the enemy of the good. Perfect general
applicability to all problem domains is not a requirement of utility.
Engineering _is_ tradeoff analysis.

------
lazulicurio
Word of caution to anybody looking to use a similar method working with SQL
Server: the SQL Server uniqueidentifier type is stored differently on disk
than binary(16). Instead of being ordered by bytes
0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15, uniqueidentifier values are ordered by
bytes 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1.

~~~
zamalek
MSSQL does have a similar identifier, though - NEWSEQUENTIALID.

~~~
lazulicurio
Yes, but NEWSEQUENTIALID isn't globally ordered, it's only ordered for the
computer on which it was generated. For that matter, on a single computer the
ordering isn't guaranteed[1]. This makes NEWSEQUENTIALID fairly useless if
you're looking to merge records from disparate sources and sort them
independently of origin.

[1]
[https://connect.microsoft.com/SQLServer/feedback/details/475...](https://connect.microsoft.com/SQLServer/feedback/details/475131/newsequentialid-
is-not-sequential)

~~~
zamalek
Ah right, I stand corrected.

------
cfv
> This alphabet excludes the letters I, L, O, and U to avoid confusion and
> abuse.

Can anyone please ellaborate? I don't get the confusion or abuse potential

~~~
tsenart
Depending on the font used, 'I' and 'L' can be easily confused by humans. 'O'
can be read as '0' too. This is meant to prevent that.

As for the 'U', I'm not sure why the original ULID spec left it out. Thanks
for raising this. I'll investigate.

~~~
tyingq
Typically, 'U' is omitted to avoid accidentally putting the word FUCK into the
generated output. No, not kidding.

See
[http://www.crockford.com/wrmg/base32.html](http://www.crockford.com/wrmg/base32.html),
Ctrl-F, search for obscenity.

Edit: Interestingly, since it already omits the letter I, adding U covers all
of George Carlins' _7 Dirty Words_ :
[https://en.wikipedia.org/wiki/Seven_dirty_words](https://en.wikipedia.org/wiki/Seven_dirty_words)

~~~
joshuak
This is so bazar that I don't even know where to start to try an understand
it.

Is it a joke? Tongue-n-cheek? It must be. I can't understand how it could
possibly be a ligament concern. What about other languages? As adults who
_doesn 't_ use 'fuck' conversationally? If it were a real concern then surly
drawing attention to it in the design of the ID is a more obvious then the
very rare occasion of 'dirty words' (by some definition of dirty) appearing in
otherwise random strings.

So I have to conclude it's a joke, but then what the fuck?

This type of thing puts noise into engineering details that just ends adding
to confusion and ambiguity. If there's no (important) reason for 'U' to be in
the omission set, then just say it's an arbitrary choice, if there is an OCR,
or legibility reason then say that.

Half of my brain says it's funny, and half my brain says it's fucked up.

~~~
tyingq
I see where you're coming from. On the other hand, since they were after
base32, they were able to choose an additional letter to omit. By omitting the
U, they happen to exclude the most popular english language expletives.

These things can and do end up in, for example, urls. Think something like an
ecommerce store order id. While not likely to happen, the below could happen
in an email, and might invite unneeded controversy:

 _" Dear Mr Customer, Please Click Here to see your order status:
[http://example.com/status/FUCKUP6433432334234](http://example.com/status/FUCKUP6433432334234)
"_

~~~
joshuak
Ah yes ok I can understand that. I can see from a marketing point of view that
you might want to simply and quietly avoid the (roughly) 1 in 1,000,000 ids
which start with 4 specific characters.

Still, it's a bit culturally specific. It gives the impression of solving a
problem that is doesn't. Depending on your market/language a different
character might be a better choice. Plus there are quite a few 'bad' words
that can still appear.

Filtering and rejecting proposed IDs based on a list of objectionable words
seems like a far more realistic solution, then baking a very specific
exclusion into the plumbing.

~~~
emmelaich
Omitting U is great bang for your buck.

When I wrote a simple password generator as well as omitting u or U I also
omitted all of 1iIlL and oO0.

Then I thought - the hell with it, I'll leave out all vowels to reduce the
chance of offending someone.

I had to increase the length to compensate. By how much I'll leave as an
exercise for the reader.

------
pritambarhate
Will the ulid be unique even if it is generated from 2 different machines at
the same millisecond?

~~~
drfuchs
Yes, if your source of entropy is worth its salt.

~~~
vlowther
That was bad and you should feel bad. :)

