
ULID: Universally Unique Lexicographically Sortable Identifier - brunoluiz
https://github.com/ulid/spec
======
zackmorris
See also Firebase push IDs:

[https://firebase.googleblog.com/2015/02/the-2120-ways-to-
ens...](https://firebase.googleblog.com/2015/02/the-2120-ways-to-ensure-
unique_68.html)

Lexicographically sortable identifiers are critical for any distributed data
store if you want anything close to consistency. I've run into the issue of
not having them and having to settle for some kind of <autoincrement_id><UUID>
key and it's a huge PITA. How this wasn't considered database 101 decades ago
just blows my mind.

I'd like to see a spec included in this for synchronizing clocks or using
RAFT/Paxos for generating ULIDs with strong guarantees on sort order.

Also a minor gripe - I wish that the ULID spec checked for microsecond
collisions instead of millisecond, because that would be more useful for
realtime networked gaming and simulations.

~~~
joatmon-snoo
> Lexicographically sortable identifiers are critical for any distributed data
> store if you want anything close to consistency.

wat?

This doesn't make any sense to me. Requiring that events be partially ordered
if not totally ordered is a consistency requirement, sure, but I'm failing to
understand how a database's consistency depends on the data you're putting in
the database.

~~~
cryptonector
And now you need every host to have an atomic clock too.

Maybe CRDTs would be better.

~~~
edoceo
I use ULID in a distributed system where some clocks are off by a bit, still
works well, untill the skew is way off. The time can be extracted from the
ULID for verification before INSERT.

~~~
riffraff
Won't removing time bizs increase the chance of collisions? UUIDs rely on a
lot of randomness to avoid the issue, but in ULID this is much more limited,
and the timestamp fills in for some of it, afaiu

~~~
edoceo
We don't remove the time bits, we inspect them for reasonable values. Eg:
never ever older than 2014-01-01 and older than 24h from our clock gets extra
logging+meta-data.

------
brdd
We have used ULIDs in production for over a year now-- and have generated
millions of these.

First, the main benefit of ULID is that you can generate the IDs within your
own software rather than rely on the database. We can queue them or even
reference them before they land in the database. The traditional roundtrip has
been eliminated.

Secondly, being able to sort ULIDs is a nice plus, although not that big of a
deal. It makes it relatively easy to shard or partition databases, and it
provides a convenient sort if you're not looking for extreme accuracy.

ULIDs are also shorter and slightly more user friendly than UUIDs.

In some circumstances we found the actual implementations to be slightly
lacking. For example, the JS library for ULID once returned a 25 character
string rather than the standard 26 characters, causing a big ruckus that we
had to manually resolve.

~~~
munk-a
> First, the main benefit of ULID is that you can generate the IDs within your
> own software rather than rely on the database. We can queue them or even
> reference them before they land in the database. The traditional roundtrip
> has been eliminated.

No you cannot, unless you're running a single threaded server process on a
single machine. What you can do is _gamble_ that you probably won't have a
collision, which is the same thing you could do with regular UUIDs and you'd
be (nearly) guaranteed to never hit a conflict with UUID4 (or probably UUID1/2
if you trusted your mac address uniqueness).

You may find this gamble acceptable and many people do, but you should be
aware that pre-generation of UUIDs on independent systems without coordination
is not a solvable problem - all attempts to do so rely on the extreme
unlikelihood of a collusion to feel good about it or use some coordinated
information (like a guaranteed unique mac address).

(Again, if it's good enough for you, right on - but it isn't theoretically
safe)

~~~
bearmcbearsly
> No you cannot, unless you're running a single threaded server process on a
> single machine. What you can do is _gamble_ that you probably won't have a
> collision

This seems like a pointless distinction.

If I did the math right, you can generate 1,000,000 ULIDs per second (1000 per
millisecond) for around 50 million years before you can expect to hit your
first collision.

I don't know about you, but I'm pretty sure any system I build won't be
running 50 million years from now. Not to mention that the timestamp portion
of the ULID will overflow in a mere 9000 years.

~~~
yen223
Does your math rely on perfect entropy on machines you don't control?

~~~
ngrilly
Yes. I did the math and got the same result, relying on a perfect entropy
hypothesis.

------
erik_seaberg
There are definitely systems out there whose clocks are not accurate to the
millisecond. It's not healthy for systems to encourage false assumptions
(e.g., that ids monotonically increase).

~~~
nixpulvis
I was surprised to find nothing about clock synchronization or quality in the
README...

> Monotonic sort order (correctly detects and handles the same millisecond)

EDIT: None of these concerns are directly related to the data format, but
would be something I'd explain before users make false assumptions.

------
Dylan16807
The failure on overflow is weird.

Since it starts at a random point and can't overflow, even if you generate a
small number of IDs per millisecond you have a constant 1/2^79 chance of
failing. The chance is small but reachable for a large network. (bitcoin does
2^88 hashes a day)

It could have just wrapped with no problem, because there's no possible node
that could generate 2^80 IDs by itself. And if you have multiple nodes it
doesn't help there either.

~~~
daveFNbuck
One of the constraints is that lexicographic sorting must order the nodes by
generation time. This precludes wrapping. It's a little weird that they didn't
just reserve or add some bits for sorting within milliseconds before or
instead of having to increment the randomness.

~~~
Dylan16807
> One of the constraints is that lexicographic sorting must order the nodes by
> generation time. This precludes wrapping.

I suppose, if you're particularly worried about the one-node situation.

> It's a little weird that they didn't just reserve or add some bits for
> sorting within milliseconds before or instead of having to increment the
> randomness.

There's no particular reason to make the fields separate. If you mask out the
top bit of the random number then they can share and also make overflow
effectively impossible.

It's also worth noting that there are two bits that go completely unused. They
could eat overflow if the layout was slightly rearranged.

~~~
daveFNbuck
> I suppose, if you're particularly worried about the one-node situation

This algorithm doesn't work if you have multiple nodes, so I don't think any
other situation is relevant here.

> There's no particular reason to make the fields separate. If you mask out
> the top bit of the random number then they can share and also make overflow
> effectively impossible

Yes, this is what I meant by reserving some bits.

> It's also worth noting that there are two bits that go completely unused.
> They could eat overflow if the layout was slightly rearranged.

Yeah, it's super weird they didn't do that. I guess they just really like
having a power of two number of bits.

~~~
Dylan16807
> This algorithm doesn't work if you have multiple nodes

Then what's the random part for?

> Yes, this is what I meant by reserving some bits.

What I'm saying is, if you separate it out into an increment-field and a
random-field, then you need a _lot_ of bits and you need to fundamentally
change how it works.

If you merely make sure your random number starts below some threshold, you
only need 1 bit, or a small fraction of a bit. _You would still increment the
random number_ , but you wouldn't have to worry about hitting the max value if
you always start between 0 and 2^79.9, for example.

~~~
daveFNbuck
> Then what's the random part for?

It's for multiple nodes, which is why this algorithm doesn't make any sense
for their use case.

------
xucheng
It’s worth noting that, since the ID in the same millisecond is incremented,
it may suffer with enumerate attack. So it should not be used to generate one
time token or object ID used in url address.

------
BugsJustFindMe
> _Uses Crockford 's base32 for better efficiency and readability_

Ugh. Crockford's base32 character set doesn't actually solve any of the
problems it sets out to solve. Using it suggests to me some uncritical
thinking.

It[0] says things like L is excluded, because "[uppercase] L Can be confused
with 1". Ignoring the part where that is wildly inaccurate for any font that
I've _ever_ seen, why not then also remove G, 6, B, 8, Z, 2, S, or 5?

Reducing 1/I/i/L/l to just 1 does little to resolve visual ambiguity for
users, because a user could just as easily read l or I instead of 1 or O
instead of 0, because users don't know your made-up rules, which causes real
problems because you often don't control both sides of the channel.

[0] -
[https://www.crockford.com/wrmg/base32.html](https://www.crockford.com/wrmg/base32.html)

~~~
grzm
Crockford Base32 folds those ambiguous characters on reading: l, L, I, and i
are treated as 1; o and O are treated as 0. The user doesn’t need to know the
rules.

From the page you quote:

> _”When decoding, upper and lower case letters are accepted, and i and l will
> be treated as 1 and o will be treated as 0. When encoding, only upper case
> letters are used.”_

~~~
BugsJustFindMe
G/6, B/8, Z/2, and S/5 are much more likely to actually be visually ambiguous
than L/1.

~~~
ivan_gammel
1/l is visually ambiguous. Since the encoding is case-insensitive, upper-case
L must also be excluded. For the same reason lower-case L is not included in
Base58.

~~~
BugsJustFindMe
All codes are encoded only to uppercase in crockford32. If you actually want
to get rid of the ambiguity, get rid of 1 and keep L.

~~~
ivan_gammel
>All codes are encoded only to uppercase.

False. ULID specification clearly says that these identifiers are case-
insensitive. There's no "uppercase" requirement anywhere.

~~~
jacobr1
crockford3 accepts any case, but the canonical output is uppercase

------
gregwebs
cuid removes more of the randomness adds a counter and a fingerprint:
[https://github.com/ericelliott/cuid](https://github.com/ericelliott/cuid)

The default id in MongoDB does about the same. I always thought the MongoDB
identifiers worked well for a lot of use cases.

Its also worth mentioning that integer incrementing ids can scale just fine if
you reserve them in large blocks and they are no longer guaranteed to match
insertion order, e.g: [https://github.com/pingcap/docs/blob/master/sql/mysql-
compat...](https://github.com/pingcap/docs/blob/master/sql/mysql-
compatibility.md#auto-increment-id)

------
mehrdadn
I like the idea, just also feel 1.21e+24 unique ULIDs per millisecond seems
kind of defeated by the millisecond accuracy. This means there are effectively
two tolerance values for time at play in the design of this spec that conflict
with each other. If we want users to be able to generate ULIDs on such a short
timescale (implying it's a realistic use case), then it would seem they should
also be able to get comparable accuracy on the timestamp itself.

------
krupan
It would sure be nice if git commit IDs could use something like this. It
would be really convenient if you could look at two commit IDs and know which
one is older.

------
Kip9000
What problem does this solve? Why is it necessary to sort a unique id?

~~~
taeric
It can help operations if you know when an id came from. So that when you get
a request id from a user, you don't also have to get a when.

~~~
mcbits
> when you get a request id from a user

Just a public service reminder that "never trust the client" still applies, in
case you were imagining user agents generating their own ULIDs to relive
servers of the duty or something. Nothing prevents them from sending duplicate
or out-of-order IDs.

~~~
taeric
I've always been intrigued by having clients provide the id. I think I
understand why that choice is made, but it does seem unusually error fragile.

------
marknadal
We have been doing something similar for a long time, works out great. Glad to
see more industry adoption around this!

We also wrote a decentralized clock sync algorithm that can be used where NTP
fails, check out
[https://github.com/amark/gun/blob/master/nts.js](https://github.com/amark/gun/blob/master/nts.js)
!

I find it a little odd they didn't use a separator symbol so that way it
doesn't have to overflow after a certain year. Also, then you could have
microseconds precision or beyond where it is supported.

Overall good progress getting people onboard with this! Solves a lot of
problems before they even start.

------
dfox
My experience is that you don't want human readable and user visible to be
sortable, but to have as unique prefix (and when you have experienced workers
also postfix) as possible. So this is certainly useful, but specifying human
readable representation is somewhat redundant.

Another issue is that there are cases when you want to represent the ID as
barcode of reasonable size and readability, which invariably leads to decimal-
only Code128 with at most ~30 digits.

------
otterley
See also
[https://github.com/segmentio/ksuid](https://github.com/segmentio/ksuid)

~~~
swah
And [https://github.com/rs/xid](https://github.com/rs/xid)

What would be more interesting to me is a benchmark of how would using those
as priary keys on Postgres would affect performance.

~~~
wvh
If the libraries produce 128-bit values, then they could use Postgresql's UUID
type. In fact, Postgresql should accept anything as a UUID in text mode as
long as the value is in hex with a length of either 32 or 36 (32 + 4 dashes),
though using binary mode would probably be faster if the UUID library and your
driver support that.

We've been using sortable epoch-based UUIDs as primary key for two different
software products, storing them in Postgresql with the builtin UUID type,
utilising binary mode. Performance is good.

~~~
swah
Yeah - xid, ksuid are all smaller than that. Maybe I could pad with zeros and
use the UUID type?

------
Solar19
Forgive my ignorance. I'm more of a social scientist than a programmer.
Questions:

1\. Why not go for 16-character strings (instead of 26 or 36), with each
character representing 8 bits?

Sure, you'd need 256 possible characters, but it's almost 2019 and Unicode has
been with us for decades now. Surely we could be more cosmopolitan than
Americentric ASCII and curate 256 characters for an 8-bit encoding?

With a 16-byte string, we could compare and process strings much faster,
particularly with SIMD instructions like Intel/AMD's SSE 4.2 string comparison
instructions. They're optimized for 16-byte strings and were introduced many
years ago in the Nehalem architecture. That's a couple of generations before
Sandy Bridge, so any server today is going to support it.

2\. What does it mean to be "user-friendly" when it comes to these sorts of
IDs? What are some scenarios where users interact with them or communicate or
share them with someone or some authority? Crockford wanted his 32 character
set to be easy to convey on a telephone, which seems like an expiring use case
today. It seems like we should be able to use all sorts of non-ASCII
characters now, without resorting to the Unicode Klingon or Tengwar blocks. Do
we really need to be able to pronounce them all like Crockford anticipated?

NOTE: Unicode characters beyond the Basic Latin block take two or more bytes
each, so we wouldn't be able to use them encoded as Unicode. What I'm
advocating is a 256 character set with each character encoded in one byte,
strictly for the purposes of generating these sorts of unique IDs represented
by compact 16-character strings. Call it Duarte's Base256. All these other
BaseN systems seem orthogonal to character encodings, or they just assume
ASCII. I guess my idea would require both a character set and an encoding
scheme. The latter would be similar to ISO/IEC 8859-15 and Windows 1252, but
more complete with 256 printable characters. A lot of them could probably be
emoji.

How good or terrible is this idea?

~~~
zeroimpl
The string representation is for display purposes/information exchange only.
I'm sure most implementations would internally store the data in a 16-byte
form (eg the C implementation uses __uint128_t), where the data is essentially
a 128-bit number.

Given that, inventing a new character set seems pointless, since you'd compare
the data using 128-bit binary operations already anyways (as opposed to
lexicographical string comparisons). Which leads to the question - how is ULID
different from UUID in practice?

~~~
dragonwriter
The difference is that time is at the front, so if you need to sort by
milliseconds the ID was created, you can.

------
pspeter3
How does this compare to Twitter's Snowflake?
[https://blog.twitter.com/engineering/en_us/a/2010/announcing...](https://blog.twitter.com/engineering/en_us/a/2010/announcing-
snowflake.html)

~~~
BugsJustFindMe
There's a trade-off between assigning IDs up front or randomly generating IDs
inside a large space. Randomly generating IDs can be done without a central
arbiter, but doesn't provide any real guarantee against collisions. People
punt on that problem by making their random IDs larger and hoping for the
best.

~~~
erik_seaberg
The usual argument: the probability of a collision only needs to be as low as
the probability of a single-bit error anywhere on the path. That's the best
you could possibly do.

~~~
giornogiovanna
How do you usually estimate the probability of a single-bit error?

~~~
erik_seaberg
Most errors are detected by checksums or hashes, so measure that across all
your hardware (client requests, network hops, server RAM) and estimate how
often your checksum should have collided and let an error slip by.

Granted, it's pretty rare to work on a big enough system to have solid data on
this.

------
cryptonector
So you need good clocks (and timesync) and good entropy. Especially for the
lexicographic sorting part, you'll need really good clocks. That's fine, if
you can get them. But it's not enough, since you get no origin ID, 1ms is a
very long time, and you can't sort events occurring in the same ms.

------
Too
It's not always a good idea to expose timestamps to third parties. You would
then need to obfuscate the url with an API-id, and in that case all the
properties of readability and url-compatibility are mostly redundant as the
ULID only circulates internally in your cluster.

------
ivan_gammel
What does 128-bit compatibility with UUID mean? I would expect that it will be
interoperability on binary level (e.g. allowing to store ULID in database
columns of UUID type), but UUID has type information encoded in it - how can
ULID address this requirement?

~~~
BugsJustFindMe
edit: you're right. my bad. <strike>UUID4 is just a 128 bit random
number.</strike>

That just means they also generate 128 bits.

~~~
ivan_gammel
You are wrong, please, check the specification. UUID4 is 122 bit random number
+ 6 bits for variant and version.

------
wgj
This seems like it could be useful, but

> Cryptographically secure source of randomness, if possible

I don't think this should be a goal. If this is for IDs, you usually want to
optimize for speed of generating IDs and evenness of distribution (aside from
merely reducing collisions.) The top answer at below link has a good top list
of hash algorithms. None of them are cryptographic.

[https://softwareengineering.stackexchange.com/questions/4955...](https://softwareengineering.stackexchange.com/questions/49550/which-
hashing-algorithm-is-best-for-uniqueness-and-speed)

~~~
exyi
IMHO, in most cases you don't need performance nor "security", if you simply
generate IDs for some documents in DB that generating ID is totally negligible
in comparison with storing that on disk. In case it is a performance
bottleneck you'll quickly find that out, it's probably very easy to get some
flamegraph and see the crypto generator here. However, in case you need
security, people quite likely to miss that out until something gets wrong...

It's certainly a compromise, but I like their choice

~~~
wgj
What are you securing? It's a cryptographic hash of what?

~~~
exyi
It's not hash, you probably want to prevent bad guys from making you ID
collisions.

------
Aeolun
We use a similar approach at my company, with a few extra bits reserved for a
machine identifier.

------
sharpercoder
If you need lexicographically sortable uuids, you have a very different
problem.

~~~
exyi
Well, in lot of cases you actually don't need that but it's a nice to have
feature. The typical case being just giving IDs to some items you store in
database - you can either give them UUIDs and enjoy that you can generate the
ID in application and don't have to wait for database to create related
entities to that. Or you choose sequential IDs generated by database, and you
get that the IDs also have some meaning - you can for example sort by ID in
your queries by default as it's roughly equivalent to creation date.

~~~
int0x80
To me this is 'just' an optimization. If sorting by ID means 'sort by creation
time' then you are just optimizing a timestamp in an entity. Overloading the
ID, IOW. For the 'create the ID in the app' yes, but that is orthogonal to the
ULID, AFAIK.

------
RcouF1uZ4gsC
> Each component is encoded with the Most Significant Byte first (network byte
> order).

This seems a surprising choice. Even the PowerPC now supports little endian. I
would guess that 95%+ of all software is running in a little endian system and
that any software that would use ULID is going to run in a little endian
system. Other than for historical compatibility, I don't think there is any
reason to use big endian today, and definitely not for greenfield protocols.

~~~
adambrenecki
Surely if it were the other way around, the timestamp being LE would mean that
the resulting strings no longer sort lexicographically in timestamp order?

