Hacker News new | past | comments | ask | show | jobs | submit login
Why Choose ULIDs over Traditional UUIDs or IDs for Database Identification? (sotergreco.com)
66 points by thunderbong 14 days ago | hide | past | favorite | 74 comments



The main driver behind lexically sortable identifiers is that you generally insert into databases in time order so if your ids are sorted by time then you are appending to the end of the table. If, on the other hand, your ids are random (e.g. the prevalent UUIDv4) then your database is spending all it's time shuffling everything around to insert your new rows in the middle of everything.

Once you have time-based identifiers as primary keys you can do fancy things like using some id you know was generated at time x as a where clause on some other table to dramatically cut down the search space. This kind of dirty trick can really improve performance sometimes when it really matters.

However, the main contender for lexically sorted IDs is ... UUIDs! The new UUID specs have ULID versions of UUIDs.

The new UUID spec includes lexically sortable variants, of which UUIDv7 is gaining widespread support. https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122b...

That spec lists 16 prior lexically-sortable identifiers, and has good motivation for adding time sorting to the UUID standard.

Two things I particularly like about the UUIDv7 standard is that it is easy to swap into existing codebases that are using UUIDv4, and that it is easy to extract the timestamp part in systems that don't know about it and are treating it as a string e.g. in lots of databases you can do some sql like TIMESTAMP_MILLIS(CAST(CONCAT("0x", LEFT(REPLACE(id, '-', ''), 12)) AS INT64)). This is a massive advantage over systems that use non-hex encodings.

So these days, use UUIDv7 if you want to generate your ID outside of a database and the time the id was created is not a secret; else use UUIDv4. Your database will thank you.


> your database is spending all it's time shuffling everything around to insert your new rows in the middle of everything

Is it? The order of data on disk tends to be the insertion order. What might get shuffled around is the index for the primary key, but indexes can handle that quite well.


And the answer is "depends on the database tech".

Here are some nice benchmarks: https://www.toomanyafterthoughts.com/uuids-are-bad-for-datab...

I learned the lexically-sortable-id trick from wizened DBAs in the 90s. It's an old trick. I've applied it myself to speed things up and have used both the ULID types described in the article and UUIDv7 etc.

I even advocate using UUID7 on BigQuery where there are no indexes. Having sortable keys is nice even if it doesn't add speed.


You should definitely bench your storage to confirm that it likes hotspots (some don't!)


You should, but B-tree is the most common data structure used for database indexes, and sequential order performs better than random order for index keys.


You often can't bottleneck the entire workload on the capacity of one cluster node and one storage device.


I can't really understand what you said, could you elaborate? :)

If one machine can only handle a hundred transactions per second, and you might ever receive more than that, you have to spread the work across several machines, probably by assigning primary keys that aren't (mostly) sequential.

usually in sharding schemes its a few bits of a hash of the id that are used to determine shard; the system or library you are using will take care of this transparently for you.

You don't need a total order for the locality properties. E.g. a 32bit ms timestamp that overflows every ~50 days + 96bit of entropy will give you the same locality with a much lower collision probability.

That being said, a lot of index types don't care about this kind of locality and might even benefit from higher entropy (randomized algorithms).


Depending on your read and write patterns, overflowing even every few days might be appropriate, and also obfuscate the creation time [1] more.

[1] https://news.ycombinator.com/item?id=40272446


Great point, using a u16 is probably the best choice then. As it give locality within 65.53600 seconds (roughly a minute) without exposing sensitive timing information, and should cover most real world write scenarios.


A small additional improvement might be to also create a random 16bit salt at startup and then xor the timestamp with that, to leak even less information. That way entries will still have locality without leaking ms timestamp info.


at this point, why not use the creation date (with all that you describe on top) as the primary component of the primary key?

you only got 128bit, you want most of that to be randomnes, otherwise collisions are much more likely

But if you for instance use MS SQL and the uniqueidentifier type, sorted UUIDv7 is not sorted im the database because MS SQL swapsnthe bytes around. So ULID (stored as binary(16)) is an advantage in that case to avoid making that mistake.

I really like that ULID looks differentt from UUIDv4. UUIDv7 is something sorted that on the surface looks like it will be random and is mixed with random IDs etc..


It depends per storage system. I think we evaluated this on Postgres b-tree and weren’t super jazzed about it. Some storage systems don’t like hotspots. They can write two far-apart keys in parallel, but adjacent keys need to be serialized. Sometimes this happens with sharding systems.


What drives me nuts is people storing uuids in databases as hex strings. They're bytes, and if you do it right, even the type 1's are lexigraphically sortable.


Hex strings may require more storage, but they remove the need to convert back and forth and make it easier to manually query the database when debugging. The best type is your database's uuid type if it has one.


There’s no reason to use hex though. Base32/36/62/64 should be considered.


there's no reason to not use hex, though.

In my post this is replying to, I mentioned "in lots of databases you can do some sql like TIMESTAMP_MILLIS(CAST(CONCAT("0x", LEFT(REPLACE(id, '-', ''), 12)) AS INT64)). This is a massive advantage over systems that use non-hex encodings."

I've had struggles with binary and base62 encodings and things and just want everyone to use hex.

Typical databases compress rows or columns anyway these days; or if they don't, then the storage does. So often the question is entropy not character count. For example storage I work with these days takes as an identical number of bytes to store a hex string as to store the same id as a byte array.


> Typical databases compress rows or columns anyway these days; or if they don't, then the storage does. So often the question is entropy not character count. For example storage I work with these days takes as an identical number of bytes to store a hex string as to store the same id as a byte array.mo

Storage tends to be cheap, so it's less about that, and more about memory consumption and index efficiency. The cost of transforming back to and from binary is also pretty trivial on modern systems.


Stuff gets "converted" between you and the database anyway. If you really want to see it as though they were strings all the time, you can always make a view.


Did the article provide any reason for ULID over UUID?

- it’s slightly more complex than UUID, but not enough to be a problem

- it’s sortable (in time?) which UUID can also be (but usually recommended not to be)

- it can produce slightly more ids per second, but not enough to make a difference.

So, it’s a tie, a tie and a tie. Why would you switch?


There is a draft for a UUID variant called v7 which is basically the same like ulid

It is absolutely valid

In particular when you want ids to be in mostly ascending order but where you can’t use an auto incrementing int

It’s sortable is a key feature because it can help make searching tables more efficient in situations where the data is naturally tied to the time at which the entry was made and your queries relate to time also

For example, if you append rows with ulid or uuid v7, then you can binary search the records in the file without first sorting or making an extra index of the ids as they are already for the most part in ascending order


Cassandra seemed to have no problem using type1 UUIDs for this.


That was left as an exercise for the reader :-/

> I am not going to cover all the reasons why Ulid is better, but in the end, it's up to your personal preference.


It jumps straight to “I’m not going to list all the reasons”. Ok, maybe just the biggest one or two reasons? Is lexicographer sorting that helpful?


> it’s sortable (in time?) which UUID can also be (but usually recommended not to be)

It gives you the sortable parts without the "recommended not to" parts.


If your team hasn't engineered in monotonically increasing distributed clocks, the results of sorting are only partly correct. I'd rather not promote an option we can't rely on.


They’re only partially correct no matter what. The timestamp in a ULID has only has millisecond resolution, so two IDs generated in the same millisecond, even within the same process, will be randomly ordered.

In practice though there are a lot of advantages to having approximately-time-ordered IDs, and I’ve found the pitfalls easy enough to avoid.


The ULID spec has the weird "first generate a random number and all successive calls within the same millisecond just increment the previous number by 1", which tries to solve this somewhat at the expense of now needing to lock, making ULID generation sequential within the same process.


That probably isn't something good for everyone, it makes trying to guess ids on a busy system so much easier than 60ish bits of random data (just get an id, try to add or remove 1, and if someone else happened to generate another id the same millisecond you've just found it?)

Whether the id is secret or not is a debatable choice, but there's no reason to make it easily guessable; at least that increment should be random with a somewhat large stride...

But in general anyway in a distributed system network latency etc will mean things are just "mostly sorted" which is good enough, keeping the last few inserts in order is much easier than reshuffling the whole db/index all the time


You’re right, good catch, I completely missed that.

For many use cases, partially correct sorting is fine.

Eg messaging, tweets etc: if two messages are sent at pretty much the same time, it does not matter which order they show in.


At this point I think that ideal setup is to use numerical id for primary and foreign keys and maintaining a separate uuid field for everything else.

The reason being that index size matters a lot (for caches and other things) and index size depends on underlying field size, obviously.

Whether to use UUID or ULID is depends on tooling. While it's not hard to write ascending UUID generator and I did it myself few times, using standard APIs might be better approach despite the shortcomings.


Or just use ints and encrypt them when you want to return them to the user.

https://sqids.org/


I like the general idea, but the page you linked explicitly states that it isn't meant to be used to hide (user) IDs, or actually encrypt data.


Touché. The project used to be called hashids, but I guess they renamed it when they weren’t actually hashes

NB: may be more accurate to call this hashing, versus encryption


Actually, it's not hashing either, it's just encoding. Anyone who knows the alphabet can easily decode them.


thank you for clarifying

To make this secure, some secret salt value needs to be added before hashing.


My link didn’t link to there library I thought it was. We encrypted ids at Yelp, not hashed them.

Or you can just store the UUIDs as bytes like the almighty intended.


UUID is 128 bits.

Even for huge tables 64 bits are enough. And for many tables 32 bits is more than enough.


Oh, 32-bits is enough for a lot of tables assuming you have a transactional ID allocator. The whole reason to have a UUID is so you don't have to serialize all your ID allocation. Agreed though that for some lower volume systems, 64-bits is likely enough to make collisions statistically unlikely.


May be I'm spoiled by postgres, but transactional ID allocator sounds like a problem solved in every database. Is there issues somewhere?

Hehe. It very much is a problem solved in every ACID database... as long as you want to serialize all your transactions.

would you still keep a uniqueness constraint on the uuid/ulid column?


I wouldn’t, it’s redundant for all intents and purposes unless you’re Google, which you are not. UUIDs/ULIDs provide strong uniqueness guarantees on their own.


Yes, I'd keep a uniqueness constraint, because it prevents bugs from inserting the same value more than once. From concurrency issues to broken id generation, I've seen it happen before. So unless there are other considerations, having the database enforce uniqueness is a good thing in my book.


Depends on how many new U(U/L)IDs you insert, but when it's part of an index anyway then I personally would, as the performance penalty would be small (much smaller than anything else you do).


Of course. Index for this column is inevitable and with index, uniqueness basically free and protects from programmer errors.


The GH repo for ULIDs provides better and more relevant reasons for use. https://github.com/ulid/spec


Related. Others?

ULIDs and Primary Keys (2022) - https://news.ycombinator.com/item?id=40016413 - April 2024 (33 comments)

Ulid: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=39878319 - March 2024 (6 comments)

ULID – Sortable Unique Identifier - https://news.ycombinator.com/item?id=34281969 - Jan 2023 (23 comments)

Using ULIDs at Incident.io - https://news.ycombinator.com/item?id=34230652 - Jan 2023 (1 comment)

Understanding UUIDs, ULIDs and string representations - https://news.ycombinator.com/item?id=29794186 - Jan 2022 (100 comments)

Going Deep on UUIDs and ULIDs - https://news.ycombinator.com/item?id=28948815 - Oct 2021 (2 comments)

Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=23160641 - May 2020 (2 comments)

ULID: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=18768909 - Dec 2018 (129 comments)

Universally Unique Lexicographically Sortable Identifier in Go - https://news.ycombinator.com/item?id=13116308 - Dec 2016 (52 comments)

ULID: Universally Unique Lexicographically Sortable Identifier - https://news.ycombinator.com/item?id=12205158 - Aug 2016 (3 comments)


This topic resurfaces quite frequently, I've noticed. One thing I find quite frustrating is that most of the time this is discussed, people don't remember that exposing the creation time of a database row to the public is not acceptable in many systems. This makes v7 UUIDs, ULIDs, KSUIDs, etc. not appropriate for the task.

In any application where the user is able to first create a draft of something (a video upload, a news article, a blog post, ...) and then release it some day in the future, we probably don't want the public to be aware of when the author created the initial draft. For instance, news websites often start drafting articles about people that are speculated to pass away soon.

The problem also affects numeric/auto-incremented IDs. The "Finnish BBC" YLE uses numeric IDs in their news article URLs, making it possible to ballpark when the article was first started. Here's [1] an article about the death of former president Martti Ahtisaari from 2023/10/16 and here's [2] the next integer ID dated 2023/5/5 (5 months before).

[1] https://yle.fi/a/74-20030335

[2] https://yle.fi/a/74-20030336


How about TSIDs? Numerical, half the size, time sorted... https://vladmihalcea.com/uuid-database-primary-key/


TSIDs are basically type1 UUID's without the MAC address, no?


My preference: using traditional incremental numeric IDs and obfuscate them with Hashids (https://hashids.org) when exposed publicly


This looks amazing.

I'm wondering about the security of using a random alphabet with this instead of the default one. In my mind this amounts to a form of cryptography, but I have no idea how to analyse how much security it gives.

EDIT: Reading the faq I see that they insist that sqids cannot provide any encryption. This does not fit with my understanding of the word. Using unique random alphabet is probably the oldest form of encryption. Whether or not it is secure enough depends on your threat model. What I want and what I need is a way to calculate the security provided by a random alphabet.


I've been doing something that includes a type character as the first character - followed by 11 random base58 digits. This allows > 2^64 possible ids and a type in ~12 bytes. I occasionally wish they were ordered, but most of my tables usually have a 'created_at' field anyway should ordering matter.

I feel like uuid / ulid are just overkill for most situations - and they're long and kinda ugly imo.


Isn't that basically a type 4 UUID that you truncated the last 4 bytes from?


The encoding is completely different? And given that encoding I don't go to any great means to store them as a special type in the db - just text. uuid being hexidecimal, databases often want to store them as binary since that's half the bytes...


You could encode UUIDs as base58, though I'm with you that that doesn't seem to make a lot of sense.


Thanks, chatgpt.


Yeah that code example is odd to say the least.


> https://github.com/f4b6a3

> f4b6a3: a meaningless random package name

...well, I guess that's one way to name your library!


I'm using KSUID (https://sotergreco.com/why-choose-ulids-over-traditional-uui...) for a while and totally happen with it One tiny note about using sortable id is you should not expose it to public use without authentication because it can be scan easily


I worked on a huge backend system that used ULIDs for everything. I liked them slightly better than UUIDs in general, but nobody coming in or using our system knew what ULIDs were, so we constantly had to explain what they were. If we used UUIDs there would have been far less confusion. In conclusion: Just use UUIDs.


Just use which UUID version?


Not much contents in this article. I thought I'd learn more about ULIDs.


I am using ksuid and it's been great.


"Why Choose ULIDs Over Traditional UUIDs or IDs for Database Identification?" "Is it worth it ?

I am not going to cover all the reasons why Ulid is better"

Well, this article was definitely worth it /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: