Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] UUIDs are obsolete in the age of Docker (shevtsov.me)
20 points by lshevtsov on May 18, 2023 | hide | past | favorite | 38 comments



> this is only correct about UUID version 1. However, it is what most applications use.

This is a bold claim and doesn't match my experience at all. UUIDv4 is all I see, everywhere, everyday.

That's also a big enough caveat to put in the title: if you have a beef with UUIDv1, say UUIDv1 is obsolete.


Yeah I agree that this isn't true, every popular uuid package uses V4 by default and all the places I've worked used V4 when they were using UUIDs.


Third-ing this opinion. Can’t remember the last time I saw v1 used in a design.


As the article points out, this is only an issue with UUIDv1. They claim "However, it is what most applications use." but I have no idea how true this is. I was under the impression that the vast majority of UUID generators were v4 by default. For example:

Postgres only offers random uuid generation (https://www.postgresql.org/docs/15/functions-uuid.html).

The `uuidgen` CLI tool, at least for modern versions (I have not checked historically), says (from https://man7.org/linux/man-pages/man1/uuidgen.1.html): "By default uuidgen will generate a random-based UUID if a high-quality random number generator is present." (later it lists /dev/random as such a generator, present on almost all systems)

What's an example of a system that generates v1 uuids by default?


Isn't the string comparison claim also wrong?

Although plenty of UUIDs are passed as strings in eg JSON, I was under the impression that where performance really matters (like db indexes) they were stored and compared as 128 bit fields. To be fair, the points about word sizes and ordering make sense.


> What's an example of a system that generates v1 uuids by default?

MySQL. One of many reasons to avoid it.


Didn’t know this - true as of MySQL 8.X. I would say “shocking” if it wasn’t in keeping with a number of dubious decisions made over the years. We are using MySQL, but thankfully are generating UUIDs in our app layer.

Kludgy, non-cryptographically-safe UUID4 implementation in MySQL: https://stackoverflow.com/a/32965744


1. Nobody uses UUIDv1. Why use UUIDv1 as a straw man argument?

2. UUID strings are awful for storage -- don't use them. Yes there are databases that support UUIDs natively, why is whether or not a UUID fits into a machine word relevant? You use UUIDs for its other properties that 64-bit integers cannot offer. KSUIDs are touted as fixing all the aforementioned issues but they're even bigger than UUIDs.

3. Both KSUIDs and UUIDs are hard for humans to read compared to 64-bit integers.

4. You don't have to encode UUIDs as hexadecimal numbers plus dashes. You can choose any binary encoding you want, I am partial to Crockford Base32 because of how general-purpose it is (no vulgarities, case insensitive so it works on Windows filesystems).

5. I still consider time-sortable UUID alternatives (like ULID) to be UUIDs. This article should have explicitly mentioned UUIDv1 and UUIDv4 in the title and it wouldn't have been so flamebait.


64 bit integers are easier just because we also end up using low numbers

9,223,372,036,854,775,807 is as nasty as a UUID to remember and type


> If you require a globally unique string ID, consider URIs

Is my knee-jerk judgement that this advice borders on nonsense, unwarranted?


> Is my knee-jerk judgement that this advice borders on nonsense, unwarranted?

No, the advice is nonsense. URIs in what scheme?

I mean, since URN has a URI scheme and UUID is a URN namespace, so urn:uuid:<uuid-value> is a URI, “use a URI” is not really a mutual-exclusive alternative to using UUID, its just much less specific.


Is anyone even still using non-random UUIDs? Every application I've ever seen them use is using v4.


Yeah, everything I’ve seen (and written) which uses UUIDs uses UUIDv4, and the main alternative for similar use seems to be ULID, not any of the other UUID versions.


Similar to other comments, I've only encountered v4 in my career. Is there a large domain where v1 is the norm that dominates the statistic, and most people happen to not work in that domain? If the author knows, I wish they'd say.


> They are awful as keys – being strings, comparisons are dramatically slower than with integers. And even if your database has a UUID type, it’s still worse because the identifier doesn’t fit into a machine word.

I’m just a bit confused, a UUID is made up of hexadecimal digits, so why would it be stored as a string? It’s also 128 bits long, so it should fit into two words, excluding whatever overhead the DBMS puts on the data type, which is really their problem to worry about.


This is a non-uncommon mistake. People, such as the author, seem to think that the UUID is a hex string containing dashes, since that is the format that they most frequently see them represented.

You are correct that a UUID is a 128 bit identifier, and so, fits in 128 bits.


Apparently Stripe stores its keys as strings, though not the usual UUID string representation.


I'm like 90% sure Stripe started on MongoDB (may still be using Mongo?). That might have something to do with it.


A company which handles money as its primary business is using MongoDB? Hopefully the actual dollars are always handled in a “realSQL” DB with proper transaction isolation.


I recalled that specifically because I remember having the same reaction when I found it out (I think? I have a distinct memory of learning this) back in 2012 or something.


I've had a similar issue with MongoDB's ObjectIDs. They are generated using a combination of process id, UNIX timestamp and a counter that is randomly initialized during process creation. The issue when docker comes into the mix is that the root process id of every container is 1 so a decent chunk of entropy is removed from the ObjectID. Add to that the fact that the timestamp doesn't have millisecond resolution, the only thing saving you is praying the counter of any of your processes never overlaps during the same second.

It's unlikely to happen but still possible and it has brought down some of our parallel worker pool because once you have a collision, you are bound to keep generating the same id sequence until you restart your whole process to randomize the counter again.


MongoDB eventually switched away from doing that and now just generates random numbers for the pid and machineid fields of ObjectIDs. The timestamp is still there because people rely on being able to sort on that (which is a bad idea for various reasons), but it's at least 24 bytes of randomness now.


When did this change? Last I checked the PHP driver still relied on the pid.


I've never thought UUIDv1 was useful in any virtualized context, and I hope it should be obvious, but maybe it's worth stating in the UUID generation docs. It is already explained somewhat well what the versions are in Python docs.

However, with all the things already supporting UUID, I also don't see any reason to switch from UUIDv4 to anything else. I don't see how UUID, in general is obsolete, with the support it has from different libraries, and databases.


What about ulid as an alternative?


The ULID spec encourages people to assume they can be sorted in create time order, but it does not handle clock skew. I wouldn’t use it except in a system that can rely on having a single monotonic clock, because I worry about things that are almost true.


One great benefit of UUIDs I have found is inability to join a wrong row.

If you use incremental numbers, every table has 1, 2, 3.


I was confused by this title because I only use uuid v4...the author covers that in the article, but I'm surprised that so many people use uuid v1. I thought v4 was the most popular, but that's probably just because I mostly work with my own code


Is there any reason to use anything except completely random UUIDs? I vaguely remember reading about problems with MAC-based UUIDs decades ago, my impression was that they have been discouraged for a long time already.


> Note: this is only correct about UUID version 1. However, it is what most applications use.

Okay, so, not all UUIDs, just v1. And, for some anecdata, I've actually only interacted with UUID v4 in my entire career; I don't know what the actual norm is, but I'm surprised to hear that it might still be v1.

> The only other practical option is version 4 – the random UUID – but random is intuitively worse, right? Read on to find out.

Oh… how is it worse?

> * They are awful as keys – being strings, comparisons are dramatically slower than with integers. And even if your database has a UUID type, it’s still worse because the identifier doesn’t fit into a machine word.

> * They are excessively long – each character of a UUID only encodes 3.5 bits of information if you count the dashes. That’s twice as less compared to 6 bits of Base64.

Sorry, UUIDs are not strings, they're 128-bit integers. They have a standardized string representation, but if you're storing a UUID as a string, you're either being required to because your language/db/tools/etc. don't support UUIDs correctly, or you're doing it wrong.

> * They are not time-ordered – despite containing a timestamp, its bits are mixed up within the UUID: the top bytes of the UUID contain the bottom bytes of the timestamp. Databases do not like an unordered primary key – it means that freshly inserted rows can go anywhere in the index. And you can’t use UUIDs for ad-hoc time sorting by time, either.

This is definitely a drawback when using a UUID as a primary key, and there are alternatives for this specific use-case. However, I think the best solution I've seen to this is to use a typical 64-bit integer for the primary key, but a UUID for a user-visible ID (so that you don't leak information about the primary keys to users); this makes joins and indexes fast, but avoids the leak to the end-user.

> * They are bad for human comprehension – UUIDs tend to look alike, and it’s hard to visually seek and compare them. This comes from experience.

This is exactly why they shouldn't be used as an Id anywhere that a human needs to interact with one. In the above solution I mentioned, the most common ID for which you'd want to use a UUID is the user's id—the user specifically has no reason to ever refer to their or anyone else's id; they'll use the human-readable username/handle equivalent instead. And developers don't need to care about UUIDs ever because inside the db, you'd have the integer primary key that you use for joins. This seems to solve all the problems?

> I kindly suggest that UUIDs are never the right answer.

Honestly, I think you've only convinced me that UUID v1 is never the right answer… and I think that's mostly been true since v4 came about.

All the best,

-HG


The database argument is also specific to certain databases and table types. It should only matter for clustered tables. So if e.g. I use Postgres, this doesn't matter as it doesn't have clustered tables.


Distributed databases rely on spread across possible keys to balance workload. Mostly-ordered keys create hotspots.


Obligatory read about UUIDs derived from MAC addresses: https://devblogs.microsoft.com/oldnewthing/20040211-00/?p=40...

TLDR on the article: don't use UUIDv1.

Lastly, even with the best and most randomized generation, it still doesn't protect you from copy pasting: https://news.ycombinator.com/item?id=22354449


Sometimes, I am amazed about what gets on the front page of ycombinator.

TLDR: Don't use UUID v1, since its entropy is based on the Mac address, if your cloud provider is generating the same mac addresses for all your containers.

To say not use UUID's it makes no sense. Use UUIDv7, use them in postgres https://github.com/fboulnois/pg_uuidv7 have fun :)


In practice, I generate UUIDs entirely using entropy from /dev/random. The probability of a collision is really low for most use cases (although not if you are Google and need something unique across all database rows in your company or something similar).


Hopefully you're setting the appropriate bits. A UUID is a bit-packed struct/union at heart, really; if you're just reading 128-bits of random data from /dev/random, that's not a UUID; passing it off as such would be needlessly confusing.

(It's fine to make a new format / it's not terrible approach for making a random ident, though you might want to peek into, e.g., ksuid from the OP for some interesting points about why you might not want to do that, plus some advice about getrandom() over /dev/random.)


The systemd developers made the same mistake initially. They later corrected it, and forced the non-random fields to the correct values.


If you treat them as opaque strings and don't use their binary representation, you'll be fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: