Hacker News new | past | comments | ask | show | jobs | submit login
UUIDs in MySQL are really not random (waleson.com)
86 points by jtwaleson 42 days ago | hide | past | web | favorite | 67 comments



The reason they are not uniform is that the first part is essentially a time stamp. From the docs:

UUID() returns a value that conforms to UUID version 1 as described in RFC 4122. The value is a 128-bit number represented as a utf8 string of five hexadecimal numbers in aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee format:

The first three numbers are generated from the low, middle, and high parts of a timestamp. The high part also includes the UUID version number.

The fourth number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time).

The fifth number is an IEEE 802 node number that provides spatial uniqueness. A random number is substituted if the latter is not available (for example, because the host device has no Ethernet card, or it is unknown how to find the hardware address of an interface on the host operating system). In this case, spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have very low probability.

The MAC address of an interface is taken into account only on FreeBSD and Linux. On other operating systems, MySQL uses a randomly generated 48-bit number.


Aha, UUID v1! The author really wanted/expected a UUIDv4 there probably. It's confusing that we talk about "UUIDs", when there are a variety of different kinds -- called "versions" but they're not really "versions" in the sense of updated/replaced algorithms, just different approaches.

I don't know if MySQL has built-in UUIDv4 though. pg does.


UUIDs have the same problem RAID levels do; they use numbers to identify different types, but the types aren't related to each other in the way numbers are related to each other (some sort of sequential operation where "thing 3" has less of whatever it is than "thing 4" does, be it age, quality, benefits and costs like certification levels, whatever). I can never keep straight which level of RAID is which either, since I only brush them every couple of years, but if they were named for what they are instead of by number I wouldn't have this problem. I would know what "RAID Full Mirroring" or "RAID Striped" were much more easily.

We really ought to use names for them, not version numbers. If someone wrote "Timestamp-based UUIDs Are Not Fully Random", well, by the time you've typed out the title of your blog post you'd have figured out what the problem was.


even if using numbers, in retrospect they should have maybe both been called "types" UUID type 1 or 4. RAID type 1 or 3. To avoid the association that both "version" and "level" have, that the numbers are incrementing the functionality in some way.


Words always get truncated or omitted, though. It's technically UUID "version 1" (I think), but see how everybody just says "UUID 1". Teaching people to say "type" instead of "version" wouldn't do much good because most people aren't saying any word there anyway.

I'd prefer to use letters in cases like this. Letters have a much weaker association with ordering, there's no implication that the set is of infinite size, and there's no affordance for doing anything arithmetic with them except enumerating them.

For example, a plain old USB cable has an "A" and "B" ends, radiation can be alpha/beta/gamma, and power outlets can be type A, B, C, ... (it looks like they're up to "N" now). Nobody would think that a Danish "type K" plug is 9 versions improved on an American "type B" plug. It's just different.


>see how everybody just says "UUID 1"

Are you sure? If they have the number I always see a "v" in front of it.


Yeah, letters are much more used in situations where they do not denote some sort of order. They do, at least in English, but it's a much weaker association. It seems on some level we intuit that letter orders are arbitrary, whereas in some sense integers are literally nothing but their order.


At least with RAID it's enumeration that'd solve that. With UUID, I'm not sure. I wonder where IPv4 vs IPv6 falls under in this mess too.


IPv6 actually _is_ a newer version of the Internet Protocol. I think that is a plain old version. Sure, we don't know what happened to IPv5 (unless we google it), but that's not that weird, that happens with other things and their versions, versions get abandoned.


Is it built in to pg now? Last time I set up pg I had to add an extension to get UUID support.

It was relatively painless, sure, but I seem to remember having to add it.


The pgcrypto module ships with pg but you have to enable it. https://www.postgresql.org/docs/10/pgcrypto.html#id-1.11.7.3...


I think "version" is the right word if you take it in the traditional sense, not the common technical meaning.


That was my first thought looking at them. And depending on how the binary values are indexed, probably much more efficient for inserting into a primary key than a UUIDv4.


That's because the UUIDs returned by mysql are documented to be version 1 which are based on timestamp and node (mac address often enough). Of course they are not random.


> If you are even more like me, you will think that this is like the difference between /dev/urandom and /dev/random

/dev/urandom does not produce such non-random results; the comparison is rather far-fetched. /dev/urandom is suitable for virtually all tasks, including cryptography[1]. It's the same as /dev/random on various BSDs, even.

[1] https://www.2uo.de/myths-about-urandom/


>So, the lesson for today is: take warnings in documentation seriously. If you used UUID() for data that is supposed to be secret (like passwords), you have a serious problem, as these secrets can now be easily guessed.

No, the lesson here is to not assume a "random" function is cryptographically secure unless stated otherwise. The same article could be made about Math.random or rand.


But the main problem is that he thinks UUIDs are supposed to be random when they're not - their only guarantee is uniqueness. MySQL uses (IIRC) V1 UUIDs, which combine a MAC address with time, while only V4 UUIDs are random.


The MAC address is a worse problem, honestly, than that they're not random. That means your database is leaking information about which host generated the UUID, which would be useful to someone trying to exploit a particular host.

Or if you have an application installed on personal machines, it's leaking identifying information about those machines.


This is true. I wish I could find the article but years ago I remember malicious hackers that were identified by law enforcement because their virus included some windows guids, which at the time included the Mac address.


Probably thinking about the Melissa Virus: http://archive.arstechnica.com/wankerdesk/2q99/melissa-1.htm...


Because 3 and 5 are based on hashes they will be random as well unless you know what the input value was.


I agree with the author, the lesson really is to take the documentation seriously. I think a lot of people might not be in the habit of doing that, because they find the documentation of many projects lacking, or because it’s easier just to read some walkthrough where the work is already done and they can just copy paste the commands. It’s important to fully understand what you are doing in all cases, not just ones touching cryptography.


> If you used UUID() for data that is supposed to be secret (like passwords)...

Since when someone on earth has recommended something like that?

UUIDs are time/mac address sensitive. They have version info within. They make an awful, guessable secret. It's just that: An unique identifier across the globe and not just your database.


A V4 UUID has 122 bits of randomness. Assuming your RNG generator has sufficient entropy, it would be perfectly valid to use the resulting UUID as a secret.

https://en.wikipedia.org/wiki/Universally_unique_identifier#...


> Assuming your RNG generator has sufficient entropy

Why would you ever assume that? UUIDs could be formed from a low-entropy LCGRNG and still match the specifications.

Any secret should be generated with cryptographically secure random number generators. Calling "UUID" instead of like, BCrypt or something would be a major red flag to me. Heck, a modern cryptographically secure secret can be generated from Intel's "RDSEED" assembly instruction, which provides a hardware-random source of entropy, as well as entropy guarantees. (You'll probably use this assembly instruction if you are using /dev/urandom)


bcrypt is not a password generation (or password stretching) algorithm.

As always, it's the programmer's duty to investigate the RNG before relying on the entropy of some function's result.

UUIDs based on a CSPRNG and used as a password/token have some nice features--they're of fixed length, easy to validate for well-formedness, easy to serialize or deserialize in a variety of conventions.


Version 4 UUIDs pretty much have to be generated from reasonable(-ish) CSRNG. The assumption that the resulting 122bits are universally unique requires that they are truly chosen at random, CS(P)RNG is the most sane way to accomplish that.


> The assumption that the resulting 122bits are universally unique requires that they are truly chosen at random, CS(P)RNG is the most sane way to accomplish that.

Certainly not.

Take a Version 1 UUID (time-based), and spin it through a 128-bit LCGRNG. Do the AND / OR bitmasks needed to properly tag it a Version4 UUID and you're done. Bam, universally unique but not cryptographically random.

This would be very fast (cryptographic entropy is very slow!), and would be ideal for databases or most purposes.


Please don't ever roll your own security. Use a language feature (if available) or a third party library... though do a bit of due diligence on the third party library to make sure it's endorsed by a trusted source.


This reminds me of the day I noticed that some records in our database had UUIDs with uppercase letters while others had lowercase letters. It didn't require much digging to find out that it was dependent on their origin. One kind was generated by the database the others by Java code.

In our case it wasn't a problem, but I couldn't help thinking that this was a nice information leak where you could find out some property of a dataset that was never intended to be retained. Just imagine that you mobile sign-ups had one kind of UUID and your web sign-ups the other kind. Though you might not explicitly store the information how someone signed up you inadvertently really do.

And now if you are tempted to ask why we stored our UUIDs as strings - just don't...


They don't seem to have the same type (according to uuidparse from util-linux):

  UUID                                  VARIANT TYPE       TIME
  4f95de28-0fd1-48db-ad2e-34ecd169c483  DCE     random     
  4331cb9e-1d91-11e9-be2c-45923c63e8a2  DCE     time-based 2019-01-21 17:28:54,631107+02:00
  4331cc4c-1d91-11e9-be2c-45923c63e8a2  DCE     time-based 2019-01-21 17:28:54,631124+02:00
  c7e2f124-f6ba-4434-843f-89958a7436ec  DCE     random     
  3e18f8dd-b1d3-4e16-8a81-4bdceac91772  DCE     random
Explanation of the output:

  Variants
   DCE   The Open Software Foundation's (OSF) Distributed Computing Environment UUIDs.

  Types
   time-based   The DCE time based.
   random       RFC 4122 random.


They're not RUUIDs, after all. The article states that it's doing pretty much what the docs say it's doing.


Well it is a universally unique identifier not secret. There's no ensurance of it being random unless you're generating v4 UUIDs.


And even if you are generating v4 UUIDs, they are not necessarily cryptographically secure. So it might be possible, if lets say, a user gets a bunch of UUIDs that your system generated, it might be possible for they to figure out other UUIDs that the same system generated. Which is totally fine for what they were built for.


> If you used UUID() for data that is supposed to be secret (like passwords), you have a serious problem, as these secrets can now be easily guessed.

I've never used UUIDs for secrets, but I've often dependended on UUIDs to be uniform - as the primary identifier when passing data around inside distributed systems. If the chance of collision is implementation dependent, that could end really badly.


UUIDs (of all versions) are guaranteed to be unique. They are not guaranteed to be uniformly distributed.


If you have two systems generating UUIDs, and their probability distributions are non uniform and overlap in any significant fashion, then your odds of collision are no longer on the order of 'heat death of the universe'.

TIL only version 4 is generated using a random source.


The purpose of UUIDs is to guarantee uniqueness without coordination within the system of computers over long enough time and within the certain rate of generating them. Hence all the timestamps and node identifiers in them, coordination on which already happened. If it doesn't guarantee uniqueness for your use case - you should change the algorithm to make sure that it does, otherwise there is no point having them if you still have to do coordination to avoid collisions.


But for the other versions the key word is overlap. Since a part of the UUID is some location based number (like the MAC), there would be no overlap, even if the distribution is completely non uniform.


The article's gripe with UUIDs is with uniformity and not randomness. A low discrepancy sequence [0] is an example for something uniform but not random in an sense.

[0] https://en.wikipedia.org/wiki/Low-discrepancy_sequence


If you want UUIDv4 in MySQL, this Stack Overflow thread is pretty useful: https://stackoverflow.com/q/32965743


Note that this will probably result in a more sparse index and longer inserts if using UUIDv4 for a primary key where UUIDv1 is available in the dbms.


What’s the use case for a random UUID? I’ve never had a need for them to be random just unique.


A uniformly random ID is really useful when you're implementing a distributed system and you want to evenly distribute the load. You can make certain useful assumptions when your IDs are uniformly random -- which is basically how hyerloglog works and how the article's author was able to approximate the progress of his jobs.


Assigning random identifiers for things. Benefit being that this can provide additional layer of security since somebody can’t just guess the ids for things he is not supposed to access.

Thing for example getDocById.php?id=...

Of course it does not have to be UUID, but this can have some benefits as you DB for example may have a nice datatype for it.


There isn't one for them, really.* The reason they exist is that non-random UUIDs rely on unique information about the host. Since UUIDs are used heavily in consumer applications (ubiquitous on Windows) and are transmitted without encryption, they're leaking identifying information about the host that generates them.

* One that is better served with a UUID than a random binary blob.


So supposing you have multiple clients you might have shared access to an image or client-data database. You might not want someone to see an identifier and guess other "close" identifiers.

For example, a company I used to work for named IntegriShield is essentially a for-hire private Internet police company. When I got there, we would take an automatic screenshot of a web site and then send it to the people who were doing something "bad" (for whatever the definition of "bad" was) and you would essentially visit some sort of link like

    https://api.integrishield.com/(api route)/12345.png
Of course if these folks were to increment the identifier they would see other screenshots we were capturing -- though at the time that was not very helpful as we would routinely capture every web site that we crawled, so most of them didn't have anything interesting on them.

One of my jobs there was to improve this process: reduce image storage costs and increase the productivity of the humans who had to look at these web sites to determine what had gone wrong. With my contribution, we could intelligently flag language that might be "bad" and then capture enough metadata that we could later reconstruct screenshots which highlighted those "bad" things, after a human being had looked at them and said "no, these things are okay but those things really are bad" and deleted half of our highlights. So the screenshots were now only generated after we were done and clearly highlighted what was wrong, and then it becomes a serious problem if folks from one of these companies can see what their peers are doing.

We did not switch to UUIDs but rather just encrypted the counter with a server key, but the point is still that you would now visit

    https://api.integrishield.com/(other-api-route)/2121a46a4c24c512965671f8fb269f0b.png
and that would give you access to this image, but now if you just change that to a random identifier there is only a 2^-64 chance that you would successfully alight upon a valid image, and if you don't we can increment a counter in the database and if that gets too large too fast we can send out emails saying "warning, lots of 404s are happening".

UUIDs v4 (chosen securely randomly) have the same properties from the very start. You can easily use them in cases where the security model demands that some non-client must be able to access some sort of information simply by knowing the identifier.

We could have chosen something more complicated, of course, like creating a many-to-many table relating screenshots to email addresses and then embedding the email address somehow in the URL,

    https://api.integrishield.com/(more-api)/12345.png?email=someone%40example.com
    https://api.integrishield.com/(more-api)/12345.png?emailBatch=67890
Of those the first is worrisome -- there's no reason that one company couldn't have access to another company's email addresses for their legal compliance department and indeed in many cases they should. And if we try to obfuscate that with an auto-incrementing ID like in the second case then we're actually legit broken: remember, I told you that the screenshots are generated shortly before the emails get sent out, so they both would tend to increment in lockstep, you might have to try only 5-10 screenshot/batch pairs before alighting upon one you weren't meant to see. So the batch needs to be protected by, what, UUIDv4? And you have the same problem that you had before.


I appreciate the response. A legacy system at my company actually exposed data this way recently. A user guessed values and was able to see a small bit of data from other customers.


Some developers use them in the place of random password generation for example. In this article, we discovered why it's not a good idea.

In other words, you shouldn't use a non-cryptographic RNG for security purposes.


Yay, you're one of today's lucky 10,000 [0]!

There are different types of UUIDs [1]. MySQL generates version 1 UUIDs - you can tell by the first digit of the third block. It looks like something else is also generating UUIDs, using the v4 format - so you have two different types of UUID in your database:

4331cb9e-1d91-11e9-be2c-45923c63e8a2 # v1

c7e2f124-f6ba-4434-843f-89958a7436ec # v4

v1 UUIDs are typically based on the MAC address of your computer and the current time. Online tools exist [2] that decode the uuid and tell you it was created at 2019-01-21 10:28:55 (UTC)

v4 UUIDs are random.

[0] https://xkcd.com/1053/

[1] https://en.wikipedia.org/wiki/Universally_unique_identifier

[2] https://realityripple.com/Tools/UnUUID/


So basically UUID V1 instead of V4. Misinformation presented as a lesson.


The UUID specification only has a small part of it random.


Ehm, this is just a fundamental misunderstanding. UUIDs are not and are not supposed to be random. They are not even unique. They are just reasonably likely to be unique enough for most purposes.



Nope. The V4 RFC does specify and recommend a method for generating UUIDs that uses randomness, but it also provides other methods, and explicitly warns against relying on the UUID being random:

"Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example."


> Do not assume

You can't assume, but you can certainly check for a given implementation. Plenty of implementations do provide secure randomness, e.g. Ruby's SecureRandom[0].

https://ruby-doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/S...


Sure, if you are really determined to do it, nothing can stop you from finding a way to shoot yourself in the foot. I still maintain that depending on randomness in UUIDs is conceptually misuse, even if technically achievable.


How are you shooting yourself in the foot. Regardless of what approach you take, you're placing your trust in some random generator. Whether this trusted generator is formatting its output in the form of raw 0s and 1s, or an alphanumeric string, or UUIDs, is absolutely irrelevant.


You are shooting yourself in the foot by calling your random number an UUID or thinking that randomness is a property of UUIDs. This is going to come back and bite you in some way, one example would be where later changes to the code (by someone who knows about UUIDs but not the assumption that they should be random) mixes in a v1 UUID...


You're attacking a strawman. No one is saying that all UUIDs are generated by CSPRNGs. Only that some are.

Regardless of what method you use to generate secrets, someone could later change the code to generate the same secret using a non-cryptographically-secure RNG. By that logic, alphanumeric strings should not be used as secrets either.


And MySQL docs clearly state they're using v1.


If you parse out the UUIDs presented (e.g. 4331cc4c-1d91-11e9-be2c-45923c63e8a2), you'll see there are a lot of version 1 UUIDs.


That's a bug/feature of the specific MySQL implementation. If you used a different implementation like the following, you can generate V4 cryptographically strong random UUIDs

https://www.tutorialspoint.com/java/util/uuid_randomuuid.htm


So... guy shoots himself in foot, writes article about being surprised by how fast and sharp bullets are. This one deserves the Picard facepalm.


Sure, but it could be a pitfall that someone else could fall into. I don't really ever use MySQL, but I didn't know about the UUID stuff.

There are a million things that get posted on Hacker News that are "beginner level". How many tutorials have we gotten about Rust on here? I don't think that's a bad thing.


This is quite funny. Anyone who has used UUIDs for a while realizes that they tend to cluster. Relying on them for randomness is effective as the old xkcd joke: https://xkcd.com/221/.

They are certainly useful for creating unique id's, but for randomness? No.


Anyone who has used UUIDs should be aware that there are different UUIDs. In fact, UUID3,4 and 5 are all generally random. UUID4 depending on the quality of your RNG and 3 and 5 depending on the quality of the input data.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: