
UUIDs in MySQL are really not random - jtwaleson
https://blog.waleson.com/2019/02/uuids-in-mysql-are-really-not-random.html
======
lialuna
The reason they are not uniform is that the first part is essentially a time
stamp. From the docs:

UUID() returns a value that conforms to UUID version 1 as described in RFC
4122. The value is a 128-bit number represented as a utf8 string of five
hexadecimal numbers in aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee format:

The first three numbers are generated from the low, middle, and high parts of
a timestamp. The high part also includes the UUID version number.

The fourth number preserves temporal uniqueness in case the timestamp value
loses monotonicity (for example, due to daylight saving time).

The fifth number is an IEEE 802 node number that provides spatial uniqueness.
A random number is substituted if the latter is not available (for example,
because the host device has no Ethernet card, or it is unknown how to find the
hardware address of an interface on the host operating system). In this case,
spatial uniqueness cannot be guaranteed. Nevertheless, a collision should have
very low probability.

The MAC address of an interface is taken into account only on FreeBSD and
Linux. On other operating systems, MySQL uses a randomly generated 48-bit
number.

~~~
jrochkind1
Aha, UUID v1! The author really wanted/expected a UUIDv4 there probably. It's
confusing that we talk about "UUIDs", when there are a variety of different
kinds -- called "versions" but they're not really "versions" in the sense of
updated/replaced algorithms, just different approaches.

I don't know if MySQL has built-in UUIDv4 though. pg does.

~~~
jerf
UUIDs have the same problem RAID levels do; they use numbers to identify
different types, but the types aren't related to each other in the way numbers
are related to each other (some sort of sequential operation where "thing 3"
has less of whatever it is than "thing 4" does, be it age, quality, benefits
and costs like certification levels, whatever). I can never keep straight
which level of RAID is which either, since I only brush them every couple of
years, but if they were named for what they _are_ instead of by number I
wouldn't have this problem. I would know what "RAID Full Mirroring" or "RAID
Striped" were much more easily.

We really ought to use names for them, not version numbers. If someone wrote
"Timestamp-based UUIDs Are Not Fully Random", well, by the time you've typed
out the title of your blog post you'd have figured out what the problem was.

~~~
jrochkind1
even if using numbers, in retrospect they should have maybe both been called
"types" UUID type 1 or 4. RAID type 1 or 3. To avoid the association that both
"version" and "level" have, that the numbers are incrementing the
functionality in some way.

~~~
ken
Words always get truncated or omitted, though. It's technically UUID "version
1" (I think), but see how everybody just says "UUID 1". Teaching people to say
"type" instead of "version" wouldn't do much good because most people aren't
saying any word there anyway.

I'd prefer to use letters in cases like this. Letters have a much weaker
association with ordering, there's no implication that the set is of infinite
size, and there's no affordance for doing anything arithmetic with them except
enumerating them.

For example, a plain old USB cable has an "A" and "B" ends, radiation can be
alpha/beta/gamma, and power outlets can be type A, B, C, ... (it looks like
they're up to "N" now). Nobody would think that a Danish "type K" plug is 9
versions improved on an American "type B" plug. It's just different.

~~~
Dylan16807
>see how everybody just says "UUID 1"

Are you sure? If they have the number I always see a "v" in front of it.

------
the_mitsuhiko
That's because the UUIDs returned by mysql are documented to be version 1
which are based on timestamp and node (mac address often enough). Of course
they are not random.

------
beefhash
> If you are even more like me, you will think that this is like the
> difference between /dev/urandom and /dev/random

/dev/urandom does _not_ produce such non-random results; the comparison is
rather far-fetched. /dev/urandom is suitable for virtually all tasks,
including cryptography[1]. It's the same as /dev/random on various BSDs, even.

[1] [https://www.2uo.de/myths-about-urandom/](https://www.2uo.de/myths-about-
urandom/)

------
gruez
>So, the lesson for today is: take warnings in documentation seriously. If you
used UUID() for data that is supposed to be secret (like passwords), you have
a serious problem, as these secrets can now be easily guessed.

No, the lesson here is to not assume a "random" function is cryptographically
secure unless stated otherwise. The same article could be made about
Math.random or rand.

~~~
hn_throwaway_99
But the main problem is that he thinks UUIDs are supposed to be random when
they're not - their only guarantee is uniqueness. MySQL uses (IIRC) V1 UUIDs,
which combine a MAC address with time, while only V4 UUIDs are random.

~~~
ben509
The MAC address is a worse problem, honestly, than that they're not random.
That means your database is leaking information about which host generated the
UUID, which would be useful to someone trying to exploit a particular host.

Or if you have an application installed on personal machines, it's leaking
identifying information about those machines.

~~~
hn_throwaway_99
This is true. I wish I could find the article but years ago I remember
malicious hackers that were identified by law enforcement because their virus
included some windows guids, which at the time included the Mac address.

~~~
sgerenser
Probably thinking about the Melissa Virus:
[http://archive.arstechnica.com/wankerdesk/2q99/melissa-1.htm...](http://archive.arstechnica.com/wankerdesk/2q99/melissa-1.html)

------
jve
> If you used UUID() for data that is supposed to be secret (like
> passwords)...

Since when someone on earth has recommended something like that?

UUIDs are time/mac address sensitive. They have version info within. They make
an awful, guessable secret. It's just that: An unique identifier across the
globe and not just your database.

~~~
whack
A V4 UUID has 122 bits of randomness. Assuming your RNG generator has
sufficient entropy, it would be perfectly valid to use the resulting UUID as a
secret.

[https://en.wikipedia.org/wiki/Universally_unique_identifier#...](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_\(random\))

~~~
dragontamer
> Assuming your RNG generator has sufficient entropy

Why would you ever assume that? UUIDs could be formed from a low-entropy
LCGRNG and still match the specifications.

Any secret should be generated with cryptographically secure random number
generators. Calling "UUID" instead of like, BCrypt or something would be a
major red flag to me. Heck, a modern cryptographically secure secret can be
generated from Intel's "RDSEED" assembly instruction, which provides a
hardware-random source of entropy, as well as entropy guarantees. (You'll
probably use this assembly instruction if you are using /dev/urandom)

~~~
dfox
Version 4 UUIDs pretty much have to be generated from reasonable(-ish) CSRNG.
The assumption that the resulting 122bits are universally unique requires that
they are truly chosen at random, CS(P)RNG is the most sane way to accomplish
that.

~~~
dragontamer
> The assumption that the resulting 122bits are universally unique requires
> that they are truly chosen at random, CS(P)RNG is the most sane way to
> accomplish that.

Certainly not.

Take a Version 1 UUID (time-based), and spin it through a 128-bit LCGRNG. Do
the AND / OR bitmasks needed to properly tag it a Version4 UUID and you're
done. Bam, universally unique but not cryptographically random.

This would be very fast (cryptographic entropy is very slow!), and would be
ideal for databases or most purposes.

------
weinzierl
This reminds me of the day I noticed that some records in our database had
UUIDs with uppercase letters while others had lowercase letters. It didn't
require much digging to find out that it was dependent on their origin. One
kind was generated by the database the others by Java code.

In our case it wasn't a problem, but I couldn't help thinking that this was a
nice information leak where you could find out some property of a dataset that
was never intended to be retained. Just imagine that you mobile sign-ups had
one kind of UUID and your web sign-ups the other kind. Though you might not
explicitly store the information how someone signed up you inadvertently
really do.

And now if you are tempted to ask why we stored our UUIDs as strings - just
don't...

------
ciupicri
They don't seem to have the same type (according to uuidparse from util-
linux):

    
    
      UUID                                  VARIANT TYPE       TIME
      4f95de28-0fd1-48db-ad2e-34ecd169c483  DCE     random     
      4331cb9e-1d91-11e9-be2c-45923c63e8a2  DCE     time-based 2019-01-21 17:28:54,631107+02:00
      4331cc4c-1d91-11e9-be2c-45923c63e8a2  DCE     time-based 2019-01-21 17:28:54,631124+02:00
      c7e2f124-f6ba-4434-843f-89958a7436ec  DCE     random     
      3e18f8dd-b1d3-4e16-8a81-4bdceac91772  DCE     random
    

Explanation of the output:

    
    
      Variants
       DCE   The Open Software Foundation's (OSF) Distributed Computing Environment UUIDs.
    
      Types
       time-based   The DCE time based.
       random       RFC 4122 random.

------
michaelcampbell
They're not RUUIDs, after all. The article states that it's doing pretty much
what the docs say it's doing.

------
cpburns2009
Well it is a universally unique _identifier_ not _secret_. There's no
ensurance of it being random unless you're generating v4 UUIDs.

~~~
greiskul
And even if you are generating v4 UUIDs, they are not necessarily
cryptographically secure. So it might be possible, if lets say, a user gets a
bunch of UUIDs that your system generated, it might be possible for they to
figure out other UUIDs that the same system generated. Which is totally fine
for what they were built for.

------
agentx3r
> If you used UUID() for data that is supposed to be secret (like passwords),
> you have a serious problem, as these secrets can now be easily guessed.

I've never used UUIDs for secrets, but I've often dependended on UUIDs to be
uniform - as the primary identifier when passing data around inside
distributed systems. If the chance of collision is implementation dependent,
that could end really badly.

~~~
hn_throwaway_99
UUIDs (of all versions) are guaranteed to be _unique_. They are not guaranteed
to be uniformly distributed.

~~~
agentx3r
If you have two systems generating UUIDs, and their probability distributions
are non uniform and overlap in any significant fashion, then your odds of
collision are no longer on the order of 'heat death of the universe'.

TIL only version 4 is generated using a random source.

~~~
zzzcpan
The purpose of UUIDs is to guarantee uniqueness without coordination within
the system of computers over long enough time and within the certain rate of
generating them. Hence all the timestamps and node identifiers in them,
coordination on which already happened. If it doesn't guarantee uniqueness for
your use case - you should change the algorithm to make sure that it does,
otherwise there is no point having them if you still have to do coordination
to avoid collisions.

------
leni536
The article's gripe with UUIDs is with _uniformity_ and not randomness. A low
discrepancy sequence [0] is an example for something uniform but not random in
an sense.

[0] [https://en.wikipedia.org/wiki/Low-
discrepancy_sequence](https://en.wikipedia.org/wiki/Low-discrepancy_sequence)

------
iooi
If you want UUIDv4 in MySQL, this Stack Overflow thread is pretty useful:
[https://stackoverflow.com/q/32965743](https://stackoverflow.com/q/32965743)

~~~
tracker1
Note that this will probably result in a more sparse index and longer inserts
if using UUIDv4 for a primary key where UUIDv1 is available in the dbms.

------
wil421
What’s the use case for a random UUID? I’ve never had a need for them to be
random just unique.

~~~
crdrost
So supposing you have multiple clients you might have shared access to an
image or client-data database. You might not want someone to see an identifier
and guess other "close" identifiers.

For example, a company I used to work for named IntegriShield is essentially a
for-hire private Internet police company. When I got there, we would take an
automatic screenshot of a web site and then send it to the people who were
doing something "bad" (for whatever the definition of "bad" was) and you would
essentially visit some sort of link like

    
    
        https://api.integrishield.com/(api route)/12345.png
    

Of course if these folks were to increment the identifier they would see other
screenshots we were capturing -- though at the time that was not very helpful
as we would routinely capture _every_ web site that we crawled, so most of
them didn't have anything interesting on them.

One of my jobs there was to improve this process: reduce image storage costs
and increase the productivity of the humans who had to look at these web sites
to determine what had gone wrong. With my contribution, we could intelligently
flag language that might be "bad" and then capture enough metadata that we
could later reconstruct screenshots which highlighted those "bad" things,
after a human being had looked at them and said "no, these things are okay but
those things really are bad" and deleted half of our highlights. So the
screenshots were now only generated _after_ we were done and _clearly
highlighted_ what was wrong, and then it becomes a serious problem if folks
from one of these companies can see what their peers are doing.

We did not switch to UUIDs but rather just encrypted the counter with a server
key, but the point is still that you would now visit

    
    
        https://api.integrishield.com/(other-api-route)/2121a46a4c24c512965671f8fb269f0b.png
    

and that would give you access to this image, but now if you just change that
to a random identifier there is only a 2^-64 chance that you would
successfully alight upon a valid image, and if you don't we can increment a
counter in the database and if that gets too large too fast we can send out
emails saying "warning, lots of 404s are happening".

UUIDs v4 (chosen securely randomly) have the same properties from the very
start. You can easily use them in cases where the security model _demands_
that some non-client must be able to access some sort of information simply by
knowing the identifier.

We could have chosen something more complicated, of course, like creating a
many-to-many table relating screenshots to email addresses and then embedding
the email address somehow in the URL,

    
    
        https://api.integrishield.com/(more-api)/12345.png?email=someone%40example.com
        https://api.integrishield.com/(more-api)/12345.png?emailBatch=67890
    

Of those the first is worrisome -- there's no reason that one company couldn't
have access to another company's email addresses for their legal compliance
department and indeed in many cases they should. And if we try to obfuscate
that with an auto-incrementing ID like in the second case then we're actually
legit broken: remember, I told you that the screenshots are generated shortly
before the emails get sent out, so they both would tend to increment in
lockstep, you might have to try only 5-10 screenshot/batch pairs before
alighting upon one you weren't meant to see. So the batch needs to be
protected by, what, UUIDv4? And you have the same problem that you had before.

~~~
wil421
I appreciate the response. A legacy system at my company actually exposed data
this way recently. A user guessed values and was able to see a small bit of
data from other customers.

------
sarabob
Yay, you're one of today's lucky 10,000 [0]!

There are different types of UUIDs [1]. MySQL generates version 1 UUIDs - you
can tell by the first digit of the third block. It looks like something else
is also generating UUIDs, using the v4 format - so you have two different
types of UUID in your database:

4331cb9e-1d91-11e9-be2c-45923c63e8a2 # v1

c7e2f124-f6ba-4434-843f-89958a7436ec # v4

v1 UUIDs are typically based on the MAC address of your computer and the
current time. Online tools exist [2] that decode the uuid and tell you it was
created at 2019-01-21 10:28:55 (UTC)

v4 UUIDs are random.

[0] [https://xkcd.com/1053/](https://xkcd.com/1053/)

[1]
[https://en.wikipedia.org/wiki/Universally_unique_identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)

[2]
[https://realityripple.com/Tools/UnUUID/](https://realityripple.com/Tools/UnUUID/)

------
mavdi
So basically UUID V1 instead of V4. Misinformation presented as a lesson.

------
anth_anm
The UUID specification only has a small part of it random.

------
krig
Ehm, this is just a fundamental misunderstanding. UUIDs are not and are not
supposed to be random. They are not even unique. They are just reasonably
likely to be unique enough for most purposes.

~~~
whack
[https://en.wikipedia.org/wiki/Universally_unique_identifier#...](https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_\(random\))

V4 UUIDs _are_ supposed to be random.

~~~
krig
Nope. The V4 RFC does specify and recommend a method for generating UUIDs that
uses randomness, but it also provides other methods, and explicitly warns
against relying on the UUID being random:

"Do not assume that UUIDs are hard to guess; they should not be used as
security capabilities (identifiers whose mere possession grants access), for
example."

~~~
bpicolo
> Do not assume

You can't assume, but you can certainly check for a given implementation.
Plenty of implementations do provide secure randomness, e.g. Ruby's
SecureRandom[0].

[https://ruby-
doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/S...](https://ruby-
doc.org/stdlib-1.9.3/libdoc/securerandom/rdoc/SecureRandom.html#method-c-uuid)

~~~
krig
Sure, if you are really determined to do it, nothing can stop you from finding
a way to shoot yourself in the foot. I still maintain that depending on
randomness in UUIDs is conceptually misuse, even if technically achievable.

~~~
whack
How are you shooting yourself in the foot. Regardless of what approach you
take, you're placing your trust in some random generator. Whether this trusted
generator is formatting its output in the form of raw 0s and 1s, or an
alphanumeric string, or UUIDs, is absolutely irrelevant.

~~~
krig
You are shooting yourself in the foot by calling your random number an UUID or
thinking that randomness is a property of UUIDs. This is going to come back
and bite you in some way, one example would be where later changes to the code
(by someone who knows about UUIDs but not the assumption that they should be
random) mixes in a v1 UUID...

~~~
whack
You're attacking a strawman. No one is saying that _all_ UUIDs are generated
by CSPRNGs. Only that some are.

Regardless of what method you use to generate secrets, someone could later
change the code to generate the same secret using a non-cryptographically-
secure RNG. By that logic, alphanumeric strings should not be used as secrets
either.

------
foobarian
So... guy shoots himself in foot, writes article about being surprised by how
fast and sharp bullets are. This one deserves the Picard facepalm.

~~~
tombert
Sure, but it could be a pitfall that someone else could fall into. I don't
really ever use MySQL, but I didn't know about the UUID stuff.

There are a million things that get posted on Hacker News that are "beginner
level". How many tutorials have we gotten about Rust on here? I don't think
that's a bad thing.

------
chapium
This is quite funny. Anyone who has used UUIDs for a while realizes that they
tend to cluster. Relying on them for randomness is effective as the old xkcd
joke: [https://xkcd.com/221/](https://xkcd.com/221/).

They are certainly useful for creating unique id's, but for randomness? No.

~~~
the_mitsuhiko
Anyone who has used UUIDs should be aware that there are different UUIDs. In
fact, UUID3,4 and 5 are all generally random. UUID4 depending on the quality
of your RNG and 3 and 5 depending on the quality of the input data.

