
Probability of GUID collisions with different versions - yanowitz
https://blogs.msdn.microsoft.com/oldnewthing/20160114-00/?p=92851
======
TazeTSchnitzel
UUIDs and GUIDs are far too complicated, personally I don't like using them.
There are multiple "versions" (really, generation algorithms) of UUID and
GUID, each with their own problems:

* Some types of UUID uniquely identify the machine they were generated on (one version contains the MAC address + current time, another contains the POSIX UID/GID + domain name) - this got Microsoft into hot water in the 1990s when Word added GUIDs to documents, which meant you could trace documents Stasi-style

* Some types of UUID are based on insecure hashing algorithms (MD5 and SHA1)

* Some types of UUID are namespaced, because everything needs namespaces, obviously

* There's a specially reserved type for Microsoft to use for special COM objects

There's only one mode you should actually use, which is the random bits.

UUIDs and GUIDs also have a weird spacing of dashes. You'd expect three
dashes, splitting it into a sequence of 4-byte chunks, but no: it's split into
4-2-2-2-6, for some reason. And which chunk it is matters, because different
chunks _have different endianness_. Some of them are considered numbers, some
of them bytes, even though they all look the same. Some of them have special
significance (it contains two different version numbers!), with no especially
obvious rhyme or reason to their placement. Oh, and the endianness is
implementation-defined: GUIDs are partly "native" endian (usually little-
endian, then), partly big-endian, whereas UUIDs are _typically_ big-endian.
How do you tell them apart? Well, GUIDs are _usually_ written in capitals, and
UUIDs are _usually_ written in lowercase.

I just use 16 random bytes encoded in hexadecimal, separated by three dashes
at 4-byte increments. No hashing algorithms, versions, endianness issues,
namespacing, severe privacy problems, just random bytes. It's not only
simpler, it has more bits of entropy, and is easier to generate.

~~~
lmm
The great advantage of being standardized is that other people have thought
about these issues and dealt with them in libraries. Yes there are many bad
ways to generate them, but your library should offer the right way to generate
them. Yes the string format is weird, but why would you write that by hand?
Let the library print it, let the library parse it. Want to store it in a
database? There's a column type for that. Want to send it over the network?
There's probably a datatype for it in whatever protocol you're using (whereas
with your approach you immediately have endianness issues).

~~~
TazeTSchnitzel
> The great advantage of being standardized is that other people have thought
> about these issues and dealt with them in libraries.

Issues that don't exist if you don't use a hideously complex format like UUID
in the first place.

> Yes there are many bad ways to generate them, but your library should offer
> the right way to generate them.

Yes, a UUID library will have a way to generate the random kind of UUID, but
it's telling that you need a library for it.

You can generate 16 random bytes without a library, by using fopen() and
fread().

> Yes the string format is weird, but why would you write that by hand? Let
> the library print it, let the library parse it.

You shouldn't need a library to handle your identifiers. If you do, your
identifiers are too complicated.

> Want to store it in a database? There's a column type for that. Want to send
> it over the network? There's probably a datatype for it in whatever protocol
> you're using (whereas with your approach you immediately have endianness
> issues).

With the approach of random bytes, there's at least a consistent endianness
throughout. And you can solve the endianness issue by treating it as a byte
string. Job done.

~~~
GFK_of_xmaspast
> You can generate 16 random bytes without a library, by using fopen() and
> fread().

Yeah but then good luck getting the right /dev/random vs /dev/urandom on the
right platform, and also good luck getting it to work on windows and ios.

~~~
dalke
And good luck if you are in a chroot'ed jail, or out of file handles. (Hence
why OpenBSD added arc4random(), now based on ChaCha20.)

------
gus_massa
> _GUID generation algorithm 4 fills the GUID with 122 random bits. The odds
> of two GUIDs colliding are therefore one in 2^122, which is a phenomenally
> small number._

Another thing to consider, is that due to the birthday paradox once you build
2.7 * 10^18 GUID, the probability that you have at lest a collision is bigger
than 50%. And 2.7 * 10^18 is only 2^61.2.

~~~
chronial
If you generated 2.7 * 10^18 GUIDs (and obviously stored them all, otherwise
the birthday paradox is not relevant), you also used up 43 exabytes (=1
million terrabytes) of storage. I wonder which problem you will encounter
first...

~~~
rycfan
True, but that's for a 50% chance of a collision. In a single system, even a
1% chance of a collision is bad news.

~~~
skrebbel
I can't do the math on my phone, but getting to a 1% chance _still_ needs an
enormous unfathomable bucketload of uuids.

------
dspillett
_> If you use the European scale._

Be careful there. Parts of Europe including here (the UK) officially use short
scale like the US.

When looking at historical figures it is important to be extra careful as
scale use has flipped back and forth over time in places. "Historical" doesn't
go as far back as you think either: short scale becoming the standard number
naming convention in the UK happened in 1974 so there are still people alive
who use long scale and remember it being the most common form.

To _really_ confuse things some places use a half-way house of "short scale
with milliard"...

It is safer to stick with "scientific" prefixes (kilo, mega, giga, tera, ...)
as they are consistently interpreted the same way except where someone is
being deliberately difficult (for "deliberately difficult" read "just plain
wrong"). It sometimes sounds odd referring to things like "giga pounds"
instead of "billions of pounds" (or "thousand millions of pounds" or
"milliards of pounds") but it reduces the risk of misinterpretation and anyone
who doesn't understand probably wouldn't truly understand any of the above
terms without explanation.

For more see
[https://en.wikipedia.org/wiki/Long_and_short_scales](https://en.wikipedia.org/wiki/Long_and_short_scales)

------
dblohm7
I'm amazed that in all of these discussions nobody ever references the RFC, so
here you go:
[http://www.ietf.org/rfc/rfc4122.txt](http://www.ietf.org/rfc/rfc4122.txt)

------
bhouston
I can deal with a machine crashing, I can not deal with an invalid database
caused by GUIDs intersecting. Thus I actually want to avoid GUID intersections
more than I care about a single machine's valid ram.

------
cek
GUID == UUID. Annoys me to this day that MS uses the term GUID. Windows
includes API functions using both names (e.g. CoCreateGuid and UuidCreate).

~~~
TazeTSchnitzel
GUIDs are a variant of UUIDs, and not a Microsoft-exclusive thing.

~~~
cek
It's far more complicated (and simple) than that. This S/O question will help
you understand.

[http://stackoverflow.com/questions/246930/is-there-any-
diffe...](http://stackoverflow.com/questions/246930/is-there-any-difference-
between-a-guid-and-a-uuid)

A UUID generated or used by any of Microsoft's APIs or tools named with "Guid"
are 100% standard UUIDs. It is not possible to create a a UUID with these APIs
or tools that does not conform to the standard.

~~~
TazeTSchnitzel
GUIDs are a particular implementation of UUID, yes. The big difference is that
GUID tends to be upper-case and little-endian (except for the big-endian bit),
whereas UUIDs tend to be lower-case and big-endian.

