Hacker News new | past | comments | ask | show | jobs | submit login
A Brief History of the UUID (segment.com)
304 points by mrbbk on June 7, 2017 | hide | past | web | favorite | 66 comments



I can actually fill in some of the details about the history of UUID's. Paul Leach was an architect who worked at Apollo, OSF, and later Microsoft. I met Paul in the mid-90's when he was an architect at OSF, and I was the tech load for Kerberos v5 development at MIT. OSF DCE was going to use Kerberos for authentication, and was going to use Apollo RFC as its RPC layer.

It was from talking to Paul that I learned about UUID's, and I added libuuid into e2fsprogs 1.05, released September 7, 1996. UUID's were used in Linux in the ext2 superblock, and later on, GNOME picked it up and used it extensively, which meant among other things that if you wanted to run GNOME on FreeBSD or NetBSD or Solaris, you had to compile e2fsprogs to get libuuid. :-)

Later on Paul went on to Microsoft, and I'm fairly certain that it was due to Paul that Microsoft adopted the OSF DCE RPC layer for its internal use, and UUID's started being used extensively inside Microsoft. UUID's also got used in Intel's EFI specification for the GPT partition table, although somewhere along the way they got renamed "Globally Unique ID's" --- it's the same spec, though.

While Paul was at Microsoft, the specs for UUID's finally got standardized by the IETF as RFC 4122, so you no longer needed to get find dated copies of the OSF DCE specification (or download e2fsprogs since I had an early version of the UUID Internet Draft in the sources long before it finally squirted out the other end of the RFC publication pipeline).

As far as uuidd is concerned, the reason why it exists is because a certain very large Enterprise Resource Planning system was using libuuid to generate uuid's for its objects, and it needed to create them very, very quickly so they can initalize the customer's ERP database in finite time. They were also using the time-based UUID's, with the UUID stored in the database with the bytes cleverly rearranged so the time bits would be stored in the LSB, and the Ethernet MAC address would be in the MSB, so that a database using a B-tree (plus prefix key compression) for its indexing would be able to very efficiently index the UUID's. This is similar to k-ordering trick that Flake was using, but this very large enterprise planning company was doing in 2007, five years before team at Boundary came up with Flake, and they were doing it using standard UUID's, but simply storing the Time-based UUID bytes in a different order. (I believe they were also simply storing the ID in binary form, instead of base-62 encoding, since if you're going to have jillions of objects in your ERP database, you want them to be as efficient as possible.)

Anyway, a certain Linux distribution company contacted me on behalf of this very large Enterprise Resource Planning company, and we came up with a scheme where the uuidd daemon could issue blocks of time-based UUID's to clients, so we could amortize the UUID generation over blocks of 50 or 100 UUID's at a time. (This ERP was generating a huge number of UUID's.) I did it as a freebie, because I was tickled pick that libuuid was such a critical part of a large ERP system, and it wasn't that hard to implement the uuidd extension to libuuid.


I can confirm Paul helped drive MS's adoption of OSF DCE RPC.

Once he convinced me of the uniqueness of correctly generated UUIDs I coined the phrase "the likelihood of a UUID collision is the same as an avocado spontaneously turning into a grapefruit."

A fun tid-bit: At one point I was the maintainer of the list of static UUIDs with the Microsoft bit set. It was a flat text file checked into the windows source. I reserved a chunk of them for my own projects because having all those zeros was useful in debugging. E.g. "00000000-0000-0000-c000-000000000046" (the c000 indicates MS reserved).

In '97 I wrote the internet-draft [1] that Paul & Rich Salz finally turned into RFC 4122 in 2005 [2].

[1] https://www.ietf.org/archive/id/draft-kindel-uuid-uri-00.txt [2] https://datatracker.ietf.org/doc/rfc4122/


I am not sure how other people feel but despite me using the Internet for a good 24 years now I am still astonished at the connection possibilities it gives to us little people. Here I am, a man of little acclaim and communicate with people who wrote e2fsprogs and the uuid RFC and whatnot. This is still a wonder and I doubt it'll ever become an everyday feeling.


If a UUIDv4 is used, then yes, I'm convinced it will be random to within the amount of entropy the system provides.

If another UUID version is used, I'm extremely wary. MAC addresses are explicitly not unique. Worse, there are plenty of bargain mfgs that make cards all with the same MAC. There's a story of a local college that ordered several hundred cards for workstations and discovered every single one had the exact same hardware MAC. Large orgs will spot this easily, but there could be hundreds or thousands of individuals that never tried connecting the same bad cards on the same network segment.

In practice, you can usually depend on it to be truly unique. But there are several failure modes that you absolutely need to be aware of if you plan on using this in production.


Paul is definitely a very smart guy and something of a programming hero of mine. I did have the good fortune to have dinner with him once, although I think I was too shy to really ask him anything.

The DCE library “surprisingly distributed by Apple” is used by their CIFS server and NetLogon daemon. From memory, it originally came from the OSF DCE RPC source code drop, which I then slightly modernised for use in XAD (the first Active Directory-compatible domain controller replacement, which became Domain Services for Windows at Novell), and then found its way to Apple via Likewise.


Even in 2007 the system csrng should be able to generate many millions of uuids/sec on one core. (Say, 10M/sec easily?) At that point isn't it just a matter of keeping a timestamp updated and swizzling it in?

How many millions/sec did they need?


The issue is that they had multiple threads generating objects and they wanted to use unique time-based UUID's because they wanted to efficiently fill a btree-indexed database. So if the Most significant bytes are constant, then all of the new objects will land in the same leaf node, and you can use certain hueristics such as header compression and assuming that you should fill a leaf node before adding a new node (instead of splitting the node equally) because you know that you are inserting keys into the b-tree in monotonically increasing sort order.

If you use random number generator to generate random UUID's, then you will be inserting the objects all over the b-tree and once the tree is too big to fit into memory, performance goes down the toilet.


In the early days of network interface, unique IDs were a problem. It was once suggested that each network interface have a $1 bill attached, with the serial number of the bill being the adapter ID.

This was a real problem in the early days of low-cost Ethernet controllers. Some manufacturers didn't buy their own address space [1], but reused that of some major vendor. (Usually 3COM) This resulted in occasional real-world duplicates.

[1] https://regauth.standards.ieee.org/standards-ra-web/pub/view...


Somewhere I have two network cards with the same MAC address burned into them.

That was a lot of fun to figure out when I was younger and things started going haywire!


Just like the sibling post, I had the same thing happen on cheap NE2K nics. Imagine my surprise when pinging one machine and getting two replies back!

Edit: it was on a hub too, so nothing kept the two machines from seeing the packet.


> It was once suggested that each network interface have a $1 bill attached

I'm assuming this was a joke, but was it actually serious?

Because things would've gotten hilarious at the first replacement banknote [1]

[1]: https://en.wikipedia.org/wiki/Replacement_banknote


That very article says that the serial numbers of replacement banknotes are distinguishable by an asterisk.


In the late nineties when I was in college working at the helpdesk we had a problem during freshmen orientation where all of NE2000 clones in the desktops from HP or Packard Bell (definitely had Packard in the name) had the same MAC set. It was a lot of fun reinstalling everyone's drivers via sneakernet.


I really like the ideas behind ksuid (near the end of the article). However, two quotes:

> Those concerned with UUID collision in a properly-configured system would find their time better spent pondering far more probable events like solar flares, thermonuclear war, and alien invasion on their systems.

And then further down:

> A “custom” epoch is used that ensures >100 years of useful life.

Wait, so the last 128 bits of a KSUID won't get me in trouble before the sun explodes, but the first 32 bits (the timestamp) will cause trouble well before my grandkids die?

I really wonder why they didn't reserve some more bits for the timestamp, if necessary at the cost of some less randomness. Could've made this stuff last for millenia at no extra collision risk, in practice.


Yeah, which is why I'm toying with using ulid identifiers, which won't face timestamp issues until around 10900 AD. Source here: https://github.com/alizain/ulid/



The repo in the comment parent to yours has a list that includes the one you mentioned and several others in various languages :)

https://github.com/alizain/ulid#implementations-in-other-lan...


Yeah. Not sure why they didn't just go straight for 64bit timestamps (maybe 10^-8 second granularity ~= 5.8k years) and 64bits of random data.

I also wonder if base58 would have been a bit nicer. base62 is of course slightly more compact, but base58 is nice that it reduces visual character ambiguity.


The original UUID's (from Apollo, as used in OSF DCE, e2fsprogs, Microsoft, etc.) used a 64-bit timestamp, with 100ns granularity. It uses as the start of its Epoch the beginning of the Gregorian Calendar (00:00:00.00, 15 October 1582), so it's good up to 3400 A.D. before it hits the 2038 problem. :-)

Note that UUID's, like IPv6 addresses, are sufficiently long that if users are needing to interact with them directly, You're Doing Something Wrong. So the whole base-62 versus using hyphens versus base58 discussion misses the point, in my view. Computers will generally be exchanging them in 128-bit binary format, and they should only really be dumped out for debugging reasons.


[Author here: Hey there Theodore -- first of all thank you so much for that other comment with first-hand knowledge about Apollo => UUID. This is the kind of priceless nugget that makes HN such an interesting place.]

From a systems perspective I can definitely see why one could reach the conclusion that if users are interacting with them directly, they're doing something wrong.

What we've seen is that IDs are routinely copied between different systems for debugging purposes. An example at least for Segment it is common to have IDs pasted into support tickets, and the search engine that indexes these will tokenize on the "-" in a UUID. The difference between finding the solution in minutes vs hours could be that two systems can be correlated by searching using the UUID. Leaving these out is a minor tweak that pays dividends over time.

For anything human-exposed (and in JSON or URLs) we use the 27-byte base62 encoding. In the database we store KSUIDs in their 20-byte binary (base256?) encoding.


  > Note that UUID's, like IPv6 addresses, are sufficiently 
  > long that if users are needing to interact with the 
  > directly, You're Doing Something Wrong
One would think so, but you would be surprised how many times someone has typed out to me (or copy/pasted) a UUID for debugging purposes, or how often I have had to eyeball UUIDs in production logs.


We eyeball, compare and copy/paste git commit IDs all the time, and they're longer than UUIDs. I don't see any problem with it as long as there's good tooling to help us interact with these IDs (e.g. git allows us to use the first few characters of a commit ID if there's no collision).


The universe is a few billion years old, and in all likelihood will last at least a few billion more.

I wonder how people manage to keep a straight face while calling something "universally unique" that can't even cover a time period corresponding to the recorded history of our own species. Perhaps it never occurred to them that archeologists and paleontologists might also find time-based UUIDs useful?

Space is cheap now. We don't have to fit everything in 128 bits. We could just use good ol' 64 bit time_t, add another 32 bits for nanosecond granularity, throw in a decent amount of random data, and still get identifiers somewhere between the size of a git commit ID (160 bits) and a bitcoin transaction ID (256 bits). Nobody writes these things by hand anyway.


Archeologists actually have their own versions of UUIDs and massive books/databases cataloguing them. Since dates are constantly subject to revision though, they don't generally use timestamps.


The very fact that scientists have had to come up with their own identifiers is testimony to the woeful inadequacy of what we're calling "universal" identifiers.

Archeologists of the year 8017 might also thank us if we were kind enough to label each piece of our data in a way that makes sense to them, including precise timestamps. Heck, they might still be using the same format if we did come up with something truly universal!


We've been using roughly the same formats for a couple hundred years already, well before computers were a thing. Problem is, that also means every government has its own slightly different databases that that no one wants to integrate.

I've rarely seen dates encoded because they don't provide much useful information. Sites themselves rarely have firm dates and individual excavations get their own (infinite) ID blocks. Each feature and artifact within that excavation is tagged with the block, alongside a bunch of metadata and paperwork. Every artifact carries with it a stack of paperwork documenting where it came from and how it was buried, making shipping the things expensive.

These collections eventually get bundled into larger site or regional packages alongside computational models and research papers, adding yet another hierarchial later.

The whole thing is very bureaucratic and a bit crusty at times, but still very scalable for a system designed before computers were even imagined. As with most things, the friction comes from trying to move things between the physical and digital worlds, not the IDs themselves (which are just guaranteed unique tags).


That's because now matter how many times we're told to simply trust uuids we will always want to add a timestamp just in case :)

Maybe we should just admit that no amount of poetic hyperbole like "until an alien invasion" will quell our innate fear of having uuid that will collide.


I have heard people laugh at the risk of UUID collisions and thought we were safe but heard two years ago that a consulting company I worked with had seen real issues with UUID collisions on a production system.


I doubt many people would argue with the math behind modern uuid generation functions. Those of us that have been around for a while know the problem can come from the implementation and the storage for the UUID.

For example: Bugs in code that may generate a unique ID once and then use it multiple places.

Imaged OS installations that may cause a UUID to be duplicated

Spoofing another ID for malicious reasons

Accidental database update statement

People tend to give UUIDs magical properties but it needs to be treated similar to an unique integer.


If you are using uuid generation that uses MAC address and time stamp, then you can have problems with multiple processes generating uuid for the same system. There are other uuid generation algorithms though.


The 100 years thing is a peeve of mine too. Especially if they want to see this widely adopted.

That is how things like IPv4 address exhaustion and Y2K problem happen. It's 2017, add another byte to your timestamps.


Are UUIDs secure for creating a unique user session identifier?


It depends. I wrote a blogpost about the security implications of UUIDs back in February (https://blog.silentsignal.eu/2017/02/17/not-so-unique-snowfl...) and it shows that not all standard libraries make secure-by-default easy for the developer. I developed a MIT licensed plugin for Burp Suite (https://github.com/silentsignal/burp-uuid) that can help pentesters and security minded developers to detect insecure UUID versions, and even in that plugin I described version 4 UUIDs as "randomly generated, although [their] entropy should be checked".


No, UUIDs aren't meant to be cryptographically unique. Session identifiers should be generated by your friendly neighbourhood crypto rng man


If you're not specifically requesting a time based ID, any proper uuid lib should just return 16 bytes of randomness from the system rng.


You need to double check the library you're using. For example, older versions of the Python standard UUID library would fall back to an insecure RNG silently.

It's easier to just read 16 bytes from /dev/urandom and encode it in hex/base64 yourself for a random token.


Getting pedantic, but a proper uuid lib should be returning 122 bits of randomness, and six bits of fixed information that specify the uuid version and generation method ("random").


For what purpose? Where's the real world reason and interop? Just because an RFC says to set some bits? Is there any code you're likely to hit in your app that actually cares?


I guess the purpose is to be able to put different UUID generation schemes into separate kind-of "namespaces".

That way, a UUIDv4 (random) will never collide with a UUIDv1 (timestamp+mac+...), which again will never collide with UUIDv5 (SHA-1).

So if one of the UUID generation schemes is flawed, these are easily filtered out, and moreover, will never interfere with any other UUID version/scheme.


Not quite, most generate v4 uuids which have a few fixed bits.


The ones most modern UUID libraries generate (v4 UUIDs, variants one or two) have 122 random bits from the OS's csprng and 6 fixed bits.


I'm banking on ULID and surprised that they didn't adopt it, it's a much more approachable solution and battle tested.


There's also "ULID" in this neo-UUID space: https://github.com/alizain/ulid

48-bit timestamp plus 80 bits randomness, base32 encoding (no hyphens), and lexicographic sort order.


That does seem pretty nice. Thanks for the link.


Why would you ever want to use UUID format, which only has 122 bits, versus just making a random 128 bit number? In which realistic scenario would simply reading 16 bytes from urandom not be fine and actually cause issues that removing 6 of those bits to identify the UUID type help?

Also, 32 bit timestamp + 128 random? I guess, but that sounds sort of overkill-ish - if you're going to go to 20 bytes (and thus not fit in a DB's UUID type, require more than 2 registers, etc.), why not make it 24 or 32 bytes and have a proper timestamp? Or if 32-bit timestamp is really acceptable, are you sure that 96-bits of randomness are not?


One of the things I've found strange about UUIDs is their serialization to hex when displayed as a string, yet I've seen real life projects with little technical debt store them as a string in a database. This is obviously the fault of the programmer, but you have to look at that and think if you're serializing to a string (whether that be in a database or over the network), there are so many better options.


There will be a lot of these. Microsoft's ASP.NET Core Identity by default stores user id as a GUID string to the database:

https://github.com/aspnet/Identity/blob/dev/src/Microsoft.Ex...


AFAICT the only reason is that you need it for compatibility. If your variant and version fields aren't valid, you don't have a UUID anymore -- you have something else.


Compatibility with what though? What's an example of a real world system that takes UUIDs and uses the standard to do stuff with the fields? Does any popular software treat them as anything beyond a 128bit number? Genuinely curious what people are giving up 6 bits for.


The point of the timestamp is to make UUIDs somewhat sortable. It is an interesting property for debugging or optimizing.

But yeah 20 bytes is an odd compromise.


Sortability.


And then two machines have the same seed and bam collision.


If two machines have the same csrng seed and can't fix that, then they have the same time, too, so that doesn't help at all.


Vaguely related story: I inherited a system which generates SSCCs to identify each box being shipped from a warehouse. These are supposed to be globally unique, and are generated from a company's GS1 number (the same used for UPCs) plus another number which the company is supposed to make sure is unique. This particular system generated them on the client-side, based on the current timestamp in microseconds, with a random pause to prevent two computers from generating identical runs if they both printed labels at the same time. In a fairly small warehouse, this generated collisions about every six months or so. I instead changed the program to use an (already existing!) auto-increment column on the shared database, thus precluding any possibility of collision and making the program a lot faster since it was no longer delaying for a quarter-second per label (on shipments of 200+ boxes).


Having 32 bits of 1-second resolution time and 128 bits of random payload makes the idea that these are "semi-sortable" a bit odd. Consider:

1. Let's be super-lenient and say that we'll consider an average size bucket of up to 64k (2^16) equivalent entries to be "semi-sortable".

2. If you generate anymore than 2^48 (2^32 * 2^16) IDs over the full 100ish year lifetime of the ID, then your giving up on even that super-lenient definition of "semi-sortable".

3. If you're only ever going to generate 2^48 IDs, then 2^128 bits of random payload (in addition to the 32 bits of timestamp!) seems like absurd overkill.

Given the amount of thought that obviously went into this, I'm guessing that there's probably a good reason that they decided to go with 32 bit timestamps (I can certainly think of many, SHA1 length assumptions being a likely component), but if it's in the article, I missed it.


It's a little buried, but here's the primary tldr of their KSUID library:

> Thus KSUID was born. KSUID is an abbreviation for K-Sortable Unique IDentifier. It combines the simplicity and security of UUID Version 4 with the lexicographic k-ordering properties of Flake. KSUID makes some trade-offs to achieve these goals, but we believe these to be reasonable for both our use cases and many others out there.


Related tangent:

Anyone have any Domain/OS stories or resources they want to share?

This system always seemed like an interesting one, but details are fairly scarce..


> However, on a mobile device, almost anything goes: mobile devices cannot be trusted. While most of these are just as good as what’s available in the scenario above, it’s routine that the PRNG source on these devices isn’t very random at all. Given that there’s no way to certify the quality of these, it’s a big gamble to bet on mobile PRNGs. ID generation on low-trust mobile devices is an interesting and active area of academic research[1].

This is true for highly specialized systems like sensor nodes maybe. For what is generally understood as a "mobile device", i.e. mobile phones or tablets, it is bollocks.


> It borrows core ideas from the ubiquitous UUID standard, adding time-based ordering.

Isn't time-based ordering bad, since it might allow hackers to predict UUID generation and use it to compromise security systems based on UUID?


The phone number was hardly the "first unique identifier in a network" and switchboards worked just fine before phone numbers; phone numbers were first added in the 1890s because of the Stronger switch.

The first UUIDs in networks were probably titles (nobility or job titles in a byzantine empire like China, Russia or, less, the Ottoman Empire). "Chief Assistant to the Assistant Chief of Shipbuilding" is a unique node identifier (doesn't identify a person, but then again phone numbers are reused too).


/dev/urandom is a major liability on any machine with low uptime booted for the first time from a widely-used image (i.e. a VPS), and on embedded systems which have few sources of entropy & do not (or cannot) save/restore it across boots. As a result there can be a much higher than pure-random chance of a collision in the "random" portion of a UUID.


Created something similar called "bronze". It tackles the problem of creating unique identifiers at a slightly different angle, while allowing high collision resistance:

https://github.com/AltusAero/bronze


Since am big fan of everything Base62 I started to work on a PHP implementation of KSUID.

https://github.com/tuupola/ksuid


Is 996238e1-28d1-4b53-b81b-beae25f8edde working for anyone?



At first I read this as "A Brief History of the IUD" - that's not the same thing at all :-)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: