
How long do GUIDs really need to be? - adamschwartz
https://eager.io/blog/how-long-does-an-id-need-to-be/?hn
======
xenadu02
This seems like useless optimization that has a high probability of biting you
in the ass later on.

Using the standard UUID generation facilities in your OS of choice there's
zero chance you get something wrong and screw yourself.

UUIDs are great because we can pretty much guarantee global uniqueness.
Acquire a company, decide to integrate with someone, need to merge a database,
etc? No problem, zero chance of record collisions no matter _what_ happens in
the future. (It also means zero chance of accidentally interpreting record
#58274 as type A when you meant type C).

Furthermore, a 1 in 1 million chance of collision is far too frequent for my
liking, but even if it were acceptable what happens when your service/product
becomes far more popular than you imagined and you blow through your initial
estimates?

~~~
bkirwi
This is absolutely right. I suspect some people do this because Twitter went
out of their way to get 64-bit ids[0] -- but Twitter did this in large part
because they made the mistake of baking that bit-length into the protocol back
when they were tiny, and since they don't control all clients, it's very
difficult to change.

For comparison, here's an eager.io URL:

[https://eager.io/app/ZYBle8qUhKFJ](https://eager.io/app/ZYBle8qUhKFJ)

And here's an equivalent with UUID in base-64:

[https://eager.io/app/b8tRS7h4TJ2Vt43Dp85v2A](https://eager.io/app/b8tRS7h4TJ2Vt43Dp85v2A)

It's a rare application for which those differences matter. (NB: I don't know
anything about eager.io... it might be important for them!)

[0] [https://blog.twitter.com/2010/announcing-
snowflake](https://blog.twitter.com/2010/announcing-snowflake)

~~~
zackbloom
It becomes a stylistic choice, but I can say we weren't happy with the
experience the longer ids gave users. It made most of the url a random hash of
characters, not anything significant. The URL is as much a part of the UX as
anything else.

~~~
al2o3cr
"The URL is as much a part of the UX as anything else."

Given what we've seen from the browser vendors lately (hiding bits of the
address bar, etc), I'd say there's a lot of momentum to disagree with that
statement.

~~~
xahrepap
I think it depends on who your customer is / what your product is. When I'm
developing against an api, I end up writing API calls in my terminal to test
things. I prefer cleaner URLs for APIs.

------
mumrah
If you want sequential IDs with no chance of collisions, read up on "flake"
IDs

* [https://blog.twitter.com/2010/announcing-snowflake](https://blog.twitter.com/2010/announcing-snowflake)

* [http://engineering.custommade.com/simpleflake-distributed-id...](http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/)

* [http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-...](http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang/)

* [https://github.com/mumrah/flake-java](https://github.com/mumrah/flake-java)

~~~
thedufer
"no chance of collisions" can't be applicable to any finite-length ID. Low
chance, perhaps.

That said, this appears to be exactly the scheme MongoDB uses (except Mongo
IDs are 96 bits).

~~~
Goopplesoft
Actually it can. As long as they aren't randomly generated. UUID4s are random,
lots of them including the mongodb use a timestamp/mac address as the first n
bits of the guid. Not sure whether they have "no chance of collisions" but it
is very possible.

~~~
gohrt
"no chance" means making a claim that there will never be 2^N devices
generating IDs simultaneously (for N somewhere around 30-60)

2 machines can have the same MAC addresses (they are reprogrammable) and can
operate at the same microsecond.

"aren't randomly generated" is not practical constraint in a high-speed
distributed system (where you don't have time for synchronization overhead).

~~~
Goopplesoft
Its an infinitesimally small chance. You'd need a mac collision, must be
generated at the same time, and a uuid collision. Even at the largest scales
(Google et all) the probability of that happening is effectively 0.

------
ColinWright
This is interesting, but it plucks from nowhere the equation for the chances
of collision. Here's my write-up of where that comes from:

[http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN_201...](http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN_20141031)

It's intended to be gentle, but a few people have said it's a bit quick in
places. I'd appreciate any feedback.

 _Added in edit: I 've submitted it as a separate item - it's been a few
months since it was discussed here._

[https://news.ycombinator.com/item?id=8540220](https://news.ycombinator.com/item?id=8540220)

------
perlgeek
You can generalize this idea to: if you are willing to exercise control over
some parameters that go into your UID, you need fewer random bits.

For example you could encode a number that identifies the host (like, the last
byte or last two bytes of the public IP address) and the process id of the
process generating the ID, and as a result you need less entropy for avoiding
collisions.

But you risk that somebody who doesn't know UID algorithm screws things up.
For example if you use the last byte of the IP address, and some network
administrator decides to give each host an IPv6 net, the last byte of the IP
might very well be one for each host. (OK, that's a bit of a contrived
example; maybe PID namespaces are a better one?).

Or things outside of your control. Your company gets acquired by a much bigger
one, and for some reason they decide to use your system for the whole company.
Or for a huge customer. And now you're facing a factor 1000 more records than
you ever thought possible. Or a factor 10000. History is full of software
systems that have been used way beyond what they were planned for originally,
and of course nobody revisited all relevant design decisions.

Second point to consider: by making parts of your UIDs deterministic, you also
leak information. Like when a dataset was created, and on what host. Which
might be relevant for timing attacks, or other kinds of security nastiness
that you don't even think about right now.

~~~
jacques_chester
> _For example you could encode a number that identifies the host (like, the
> last byte or last two bytes of the public IP address) and the process id of
> the process generating the ID, and as a result you need less entropy for
> avoiding collisions._

UUID v1 does this by encoding the MAC address as part of the UUID. v3 and v5
use a scheme that encode information from other namespaces, eg FQDNs.

------
jacques_chester
UUIDs have the advantage that they are well-understood and widely supported.
If you really need to shave a few bytes here and there, developing your own
coding scheme is useful. But for the most part, I don't see the win.

Locality is definitely important, but I must be missing something -- if lookup
by date, machine ID etc is required, why not create indices on those fields?
Why rely on coincidental locality?

------
stith
I had a similar issue with an app I'm writing now. I wanted short IDs so my
URLs wouldn't be fugly, but with a low chance of collisions. The solution I
went with (in javascript) is:

    
    
        // Make a "pretty unique" ID for this session.
        // Since RethinkDB doesn't have a way for us to guarantee a _short_
        // random unique value (short of trying the insert and regenerating if it
        // doesn't save), we'll just have to rely on the unlikeliness of a collision
        // with both this time-based ID and the title-based slug.
        // I'm sure this will never ever cause any problems ʘ‿ʘ
        var alphabet = "0123456789abcdefghijklmnopqrstuvwxyz";
        var id = new Date().getTime().toString().match(/.{1,2}/g).map(function(val){return alphabet[val % alphabet.length];}).join('');
        var slugPart = slug((this.title || "").substring(0,60).toLowerCase());
        this.url_slug = id + "/" + slugPart;
    

That is, get a current timestamp (in milliseconds), and use every group of 2
digits to pull a letter out of an alphabet string. Then append "/title-of-the-
thing-made-url-safe". This results in strings that look like
"ee7zrm9/something-goes-here", which is then used as the primary key for the
document. It's not perfect by any means, but it gets the job done, and I thing
appending the title makes collisions extremely rare.

------
chris_va
As a warning, don't try to be too clever with your ID system. You can get
collision bugs that aren't usually visible in testing.

I had a catastrophic bug (ala private data going to the wrong person) from
96-bit (32 bit segment number, 64 bit random local docid) ID collisions when
the caching code decided it was going to use docid as the cache key without
realizing it was missing a bunch of bits.

------
lobster_johnson
The article mentions "friendly" URLs as being a driving factor. That makes it
a presentation issue; ie., it's part of the content, and it is wise to
consider if you can derive it from the content.

For a blog post, for example, there is a title. The classic way of adding a
readable date to the URL is useful, if you're reading the URL in the first
place. This particular blog post uses that approach:
[https://eager.io/blog/how-long-does-an-id-need-to-
be/](https://eager.io/blog/how-long-does-an-id-need-to-be/).

For other objects there might still be useful data. Instead of
/invitation/3jdix8jAJm you might have /invitation/myblog/bob@example.com/u7pW,
the last part being an auto-generated random component. The benefit is that
the ID becomes self-explanatory (self-describing) and very nice for tracing
through logs and the like. Of course, one has to be careful about not exposing
anything exploitable.

------
chacham15
There is a bit of possible misunderstanding/misinformation here: there is a
difference between a primary key and a rowid. The reason that I point out this
distinction is that rows are stored on disk by rowid, meaning that an insert
will still usually insert to the end. On the flip side, yes, the index will
have this problem, but the index shouldnt be very large relative to the table
meaning that it shouldnt be as expensive as the OP is thinking. Note: often
the database will optimize and use the auto increment primary key as the
rowid, but it wont for a uuid primary key.

~~~
curun1r
When we tested on MySQL, the update to the index really was that bad, once you
reached a threshold of 10m-20m rows. And it got progressively worse to the
point where we were never able to get a table with 100m rows using type-4
UUIDs.

It's a moot issue anyways since type-1 UUIDs don't have this problem. They're
monotonic, which allows index updates to only need to append.

~~~
marcosdumay
It really shouldn't. B-trees have O(log n) insertion time, they shouldn't take
much longer when your dataset grows.

But then, it's MySQL. I was surprised by your experience, but not that much.

------
StavrosK
I didn't like long UUIDs either, so I wrote this small Python library to re-
encode them using a more varied character set:

[https://github.com/stochastic-
technologies/shortuuid](https://github.com/stochastic-technologies/shortuuid)

------
pbhjpbhj
> _When your IDs are random however, each new record has to be placed at a
> random position. This means a lot more thrashing of data to and from disk._
> //

Just use the GUID externally and use have a sequential primary key as the
table index?

~~~
pan69
Exactly. I never expose my database Id's to the outside world but generate a
random token for each of my objects that has to be. This token is then
translated to a database Id which is then used from then on. This allows for
the greatest flexibility.

~~~
zackbloom
That's not going to be the most efficient option, as it means all your indexes
need to include both the primary key, and this secondary id. I would encourage
you to think about just using the random token as the primary key.

------
amelius
The short answer: if the probability of a collision is smaller than the
probability of a meteor landing on your head, then you're fine.

------
gbrits
This may be a bit naive but are UUIDV4 completely random from 'head-to-tail'?
I mean, given the birthday paradox calculation, couldn't you just take head of
the uuidv4 (i.e.: the first x characters) to arrive at the collision/space-
consumption tradeoff you want?

~~~
lmm
There are a few bytes for the version number, but other than that, yes.

> I mean, given the birthday paradox calculation, couldn't you just take head
> of the uuidv4 (i.e.: the first x characters) to arrive at the
> collision/space-consumption tradeoff you want?

Yes you could, but is that actually any easier than "generate x random bytes"?

------
lectrick
Twitter came up with a different scheme called Snowflake
([https://github.com/twitter/snowflake](https://github.com/twitter/snowflake)).

------
whitten
I appreciate this article. The mention of the Birthday problem made the
calculation reasonable, and the trick to ensure time-locality as a fixed bit
pattern was enlightening.

~~~
dokimorning
It seems if he's planning to make 100 million records with a probability of 1
in 1 million of a collision, he's going to end up with ~100 collisions. I
think I would plan to make the collision probability of N records at least <
1/N. Plus, one million records is really not so big.

I do like the point that UUIDs are generally stored as strings whereas they
represent a 122 bit value. Seems encoding the UUIDs as binary would offer much
greater efficiency in storage space as well as indexes.

~~~
lectrick
I would never store a UUID as a string for exactly that reason. Either a
binary or a native UUID datatype. Any developer IMHO who stores a UUID as a
string is suspect.

~~~
dozenal
I've always seen UUIDs stored as strings. What's the suspect part? Favoring
human readability over optimal machine storage utilization?

~~~
bartonfink
Human readability is a concern of a client and is independent of the storage
mechanism. Every modern database stores integers in binary format, for
instance, but clients display them as a decimal string of characters as
opposed to a binary or hexadecimal representation. Timestamps are similarly
stored in binary fashion, but often formatted for human readability in the
client.

~~~
lmm
If you're using an SQL database you're presumably doing so so that humans can
run ad-hoc reports (otherwise there are better datastores). So the UX they get
for that is important. And in mysql (yes, not the best choice these days, but
a reasonable one when the decision was made), if you store UUIDs as binary
(there's no native UUID type) then you do not provide a good UX.

~~~
lectrick
MySQL itself should not be the main interface; there should be some kind of
model layer on top of that which does the translation of things like that,
such as (for example) ActiveRecord if you were on a Rails stack. That gets you
the best of both worlds.

OR you could store the uuid twice, once "natively" and once as a computed
column. Searches on the native field would be faster vs. an index on a string
column.

> you're presumably doing so so that humans can run ad-hoc reports (otherwise
> there are better datastores)

oh dear, someone has drank the NoSQL punch... Storing data relationally is NOT
something only suitable for ad-hoc queries by end-users! :O

------
SeoxyS
I do something very similar: 64-bit IDs start with a timestamp with a custom
epoch, and fill in the rest with random data. I store these in bigints in
Postgres.

------
JoshTheGeek
Couldn't you generate a UUID and then check the database that it is unique
before using it? You could then repeat until you got a unique identifier.

~~~
jacques_chester
UUIDs have the advantage that no coordination is required. With a database you
will create coordination problems as you scale up your system.

I have either a high or total assurance that any UUID I self-assign has never
been used anywhere, ever, for anything else, by anyone else, in any system.

In fact, the odds that I will have a collision with another UUID are lower
than a cosmic ray striking a computer at the moment when it is performing the
uniqueness check.

------
erik14th
I don't get GUIDS, couldn't you just prefix the sequential IDS for each node
producing IDS?

~~~
jacques_chester
UUIDs come in different versions. v1 UUIDs are built out of MAC addresses and
a timestamp, which is similar to what you're proposing.

Unfortunately, or fortunately, depending on your requirements, it is easily
predictable.

------
mark-r
This seems overly complicated. Why not generate an ID out of two numbers, one
a server or thread ID and another that auto-increments? Assigning a unique
number to each entity that can generate IDs seems like a tractable problem,
and the odds of generating a collision can be reduced to zero if you negotiate
a new number when the counter wraps around.

~~~
JoeAltmaier
UUID generation collision is essentially zero too. Any scheme you can contrive
has the vulnerability that somebody else may choose the same roots as you,
including yourself if you restart unexpectedly.

~~~
mark-r
Certainly I agree that UUIDs aren't going to collide. But UUIDs were created
to solve a problem most people don't have - how to ensure unique IDs when
there is absolutely no communication between the parties generating the IDs.
The article was suggesting one alternative to the UUID, and I suggested
another.

~~~
JoeAltmaier
Sure but why? We have UUID's and good generators. Run with it.

UUIDs solve a problem many, many people do have - identifying things over time
and space - without having to even consider communication/coordination. That's
a very powerful property.

~~~
mark-r
Again, I'm not arguing that UUIDs aren't a powerful tool. But the article
points out two problems with them - they're inconveniently long when you
expose them to the user, and they lead to a fragmented database because
they're not generated in order. If those problems matter to you, then you look
for an alternative. If they don't, then the UUID is the go-to solution.

~~~
JoeAltmaier
Very true. Some help: Inside a self-consistent database you might index the
UUIDs and use the index internally to link them other places. Or something
like that. Its work. If more use were made of UUIDs, databases would handle
them better and that's probably going to happen.

------
Kenji
He assumes that the UUIDs are independent (selected independently uniformly at
random). They are not. Trust me, it's very, very hard to get this kind of
'pure' randomness on computers. No matter what randomness you use, you will
have some correlation and your odds to get different UUIDs drop rapidly.

~~~
lmm
On the contrary, there is such a thing as a CSPRNG, and many of them exist.

