
Generating unique IDs: an easy and reliable way - antirez
http://antirez.com/news/99
======
geofft
Using SHA-1 in this way, with a random seed and a counter, is just building a
(perfectly sound) CSPRNG with, I believe, an 80-bit security level. If you
trust the source of the random seed, e.g. /dev/urandom, you may as well just
use /dev/urandom itself. If you don't, you're already in trouble.

I'd like to see actual benchmarks of /dev/urandom vs. a userspace PRNG before
believing the argument that it's too slow. Some people (I seem to recall
Chromium does this) have done the benchmarks, _and_ use enough random data
that they implement their own userspace PRNG. Most applications don't need it.

It's also possible that getentropy() on OpenBSD and GNU/Linux recovers enough
of the performance. The system call path is extremely fast, and the kernel
isn't doing random number generation any faster than you are -- unless you're
doing a bad job of it.

And if you somehow need a userspace PRNG, the usual advice about not rolling
your own crypto unless you know what you're doing applies. (Especially for
database IDs, the risk of collisions should be considered a security problem,
ergo this should be considered crypto, until proven otherwise.) In this case,
using BLAKE2 instead of SHA-1 would get you a higher security level _and_
faster hashing.

Or, in tptacek's words: [http://sockpuppet.org/blog/2014/02/25/safely-
generate-random...](http://sockpuppet.org/blog/2014/02/25/safely-generate-
random-numbers/)

~~~
MichaelGG
On Windows at least I had no trouble getting many hundreds of MBs/sec out of
the default CSPRNG (which uses AES CTR or something like that) - even with
small calls. (A customer was insisting that generating 16-byte IDs this way
was too slow for production apps and was shown to be incorrect.) It'd be
surprising if other OSes weren't around the same level.

~~~
geofft
On Linux it's easy to benchmark this via e.g. `dd if=/dev/urandom of=/dev/null
bs=1M`. On my somewhat old server (with Intel Core i5s from 2012), I'm getting
17 MB/s.

Supposing we blame slowness on syscall overhead: at a block size of 16 bytes
instead of one megabyte, it slows down to all of 11.6 MB/s, that is, some
700,000 128-bit IDs per second. (And since we're not using SHA-1, all 128 bits
are meaningful for collision resistance.) For comparison, Twitter gets about
1% that many tweets per second, globally.

If anyone has plans of scaling to 100 times the size of Twitter on a single
2012-era Core i5, and has figured everything else out already, I'll be happy
to audit your userspace PRNG then.

~~~
kristianp
What about the case of using Linux in a VM? Isn't there a possible problem
with the amount of entropy available?

~~~
MichaelGG
You only need 16-32 bytes to fully seed a CSPRNG real good. It's true that
some VM setups have had issues, yes, but that shouldn't generally be the case.
But a user mode generator isn't gonna do any better.

------
Tharkun
If you don't need consecutive IDs, use a UUID or a Random UUID if you can get
away with it.

If you do need consecutive IDs, use atomic operations on a counter (compare-
and-swap type increments, AtomicInt in Java, etc). Need to preserve them after
a restart? Reserve a block of size N, when you get close to N, stop generating
until you've persisted the reservation for the next N identifiers. Atomic
increments will ensure you never hand out the same number twice, and reserving
a block before you persist will ensure that you'll be save even after a power
outage or whatever.

Why bother with incredibly long random numbers? It can go wrong in so many
ways. The probability of duplicates being low _does not mean_ that you won't
get duplicates.

~~~
userbinator
_The probability of duplicates being low does not mean that you won 't get
duplicates._

This is true for cryptographic hash functions, yet in practice no one has a
problem with using them for identifying files etc., because their output space
is so large as to make the chance of a collision sufficiently low. I'm almost
willing to bet that even for ones which are cryptographically insecure like
MD5, and for which colliding pairs can be easily deliberately created, there
hasn't been a truly accidental collision.

(Fun experiment to try: hash all the files you own, and then compare byte-by-
byte the ones with the same hash. Try a few different cryptographic hash
functions. If you do find accidental collisions, you've just experienced an
event far more improbable than winning the lottery; and if it's SHA1, SHA256,
or bigger, it's worth publicising this event.)

~~~
ricardobeat
I have seen an md5 collision happen at work. Having low probability doesn't
mean it won't happen, you just don't know when - people win the lottery around
the world daily.

~~~
deathanatos
Did you see a MD5 collision that happened randomly, or was someone
intentionally trying to create one?

 _Intentionally_ creating a collision with MD5 is fairly easy; having one
happen at random would require creating something like 10 ^ 14 _billion_
hashes in order to have the same chance of a collision as winning the lottery.
People win the lottery, yes, but they don't happen across MD5 collisions by
accident.

~~~
ricardobeat
Random! I don't remember the details, but we spent a good half hour looking up
probabilities and marvelling at it. I think at the time nobody realized just
_how_ unlikely it was, should have saved the inputs :(

------
sandstrom
This is an interesting topic -- and I'm super-excited that Salvatore is
working on Disque!

Below are a few methods I've seen for UIDs.

Setup used by MongoDB[1]:

    
    
        [ time, machine_id, process_id, counter ]
    

Twitter Snowflake[2] use something similar to the above.

Cassandra[3] use something like this:

    
    
        [ time, version, variant, counter, node_id ]
    

Flickr on their ticket-DB[4] and Instagram on their setup[5].

\---

[1] [https://github.com/mongodb/bson-
ruby/blob/master/lib/bson/ob...](https://github.com/mongodb/bson-
ruby/blob/master/lib/bson/object_id.rb#L372)

[2] [https://blog.twitter.com/2010/announcing-
snowflake](https://blog.twitter.com/2010/announcing-snowflake)

[3]
[http://wiki.apache.org/cassandra/TimeBaseUUIDNotes](http://wiki.apache.org/cassandra/TimeBaseUUIDNotes)

[4] [http://code.flickr.net/2010/02/08/ticket-servers-
distributed...](http://code.flickr.net/2010/02/08/ticket-servers-distributed-
unique-primary-keys-on-the-cheap/)

[5] [http://instagram-
engineering.tumblr.com/post/10853187575/sha...](http://instagram-
engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram)

------
mrb
The author says SHA1_HMAC would make IDs "very hard" to predict compared to
SHA1, but this is not true at all. An attacker _can_ bruteforce the seed
either way. In practice the double hash construction used internally by HMAC
will make bruteforcing ~twice slower, but that is a minor difference. If you
want to double the difficulty of bruteforcing the seed, a better way is to
just add 1 bit to it.

The easiest and most secure way to generate unique IDs is to just read 256
bits from getrandom() or /dev/urandom. It is so simple that most programmer
won't botch up an attempt to implement that. 256-bit IDs give you a 128-bit
security level, which is a good default to aim at.

~~~
antirez
1\. It's assumed that the seed you get from /dev/urandom is unguessable. This
is a very solid assumption to make. The reason why it's a good idea to use an
HMAC is entirely different and is the following classes of attacks, not a
matter of costs of bruteforcing (which is impossible becase of the search
space, without having some clue of the content or state of /dev/urandom). For
example, for an error the code could be written to SHA1(seed || counter). In
this case you can "re-open" the hash function and perform an "length extension
attack", so you add "0" at the end and successfully predicted one of the
successive IDs. HMACs are constructs used to make sure the secret and plaintex
is mixed in a more cryptographic robust way, basically.

2) If you don't trust /dev/urandom to seed the generator, you have no reasons
to trust that /dev/urandom cannot be predicted by observing some part of the
old stream of bytes.

More information about the attack I described are available here:
[https://en.wikipedia.org/wiki/Length_extension_attack](https://en.wikipedia.org/wiki/Length_extension_attack)

~~~
mrb
The construct described in your blog, H(counter+seed), is _not_ vulnerable to
length extension attacks, so you gain nothing by using an HMAC here.

The only thing I thought you could be theoretically referring to when you
wrote "very hard to predict" was bruteforcing, so I talked about bruteforcing
on a theoretical level. I agree of course that a 160-bit seed from urandom can
be trusted and is not bruteforceable in practice.

~~~
antirez
To make things simpler for our discussion, I made as example the length
extension attack, but message||secret has other more subtle issues (there is
extensive documentation about it and many papers detailing the attacks. A more
robust way that is still weak compared to HMACs is secret||message||secret).
So you gain security by using an HMAC. Given all the above you may want to
update the sentence "The author says SHA1_HMAC would make IDs "very hard" to
predict compared to SHA1, but this is not true at all", for the sake of other
readers not following the whole discussion.

~~~
mrb
Again, your H(counter+seed) solution for generating unique IDs is perfectly
fine. This is clean, neat and fast. There are no subtle issues whatsoever. An
HMAC will not give you any extra security in this case. (The issues with
H(message+secret) only exist when the attacker has some control on "message",
but he does not in your case --I am familiar with this, I do crypto reviews
for a living.)

I cannot update my comment (HN only lets you edit posts < 2 hours old), but I
would not want to update it anyway since what I say is true :) However I think
you should edit your blog post and mention that the order of concatenation of
the counter and seed is important. If you want say something along the lines
of "HMAC is traditionally the best practice but in this case H(counter+seed)
is fine".

------
jpollock
You could always just use UUIDs [1], or generate your own isolation and
uniqueness.

Personally, instead of randomness, it's easier to just make your keys multi-
part to give each process an independent portion of the address space.

{ip/mac, pid, thread_id, start_timestamp, counter}

128 + 32 + 32 + 32 + 32.

increment the per-thread counter, because it's per-thread, no atomic
operations required.

Weaknesses : pid re-use inside of the resolution of start_timestamp.

[1] [https://tools.ietf.org/html/rfc4122](https://tools.ietf.org/html/rfc4122)

~~~
antirez
This method leaks internal info (ip/mac), and is not as collision resistant as
the one I propose. imagine you have an embedded system that after a restart
has the internal clock reset because of clock battery failure. Pid and
thread_id are the same after every restart (or may be the same) because of the
OS initialization steps which are simple and deterministic. You have your
system in "reboot loop" for some reason. It reboots every few minutes. Soon or
later you'll generate a repeated ID. Given the method is also more complex,
why to use it?

------
mabbo
I've always been particular to pre-pending the current time (epoch, date,
whatever) to the start of a UUID of some sort.

Mathematically, you're actually more likely to generate collisions this way,
but we're talking 0.00001% in a million years vs 0.001% in a million years. It
won't happen either way, practically speaking.

And the benefit? It's usually easier to explain to other people. It's more
intuitively safe because there would need to be two colliding keys generated
_the same second_.

~~~
twoodfin
However many bits you're using to store the timestamp would better guarantee
uniqueness if they were random. Whether it's intuitively safer or not, it's
empirically less safe.

This is exactly why the UUID spec evolved into a pure random value.

~~~
emcq
Agreed, but it can still be a useful technique if you ever need to sort your
values by time, in which case it's logically a tuple masquerading as a
physical single id.

------
e12e
Is this any better than reading 160 random bits from /dev/random and
iterating? If unpredictability is a requirement it is - but it shouldn't do
any better vs collisions?

It might even be a little worse than counting atraight up?

~~~
antirez
The problem is that /dev/random, and /dev/urandom in certain systems (it may
be just a symbolic link to /dev/random in the extreme case), depletes soon and
blocks until the OS can collect new entropy. Even when /dev/urandom can be
used without ever blocking, like in Linux, it is slower than just computing
SHA1(secret||counter).

For the goal of just getting the initial seed /dev/random is just good. So
basically the method described works almost everywhere there is _some_ even
slow source of unguessable bits.

~~~
e12e
I meant reading 160 bits, then counting up:

    
    
      start = 160bits from /dev/random
      next_id = start+1
    

Using a hash function would _seem_ to be more random, but you're effectively
walking a path that is no more random than enumerating all 160 bits numbers. I
don't thing there's any guarantee that "walking" from 1 with sha1(1+inc), is
any less likely to converge on sha1(bigint+inc), than "walking" from 1 and
bigint by simply increasing by 1.

On the other hand, if you had some sort of communication between nodes, it'd
be trivial to predict if two instances would overlap, if you just kept
incrementing "big random number 1" and "big random number 2".

ed: If sha1 is a perfect hash function (I don't know if has been proven to be
that?), then sha1sum(seed+inc) shouldn't be _more_ likely to collide than
seed+inc % 2^160. But if you have one such "random" identifier, you can of
course much more easily guess previous/next identifiers, than if using a
secure hash function.

ed2: I suppose it should be obvious that given a pair of
seed1+inc1=seed2+inc2, both plain iteration and sha1sum(seed+inc) will
collide. I'm not sure if there are cases where (seed1+inc1, seed2+inc2)<2^160
[ed3: and seed1+inc1 != seed2+inc2 ] can collide after running them through
sha1.

------
revelation
This is building another CSPRNG from an already existing one. I'd suggest you
leave the CSPRNG part to the OS.

------
numbsafari
Why not just use version 4 UUIDs? In Python, at least, they are generated
directly from /dev/urandom.

~~~
antirez
Using /dev/urandom to generate every new ID is not optimal for a few reasons:

1) It's not better than the method I suggested, but requires to use I/O for
every generated ID. So without gaining anything you lose performances. In fact
you could imagine /dev/urandom as a more complex form of an hash function in
"counter mode". Given that for this application, cryptographic properties are
not needed, but only good distribution, the two methods are identical but one
is very fast and one is slower.

2) /dev/urandom in certain systems, notably OSX, at some point may block for
some time, it is not guaranteed to be always able to generate a stream of
random numbers. Certain operating systems may generate only a given amount of
stream to /dev/urandom, starting with a given amount of entropy they have in
the pool. If the seed cannot be renewed with new entropy the OS may block
readers on /dev/urandom. I think this implementation does not make much sense
and /dev/urandom should _always_ provide output as long as the original seed
had enough entropy, but implementations may vary, and you cannot ignore OSX.

~~~
geofft
> _/ dev/urandom in certain systems, notably OSX, at some point may block for
> some time_

This is not true. OS X has the same behavior as the BSDs: /dev/urandom is a
compatibility symlink to /dev/random, and neither device blocks.

[https://developer.apple.com/library/mac/documentation/Darwin...](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man4/random.4.html)

"Yarrow is a fairly resilient algorithm, and is believed to be resistant to
non-root. The quality of its output is however dependent on regular addition
of appropriate entropy. If the SecurityServer daemon fails for any reason,
output quality will suffer over time without any explicit indication from the
random device itself."

The only device node that's broken in this way is Linux's /dev/random.

~~~
antirez
I had to switch implementation to use SHA1 in counter mode in actual code,
because on OSX, /dev/urandom blocked. To make it block did not required a huge
amount of data.

~~~
geofft
Huh. Well that's a fairly serious bug that someone should track down and
document widely. I'll see if I can reproduce it on my OS X machine.

------
cgtyoder
I am surprised this hasn't been brought up already: the easiest and most
reliable way to generate GUIDs is to use existing APIs, available on near-
every platform out there. These blocks of code have been around for years if
not decades and have been well-vetted and optimized. Rolling your own is just
a poor use of time and invites error. Unless you're doing this for homework or
have some free time and are curious about what's under the proverbial hood,
don't reinvent the wheel.

------
jasonwatkinspdx
Please just use 32 bytes from /dev/random by default. If it shows up in
profiling data, then think about inventing something else. It likely won't.

------
inglor
If you need the IDs to be unique per machine just use a unique-per-machine
identifier - like the MAC address of the network card. Then use a counter that
increments by 1 for each generated ID. If you need to be resilient to failures
- just checkpoint every 10K (or 100K) numbers and add 10K to the starting
number when you reboot after failure.

No collisions ever. Seriously - I don't understand why people would overthink
it that much.

Not to mention OP talks about MD5 and SHA1's security issues - well guess
what? MD5 was never meant to be used for crypto - just validation. Saying it's
bad at being used for cryptography like that is like saying cats are a bad war
time weapon - true but meaningless.

~~~
robrenaud
What if your MAC addresses aren't spread that far between each other?

~~~
inglor
What about it? Remember, we're no longer using a hash function on anything -
we don't want the numbers to be uniformly spread on any range - just unique.

------
jph
[https://github.com/sixarm/sixarm_ruby_xid](https://github.com/sixarm/sixarm_ruby_xid)

XID is our solution for when collisions do not matter much.

XID defintion: a 128-bit completely random number, generated from an
unblocking source of high security randomness. Example of sources: the OS
using /dev/urandom, or the language using a secure random class, etc.

XID is compatible with UUID-size-fields in Postgres and any other 128-bit
storage system, if you want to save them. The XID idea is quite easy to
implement on any system because it's always calling good built-in random
tools, rather than trying to home-grow some kind of hash or SHA or masking.

The primary advantage is that naming something an XID tells programmers that
it's important to use it as intended. Calling a field an XID means: do not use
fewer bits, do not use a less-secure randomness source, do not use an UUID
library that sets any of the bytes, etc.

If you would like to contribute XID code for any language, we welcome commits.

~~~
deathanatos
> for when collisions do not matter much.

Collision, in that scheme, are effectively impossible. It's mostly the same as
a UUIDv4. (The link notes this, and perhaps those extra bits are nice; I'm not
sure why you'd use a poor PRNG to generate a UUIDv4 as that would effectively
be a broken implementation. If you're doing that, what would prevent you from
doing the same with XIDs?)

~~~
jph
Yes you're correct. In practice we've seen many codebases that generate UUIDv4
by a poor PRNG or a codebase-seeded PRNG. For example we've seen bash `$RANDOM
% 10` instead of /dev/urandom, Java Random instead of SecureRandom, or
misinformed home-grown hashing attempts as described in the article, or a
constant seed at the top of the code.

The UUID spec allows these; the XID spec prohibits these.

You asked: what would prevent you from doing a broken implementation of an
XID? In a case such as that, an advantage of an XID is a correct
implementation is easier to see during code review, easier to verify during
continuous integration, and easier to discuss with teams that aren't as
familiar with PRNGs, seeds, crypto, and UUID v1-v5. These are people-pragmatic
benefits not information-theoretic benefits.

------
molyss
I would be very, very, careful about using PRNG to generate unique IDs.

If the ID indeed needs to be unique, I would stay as far away as possible from
PRNG. That a PRNG is cyclic with some cycle size that's larger than the number
of atoms in the universe does not prevent you from running into repeated
sequences in a much higher probability, especially if you're reducing the
space of values returned one way or another.

To some extent they're even wanted properties of a good PRNG. If you're using
a PRNG whose cycle is larger than the space of distinct values it returns,
there _will_ be repeated sequences of length > 1.

If your application cannot handle duplicated Ids, then use an ever
incrementing counter concatenated with a serverId, add a checksum for the good
cause if you want (it won't add any entropy or reduce the risk of collision)

If you really want to use a PRNG to generate unique ids, use an ever
incrementing counter + a serverId + a random number generator. You can
probably concatenate them any way you want, as long as the ever incrementing
counter is stored in full, but the minute you truncate that one, I wouldn't
expect any uniqueness anymore.

If you simply concatenate partial values coming from a PRNG, I would expect
the probabilities for a collision to be much higher than cycle_length.

Using a hash function on a counter will still increase the risk of collision
that simply using the counter. And using a hash function on a PRNG will also
increase the risk of collision compared to the pure PRNG approach.

If your app is fine with duplicated IDs, then go ahead, use a PRNG. It's
completely OK. Alternatively if you have a PRNG whose space of values is
bigger than the space of UUIDs you're generating I would suspect should be
fine. Same if you have a (non pseudo) RNG at hand.

EDIT : added the hash function paragraph.

~~~
geofft
Or use a cryptographically-secure PRNG (CSPRNG), which doesn't have the flaw
of a cycle size that will be reached within our lifetimes. (Usually they
return 128-bit or more outputs, and 2^128 is a very large number; by
comparison, the standard Mersenne twister returns 32-bit outputs.)

For most applications, CSPRNGs are way, way more than fast enough. They're
also readily available and well-tested on all current OSes and languages.

------
seba_dos1
What benefits does this method offer? The key thing that makes it better than
what Mike was doing is to use /dev/urandom instead of Math.random, which isn't
always possible (for instance, in the browser). With V8's Math.random, it's
going to fail the same way.

~~~
geofft
In the browser you have window.crypto.getRandomValues(), which is a CSPRNG
just like /dev/urandom and available in every current browser other than Opera
Mini:
[http://caniuse.com/#feat=getrandomvalues](http://caniuse.com/#feat=getrandomvalues)

(But, in either the browser or in native code, you should use your platform's
CSPRNG directly instead of writing an ad-hoc CSPRNG on top of it.)

------
kungfooguru
Nothing new. Just use the right variant for your use case:
[https://en.wikipedia.org/wiki/Universally_unique_identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)

------
jack9
Wasn't this method already popularized (basically) by twitter's
simpleflake/snowflake?

[https://news.ycombinator.com/item?id=7764254](https://news.ycombinator.com/item?id=7764254)

~~~
geofft
Twitter Snowflake has the additional requirements of being mostly
chronologically sorted (events at least 1s apart, and preferably 10ms apart,
should have IDs in the right sort order) and using 64-bit IDs, which make the
problem harder.

If you're happy to use random, 128-bit IDs, /dev/urandom solves the problem of
unique IDs, and very easily fits the performance numbers Snowflake promises
without even trying (they want 10k IDs per second and 2ms latency;
/dev/urandom on my three-year-old machine does 700k IDs per second and 14µs
latency in read()).

