Hacker News new | past | comments | ask | show | jobs | submit login
Generating unique IDs: an easy and reliable way (antirez.com)
74 points by antirez on Nov 21, 2015 | hide | past | favorite | 64 comments

Using SHA-1 in this way, with a random seed and a counter, is just building a (perfectly sound) CSPRNG with, I believe, an 80-bit security level. If you trust the source of the random seed, e.g. /dev/urandom, you may as well just use /dev/urandom itself. If you don't, you're already in trouble.

I'd like to see actual benchmarks of /dev/urandom vs. a userspace PRNG before believing the argument that it's too slow. Some people (I seem to recall Chromium does this) have done the benchmarks, and use enough random data that they implement their own userspace PRNG. Most applications don't need it.

It's also possible that getentropy() on OpenBSD and GNU/Linux recovers enough of the performance. The system call path is extremely fast, and the kernel isn't doing random number generation any faster than you are -- unless you're doing a bad job of it.

And if you somehow need a userspace PRNG, the usual advice about not rolling your own crypto unless you know what you're doing applies. (Especially for database IDs, the risk of collisions should be considered a security problem, ergo this should be considered crypto, until proven otherwise.) In this case, using BLAKE2 instead of SHA-1 would get you a higher security level and faster hashing.

Or, in tptacek's words: http://sockpuppet.org/blog/2014/02/25/safely-generate-random...

On Windows at least I had no trouble getting many hundreds of MBs/sec out of the default CSPRNG (which uses AES CTR or something like that) - even with small calls. (A customer was insisting that generating 16-byte IDs this way was too slow for production apps and was shown to be incorrect.) It'd be surprising if other OSes weren't around the same level.

On Linux it's easy to benchmark this via e.g. `dd if=/dev/urandom of=/dev/null bs=1M`. On my somewhat old server (with Intel Core i5s from 2012), I'm getting 17 MB/s.

Supposing we blame slowness on syscall overhead: at a block size of 16 bytes instead of one megabyte, it slows down to all of 11.6 MB/s, that is, some 700,000 128-bit IDs per second. (And since we're not using SHA-1, all 128 bits are meaningful for collision resistance.) For comparison, Twitter gets about 1% that many tweets per second, globally.

If anyone has plans of scaling to 100 times the size of Twitter on a single 2012-era Core i5, and has figured everything else out already, I'll be happy to audit your userspace PRNG then.

What about the case of using Linux in a VM? Isn't there a possible problem with the amount of entropy available?

You only need 16-32 bytes to fully seed a CSPRNG real good. It's true that some VM setups have had issues, yes, but that shouldn't generally be the case. But a user mode generator isn't gonna do any better.

If you don't need consecutive IDs, use a UUID or a Random UUID if you can get away with it.

If you do need consecutive IDs, use atomic operations on a counter (compare-and-swap type increments, AtomicInt in Java, etc). Need to preserve them after a restart? Reserve a block of size N, when you get close to N, stop generating until you've persisted the reservation for the next N identifiers. Atomic increments will ensure you never hand out the same number twice, and reserving a block before you persist will ensure that you'll be save even after a power outage or whatever.

Why bother with incredibly long random numbers? It can go wrong in so many ways. The probability of duplicates being low does not mean that you won't get duplicates.

This is a common misconception. The actual real world effect is that you'll never ever run into a collision, because it is much more probably than any other catastrophic event will happen in your computer, or that bits will flip at random in your processors and so forth. Note that:

1) Economic transactions, signatures, files, are often verified using "incredibly long random numbers".

2) Rsync transfers files hashing blocks with hash functions and avoid transferring blocks with HMACs that match.

You'll never see a collision if not on-purpose in the case the hash function has a vulnerability that allows the attacker to craft a plaintext P2 so that HASH(P2) == HASH(P1), or at least allows an attacker to select two P1 and P2 so that HASH(P1) == HASH(P2).

If you don't trust crypto hash functions to be collision resistant in the explained use case, where even the above attacks are irrelevant since we are just using the distribution properties, you can't trust most modern computing at all.

If you generate a sufficiently long random number then the probability of duplicates easily becomes lower than the probability of duplicates in a counter scheme due to a bug or hardware fault. And "sufficiently" long doesn't require a novel. 128 bits is sufficient for just about everything, and 256 bits is enough for everything.

Why bother with a complicated atomic ID operation (and that's going to be fun to scale, too) when a relatively short string of random bits will do the job?

All a UUID is really is an incredibly long random number (at least v4). I'm not sure exactly what you're advocating here.

Agree. And if you only need unique (or increasing) IDs and not consecutive then it's easier. Long random/GUID ids are only needed when the id should not he guessable or predictable. Which is sometimes important, sometimes not.

The probability of duplicates being low does not mean that you won't get duplicates.

This is true for cryptographic hash functions, yet in practice no one has a problem with using them for identifying files etc., because their output space is so large as to make the chance of a collision sufficiently low. I'm almost willing to bet that even for ones which are cryptographically insecure like MD5, and for which colliding pairs can be easily deliberately created, there hasn't been a truly accidental collision.

(Fun experiment to try: hash all the files you own, and then compare byte-by-byte the ones with the same hash. Try a few different cryptographic hash functions. If you do find accidental collisions, you've just experienced an event far more improbable than winning the lottery; and if it's SHA1, SHA256, or bigger, it's worth publicising this event.)

I have seen an md5 collision happen at work. Having low probability doesn't mean it won't happen, you just don't know when - people win the lottery around the world daily.

Did you see a MD5 collision that happened randomly, or was someone intentionally trying to create one?

Intentionally creating a collision with MD5 is fairly easy; having one happen at random would require creating something like 10 ^ 14 billion hashes in order to have the same chance of a collision as winning the lottery. People win the lottery, yes, but they don't happen across MD5 collisions by accident.

Random! I don't remember the details, but we spent a good half hour looking up probabilities and marvelling at it. I think at the time nobody realized just how unlikely it was, should have saved the inputs :(

That sounds like an interesting story. Consider yourself very lucky! Were the data being hashed related in some way, or user input (whereby someone could relatively easily feed in a collision pair)?

If I came across an MD5 collision by accident, I'd certainly try to save the input pair, because such events are exceedingly rare. Owning a pair of entirely naturally occurring data with colliding MD5s is itself article-worthy, I think.

This is an interesting topic -- and I'm super-excited that Salvatore is working on Disque!

Below are a few methods I've seen for UIDs.

Setup used by MongoDB[1]:

    [ time, machine_id, process_id, counter ]
Twitter Snowflake[2] use something similar to the above.

Cassandra[3] use something like this:

    [ time, version, variant, counter, node_id ]
Flickr on their ticket-DB[4] and Instagram on their setup[5].


[1] https://github.com/mongodb/bson-ruby/blob/master/lib/bson/ob...

[2] https://blog.twitter.com/2010/announcing-snowflake

[3] http://wiki.apache.org/cassandra/TimeBaseUUIDNotes

[4] http://code.flickr.net/2010/02/08/ticket-servers-distributed...

[5] http://instagram-engineering.tumblr.com/post/10853187575/sha...

The author says SHA1_HMAC would make IDs "very hard" to predict compared to SHA1, but this is not true at all. An attacker can bruteforce the seed either way. In practice the double hash construction used internally by HMAC will make bruteforcing ~twice slower, but that is a minor difference. If you want to double the difficulty of bruteforcing the seed, a better way is to just add 1 bit to it.

The easiest and most secure way to generate unique IDs is to just read 256 bits from getrandom() or /dev/urandom. It is so simple that most programmer won't botch up an attempt to implement that. 256-bit IDs give you a 128-bit security level, which is a good default to aim at.

1. It's assumed that the seed you get from /dev/urandom is unguessable. This is a very solid assumption to make. The reason why it's a good idea to use an HMAC is entirely different and is the following classes of attacks, not a matter of costs of bruteforcing (which is impossible becase of the search space, without having some clue of the content or state of /dev/urandom). For example, for an error the code could be written to SHA1(seed || counter). In this case you can "re-open" the hash function and perform an "length extension attack", so you add "0" at the end and successfully predicted one of the successive IDs. HMACs are constructs used to make sure the secret and plaintex is mixed in a more cryptographic robust way, basically.

2) If you don't trust /dev/urandom to seed the generator, you have no reasons to trust that /dev/urandom cannot be predicted by observing some part of the old stream of bytes.

More information about the attack I described are available here: https://en.wikipedia.org/wiki/Length_extension_attack

The construct described in your blog, H(counter+seed), is not vulnerable to length extension attacks, so you gain nothing by using an HMAC here.

The only thing I thought you could be theoretically referring to when you wrote "very hard to predict" was bruteforcing, so I talked about bruteforcing on a theoretical level. I agree of course that a 160-bit seed from urandom can be trusted and is not bruteforceable in practice.

To make things simpler for our discussion, I made as example the length extension attack, but message||secret has other more subtle issues (there is extensive documentation about it and many papers detailing the attacks. A more robust way that is still weak compared to HMACs is secret||message||secret). So you gain security by using an HMAC. Given all the above you may want to update the sentence "The author says SHA1_HMAC would make IDs "very hard" to predict compared to SHA1, but this is not true at all", for the sake of other readers not following the whole discussion.

Again, your H(counter+seed) solution for generating unique IDs is perfectly fine. This is clean, neat and fast. There are no subtle issues whatsoever. An HMAC will not give you any extra security in this case. (The issues with H(message+secret) only exist when the attacker has some control on "message", but he does not in your case --I am familiar with this, I do crypto reviews for a living.)

I cannot update my comment (HN only lets you edit posts < 2 hours old), but I would not want to update it anyway since what I say is true :) However I think you should edit your blog post and mention that the order of concatenation of the counter and seed is important. If you want say something along the lines of "HMAC is traditionally the best practice but in this case H(counter+seed) is fine".

You could always just use UUIDs [1], or generate your own isolation and uniqueness.

Personally, instead of randomness, it's easier to just make your keys multi-part to give each process an independent portion of the address space.

{ip/mac, pid, thread_id, start_timestamp, counter}

128 + 32 + 32 + 32 + 32.

increment the per-thread counter, because it's per-thread, no atomic operations required.

Weaknesses : pid re-use inside of the resolution of start_timestamp.

[1] https://tools.ietf.org/html/rfc4122

This method leaks internal info (ip/mac), and is not as collision resistant as the one I propose. imagine you have an embedded system that after a restart has the internal clock reset because of clock battery failure. Pid and thread_id are the same after every restart (or may be the same) because of the OS initialization steps which are simple and deterministic. You have your system in "reboot loop" for some reason. It reboots every few minutes. Soon or later you'll generate a repeated ID. Given the method is also more complex, why to use it?

With Version 1 UUIDs you expose the MAC/IP and it may also not be desirable if you don't want your ids to be predictable (which IIRC, may have been a requirement of the original article).

I've always been particular to pre-pending the current time (epoch, date, whatever) to the start of a UUID of some sort.

Mathematically, you're actually more likely to generate collisions this way, but we're talking 0.00001% in a million years vs 0.001% in a million years. It won't happen either way, practically speaking.

And the benefit? It's usually easier to explain to other people. It's more intuitively safe because there would need to be two colliding keys generated the same second.

However many bits you're using to store the timestamp would better guarantee uniqueness if they were random. Whether it's intuitively safer or not, it's empirically less safe.

This is exactly why the UUID spec evolved into a pure random value.

Agreed, but it can still be a useful technique if you ever need to sort your values by time, in which case it's logically a tuple masquerading as a physical single id.

Yup! I came here to post exactly this. Like you said, mathematically it isn't perfect but in reality it's basically rock solid.

If space is a concern, this time stamp could also be converted into ascii or the like.

How would pre-pending the date to a UUID increase the change of collision?

Is this any better than reading 160 random bits from /dev/random and iterating? If unpredictability is a requirement it is - but it shouldn't do any better vs collisions?

It might even be a little worse than counting atraight up?

The problem is that /dev/random, and /dev/urandom in certain systems (it may be just a symbolic link to /dev/random in the extreme case), depletes soon and blocks until the OS can collect new entropy. Even when /dev/urandom can be used without ever blocking, like in Linux, it is slower than just computing SHA1(secret||counter).

For the goal of just getting the initial seed /dev/random is just good. So basically the method described works almost everywhere there is some even slow source of unguessable bits.

I meant reading 160 bits, then counting up:

  start = 160bits from /dev/random
  next_id = start+1
Using a hash function would seem to be more random, but you're effectively walking a path that is no more random than enumerating all 160 bits numbers. I don't thing there's any guarantee that "walking" from 1 with sha1(1+inc), is any less likely to converge on sha1(bigint+inc), than "walking" from 1 and bigint by simply increasing by 1.

On the other hand, if you had some sort of communication between nodes, it'd be trivial to predict if two instances would overlap, if you just kept incrementing "big random number 1" and "big random number 2".

ed: If sha1 is a perfect hash function (I don't know if has been proven to be that?), then sha1sum(seed+inc) shouldn't be more likely to collide than seed+inc % 2^160. But if you have one such "random" identifier, you can of course much more easily guess previous/next identifiers, than if using a secure hash function.

ed2: I suppose it should be obvious that given a pair of seed1+inc1=seed2+inc2, both plain iteration and sha1sum(seed+inc) will collide. I'm not sure if there are cases where (seed1+inc1, seed2+inc2)<2^160 [ed3: and seed1+inc1 != seed2+inc2 ] can collide after running them through sha1.

This is building another CSPRNG from an already existing one. I'd suggest you leave the CSPRNG part to the OS.

Why not just use version 4 UUIDs? In Python, at least, they are generated directly from /dev/urandom.

Using /dev/urandom to generate every new ID is not optimal for a few reasons:

1) It's not better than the method I suggested, but requires to use I/O for every generated ID. So without gaining anything you lose performances. In fact you could imagine /dev/urandom as a more complex form of an hash function in "counter mode". Given that for this application, cryptographic properties are not needed, but only good distribution, the two methods are identical but one is very fast and one is slower.

2) /dev/urandom in certain systems, notably OSX, at some point may block for some time, it is not guaranteed to be always able to generate a stream of random numbers. Certain operating systems may generate only a given amount of stream to /dev/urandom, starting with a given amount of entropy they have in the pool. If the seed cannot be renewed with new entropy the OS may block readers on /dev/urandom. I think this implementation does not make much sense and /dev/urandom should always provide output as long as the original seed had enough entropy, but implementations may vary, and you cannot ignore OSX.

> /dev/urandom in certain systems, notably OSX, at some point may block for some time

This is not true. OS X has the same behavior as the BSDs: /dev/urandom is a compatibility symlink to /dev/random, and neither device blocks.


"Yarrow is a fairly resilient algorithm, and is believed to be resistant to non-root. The quality of its output is however dependent on regular addition of appropriate entropy. If the SecurityServer daemon fails for any reason, output quality will suffer over time without any explicit indication from the random device itself."

The only device node that's broken in this way is Linux's /dev/random.

I had to switch implementation to use SHA1 in counter mode in actual code, because on OSX, /dev/urandom blocked. To make it block did not required a huge amount of data.

Huh. Well that's a fairly serious bug that someone should track down and document widely. I'll see if I can reproduce it on my OS X machine.

> It's not better than the method I suggested, but requires to use I/O for every generated ID. So without gaining anything you lose performances.

I highly doubt unique ID generation is going to be the bottle neck in your app. Even if is, while your method is faster, it is not that much faster; for me in Python:

  In [15]: %timeit hashlib.sha1(b'100'+seed).digest()
  1000000 loops, best of 3: 632 ns per loop

  In [17]: %timeit dev_urand.read(20)
  100000 loops, best of 3: 1.08 µs per loop

  In [22]: %timeit uuid.uuid4()
  10000 loops, best of 3: 15.7 µs per loop

In certain systems (OSX, certain *BSD), /dev/urandom is an alias for /dev/random so the generator will completely block waiting for entropy.

On FreeBSD it is the other way around: /dev/random will never block once seeded. /dev/urandom as a symlink will give the same behavior.

I haven't tried OS X, but given that it also uses a PRNG (Yarrow) for /dev/random, I would expect it to behave similarly. Are you able to reproduce the described behavior?

Do you have measurements that show that urandom is slower than a SHA1 hash on Linux? I don't have a good sense for how fast/slow the read call is (obviously you can cache the fd across reads).

Also, I can't find references to OS X or any other OS blocking on urandom reads. Can you point me to one?

Can you explain why you 'cannot ignore OSX'? I thought you talk about server side generation here. I .. might be a bit harsh here, but I don't think anyone would _want_ to run a server on OSX? Why can't you _not_ just ignore OSX for these use cases?

Couldn't you request N blocks of bits from every call to read() and amortize the syscall cost?

Primarily it's good to understand how to these things work in the first place which is why this blog post is interesting. `/dev/urandom` on most machines works in a very similar manner to what is described here.

In the browser, if you generate UUIDv4 using data from Math.random, it's going to fail the same way on V8 as it did for Mike.

However, the same applies to the method from this article.

This is what I do and it's worked well for me so far.

I am surprised this hasn't been brought up already: the easiest and most reliable way to generate GUIDs is to use existing APIs, available on near-every platform out there. These blocks of code have been around for years if not decades and have been well-vetted and optimized. Rolling your own is just a poor use of time and invites error. Unless you're doing this for homework or have some free time and are curious about what's under the proverbial hood, don't reinvent the wheel.

Please just use 32 bytes from /dev/random by default. If it shows up in profiling data, then think about inventing something else. It likely won't.

If you need the IDs to be unique per machine just use a unique-per-machine identifier - like the MAC address of the network card. Then use a counter that increments by 1 for each generated ID. If you need to be resilient to failures - just checkpoint every 10K (or 100K) numbers and add 10K to the starting number when you reboot after failure.

No collisions ever. Seriously - I don't understand why people would overthink it that much.

Not to mention OP talks about MD5 and SHA1's security issues - well guess what? MD5 was never meant to be used for crypto - just validation. Saying it's bad at being used for cryptography like that is like saying cats are a bad war time weapon - true but meaningless.

No collisions ever, unless your MAC addresses collide. This could happen due to using VMs, or if somebody reconfigures the MAC to a colliding value, or if it turns out that your NIC manufacturer was lazy/incompetent and gave you a non-unique MAC.

I agree, I don't understand why people would overthink it this much. Grab N bits from /dev/urandom (where N=128 is fine, and N=256 will quiet any paranoia) and call it done.

In the unlikely event that generating unique IDs becomes a performance bottleneck, then you can look into something slightly more complex. Reading N*M bytes (where M is maybe 10-100) in one call and caching the surplus should suffice.

> MD5 was never meant to be used for crypto - just validation.

I'm not sure what you mean by this. Quoting from the author of MD5 in the RFC specifying it (https://tools.ietf.org/html/rfc1321):

"The MD5 algorithm is intended for digital signature applications, where a large file must be 'compressed' in a secure manner before being encrypted with a private (secret) key under a public-key cryptosystem such as RSA."

What if your MAC addresses aren't spread that far between each other?

What about it? Remember, we're no longer using a hash function on anything - we don't want the numbers to be uniformly spread on any range - just unique.


XID is our solution for when collisions do not matter much.

XID defintion: a 128-bit completely random number, generated from an unblocking source of high security randomness. Example of sources: the OS using /dev/urandom, or the language using a secure random class, etc.

XID is compatible with UUID-size-fields in Postgres and any other 128-bit storage system, if you want to save them. The XID idea is quite easy to implement on any system because it's always calling good built-in random tools, rather than trying to home-grow some kind of hash or SHA or masking.

The primary advantage is that naming something an XID tells programmers that it's important to use it as intended. Calling a field an XID means: do not use fewer bits, do not use a less-secure randomness source, do not use an UUID library that sets any of the bytes, etc.

If you would like to contribute XID code for any language, we welcome commits.

> for when collisions do not matter much.

Collision, in that scheme, are effectively impossible. It's mostly the same as a UUIDv4. (The link notes this, and perhaps those extra bits are nice; I'm not sure why you'd use a poor PRNG to generate a UUIDv4 as that would effectively be a broken implementation. If you're doing that, what would prevent you from doing the same with XIDs?)

Yes you're correct. In practice we've seen many codebases that generate UUIDv4 by a poor PRNG or a codebase-seeded PRNG. For example we've seen bash `$RANDOM % 10` instead of /dev/urandom, Java Random instead of SecureRandom, or misinformed home-grown hashing attempts as described in the article, or a constant seed at the top of the code.

The UUID spec allows these; the XID spec prohibits these.

You asked: what would prevent you from doing a broken implementation of an XID? In a case such as that, an advantage of an XID is a correct implementation is easier to see during code review, easier to verify during continuous integration, and easier to discuss with teams that aren't as familiar with PRNGs, seeds, crypto, and UUID v1-v5. These are people-pragmatic benefits not information-theoretic benefits.

I would be very, very, careful about using PRNG to generate unique IDs.

If the ID indeed needs to be unique, I would stay as far away as possible from PRNG. That a PRNG is cyclic with some cycle size that's larger than the number of atoms in the universe does not prevent you from running into repeated sequences in a much higher probability, especially if you're reducing the space of values returned one way or another.

To some extent they're even wanted properties of a good PRNG. If you're using a PRNG whose cycle is larger than the space of distinct values it returns, there _will_ be repeated sequences of length > 1.

If your application cannot handle duplicated Ids, then use an ever incrementing counter concatenated with a serverId, add a checksum for the good cause if you want (it won't add any entropy or reduce the risk of collision)

If you really want to use a PRNG to generate unique ids, use an ever incrementing counter + a serverId + a random number generator. You can probably concatenate them any way you want, as long as the ever incrementing counter is stored in full, but the minute you truncate that one, I wouldn't expect any uniqueness anymore.

If you simply concatenate partial values coming from a PRNG, I would expect the probabilities for a collision to be much higher than cycle_length.

Using a hash function on a counter will still increase the risk of collision that simply using the counter. And using a hash function on a PRNG will also increase the risk of collision compared to the pure PRNG approach.

If your app is fine with duplicated IDs, then go ahead, use a PRNG. It's completely OK. Alternatively if you have a PRNG whose space of values is bigger than the space of UUIDs you're generating I would suspect should be fine. Same if you have a (non pseudo) RNG at hand.

EDIT : added the hash function paragraph.

Or use a cryptographically-secure PRNG (CSPRNG), which doesn't have the flaw of a cycle size that will be reached within our lifetimes. (Usually they return 128-bit or more outputs, and 2^128 is a very large number; by comparison, the standard Mersenne twister returns 32-bit outputs.)

For most applications, CSPRNGs are way, way more than fast enough. They're also readily available and well-tested on all current OSes and languages.

What benefits does this method offer? The key thing that makes it better than what Mike was doing is to use /dev/urandom instead of Math.random, which isn't always possible (for instance, in the browser). With V8's Math.random, it's going to fail the same way.

In the browser you have window.crypto.getRandomValues(), which is a CSPRNG just like /dev/urandom and available in every current browser other than Opera Mini: http://caniuse.com/#feat=getrandomvalues

(But, in either the browser or in native code, you should use your platform's CSPRNG directly instead of writing an ad-hoc CSPRNG on top of it.)

Nothing new. Just use the right variant for your use case: https://en.wikipedia.org/wiki/Universally_unique_identifier

Wasn't this method already popularized (basically) by twitter's simpleflake/snowflake?


Twitter Snowflake has the additional requirements of being mostly chronologically sorted (events at least 1s apart, and preferably 10ms apart, should have IDs in the right sort order) and using 64-bit IDs, which make the problem harder.

If you're happy to use random, 128-bit IDs, /dev/urandom solves the problem of unique IDs, and very easily fits the performance numbers Snowflake promises without even trying (they want 10k IDs per second and 2ms latency; /dev/urandom on my three-year-old machine does 700k IDs per second and 14µs latency in read()).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact