
Flake: A Decentralized, K-Ordered Unique ID Generator in Erlang - d2fn
http://blog.boundary.com/2012/01/12/Flake-A-decentralized-k-ordered-id-generation-service-in-Erlang.html
======
mjb
A couple of questions about this:

\- Why [time, node id, seq] and not [time, seq, node id]? That would improve
ordering if you have approximately equal load on each box.

\- Isn't a 16 bit seq number overkill? Handing some of those bits to the
unique ID would have made unique ID assignment easier. Duplicate MACs can and
do exist (especially if you buy a lot of hardware from the same vendors).

\- The quality of the ordering is going to be restricted by the quality of
time synchronization within the cluster. Relying on NTP for this is OK, but
experience suggests that a secondary monitoring system will be needed.
Similarly, relying on monotonic time needs some care in system administration
- care that could potentially be avoided with a different unique host ID
assignment scheme.

~~~
rdtsc
> Why [time, node id, seq] and not [time, seq, node id]

Would you rely on sub-millisecond synchronization between nodes _and_ an
almost exact load amount?

In other words for a particular millisecond seq order is only relevant on that
particular node, so node should come first.

~~~
mjb
> In other words for a particular millisecond seq order is only relevant on
> that particular node, so node should come first.

It still seems to me that T_0,0,A (issued by node A at T_0 with seq number 0)
would tend to come before T_0,1,B so putting the sequence number first does
add some value. On the other hand, the ordering T_0,A,0 < T_0,A,1 < T_0,B,0 <
T_0,B,1 is arbitrary.

> Would you rely on sub-millisecond synchronization between nodes _and_ an
> almost exact load amount?

Not rely on. Absolutely not. Maybe it's better to go further, and make it more
explicit that the IDs are K ordered. Say you could synchronize your host
clocks reliably to a maximum delta of 512 ms, then you could chose a scheme
like:

[ milliseconds >> 9, host id, milliseconds & 0x1ff, seq id ]

The value here is that you make it much harder for consumer of the IDs to make
incorrect assumptions about the precision of their ordering. Basically taking
away the temptation to make statements about their ordering with false
precision, by making a simple sort only provide a meaningful ordering within
the real available precision.

~~~
rdtsc
> On the other hand, the ordering T_0,A,0 < T_0,A,1 < T_0,B,0 < T_0,B,1 is
> arbitrary.

Still don't see it. We are already making the assumption that the millisecond
part is the same T_0. So then seq will depend strictly on the count of
previous such IDs issued by that node only (would you agree with that?). So
that is why I said that it doesn't make sense to compare seq number from A --
(0) and a seq number of B -- (1) before considering the identity of A & B. You
would effectively order by relative load between all your machines at that
particular time, at that time frame it would be things like disk access, cache
states and so on.

EDIT:

In addition. Ordering T_0,A,0 < T_0,A,1 < T_0,B,0 < T_0,B,1 will be stable
though, machine A,B mac addresses don't change. For any particular millisecond
you will get all the results for a particular machine then those results will
be sorted by sequence # so you would even get some kind of a relative load
measurement.

Let's look at a longer example:

T_0,A,0

T_0,A,1

T_0,B,0

T_0,B,1

T_0,B,2

T_0,B,3

T_0,C,1

T_0,C,2

T_0,C,3

T_0,D,0

I think is a lot better order than

T_0,0,A

T_0,0,B

T_0,0,C

T_0,0,D

T_0,1,A

T_0,1,B

T_0,1,C

T_0,2,B,

T_0,2,C

T_0,3,B

T_0,3,C

------
rarrrrrr
There's also Instagram's implementation in PostgreSQL:

[http://instagram-
engineering.tumblr.com/post/10853187575/sha...](http://instagram-
engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram)

------
d2fn
Thanks for the interest and the feedback. I've updated the readme and added a
roadmap as well as a faq after getting some good input over the last couple of
days.

<https://github.com/boundary/flake>

------
google-1
This is quite similar to the BSON ObjectId specification used by MongoDB:

[http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-
BSO...](http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-
BSONObjectIDSpecification)

------
alpb
For those interested, there is also Snowflake by Twitter on GitHub
<https://github.com/twitter/snowflake>

~~~
thezilch
Snowflake is referenced in the article "Credits" with a distinction between
the products -- though, the product name is a bit dubious. _It [Flake] differs
primarily in that we allow ourselves a wider address space in which to fit ids
which means we don’t have to think about timestamp truncation or coordination
of worker ids._

~~~
foobarbazetc
Yeah. But the hard part is the 64bit representation. If you have 128 bits, you
can no longer be a standard int. And you can just use UUIDs if you don't care
about k-ordering.

------
rdtsc
Why not use uuid1 then reorder its bits ?

------
Raphael
Is it Erlang day already?

------
coolrhymes
as mjb said, mac addresss can be cloned so not sure whats that buying we are
using uuid1 which does more or like does what these guys are doing...

~~~
jpitz
UUID1s don't sort by their generation time.

