

Simpleflake - Distributed ID generation for the lazy - makmanalp
http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/

======
Canausa
Though this is a great idea ( similar to one I had created in the past) there
is a major problem I think users should be aware of, concurrency.

Usually using a ID generating system is used when data needs to be shared or
asking for an ID from a database is just not an option. It is not typically
used in small environments.

I think this solution has an upper bar of effectiveness that is just to low
for the problem this is trying to solve. The use of a pseudo random number
generator ( python's random module) is too vulnerable to not generating
"random" numbers and so many collisions can occur at high concurrency.

Most pseudo random number generators usually are seeded with the time on the
host system. As stated in the docs, the best way to use simpleflake is to
setup NTP so all the servers times are in sync, the random module on all the
servers/processes could be seeded with the same time ( like after a code
push). If all the servers get seeded with the same time and the objects being
operated on take a uniform time the probability of collisions go up and slow
down all processes trying to interact with the objects, as they generate new
IDs. As new Ids are being generated you lose the time sort that you originally
wanted by including the time in the ID.

I do not mean this to be insulting. I am speaking from experience from having
debugged a problem caused the the ID generator at the company I formerly
worked for. We used a similar generation technique to identify messages
passing through the system, but we included more meta data in our IDs such as
sender and recipient of the message and even the server name. With all of that
when we ran 150 processes on a 4 core server, there was a .01% of collisions,
but that jump significantly once the servers were upgraded to 16 core machines
and at that point in the process there was no way to regenerate a new ID, the
data was simply dropped out of existence.

~~~
makmanalp
Hey, this is pretty insightful and not insulting at all - I always love
hearing about the experiences of others.

The RNG being used is not the random.random() Mersenne Twister one, it's the
OS-provided cryptographically secure one:

[https://github.com/SawdustSoftware/simpleflake/blob/master/s...](https://github.com/SawdustSoftware/simpleflake/blob/master/simpleflake.py#L48)

But in any case, the reason you need NTPd sync is not to seed the RNG, but
because the IDs themselves are prefixed with millisecond timestamps. This is
to ensure that you're not using the same sub-keyspace for that millisecond
over and over again from multiple processes.

If you're seeing that many collisions, obviously, it's time to change methods
like you mention. Data being dropped out of existence sounds like an
application code issue, and not an inherent problem with your IDing scheme.
That sounds like a place where ops should get a big honking error message
("tried to insert twice, collided both times"). In any case, the contingency
plan for that for us is to just switch to snowflake, since with this specific
scheme it should take a while before we hit that barrier, as calculated in the
article.

Adding more metadata is definitely an option, but I'm not a fan of it since
data schemas (and data itself) have a nasty habit of changing over time,
meaning you have to change your IDing scheme. Completely surrogate IDs don't
have this issue.

------
chrisfarms
The arguments against UUID just don't seem strong enough to me.

Even with huge datasets queries against UUID keys/indices have never been any
kind of bottle neck or major performance hit compared with other more complex
queries within the application.

Add that to the fact db's like Postgres have native support for UUID types and
stable extensions for UUID1-4 generation, I just could not see myself wanting
to sacrifice such a well known solution for one that offers such slight gains.

~~~
makmanalp
Thanks for commenting! I think DB support is great. I'm especially jealous of
postgres, which like you mention, has native UUID types. UUID-OSSP is
especially great: [http://www.postgresql.org/docs/devel/static/uuid-
ossp.html](http://www.postgresql.org/docs/devel/static/uuid-ossp.html)

Postgres handles all the gunk for you in terms of efficient storage and
display, AND you can cluster by a different field than the UUID one.

For those of us not unfortunate enough to be using a DB that has such
niceties, things are different :)

As for performance I think this will always depend on your what your queries
look like. For random reads or for reads when your dataset is not larger than
RAM, you won't see any degradation because of caching. But if you need to make
tons of range queries on a dataset that doesn't fit in RAM (or even worse, the
_index_ doesn't fit in RAM), things can get a bit crazy.

I wish I had time to also do a benchmark supporting this, but I've hit this
issue before. Other people also seem to have hit it here:

[http://stackoverflow.com/questions/2365132/uuid-
performance-...](http://stackoverflow.com/questions/2365132/uuid-performance-
in-mysql) (comments and answers)

or here:

[https://news.ycombinator.com/item?id=5310662](https://news.ycombinator.com/item?id=5310662)

or here:

[http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-
no...](http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/)

So, it really depends on your use case, I guess. Hope this helps!

------
veesahni
MongoDB ObjectID's are designed with similar goal in mind - time based,
distributed, uncoordinated, "probably" unique:
[http://docs.mongodb.org/manual/reference/object-
id/](http://docs.mongodb.org/manual/reference/object-id/)

I say "probably" unique because they depend on the 'machine identifier' being
unique, which the ruby driver implements by grabbing the first 3 bytes of the
hostname's md5 hash.

~~~
makmanalp
Yep! In essence, this is not too different. I think the main difference is
that ObjectIDs are 96 bytes vs simpleflake / snowflake's 64, which fit into
standard DB types better. (Don't have to muck around with BINARY and such).

------
noise
Thanks for this, I had been looking for a good/simple ID generator in Python
and was about to resort to building it myself.

~~~
makmanalp
Thank you! I loved your article on partially applied functions in C, by the
way (that was you, right?)

~~~
noise
Nope, that wasn't me.

------
wbl
What about simply having one do odd and the other even?

~~~
makmanalp
Hmm, fair point! I think this would definitely work, but only in the case
where you have just two things (or in the more general case, a fixed number of
things based on modulo n) generating IDs.

You're essentially dividing your keyspace into two from the beginning. So
then, to add another generator, you'd have to have a scheme that transparently
updates all of your generator nodes to do mod 1 mod 2 mod 3 etc.

It definitely could work though, if that's what you need.

