

God: Scalable in Memory Data Structure Server in Go - joshbaptiste
http://zond.github.com/god/architecture.html

======
Whitespace
I haven't read the article, but the claim of MurmurHash3 having low collision
rates is false: you can generate an arbitrary quantity of collisions according
to [0].

When this was released last year, ruby[1], jruby[2] and rubinius[3] all
switched to siphash[4]. Looking at [5] mentions tomcat, .NET, PHP, etc. all
switching away from MurmurHash.

(I'm not saying one would HashDoS one's own database, I'm merely pointing out
that MurmurHash3 wasn't _designed_ with low collision rates in mind [siphash
is, though])

[0] [http://emboss.github.com/blog/2012/12/14/breaking-murmur-
has...](http://emboss.github.com/blog/2012/12/14/breaking-murmur-hash-
flooding-dos-reloaded/)

[1] [http://www.ruby-
lang.org/en/news/2012/11/09/ruby19-hashdos-c...](http://www.ruby-
lang.org/en/news/2012/11/09/ruby19-hashdos-cve-2012-5371/)

[2] <http://jruby.org/2012/12/03/jruby-1-7-1.html>

[3]
[https://github.com/rubinius/rubinius/commit/a9a40fc6a1256bcf...](https://github.com/rubinius/rubinius/commit/a9a40fc6a1256bcf6382631b710430105c5dd868)

[4] <https://131002.net/siphash/>

[5]
[http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2011-481...](http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2011-4815)

[edit: formatting]

~~~
tveita
> MurmurHash3 wasn't _designed_ with low collision rates in mind

There is a difference between "low collision rate" and "collision resistance".

Real-world data often contain regular patterns that can cause "bad" hash
functions to give very non-random output, e.g. numbers at fixed intervals,
pointers all aligned to 4k boundaries, or long strings that differ only by a
short suffix.

For instance, early versions of Java would sample only a few of the characters
in a long string to calculate the hash code, which caused horrendous
performance if you stored a bunch of related file paths in a hash map.

Most hash functions try to avoid bad behaviour on this kind of good-natured
input, but some are better at it than others, due to careful design for
random-like distribution, and low collision rates.

<http://www.strchr.com/hash_functions> has examples of collision rates for
different hashes, all non-cryptographic.

~~~
wisty
Python's hash, for example, will generate lower collisions than a "good" hash
function. 'file1', 'file2', and 'file3' will not collide, but end up in
sequential boxes, which is very efficient. Of course, you can send python web
apps a "cookie of death", in which all cookies collide. There's a patch, but I
think it's disabled by default.

~~~
kibwen
Hash randomization was actually enabled by default in Python 3.3.

[http://docs.python.org/3/whatsnew/3.3.html#summary-
release-h...](http://docs.python.org/3/whatsnew/3.3.html#summary-release-
highlights)

~~~
slurgfest
Which means this is not available in PyPy? :(

~~~
wisty
python -R should enable it in 2.X (in the patched version, 2.7.3 is fine).
PyPy does not include hash randomization (see
<http://doc.pypy.org/en/latest/cpython_differences.html>).

------
fusiongyro
Not to be confused with god the Ruby process monitor:

<http://godrb.com/>

Also, great choice for searchability: "go god." You guys are so clever!

~~~
LeafStorm
And that's not even getting into the Third Commandment issues.

~~~
jlgreco
Remember the sabbath, and keep it holy?

~~~
riffraff
Ten commandments are wildly inconsistent

[http://www.biblicalheritage.org/bible%20studies/10%20command...](http://www.biblicalheritage.org/bible%20studies/10%20commandments.htm)

~~~
jlgreco
Interesting, apparently as a former Lutheran I learned the Catholic version.
Seems it is even worse than that link suggests though, as there are two texts
which are considered to be the 10 commandments:
[http://en.wikipedia.org/wiki/Ten_Commandments#Two_texts_with...](http://en.wikipedia.org/wiki/Ten_Commandments#Two_texts_with_numbering_schemes)

How was I not aware of this...

~~~
fusiongyro
Because Christians are generally speaking quite ignorant of the Old Testament,
and not infrequently the New as well. Every religious Jew on earth is well
aware of this, and if you read the two versions side-by-side you'll see the
differences are stylistic and not substantive--the shade of difference between
"honor" and "love" or "remember" and "observe" is just not that great, though
our practices often are explained as owing to these distinctions.

If you are looking for something about the Bible to be upset about, you can
certainly do a lot better than this.

~~~
jlgreco
"The 10 Commandments" and/or the Bible itself upset me for other reasons. In
this particular case I am only upset by my own ignorance.

------
shin_lao
The Chord algorithm sounds like a very good choice, however I'm more skeptical
about the radix tree approach. I fear you might get a huge performance
penalty.

~~~
zond
I would have preferred other algorithms, but there was a strict need for 1)
sorted data and 2) that 2 identical trees had the same structure (for the
merkle element). Not many structures were left to choose from, and radix
seemed to work well enough.

~~~
shin_lao
My concern is that you're mixing two different concepts.

Do you really need data to be ordered? Why do you care about having "close"
data on the same node?

~~~
zond
The synchronization uses Merkle trees, and they require ordered data since
they hash contiguous data into a tree of hashes.

And to avoid having a separate structure for the Merkle trees I just hash all
nodes in the main tree, and compare the hashes to find differences.

Thus the same content must have the same structure, or the comparisons won't
work.

~~~
shin_lao
I think I didn't make myself clear:

\- You say _However, since it could be very useful for users of a database to
store ordered data, or to wilfully concentrate certain data on certain parts
of the cluster, god does not force the user to hash the keys._ -> why do you
care about how the data is actually stored? \- _To map keys to values, a
mapping structure is needed. For infrastructural reasons (synchronization and
cleaning) as well as for functionality of different kinds, we need a sorted
mapping, and it has to be deterministically structured._ -> why?

~~~
zond
> why do you care about how the data is actually stored?

I just said I don't. 'god does not force the user to hash the keys'.

> > To map keys to values, a mapping structure is needed. For infrastructural
> reasons (synchronization and cleaning) as well as for functionality of
> different kinds, we need a sorted mapping, and it has to be
> deterministically structured.

> why?

Functionality: To be able to return the first or the n'th entry it has to be
ordered.

Synchronization/cleaning: To be able to hash element 0000-000f we need an
efficient way to fetch a segment of elements, thus it again has to be ordered.

To optimize the hashing so that I don't have to keep two separate data
structures I keep the hashes in the nodes of the sorted data structure. Thus
the structure has to be deterministically structured or the hashes won't be
equal even if the trees contain the same data.

~~~
shin_lao
Thank you for your answers.

------
lobster_johnson
Bad name. Already taken by a well-known Ruby tool (<http://godrb.com/>).

Why not call it Heaven? It's got clouds.

~~~
codygman
I didn't know of this tool. I'm sure lots of programs use the name God TBH.

~~~
lobster_johnson
No, looks like it's just the one. And it's in Debian/Ubuntu [1][2] as "god",
so this tool is already at a disadvantage, namewise:

    
    
        $ apt-cache search god | grep "^god"
        god - Fully configurable process monitoring
    

[1]
[http://packages.debian.org/search?keywords=god&searchon=...](http://packages.debian.org/search?keywords=god&searchon=names&suite=stable&section=all)

[2]
[http://packages.ubuntu.com/search?keywords=god&searchon=...](http://packages.ubuntu.com/search?keywords=god&searchon=names&suite=quantal&section=all)

~~~
dsl
I wonder what the Debian/Ubuntu policy is on package name reuse? When Ruby is
sent out to pasture, do we never get to use 'god' again?

~~~
dschulz
wasn't "pasture" a slackware thing?

------
JulianMorrison
Can it be run in-process from Go as a library?

~~~
JulianMorrison
And also, can it be run in diskless mode (no logging, no snapshotting)?

~~~
zond
Yes, it isn't documented (for some reason, I must have forgot) but
<https://github.com/zond/god/blob/master/dhash/dhash.go#L109> shows how the
empty string as persistence directory will avoid any persistence.

------
vph
obvious question: how is this compared to Redis?

~~~
zond
Hard to know. It seems to perform comparably, anyway, at single node level.

I have yet to find a bunch of equally powerful machines to perform a proper
scalability benchmark :/

------
clumsybull
Arg, should not have chosen that name: <http://godrb.com/>

------
rartichoke
I don't want to come off as a negative Nancy but why didn't you just stick to
using Redis? What problems does your lib solve that Redis does not?

~~~
zond
If you run out of RAM or CPU on a single machine you start running into
operationally problematic situations with Redis.

~~~
rartichoke
I've never had this issue pop up (I am not involved in any huge scale sites)
but I have to believe this really isn't too common?

Also a 5 second Googling shows there's ways to set Redis up to stop writing
but continue reading if you're running out of memory. If you really have that
large memory needs then you're going to also run out of memory with this new
lib on 1 machine.

It seems like Redis has more than enough options to prevent real problems from
occurring once you do surpass your hardware requirements.

~~~
zond
I guess it's not for you, then.

------
pknerd
Interesting to see how technology is being _religionized_ by hackers. _Smirk_

