Hacker News new | past | comments | ask | show | jobs | submit login
God: Scalable in Memory Data Structure Server in Go (zond.github.com)
150 points by joshbaptiste on Feb 5, 2013 | hide | past | favorite | 63 comments



I haven't read the article, but the claim of MurmurHash3 having low collision rates is false: you can generate an arbitrary quantity of collisions according to [0].

When this was released last year, ruby[1], jruby[2] and rubinius[3] all switched to siphash[4]. Looking at [5] mentions tomcat, .NET, PHP, etc. all switching away from MurmurHash.

(I'm not saying one would HashDoS one's own database, I'm merely pointing out that MurmurHash3 wasn't _designed_ with low collision rates in mind [siphash is, though])

[0] http://emboss.github.com/blog/2012/12/14/breaking-murmur-has...

[1] http://www.ruby-lang.org/en/news/2012/11/09/ruby19-hashdos-c...

[2] http://jruby.org/2012/12/03/jruby-1-7-1.html

[3] https://github.com/rubinius/rubinius/commit/a9a40fc6a1256bcf...

[4] https://131002.net/siphash/

[5] http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2011-481...

[edit: formatting]


> MurmurHash3 wasn't _designed_ with low collision rates in mind

There is a difference between "low collision rate" and "collision resistance".

Real-world data often contain regular patterns that can cause "bad" hash functions to give very non-random output, e.g. numbers at fixed intervals, pointers all aligned to 4k boundaries, or long strings that differ only by a short suffix.

For instance, early versions of Java would sample only a few of the characters in a long string to calculate the hash code, which caused horrendous performance if you stored a bunch of related file paths in a hash map.

Most hash functions try to avoid bad behaviour on this kind of good-natured input, but some are better at it than others, due to careful design for random-like distribution, and low collision rates.

http://www.strchr.com/hash_functions has examples of collision rates for different hashes, all non-cryptographic.


Python's hash, for example, will generate lower collisions than a "good" hash function. 'file1', 'file2', and 'file3' will not collide, but end up in sequential boxes, which is very efficient. Of course, you can send python web apps a "cookie of death", in which all cookies collide. There's a patch, but I think it's disabled by default.


Hash randomization was actually enabled by default in Python 3.3.

http://docs.python.org/3/whatsnew/3.3.html#summary-release-h...


Which means this is not available in PyPy? :(


python -R should enable it in 2.X (in the patched version, 2.7.3 is fine). PyPy does not include hash randomization (see http://doc.pypy.org/en/latest/cpython_differences.html).


To clarify: with MurmurHash{2,3} or CityHash, it's easy to generate arbitrary numbers of reasonably small keys which all hash to the same value regardless of what seed is used. This is much worse than most people initially assume when they hear that "collisions in murmurhash have been found".

That said, it's a fine hash function if you're not worried about malicious keys.


[disclaimer]I wrote god[/disclaimer]

As someone in the thread below wrote, Murmur is good if you don't worry about malicious keys (it is not cryptographic, and doesn't claim to be).

I don't worry about malicious keys :)


Murmurs main issue (in an attacker controlled context) is how it incorporates the seed in to the hash, not unexpected collisions on non-malicious data.

Further, _any_ hash, even cryptographic, will not protect you if it is not used with a random seed unknown & unpredictable to the attacker.


You have a practical way to generate collisions in, say, SHA-256?


You only need to generate collisions for the keyspace used. Brute forcing 20-30 bit collisions in any hash function is trivial.


Not to be confused with god the Ruby process monitor:

http://godrb.com/

Also, great choice for searchability: "go god." You guys are so clever!


Searching for [golang foo] has been the recommended strategy for almost as long as Go has been around.


Yet another terrible name for a software package.


And that's not even getting into the Third Commandment issues.


And interestingly enough, your account is 666 days old today.


Remember the sabbath, and keep it holy?



Interesting, apparently as a former Lutheran I learned the Catholic version. Seems it is even worse than that link suggests though, as there are two texts which are considered to be the 10 commandments: http://en.wikipedia.org/wiki/Ten_Commandments#Two_texts_with...

How was I not aware of this...


Because Christians are generally speaking quite ignorant of the Old Testament, and not infrequently the New as well. Every religious Jew on earth is well aware of this, and if you read the two versions side-by-side you'll see the differences are stylistic and not substantive--the shade of difference between "honor" and "love" or "remember" and "observe" is just not that great, though our practices often are explained as owing to these distinctions.

If you are looking for something about the Bible to be upset about, you can certainly do a lot better than this.


"The 10 Commandments" and/or the Bible itself upset me for other reasons. In this particular case I am only upset by my own ignorance.


That's not what any reasonable person would call "wildly inconsistent," though it is certainly enough to make the author's point.


to be honest, that is the first result I found looking on google for "ten commandments jews"[0].

Sure, as another (apparently deleted?) comment said, if you take them as a whole you more or less can respect 80% of it across a few major branches.

But they are inconsistent enough that if I tell you one by number there's a 80% chance that you won't be able to understand what I mean unless you know my religion, which is the case i replied to.

[0] I first found out that what the roman catholic catechism taught to me (in rome :) as VI, "do not commit impure acts", is different for jews.


In that case, fair enough.


Yup, I am not happy about the name.

It has its root in 'Go database', and began as a working name that I never had time to replace...


You could call it Ghord


Sounds gristly :O


Sounds totally metal. Ghord the conqueror.


Can't be harder to look up than "Go".


Try looking up anything for R


First hit for googling "R api" is "Writing R Extensions". Looks pretty easy googling for R to me.


Actually that would be a neat search engine idea: Targeted at programmers, you can specify the language the library/question/tutorial/etc. should be written in and it searches only websites related to this.


It's not a search problem, as much as it's an indexing one, no?


True. Should be basicly doable like a filter on google


The Chord algorithm sounds like a very good choice, however I'm more skeptical about the radix tree approach. I fear you might get a huge performance penalty.


I'm no data structures expert (far from it), but via hearsay I've only heard good things about radix trees. Care to elaborate?


Radix trees are great, with the right workload. The main competition (for strings) are the ternary search tree (for sparse keyspaces with rare insertions) and the B-Tree (which is nominally disk-optimized, but that optimization also tends to help with the L3 cache).

None of those are best for all workloads. There are situations where using a radix-tree is actually faster than a hashtable with a good hash function, and situations where it is slower than a red-black tree.


I would have preferred other algorithms, but there was a strict need for 1) sorted data and 2) that 2 identical trees had the same structure (for the merkle element). Not many structures were left to choose from, and radix seemed to work well enough.


My concern is that you're mixing two different concepts.

Do you really need data to be ordered? Why do you care about having "close" data on the same node?


The synchronization uses Merkle trees, and they require ordered data since they hash contiguous data into a tree of hashes.

And to avoid having a separate structure for the Merkle trees I just hash all nodes in the main tree, and compare the hashes to find differences.

Thus the same content must have the same structure, or the comparisons won't work.


I think I didn't make myself clear:

- You say However, since it could be very useful for users of a database to store ordered data, or to wilfully concentrate certain data on certain parts of the cluster, god does not force the user to hash the keys. -> why do you care about how the data is actually stored? - To map keys to values, a mapping structure is needed. For infrastructural reasons (synchronization and cleaning) as well as for functionality of different kinds, we need a sorted mapping, and it has to be deterministically structured. -> why?


> why do you care about how the data is actually stored?

I just said I don't. 'god does not force the user to hash the keys'.

> > To map keys to values, a mapping structure is needed. For infrastructural reasons (synchronization and cleaning) as well as for functionality of different kinds, we need a sorted mapping, and it has to be deterministically structured.

> why?

Functionality: To be able to return the first or the n'th entry it has to be ordered.

Synchronization/cleaning: To be able to hash element 0000-000f we need an efficient way to fetch a segment of elements, thus it again has to be ordered.

To optimize the hashing so that I don't have to keep two separate data structures I keep the hashes in the nodes of the sorted data structure. Thus the structure has to be deterministically structured or the hashes won't be equal even if the trees contain the same data.


Thank you for your answers.


Other comments, timestamps will not work as you expect if you have a network partition and you are using your own time synchronization mechanism. Use NTP instead.


I don't think I understand what you mean. Or you don't understand how my timestamps and time synchronization works?


Bad name. Already taken by a well-known Ruby tool (http://godrb.com/).

Why not call it Heaven? It's got clouds.


I didn't know of this tool. I'm sure lots of programs use the name God TBH.


No, looks like it's just the one. And it's in Debian/Ubuntu [1][2] as "god", so this tool is already at a disadvantage, namewise:

    $ apt-cache search god | grep "^god"
    god - Fully configurable process monitoring
[1] http://packages.debian.org/search?keywords=god&searchon=...

[2] http://packages.ubuntu.com/search?keywords=god&searchon=...


I wonder what the Debian/Ubuntu policy is on package name reuse? When Ruby is sent out to pasture, do we never get to use 'god' again?


wasn't "pasture" a slackware thing?


Can it be run in-process from Go as a library?


Yes, look at https://github.com/zond/commendable which is a real life project I am basing on god (as an exercise, showcase and test run).


And also, can it be run in diskless mode (no logging, no snapshotting)?


Yes, it isn't documented (for some reason, I must have forgot) but https://github.com/zond/god/blob/master/dhash/dhash.go#L109 shows how the empty string as persistence directory will avoid any persistence.


obvious question: how is this compared to Redis?


Hard to know. It seems to perform comparably, anyway, at single node level.

I have yet to find a bunch of equally powerful machines to perform a proper scalability benchmark :/


Arg, should not have chosen that name: http://godrb.com/


I don't want to come off as a negative Nancy but why didn't you just stick to using Redis? What problems does your lib solve that Redis does not?


Distributed hash table algorithms were originally created for things like Gnutella and are optimal for cases where you expect nodes to pop up and disappear on a regular (but unpredictable) basis.


If you run out of RAM or CPU on a single machine you start running into operationally problematic situations with Redis.


I've never had this issue pop up (I am not involved in any huge scale sites) but I have to believe this really isn't too common?

Also a 5 second Googling shows there's ways to set Redis up to stop writing but continue reading if you're running out of memory. If you really have that large memory needs then you're going to also run out of memory with this new lib on 1 machine.

It seems like Redis has more than enough options to prevent real problems from occurring once you do surpass your hardware requirements.


I guess it's not for you, then.


Interesting to see how technology is being religionized by hackers. Smirk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: