I haven't read the article, but the claim of MurmurHash3 having low collision rates is false: you can generate an arbitrary quantity of collisions according to [0].
When this was released last year, ruby[1], jruby[2] and rubinius[3] all switched to siphash[4]. Looking at [5] mentions tomcat, .NET, PHP, etc. all switching away from MurmurHash.
(I'm not saying one would HashDoS one's own database, I'm merely pointing out that MurmurHash3 wasn't _designed_ with low collision rates in mind [siphash is, though])
> MurmurHash3 wasn't _designed_ with low collision rates in mind
There is a difference between "low collision rate" and "collision resistance".
Real-world data often contain regular patterns that can cause "bad" hash functions to give very non-random output, e.g. numbers at fixed intervals, pointers all aligned to 4k boundaries, or long strings that differ only by a short suffix.
For instance, early versions of Java would sample only a few of the characters in a long string to calculate the hash code, which caused horrendous performance if you stored a bunch of related file paths in a hash map.
Most hash functions try to avoid bad behaviour on this kind of good-natured input, but some are better at it than others, due to careful design for random-like distribution, and low collision rates.
Python's hash, for example, will generate lower collisions than a "good" hash function. 'file1', 'file2', and 'file3' will not collide, but end up in sequential boxes, which is very efficient. Of course, you can send python web apps a "cookie of death", in which all cookies collide. There's a patch, but I think it's disabled by default.
To clarify: with MurmurHash{2,3} or CityHash, it's easy to generate arbitrary numbers of reasonably small keys which all hash to the same value regardless of what seed is used. This is much worse than most people initially assume when they hear that "collisions in murmurhash have been found".
That said, it's a fine hash function if you're not worried about malicious keys.
Murmurs main issue (in an attacker controlled context) is how it incorporates the seed in to the hash, not unexpected collisions on non-malicious data.
Further, _any_ hash, even cryptographic, will not protect you if it is not used with a random seed unknown & unpredictable to the attacker.
Interesting, apparently as a former Lutheran I learned the Catholic version. Seems it is even worse than that link suggests though, as there are two texts which are considered to be the 10 commandments: http://en.wikipedia.org/wiki/Ten_Commandments#Two_texts_with...
Because Christians are generally speaking quite ignorant of the Old Testament, and not infrequently the New as well. Every religious Jew on earth is well aware of this, and if you read the two versions side-by-side you'll see the differences are stylistic and not substantive--the shade of difference between "honor" and "love" or "remember" and "observe" is just not that great, though our practices often are explained as owing to these distinctions.
If you are looking for something about the Bible to be upset about, you can certainly do a lot better than this.
to be honest, that is the first result I found looking on google for "ten commandments jews"[0].
Sure, as another (apparently deleted?) comment said, if you take them as a whole you more or less can respect 80% of it across a few major branches.
But they are inconsistent enough that if I tell you one by number there's a 80% chance that you won't be able to understand what I mean unless you know my religion, which is the case i replied to.
[0] I first found out that what the roman catholic catechism taught to me (in rome :) as VI, "do not commit impure acts", is different for jews.
Actually that would be a neat search engine idea: Targeted at programmers, you can specify the language the library/question/tutorial/etc. should be written in and it searches only websites related to this.
The Chord algorithm sounds like a very good choice, however I'm more skeptical about the radix tree approach. I fear you might get a huge performance penalty.
Radix trees are great, with the right workload. The main competition (for strings) are the ternary search tree (for sparse keyspaces with rare insertions) and the B-Tree (which is nominally disk-optimized, but that optimization also tends to help with the L3 cache).
None of those are best for all workloads. There are situations where using a radix-tree is actually faster than a hashtable with a good hash function, and situations where it is slower than a red-black tree.
I would have preferred other algorithms, but there was a strict need for 1) sorted data and 2) that 2 identical trees had the same structure (for the merkle element). Not many structures were left to choose from, and radix seemed to work well enough.
- You say However, since it could be very useful for users of a database to store ordered data, or to wilfully concentrate certain data on certain parts of the cluster, god does not force the user to hash the keys. -> why do you care about how the data is actually stored?
- To map keys to values, a mapping structure is needed. For infrastructural reasons (synchronization and cleaning) as well as for functionality of different kinds, we need a sorted mapping, and it has to be deterministically structured. -> why?
> why do you care about how the data is actually stored?
I just said I don't. 'god does not force the user to hash the keys'.
> > To map keys to values, a mapping structure is needed. For infrastructural reasons (synchronization and cleaning) as well as for functionality of different kinds, we need a sorted mapping, and it has to be deterministically structured.
> why?
Functionality: To be able to return the first or the n'th entry it has to be ordered.
Synchronization/cleaning: To be able to hash element 0000-000f we need an efficient way to fetch a segment of elements, thus it again has to be ordered.
To optimize the hashing so that I don't have to keep two separate data structures I keep the hashes in the nodes of the sorted data structure. Thus the structure has to be deterministically structured or the hashes won't be equal even if the trees contain the same data.
Other comments, timestamps will not work as you expect if you have a network partition and you are using your own time synchronization mechanism. Use NTP instead.
Distributed hash table algorithms were originally created for things like Gnutella and are optimal for cases where you expect nodes to pop up and disappear on a regular (but unpredictable) basis.
I've never had this issue pop up (I am not involved in any huge scale sites) but I have to believe this really isn't too common?
Also a 5 second Googling shows there's ways to set Redis up to stop writing but continue reading if you're running out of memory. If you really have that large memory needs then you're going to also run out of memory with this new lib on 1 machine.
It seems like Redis has more than enough options to prevent real problems from occurring once you do surpass your hardware requirements.
When this was released last year, ruby[1], jruby[2] and rubinius[3] all switched to siphash[4]. Looking at [5] mentions tomcat, .NET, PHP, etc. all switching away from MurmurHash.
(I'm not saying one would HashDoS one's own database, I'm merely pointing out that MurmurHash3 wasn't _designed_ with low collision rates in mind [siphash is, though])
[0] http://emboss.github.com/blog/2012/12/14/breaking-murmur-has...
[1] http://www.ruby-lang.org/en/news/2012/11/09/ruby19-hashdos-c...
[2] http://jruby.org/2012/12/03/jruby-1-7-1.html
[3] https://github.com/rubinius/rubinius/commit/a9a40fc6a1256bcf...
[4] https://131002.net/siphash/
[5] http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2011-481...
[edit: formatting]