

HyperDex: Super-charging your NoSQL with a Strongly Consistent Datastore - ziyadb
http://micrypt.com/2012/08/10/Getting-started-with-HyperDex.html

======
sbhat7
Previous and relevant discussions:

<http://news.ycombinator.com/item?id=3622059>

[https://groups.google.com/forum/?fromgroups#!topic/redis-
db/...](https://groups.google.com/forum/?fromgroups#!topic/redis-
db/N4oy3lCngsU%5B1-25%5D)

------
karterk
I am in the middle of evaluating HyperDex. It's still early stages, but it's
very promising when you want to slice and dice your data on multiple
dimensions.

Having said that, I had trouble stopping and/or restarting the cluster in a
clean way. To make me even consider using it on production, it should also
offer me ways to backup data and as well as convince me that future upgrades
will be somewhat smooth.

~~~
egs
Not sure when you last looked at it, but we added the ability to cleanly
shutdown and restart the cluster two releases ago. We are committed to
providing a smooth upgrade path as well; that is not to say we will always be
binary-compatible, as the next release involving the disk layer will change
the on-disk representation, but we will always provide automatic upgrade
scripts. So, if you last tested 0.29b or so, do check out the latest code in
the repo.

------
petercooper
Bypassing the content, I love the idea of having a suggested soundtrack for a
blog post! - along with the link to actually hear it :-)

------
cmancini
I was wondering how long it would take for HyperDex to get some traction on
HN. While I haven't had immediate need to use it myself, I have several
friends who attest that it is a lovely system. The article doesn't mention it,
but it's wicked fast, outperforming Redis in many metrics. Clearly the
consistency is the big kicker though. So many systems are saying eventual
consistency is "good enough." Clearly that can be fine for a SoLoMo app, but
probably not for a medical system.

~~~
nicolast
Lack of consistency guarantees in most systems available in the open is the
exact reason why we created Arakoon: our usage scenarios (safekeeping metadata
for large-scale storage systems, among others) simply don't allow for
'eventual consistency'. It's also available as free software, check it out at
<http://arakoon.org>.

------
jperras
Neat. From a very cursory glance at the HyperDex paper, it looks like a
distributed kd-tree with the addition of a single-dimensional key subspace (to
use their terminology). The real clever part, I believe, is the value-
dependent chaining for the deterministic propagation of changes/deletion of
objects.

Very cool.

~~~
egs
Thanks! Unlike a kd-tree or b-tree variants, HyperDex does not build an
auxiliary data structure. It turns out that keeping aux data structures in
sync with the data is very difficult if you want to provide strong consistency
guarantees. Hyperspace hashing is purely a mapping trick, not a distributed
data-structure trick.

Agreed with you fully that value-dependent chains are neat. They allow the
system to replicate and relocate data, without any need for background
processes. VDCs are the key to HyperDex's strong consistency guarantees.

------
nateberkopec
NoSQL/SQL trolling aside, doing data lookup in a hyperspace is such a great
idea. Being able to slice my data on hyperplanes is such a useful and cool
feature.

~~~
jandrewrogers
In fact, database theory was originally conceived as hyper-dimensional metric
space in which both data and selection could be mapped to hyper-rectangles in
the space. A hyper-plane or a point in the metric space is just a degenerate
hyper-rectangle with the same dimensionality as the metric space. It has
always been the most elegant way to represent a database.

One of the best written expositions of databases properly expressed as
operations over hyper-dimensional spaces was produced by Rudolf Bayer, the guy
that invented the B-Tree. In the late 1990s he invented a multidimensional
indexing structure based on space-filling curves called a UB-Tree and they
wrote at length about how various database operations are implemented using
that representation. It is generally informative if you are unfamiliar with
this aspect of database theory, not just in the context of UB-Trees about
which it was written.

If it is such a great idea then why does no major database implement things
this way? Despite several attempts by companies like Oracle, IBM, and
Microsoft, no one has described a generalized data structure for databases
with hyper-rectangle operands as your primitives. There are dozens of narrow
algorithm solutions known, both published and unpublished, but none that you
could legitimately use in a commercial database system because they all have
limitations that will adversely affect real-world applications.

Hyperdex is a conventional algorithm from the standpoint of indexing hyper-
dimensional spaces. They are not doing anything new there that has not been
done before. However, the update value chaining element of it is actually
pretty neat.

~~~
egs
I need to add two factoids to this insightful discussion:

There is a big difference between space-filling curves and hyperspace hashing.
Space-filling curves map a multidimensional space to a single path through
that space that is then mapped to nodes. In the process, they do not retain
locality. To our knowledge, hyperspace hashing is a direct intellectual
descendant of consistent hashing and has not been done before. If you have
pointers to work where data is mapped to nodes in a cluster using a
multidimensional hash, please send them to us!

And one major reason why multidimensional databases failed to take off is a
problem known as "the curse of dimensionality." If you implement a multi-
dimensional representation naively, highly-dimensional data (say, an object
with 10-20 attributes) will require a large number of nodes to be efficient.
HyperDex solves this through something called space partitioning (I think the
paper calls it "data partitioning," but we've changed the name to be a bit
more descriptive). They're kind of analogous to materialized views, very
loosely speaking.

Agreed completely that hyperspace hashing comes to its own when coupled with
value-dependent chaining!

~~~
jandrewrogers
While there is a distinction between space-filling curves and what you are
calling "hyperspace hashing", they are theoretical siblings and the conceptual
gap between the two is very small, arguably differing interpretations of the
same abstract construct. The "hash" interpretation was well-known when the
space-filling curve indexing literature in the late 1980s was written though
they were primarily interested in access methods for sequential storage.

With respect to "hyperspace hashing", it has been the standard way of
interacting with large space decompositions for decades. The mathematics is
from the 1960s (Morton et al) and the practical computer science was mature by
the 1980s. When I was writing parallel codes for supercomputers, virtually all
of our data structures were built this way. It is also the algorithm used for
some large-scale geospatial systems; some of the Google Earth backends used it
back in the day. While I've never personally worked on it, I've heard some
graphics processing pipelines also use the algorithm. I first learned about it
ten years ago because I ran into a couple systems using it.

I've never heard it called "hashing" until recently, largely because
technically the algorithms are not a hash even if used as one in many
implementations. This, and the fact that most of the literature on this idea
significantly pre-dates electronic publishing, probably explains why you've
seen so few references to hyper-dimensional hashing.

As you point out, the curse of dimensionality is a classic problem for these
structures. The partitioning technique in the paper works for some use cases
but is not used as much anymore. There have been some clever and fairly
general attacks on the dimensionality problem over the last few years by a
couple companies but I have not seen anything published.

All that said, this general approach is overdue for large-scale distributed
systems. It has been proven out in niche markets like supercomputing for
decades and is superior to distributed hash tables for many purposes. However,
most of the recent research does not seem to be published.

------
peterwwillis
<https://www.youtube.com/watch?v=kTf-iAWfvS4>

I think it's weird when people believe there's a tool that will do their job
for them, like a hammer that builds a roof by itself.

I'm sure HyperDex is totally useful for some cases, but it has clear
disadvantages when you try to use it for what it wasn't intended (like global
HP databases). All of a sudden you find yourself building glue to make it fit
with your hybrid architecture. Instead you could take something simple and
customize it, and build a huge successful business off of it, like the biggest
sites in the world do currently with various tools that weren't engineered to
solve simple problems like the number of round trips to look up an object.

