

HyperDex: A Searchable Distributed Key-Value Store - lpgauth
http://hyperdex.org/performance/

======
ericmoritz
From the FAQ <[http://hyperdex.org/faq/>](http://hyperdex.org/faq/>):

"So, the CAP Theorem says that you can only have one of C, A, and P. Which are
you sacrificing?

HyperDex is designed to operate within a single datacenter. The CAP Theorem
holds only for asynchronous environments, and well-administered datacenters
enable us to sidestep this tradeoff entirely."

I'd like to see how they pull that off when a node goes down. I guess in
"well-administered" data centers, nodes don't go down.

Sounds like they're sacrificing "A" to me because they're doing synchronous
replication.

~~~
snewman
It's a bit of a middle ground. Yes, the replication is synchronous, which
impacts availability. However, the master can remove a failed replica from the
chain fairly quickly. In principle, with proper tuning, a node failure would
merely cause a brief hiccup. This would feel more like a period of increased
latency than a full-blown outage. So there really needn't be much sacrifice of
availability.

However, there's also a sacrifice of partition tolerance. If the master is
unable to communicate with any replica, the system can't serve requests. Also,
the master is implemented as a collection of Paxos nodes; if these nodes are
partitioned from one another, the entire system would grind to a halt.

Since this is intended for intra-datacenter use, one could argue that a full
network partition might be unlikely. (Depending on what sort of data center
you hang out in.) But in CAP terms, it's possible, of course.

(I base all this on the value-dependent chaining paper cited below.)

~~~
nicktelford
This sounds like a CP system to me. There's nothing wrong with that btw, I
don't know why people are so reluctant to admit this.

AP systems have some useful properties, but they're also (typically) more
difficult to reason about.

The "hiccups" you describe are periods of unavailability. The increased
latency is caused by an element of the system waiting for the data to become
available again, a totally valid strategy for coping with transient
failures/partitions.

Your argument about intra-datacenter partitions being unlikely are true, but
they do happen. You also make a good point about such partitions also
affecting client applications. Both of these are indicative of CP systems and,
like I said: there's nothing wrong with that.

Personally, I think both AP and CP distributed systems are equally
interesting. What I consider a red flag is attempting to rationalize how a
system "beats CAP".

~~~
snewman
To be precise, I would argue this is actually a CA system.

"C" because there are consistency guarantees, which are upheld even in the
face of failures / partition.

"A" because the system will continue making progress even after a node
failure. What I called a "hiccup" can be made arbitrarily short, in principle
at least. The system can work around failed nodes, it does not need to wait
for them to be repaired.

Not "P", because if the network falls apart sufficiently, the system will
generally not be able to progress.

~~~
rbranson
There is no CA.

You cannot guarantee consistency and availability simultaneously in the face
of network partitions. Once the line of communications is cut or overloaded
(slow enough = a partition), you have to pick one or the other. It's basic
physics.

If two entities can't communicate, they can't synchronize state, so one (or
both) of them have to quit acting like they have a consistent view of the
data.

~~~
the_paul
It's not exactly clear from the paper what CA should mean.

I've seen people claim it means "you guarantee both consistency and
availability, as long as there are _no_ network partitions (you don't have to
handle those because you haven't chosen P)". That's a supportable claim. So
you can do that, but it's kind of a useless choice, because as long as there
are no network partitions, both CP and AP systems can _also_ guarantee full
consistency and availability.

I lay the CAP family out thus:

* CP: on network partition, lose availability

* AP: on network partition, lose consistency

* CA: on network partition, lose both

It seems to be a common thread among distributed systems engineers who claim
to have beaten CAP: "network partitions don't matter for whatever reason, so
therefore I can always guarantee both availability and consistency and so CAP
must be wrong yaaay!!"

Sorry, no, nice try.

------
jandrewrogers
This is a pretty nice implementation of an old design concept that has merit
for modern distributed problems. It is worth studying as a model for
distributed systems.

I am aware of a couple commercial distributed systems that use it (not this
implementation but the underlying algorithm). Several organizations seem to
have reinvented it over the last five years.

Organizing data in this way was studied in the 1980s but was poorly suited to
the computing systems of the time. Relatively little was written about it
because it was viewed as a dead end and modern literature has all but
forgotten about it. Most of the research was done by companies rather than
academics. Designs like this have fallen into the blackhole of "if it is not
on the Internet then it doesn't exist". Back when I was studying these models
I found more crusty old patents related to these types of models than relevant
papers on the Internet. It will be valuable to have some modern literature
pertaining to these designs.

------
stevelosh
For those that want to take a quick look at the source code without physically
cloning it, there's a GitHub link buried somewhere in the site (I forget where
I found it): <https://github.com/rescrv/HyperDex>

Their Python client seems to be using Cython for extra speed:
[https://github.com/rescrv/HyperDex/blob/master/hyperclient/p...](https://github.com/rescrv/HyperDex/blob/master/hyperclient/python/hyperclient.pyx)

------
ypcx
I'm looking for a good paper or a book detailing patterns or strategies of
storing complex data structures and relations, or designing complex schemas on
top of key/value stores, possibly with real-world use cases / success stories.
A bonus would be the same on top of key/value/range(search) stores, just like
HyperDex is. Any suggestions HN?

Technically, it should be even possible to implement SQL on top of a store
like HyperDex (there's the Tenzing paper from Google). But I'm not looking
specifically at SQL, but at more general scope.

------
jdf
It looks like they trade the ability to scan ranges of keys for the ability to
get single objects via multiple attributes. The value-dependent stuff is also
a neat way of solving the consistency issues with multiple node updates.
Interesting stuff.

If you were to swap this with your Cassandra cluster, you'd be losing multi-
datacenter replication. Although partitions within a data center are pretty
rare, you'd also lose some availability there as well. However, Cassandra is
usually hash-partitioned so it needs to do broadcast for a scan (AFAICT, it
even needs to do a broadcast for a lookup on a single secondary attribute), so
you'd probably gain quite a bit of performance with HyperDex.

I can't tell if it's possible to dynamically change the set of secondary
attributes being indexed without rebuilding the entire data set. Or how value-
chaining works with missing attributes.

Also, apparently consistency has some... gaps... when you search via a
secondary attribute:

"The searches are not strongly consistent with concurrently modified objects
because there is a small window of time during which a client may observe
inconsistency."

------
jfager
I'm reading through the paper, and I'm curious if I understand the
implications of key subspacing and value dependent chaining correctly - do all
reads and writes for a given key get forced to a single node? I understand how
the replication that's described allows for failover when the point-leader
fails, but does it also allow for scaling writes and key lookups?

~~~
rescrv
Reads and writes for a given key always go to a single node. As you add more
machines with the same workload, you are handling fewer keys per machine. The
flip side is, if the number of keys per machine stays constant, more machines
directly translates into the ability to handle more keys.

HyperDex scales linearly as you can see in our scalability graph.

~~~
jfager
I think you're missing the point of the question. I'm not asking how you scale
in general, I'm asking whether your replication scheme is only for failover or
if it contributes to your scaling story - i.e., if a given key goes hot, does
everything have to run through the point leader, or can each replica take
writes/serve requests? It sounds like the former, which is fine, just wanted
to clarify.

~~~
rescrv
It's the former. I'd rather not sacrifice our strong consistency by using
replicas to serve GET requests.

------
NOD507
How does it compare to elasticsearch?

~~~
k-a-r
Elasticsearch is a Fulltext search solution not a K/V Database.

~~~
erickt
Technically yes, but you can use elasticsearch as a distributed key-value
store. I use it in production and it works amazingly well for me. Yes,
searches may return stale data (one second old by default) but that is not
much of a problem in my app.

~~~
gregholmberg
That sounds both interesting and non-intuitive. Can you go into more detail
about how you have been using ElasticSearch as a k/v store?

EDIT: Color me surprised. I didn't see it mentioned at all in the
ElasticSearch docs or in their Github repo, but apparently this is a fairly
popular thing to do.

"Elastic search as a database?"
[http://stackoverflow.com/questions/6636508/elastic-search-
as...](http://stackoverflow.com/questions/6636508/elastic-search-as-a-
database)

"ElasticSearch, datastore for user and social data?"
[http://stackoverflow.com/questions/8190288/elasticsearch-
dat...](http://stackoverflow.com/questions/8190288/elasticsearch-datastore-
for-user-and-social-data)

"Elasticsearch and NoSql database"
[http://stackoverflow.com/questions/8026880/elasticsearch-
and...](http://stackoverflow.com/questions/8026880/elasticsearch-and-nosql-
database)

------
xxqs
Did you compare the performance with (local) BerkeleyDB?

Also, is there a locking mechanism?

I'm the author of Torrus (torrus.org), and BerkeleyDB stability and non-
network nature are quite painful. But I'm relying on its speed, concurrent
locking, and some ways to acquire an exclusive lock on a table. It would be
interesting to offer an alternative backend for torrus.

~~~
rescrv
I'm not sure exactly what you're looking for, but Berkeley DB is embedded in
your application on a single host, while HyperDex is distributed. It sounds
like that's exactly what you want.

What do you use exclusive locks for?

~~~
xxqs
I use exclusive locks where multiple processes concur for data access, and
these locks guarantee the data consistency.

For example, the GUI engine may eventually start cleaning its old cache, and
the exclusive lock prevents other GUI processes (which are mod_perl or fcgi
processes) to do the same.

another example is when the database is big enough, and when multiple
collector processes start at the same time, I use the exclusive lock to let
them initialize and slurp the data sequentially: this prevents from excessive
disk seeks.

but as you mentioned in another thread, your solution is locked to a single
architecture, and I cannot afford that.

------
simonw
Is it open source? I browses around the site and FAQ and didn't spot a mention
of the license.

~~~
jandrewrogers
The code is open source. However, they applied for patents on the algorithm:

<http://www.wipo.int/patentscope/search/en/WO2011137189>

~~~
ypcx
While waiting for the answer, what does this generally means if the patent is
granted? Since the code is open-sourced under BSD, can the algorithm be used
in other open-source BSD-licensed products, and can corporations then use
these products to store data, without needing to pay the royalties on the
patent?

~~~
bryanlarsen
The BSD license does not contain a patent grant, so it's technically possible
that you'll have to pay to use this software in the future even though it's
under the BSD license. I'd really like to see a switch to the Apache license,
which does contain a patent grant.

~~~
ypcx
So if I release my code under Apache license, and later it is found that it
happens to infringe on a patent, am I liable to pay the damages on behalf of
anyone using my code and getting sued based on the patent infringement?

Or, to put it another way - what is the best "free" license for a developer to
release his/her code under? I don't really even care about mentioning my name
in derivative works.

~~~
bryanlarsen
I'm no expert, but as far as I'm aware the Apache grants the user a license
for your relevant patents. The GPL3 goes a step further and grants the user a
license for any patents you have a license for whether its your own or one you
licensed from somebody else. Since neither situation likely applies in your
case, the BSD and Apache licenses are basically functionally identical.
However, by choosing the Apache license you are telling potential users that
you don't have any "sleeper" patents, so you might get better uptake.

If somebody later asserts a patent against your code, both you and your users
are screwed no matter what license you choose. But the choice of license might
influence who might help you out if that happens...

------
DEinspanjer
I think maybe only certain versions of Ubuntu are supported? It won't install
because I don't have a repo that provides libgoogle-glog0

~~~
rescrv
The packages are built for Ubuntu 11.10. What version of Ubuntu are you
running?

~~~
DEinspanjer
Thanks for letting me know the target version. I am running 10.4.4 LTS

------
PanMan
Since they compare it with redis, I wonder if this can handle data bigger than
memory. The other properties seem nice :)

~~~
rescrv
We compared it to Redis as it is one of the many key-value/document-database
systems that provide the high throughput and low latency properties that make
it comparable to HyperDex. We are expanding the systems we benchmark against,
aiming to compare against as many systems as possible.

HyperDex is designed for bigger-than-memory data.

~~~
antirez
The comparison with Redis is a bit strange at a first glance, what I mean is,
if in Redis I replace the actual fetching of data with "return 'foo'" at every
query it still returns 150k/requests per second for every operation. The same
will do memcached _per single core_. So either in your code there is some good
trick to improve speed in the networking layer, or you are comparing multiple
cores/nodes to single core/node.

About benchmark "E", this is a primitive that Redis lacks, so it is in my
opinion a bit misleading.

Another question is, in the benchmarks with Redis, does the data set fits
inside Ram also in HyperDex DB (or OS cache)? In such a case you are comparing
two in memory databases in practical terms, but at the same time you say that
HyperDex is suitable for datasets bigger than RAM (in a portion of text _near_
the Redis benchmark), but the numbers that the users will experiment in actual
production setups will be different if the data set does not fit in RAM.

Finally, in the benchmark it is not clear that Redis can perform a lot of work
in a single operation, for example a variadic LPUSH can insert 10 elements
inside a list in the same time an LPUSH does this with 1 elmenet, more or
less. If you multiply this for N cores in a stand alone Linux box you get 1
billion operations per second or alike, but I would be not comfortable with
writing this in the Redis home page. I mean, if you don't specify _very_ well
the methodology this unfortunately is just marketing material without real
interest for the field.

~~~
rescrv
Both systems were single process per host.

I agree that Redis does not have good support for "E", so anyone looking at
the results should look at A-D,F.

Both systems performed entirely out of main memory to avoid touching disk.

The Redis benchmark results only came in last night. We'll be writing up the
methodology soon. We used the YCSB Redis binding from branch master. I'd be
interested in your opinion on how well the bindings map to what is truly best
for Redis.

~~~
antirez
Redis uses a single thread, not just a single process, so if Hyperdex is
multi-threaded you are comparing single core vs multiple cores. As you can see
with memcached that instead is able to use multiple cores (a feature that
Redis is going to implement soon) this leads to a big performance improvement
in this kind of benchmarks.

EDIT: (I checked that Hyperdex actually uses threads and multiple cores) If
you want a fair comparison you should run Hyperdex on a single core as well,
or you can run N instances of Redis (one per number of core) and write the
benchmark so that it uses all the instances.

I'll check the YCSB Redis bindings, I never looked at them before. Thanks for
the reply.

EDIT2: there are also problems with the YCSB Redis bindings:

1) It basically forces an object-store data model on Redis, so only uses
hashes to store objects, and every time an object is stored or deleted, a
sorted set is updated as well. This is a possible use case of Redis but not a
very idiomatic / representative one.

2) Even for benchmarks not involving searching, the sorted set is anyway
updated.

3) There is no pipelining used to alter the object and store the sorted set.
Every operation pays 2x the Round Trip Time in the Redis bindings.

4) Even worse, there is no pipelining in the "search" operation, so you may
the RTT a lot of times when you do a scan operation with this bindings.

The minimal change to the YCSB bindings is to modify the bindings to use
pipelining when possible (almost always, actually). Still I think that an
intermediate layer to turn Redis into an automatically-indexing object store
does not make sense. Another big problem is that you are comparing multi-cores
vs single-core.

So if you really are interested in a comparison between HyperDex and Redis you
should pick an use case and model it accordingly with the best tools of both
the databases.

What I would recommend is to use the following use cases.

1) Populate the two DBs with 10 millions of hashes, then write a benchmark
where 50 clients simultaneously get and set specific fields.

2) Like "1" but increment a field by 10 at every write query.

3) Simulate a leader board where 50 clients simultaneously update the scores
of the different "players" in one operation, and ask the top 10 users in the
leader bord with another operation.

4) Simulate a capped collection where you always add the latest news in a web
site, and you can get the top-10 to show in the home page. Every of the 50
clients should write a single item and fetch top-10 items.

And so forth. Always use 10 million objects and mixed reads and writes with 50
clients at the same time. Write the best code for both the DBs (I can help
with Redis).

This time you are truly comparing the two DBs in a real world scenario.

~~~
icanhazdata
I think YCSB is meant so that experts in each technology can tweak it for
their system. So, it looks to me like you're in a great place to make that
happen, and I'd love to see it!

Also, It's extremely difficult to get speedup of 4x on 4 cores, so I'll
believe that argument when I see it. It seems to me that with the current
implementation of Redis, you'd run into some serious problems with memory
management if you run 4 redis nodes on one 4 core machine.

Also, you say: "This is a possible use case of Redis but not a very idiomatic
/ representative one." Can you elaborate? What is idiomatic? From the front
page of the Redis site: "Redis is an open source, advanced key-value store."

Also, on a slightly different topic, I can't seem to find any real
documentation about the consistency guarantees of Redis, so I thought you
might be able to point me in the right direction. It appears that the
master/slave replication scheme in Redis just backs up onto the slaves
eventually, but the master immediately returns. Is that true? If a master goes
down (disk and all), could a write have been confirmed to the client which is
missing on a slave? What replication protocol does it use? I find the
documentation lacking in this regard and it'd be great if you could point me
to some actual technical specifications. Thanks!

------
pixelcort
This reminds me of <http://xanadu.com/zigzag/> .

~~~
rescrv
HyperDex is different. ZigZag focuses on data visualization. HyperDex provides
you the same key-value interface of other systems, while also providing an
efficient search primitive.

Check out <http://hyperdex.org/tutorial/> for examples of the Python API.

------
jcapote
Is this better than hadoop?

------
xxqs
the FAQ mentions that you target x86_64 architecture. Does the server work on
i386?

also, I hope it's doing proper memory alignment and endiannes independence?
(because Mongo sucks on that)

~~~
rescrv
We use many lock-free datastructures which rely heavily upon the x86_64
architecture. Further, the expanded virtual address space enables us to mmap
everything.

All network traffic is packed, and in network byte order.

~~~
otterley
> the expanded virtual address space enables us to mmap everything

This sounds like a disaster waiting to happen when a node's working set is
larger than RAM. Have you considered the impact on performance due to
excessive I/O resulting from this kind of overcommit?

Redis performance, for example, suffers tremendously when its database size
exceeds the available RAM, which is why the authors advise implementors to cap
its database size. The only real difference I can see here is that Redis
allocates its storage anonymously, while Hyperdex uses file-backed pages; in
either case, performance under memory pressure will largely be governed by how
the VM chooses to move pages in and out of its backing store -- behavior the
application has no control over.

Optimizing the performance of working sets larger than RAM is _hard_. Redis
had a (thankfully) aborted attempt to do so (the VM is gone in the most recent
stable release); and the InnoDB buffer pool in MySQL has been refined for many
years and is subject to quite a bit of tuning for specific workloads. (See
also <http://blog.kennejima.com/post/1226487020/thoughts-on-redis> for some
thoughtful discussion on the subject.)

You claim on the one hand that Hyperdex was designed to work on data sets
larger than RAM, but on the other hand you admit that all the benchmarks were
performed with a working set smaller than RAM. I'd like to see comparative
benchmarks where the working set is larger than RAM to be convinced.

~~~
lobster_johnson
> This sounds like a disaster waiting to happen

Well, define "disaster". A memory-mapped dataset will cause pathological
performance only when there is thrashing: If you are accessing a small
percentage of the entire dataset, then unused data will be paged out and
remain paged out. If the bulk of data being accessed exceeds the amount of
available physical RAM, then you will get I/O trashing.

Memory-mapping is a good alternative to static allocation or a home-grown
paging system because it lets the kernel handle the dynamics of allocation,
letting your application transparently and gracefully handle RAM tension
situation by relinquishing memory space to other apps. Kernels (including
Linux and Windows) and CPUs are extremely efficient at paging I/O, much more
efficient than a hand-written paging system because there's no need for the
application's code to check whether a page is in physical memory — that's
handled by the CPU itself.

Of course, any I/O incurred by a too-large dataset will drastically reduce
performance compared to in-memory speed. But paging in itself does not
necessarily lead to "disaster".

~~~
otterley
Did you read my entire comment? I specifically said, "when the working set is
larger than RAM."

Your observations, while true in the abstract, don't reflect real-world
behavior under these conditions.

~~~
icanhazdata
Working set larger than ram doesn't imply thrashing. It all depends on the
page replacement algorithm employed by the OS, and the applications that are
running.

------
nirvana
If you're building a distributed KV stores, you should benchmark against other
distributed KV stores. Mongo and Cassandra aren't really. But Riak is.

Plus, since its "distributed" here's the benchmark I'd like to see:

1\. Set up a cluster of 8 nodes. Set data replication to 3. 2\. Load 3TB of
data into the cluster, across 1M documents (or some data set of that order)
3\. Run your tests. Optimize each of the DBs for the best way to access them
(Eg: link walking vs. map reduce on Riak if that's faster, or secondary
indexes if that's faster, or Riak Search if that's faster-- there are many
ways to search Riak.) 4\. Throw out the results of #3. 5\. Pull the plug on 2
nodes. EG: Shut down completely, no longer on the net work at all. Poof, gone.
Pick the two nodes by rolling the dice. If you have SPFs, and a bad role of
the dice would have brought the whole cluster down, remove the word
"distributed" from your marketing. 6\. Let the cluster sit for 30 minutes. 7\.
Run your benchmarks.

The benchmarks developed in #7 are the ones I want to see.

I have no clue how HyperDex would perform in this situation. It could kick
Riak's butt (but then, small cluster performance is just one of the criteria
that is important to me). I just wish people did benchmarks like this (though
I know its a PITA to do it this way.)

~~~
foobarbazetc
Why would a competitor build this Riak benchmark? How are they supposed to
know which Riak configuration performs best?

That's the point of YCSB. Each vendor can submit the optimal configuration for
their system, and they all run the same benchmark.

At the end of the day, each vendor is going to publish benchmarks that show
their system performing better than all others.

It's your job, not theirs, to verify those benchmarks for your particular work
load.

Edit: Also, while Mongo wouldn't classify as a distributed store, Cassandra
definitely would. In fact, it's more "distributed" than Riak, since you have
to pay Basho for multiple DC support (unless that's in the open source
version?), whereas Cassandra is completely free.

~~~
3pt14159
If things haven't changed in the past year you are right about not being able
to do replicated multiple DC support.

------
jsavimbi
Has it been proven in a pr0n environment? If not, I'll wait and see.

------
jbverschoor
Interesting.. keep it coming.

We need a unified nosql language.. Basically what SQL is.

~~~
bickfordb
What's wrong with SQL?

~~~
wickedchicken
* COBOL-inspired syntax that is neither human or machine parseable (easily).

* Strange nested query rules

* Language encourages Cartesian products for virtually every non-trivial calculation. This makes join order crucial for reasoning about performance.

* Humans don't have the brain capacity to reason about these joins, forcing the logic onto the query planner.

* Humans don't have the brain capacity to reason about the query planner reasoning about the joins.

* End result: bizarro-english syntax that's supremely hacked-up to twerk the query planner to do your bidding.

~~~
Craiggybear
What? SQL is virtually _plain English_ and is efficient across a range of
database architectures for moderate-to-complex join situations. Its just an
implementation of Boolean Algebra with a thin abstraction layer that uses
words like SELECT, JOIN, FROM, WHERE, etc.

If you have performance issues, its invariably because the structure of your
data is less than optimal or you have many linked external tables. Or -- you
are _really_ bad at efficient query design. Even highly experienced SQL-ers
always look at old queries and realise that they could be three times faster
and half as big, and if there is a performance issue then that's the first
thing they'll review.

Humans have been 'understanding' SQL very well for nearly forty years --
without modern refinements like graphical query designers, either.

Not all data is a suitable candidate for an RDBMS solution, I am the first to
agree, but there's not actually much wrong with SQL when what you need _is_ a
relational table store.

There are a few bad implementations of it, though, it is true, and a lot of
people who can't seem to grasp SQL for what it is.

Edit: I am myself a _huge_ fan of NoSQL solutions and always seek to
rationally justify a non-relational datastore solution where _practical_ or
_possible_. NoSQL isn't a replacement for SQL (with all its imperfections) for
the kind of slicing and dicing where it _is needed_.

~~~
lobster_johnson
SQL is a very, very poor implementation of relational algebra. For example,
the output of an SQL query is not even a valid relation variable (in E. F.
Codd's terminology).

If you want to look at how SQL might have looked if it had been relational,
take a look at "Tutorial D", the language proposed in a book called _The Third
Manifesto_ , by Hugh Darwen and C. J. Date: <http://thethirdmanifesto.com/>.
There are several experimental implementations.

