

RethinkDB performance data - coffeemug
http://www.rethinkdb.com/blog/2009/08/rethinkdb-performance-data/

======
mattj
So what are the numbers for selects while writes are happening? It's really
suspicious to show 'selects with no writers.' In terms of transparency, it's
nice to have benchmarks that don't show you winning everything (makes me trust
that you actually know what you're doing, rather than are just optimizing a
few benchmark-able cases).

Also, include the actual queries you are executing. Inserting into 2 columns
with only a primary key index is a completely different beast than 30 columns
with uniq key constraints and multiple multi-key indexes.

~~~
coffeemug
There are a lot of interesting benchmarks to run - updates, deletes, different
concurrent loads, different indexing schemes, a variety of row sizes and
variable column sizes, etc. Right now it's only three of us, and it's only
been a little over two months. We're really looking forward to running all of
these benchmarks (along with TPC and iiBench), but there is only so much we
can do with a limited amount of time.

One thing we're finding is that people are skeptical by default. In
retrospect, we probably should have expected this given the amount of false
advertisement, magic bullets, and vaporware out there. We're not offering
magic bullets - we're simply optimizing for modern hardware and modern
workloads as opposed to 1970 hardware and database usage patterns. We'll be
posting more benchmark results and detailed explanations of how RethinkDB
works to win your confidence. We're not trying to trick you. What's the point?
Eventually everyone would find out and the house of cards would collapse
anyway. We just have only 24 hours in a day (btw, if anyone has a device to
solve this problem, we'd love to beta test it!)

------
Maro
These numbers are 4x as slow as Keyspace[1] with fast spinning drives as
Keyspace can do 100,000 inserts/sec. Keyspace currently uses BDB as the
backend, so this is basically a statement about BDB performance.

I'm curious, where does the Mysql/RethinkDB take this performance hit, or is
this just a totally different benchmark?

Keyspace is a KV-store, but this shouldn't be a big deal, as a rowstore also
inserts into a single btree as long as there are no secondary indices (A KV
store inserts a K+V whereas a rowstore C1+C2..Cn.) Keyspace groups inserts
into transactions of size ~1M in the benchmark I'm referring to. Are you
inserting one row per transaction/sync here, or what's the deal?

[1] <http://scalien.com/keyspace>

~~~
coffeemug
The biggest performance hit right now is getting the data into MySQL (the
network stack). We're inserting one row at a time, so most of the time is
spent on the wire. Effectively, it's a different benchmark.

BTW, an average drive rotates at 250 revolutions per second. If you flush on
every transaction, this is roughly how many transactions you'll be able to
sustain. Anything more than that, and you have to batch transactions in
memory.

~~~
Maro
So it's one network roundtrip + commit per INSERT?

~~~
coffeemug
Yes, that's correct. The benchmark was run on localhost, so it might be going
through loopback, but most of the time is spent in the network stack.

~~~
rarrrrrr
Any difference connecting through a local unix domain socket instead of
localhost (to avoid TCP overhead?)

~~~
leif
We'll be looking at many ways to cut networking overhead over the coming
weeks, including socket communication (though I don't know if MySQL supports
this).

------
hendler
Congrats on the benchmarks.

A couple of questions:

How has RethinkDB been affected by the recent changes to SSDs that have an
improved firmware for garbage collection?

Unrelatedly, does RethinkDB use the same clustering strategies as MySQL?

~~~
leif
The new garbage collecting firmware is probably only effective for consumer
use, not enterprise, as these firmware chips (we think) need some downtime to
operate properly. We haven't gotten our hands on them, but we don't think
they'll solve the problems we're solving.

We don't support clustered indexes. Clustered indexes are good on rotational
drives because they allow you to read data as soon as you find an interesting
node in the index tree. Because SSDs don't have slow random access, this is
not necessary, so we don't do it.

------
jhancock
this will be great as soon as hosting providers give the option of "disk space
using SSD". Until then, a solution like this binds you to buying your own
servers. Maybe I'm misreading their web site, but they do list SSD as their
first "feature".

~~~
mmcgrana
SoftLayer offers SSD options on their monthly dedicated servers, in particular
32 & 64GB "Intel SLC SSD" drives.

------
simonista
_times only across the mysql_stmt_execute() call_

I hope this means timing many, many calls and then timing the loop overhead,
subtracting, and dividing by the number of calls made. Clocks just aren't good
enough to time individual calls.

------
sethg
I click the link and get "504 Gateway Time-Out".

~~~
coffeemug
Err, all the advanced database work, and we screwed up PHP. We redirected it
to a mirror for the time being.

------
jganetsk
Objective-C? Why?

~~~
bkudria
"We wrote our original benchmarking tool in Python, but during our latest
benchmarks, we noticed that it was taking about as much time as the engine
itself, hiding our real performance numbers. We now have a very small
Objective-C program (<900 lines) that uses prepared statements in a tight
loop, and times only across the mysql_stmt_execute() call."

What's wrong with Objective-C?

~~~
jganetsk
Seems a very arbitrary choice.

Shouldn't they be using standard benchmarks, anyway? It's easy to write your
own benchmark to make something look good.

~~~
coffeemug
We used Objective-C because it has a very simple and efficient threading
model, and we're on a tight schedule. Python wasn't fast enough, coding
threads in C or C++ would take a few hours longer than we had, so we used
Objective-C.

You're absolutely right about standard benchmarks. We haven't engineered our
benchmarking tools to make RethinkDB look good, we simply wrote code to run
the most obvious operations - as many inserts as possible, as many selects as
possible, etc. We will be moving to more commonly accepted benchmarks (such as
TPC and iiBench) soon, but in the meantime it helps to have a simple
benchmarking tool that shows off what RethinkDB can do on common workloads.

~~~
jganetsk
I'm curious. This is an interesting new company, and I'd like to follow it.
I'm very curious about the tools you use.

What is particularly efficient about the threading model? I know very little
about Objective-C. Are you guys are very comfortable with it? Does someone
writes Mac application there?

~~~
leif
I wrote the benchmark, and have never developed on a Mac or with the Cocoa
framework. I started with a C application because I was familiar with the
MySQL C api (specifically, their prepared statements), and that was very fast.
Adding threading in C is difficult, so I took the opportunity to learn how to
use NSThread, and really liked it. The benchmark is mostly C for the nitty
gritty details, with some Objective-C to tie everything together. I wouldn't
say I'm extremely comfortable with it, but it's very easy to drop straight
down to C for the interesting bits.

------
brandon272
Looks great! Can't wait until it's ready for a production setting.

------
rythie
What about alter performance?

