
RethinkDB 1.5 released: secondary indexes, soft durability, perf improvements - josegonzalez
http://www.rethinkdb.com/blog/1.5-release/
======
LeafStorm
> Early in the development of RethinkDB, we made the design decision to be
> extremely conservative about durability and safety of users's data.

I'm glad that the Rethink engineers are taking this approach to development
instead of "let's win all the benchmarks!" Overall, it's apparent that they
are really thinking through the consequences of their decisions and the
tradeoffs in database systems, and that they want to build something that will
last instead of something that will just become popular in the short term.

~~~
jdoliner
Thanks. If we can succeed in being Mongo without the gotchas we'll be very
happy.

------
hendzen
Seems a bit unconventional to manually specify which index you'd like to use
when writing queries. This would normally be something handled by a query
optimizer.

Of course, many DBMS provide some facility for specifying indices manually
through query hinting, but for non-trivial queries, the optimizer often can do
a better job than a programmer at picking the most efficient query plan.

~~~
jdoliner
Hi, I implemented secondary indexes in RethinkDb.

This was an intentional choice we made for a couple of reasons. The most
pragmatic one was that we don't have an optimizer and we thought that
secondary indexes could still be useful to a lot of people without one. And
we've found it's always better to ship early and get feedback sooner rather
than later.

Another reason though is we're not totally sure we're sold on the state of the
art of query optimizers. From what we've seen they can introduce a number of
non intuitive behaviors and it can be hard to tell which operations will use
which index. Query optimizers are nice when they work correctly but they don't
always work correctly so for the first release of secondary indexes we opted
to make something that had very clear precise semantics and then build on it.
We will eventually have a query optimizer but it's a big problem and we don't
just want to implement what others have because we think we can do better.

~~~
sergei
A decent cost based optimizer will be right most of the time, and you have
hints for the cases where it falters.

Your approach is simply hiding the query optimizer choices inside of the app,
making it more fragile.

Let's say I have two fields: a and b, both of which have an index. And then I
have a query with predicates across both a and b. For example, a = 10 and b =
20.

One of the core facilities within an optimizer is selectivity estimation. By
looking at the statistics, the optimizer will see that a = 10 might look at
10,000 rows while b = 20 might only look at 10. So the optimal and desired
choice would be to use index over b.

However, the exact same query construct with different parameters (e.g. a = 50
and b = 3) might flip the index selection.

Now let's imagine I had to implement this inside of my app. Every time I have
such a constraint, the app has to become aware of selectivity to know which
indexes to use based on input parameters.

~~~
btilly
You are underestimating the necessary design tension. If you expose hints,
then you've constrained how your optimizer works because it has to work with
those hints. (Oracle, for example, has been fighting this battle for ages.)
Once you've built an optimizer, people come to depend on it so it is critical
that you get it right.

They will need optimizer eventually, and they know it. But I'm glad that they
are not implementing it before they are ready.

Secondly your selectivity estimation point can go either way, and on the whole
I don't like it. I've personally experienced the situation where a database
recomputes statistics, the CBO decides that a new query plan is needed for a
common query, it made a poor choice, and the first that any human hears about
it, the site is down.

The problem here is that the risk profile for the application of trying to be
smart here is completely backwards. In general, as long as a query is not a
bottleneck, I don't care about making it faster. Oh, you made it 2x faster? I
didn't mind before and I'm unlikely to even know that you did so. But if you
JUST ONCE switch to a bad plan on production without warning, your users WILL
notice, they WILL care, and they WILL NOT be happy.

As a developer, I don't care that you make the right choice 95% of the time. I
want you to make a predictable choice. Because if you're making a bad choice
in development, I've got a window of opportunity to notice and do something
about it where nobody cares. But if you randomly make changes on production,
every single mistake counts.

Oh, but you say that this just means that you need stored query plans? I
agree, and this is an example of why the behavior of the optimizer has to be
thought through very carefully before you just throw something out there,
people come to depend on it, and then you realize that you have put barriers
to thinking of it the way you want to think of it.

~~~
sixbrx
This accords with our experience with large Oracle databases as well.

We've had a few panics caused by a DBA having updated table statistics when
trying to optimize some query. This would occasionally cause sudden, massive
changes in the way that other unrelated queries were performed, which queries
would not finish and sometimes bring down the database.

These experiences caused us to have to change our procedures and keep the test
database data closely in sync with the production data, so we'd know how newly
gathered statistics would affect queries. The database is large enough that
having testing be a full and recent copy of production is pretty painful.
Oracle has since introduced features in 11g that allow pinning query plans,
we've yet to try these though.

~~~
btilly
I wish that databases did "speculative" queries.

In the simplest form, you could say, "I have 2 reasonable plans, let's try A,
and if it takes above time X, then start B in parallel and go with whatever
finishes first."

You could ramp up the idea to handle changing query plans based on updated
statistics by sending some fraction of queries to one plan, and some to
another. Then keep stats on how that worked out for you.

Basically never simply flip the switch in production to try a new, unproven
query.

Incidentally 12g advertises that they will actually collect statistics during
query execution and based on those will validate the query plan. If that
works, then this problem should get better. But of course that comes with
overhead, and is likely to be a complex piece of code, so you tell me whether
you trust them to get it right right away.

~~~
hendzen
I'm sure you'll appreciate the irony of this, but what you described is pretty
much the exact mechanism of the MongoDB query optimizer [0].

[0] - [http://docs.mongodb.org/manual/core/read-
operations/#query-o...](http://docs.mongodb.org/manual/core/read-
operations/#query-optimization)

~~~
jeffdavis
A cost-based optimizer also needs a way to handle the combinatorial explosion
of possible plans. For MongoDB, maybe they can be exhaustive or use a simple
strategy. But for a SQL DBMS, the search space is way too huge, and you need a
way to navigate it intelligently.

This "try the plan out" idea has come up many times, and it's a reasonable
idea, but it's nothing new and not a complete solution. You need a way to
evaluate whether a plan has a reasonable chance of being the best one before
the whole plan is even constructed, much less executed.

Combinatorial explosion is also one of the reasons that humans generally
aren't great at coming up with non-trivial plans, either.

A good optimizer is a combination of many good ideas; so I don't mean to be
discouraging, just that one idea is not the full story.

If you are worried about the risk of a changing plan, you are between a rock
and a hard place. If you lock the plan down (or hard-code it in the
application), and the data characteristics change, that could be disastrous.
If you don't lock it down, that's a leap of faith that the optimizer will
adapt gracefully.

~~~
hendzen
I understand the issues you've brought up in your post as well as the
traditional methods of pruning the search space of query plans through using
left-deep join trees and histograms for calculating selectivity.

My top-level point was that there is no way humans will come up with the
optimal query plan by hand, and like you said even if they do - the data will
change.

As an aside, the reason Mongo can use such a simple but effective strategy is
because it doesn't support joins, which makes calculating the optimal plan
much, much easier. RethinkDB does, however, and as such, a CBO is that much
more important.

------
rdudekul
I love RethinkDB API. It makes joins possible without going through mapreduce.
Though I am using MongoDB currently, I plan on switching to RethinkDB as soon
as it is more stable and provides better performance. Excited to see the team
making progress on all fronts.

~~~
dude_abides
_It makes joins possible without going through mapreduce._

This is pure gold <https://twitter.com/shit_hn_says> material :)

~~~
RyanZAG
I'm not sure what you think is pure gold here - the statement is correct. When
dealing with big data, joins are the first thing to go as you need to shard
your data. Whether RethinkDB will really solve joins across sharded data is up
in the air though - ultimately you still need to do filtering per shard and
then reduce the result somewhere - and doing filtering across two attributes
means you are going to be doing a lot of data transfer, but RethinkDB is about
the only database I've seen really making progress in this area.

~~~
wmf
Clustered databases are actually older than mapreduce, hence the hilarity.

------
hoodoof
I like it alot but even simple queries are hideously complex. It doesn't need
to be this hard. The ReThink developers are exceptionally helpful but
goodness. Try to do a search returning all items where an array contains a
value. In simpledb it's just select * where field='value', even if field is an
array. In RethinkDB I had to ask for developer help and ended up writing a
deep, sophisticated query. Querying has to be dramatically more simple.

~~~
coffeemug
We're actually working on adding commands to ReQL to make queries _a lot_
simpler. (It turns out that just a few commands that are very easy for us to
implement can make enormous difference)

If you can, would you mind giving examples of things you'd like to do that are
hard? We'd love to make it easy, and we think we've got most common things
covered in the next release, but if there's something specific you have in
mind we'd really appreciate it.

~~~
hoodoof
The basic stuff needed to let a beginner developer write a web app should be
super super simple. REALLY simple. No map no reduce nothing hard.

Here are some of the basic examples of typical things people might need to do
when writing a web app:

 __How do I select records based on a value in an array?

 __How would I find all records whose name begins with "star" i.e. 'Star Trek'
'Star Base' 'Star Wars' 'Star Anise'

 __I have records which contain an array of email addresses. How would I get
all documents which contain one or more of a list of email addresses?

 __How would I add a where clause to my queries to ensure for example that the
"status" field in my document equals "open"?

 __How would I implement paging through a result set?

 __How would I implement count to control the number of records returned
during paging?

 __What is the right way to store dates in ReThinkDB so that they can be
queried?

 __How would I select all documents in a date range?

 __How do I do queries on fields that are within other fields in my JSON
structure?

 __How do I do wildcard searches?

 __How do I define which fields are returned from my query, or does it always
return the entire document?

 __How do I make sure that I do not add documents to the database that are
missing required fields, thereby breaking all queries that do not contain
those required fields?

It should be incredibly easy (and well documented) to do all the above.

I wanted to work with ReThinkDB but chose a different database because I was
so intimidated by the query writing.

~~~
coffeemug
Thank you, this is phenomenal feedback!! Most of these are already very
simple, and the ones that aren't will become very simple soon. Michel
(@neumino on github) is working on improving the docs right now, and I will
pass this list on to him. We'll get this documented ASAP.

Thank you again, this is incredibly helpful!

~~~
hoodoof
I look forward to returning to RethinkDB. Great product and the developers are
so friendly and helpful. Good attitude.

------
troyk
Am I the only one that hardly ever needs a case-sensitive query/index on
string data? This is the one feature I miss from the MySQL glory days. With
postgres, I use the citext type (case-insensitive text) but I guess Mongo and
ReThinkDB expect you to either store the data as all lower or upper, or
duplicate the field (for indexing).

~~~
coffeemug
The reason we (and probably Mongo folks) do it this way is because JSON is
case-sensitive, so making indexes case-insensitive wouldn't work very well.

EDIT: it looks like I misread your comment. I was talking about index/field
names, while I think you were talking about field values. In this case Sam's
comment below is correct -- once we add string.toLower() (which is very easy)
you'll be able to do case-insensitive indexing if you like because the
indexing system supports arbitrary reql expressions. If you're using rethink
and this is important to you, let me (slava@rethinkdb.com) know, and we'll
prioritize this.

~~~
pixelcort
Are the indices based on reql expressions updated incrementally?

How about making indices based on map-reduce? Would that work? Would they be
updated incrementally?

~~~
coffeemug
Yes, indices are updated incrementally. You can currently only evaluate an
index for a given row based only on that row, not on other rows. The reasons
for this are pretty deep -- if you start evaluating indexes on a row based on
other rows, the result becomes non-deterministic and can't easily be
replicated across the cluster. Once we introduce value-based replication (as
opposed to evaluation-based replication we have now) this will become
possible.

------
manacit
I'm really excited to see this project take off - I played with it a few
months ago and it seemed like it was getting close to prime time. The two
issues I see now are:

* More packages for more distros - It's great that I can install it on my OS X machine for testing, but I can only roll it out on an Ubuntu machine, and last time I checked the packages were a bit iffy when I tested them.

* More languages - It looks like you guys are working hard on this one, which is great.

Hopefully it'll be production ready soon! Great work guys.

~~~
mglukhovsky
Mike @ RethinkDB here. Great feedback-- while we've been a bit behind in
rolling out releases to other distros and languages, we've been delighted to
see the community pick up our slack:

\- Our new build system has people successfully compiling from source and
building community packages for CentOS, Slackware, Gentoo, Arch, and lots of
other distros.

\- One intrepid soul is porting RethinkDB to FreeBSD, where it builds
successfully (almost ready to merge in):
<https://github.com/rethinkdb/rethinkdb/pull/688>

\- We've seen an enormous number of client drivers for different languages pop
up in a short time, with some impressive collaboration:

    
    
        * rethink-net (C#/.NET): https://github.com/mfenniak/rethinkdb-net
    
        * php-rql (PHP): https://github.com/danielmewes/php-rql
        
        * rethinkgo (Go): https://github.com/christopherhesse/rethinkgo
        
        * cl-rethinkdb (Common Lisp): https://github.com/orthecreedence/cl-rethinkdb
        
        * haskell-rethinkdb (Haskell): https://github.com/atnnn/haskell-rethinkdb
        
        * lethink (Erlang): https://github.com/taybin/lethink
        
        * librethinkdb (C): https://github.com/unbit/librethinkdb
        
        * Lots of emerging efforts for Java, Perl, and other languages.
    

It's great to see people passionate about bringing our project to other
platforms / distros, and we're doing whatever we can to support them. If
anyone's interested in a similar project, shoot me an email at mike [at]
rethinkdb.com

------
dkhenry
Looks like one by one the features are being checked off to make rethink a
real contender. Keep up the good work guys.

~~~
jdoliner
Thank you!!

------
hcarvalhoalves
RethinkDB is looking great. The feature gap is closing, now I'm itching to use
it in a real project.

------
themgt
We've updated our demo to 1.5 as well: <http://rethinkdb-demo.pogoapp.com/>

------
agildehaus
Build fails on a "brew upgrade" for me.

<http://pastebin.com/3GWetUTE>

~~~
craigyk
It's not always 100%, but who else think homebrew is awesome?

I miss it when I go to my linux machines.

~~~
nickpresta
I like Homebrew as well and I don't want to detract from how awesome it is,
but dpkg + aptitude is much better in my opinion and I find I actually miss
aptitude on OSX.

------
psykotic
Did anyone notice the name of the example database in the video demo? The
RethinkDB crew must be Dota fans.

------
Ecio78
Congrats for your release, RethinkDB seems very interesting and I really
appreciate how coffeemug and all the other team members genuinely and clearly
reply to questions here.

Is there already a list of sites/services that use RethinkDB in production?

~~~
coffeemug
Hi :)

There isn't a list yet we can share, but we're starting to work very closely
with a small group of early customers -- shoot me an email to
slava@rethinkdb.com if you'd like to see if we can work together!

------
anatari
So is this production ready? Excited about this, but I don't have time at the
moment to dabble on things I can't use and the last comment regarding this is
a post on Quora from the founder saying that it is not production ready yet.
1.0+ version numbers makes you think this isn't beta software anymore, but
that doesn't seem to be the case.

~~~
coffeemug
We'll bump the version to 2.0 when the product is production ready (the 1.x
versioning scheme is an accident that unfortunately we can't do much about).

------
dennis82
this is a good update and not to rain on their parade, but doesn't this seem
market redundant and reinvent-the-wheel?

~~~
coffeemug
RethinkDB is definitely late to the party, so this is a great question. Here's
how we think about it.

1\. Technically, Rethink already does very useful things leading NoSQL
contenders don't do. An extremely expressive query language, massive query
parallelization, distributed joins, etc. You can't get that anywhere else, and
we have lots of features in the pipeline to keep raising the bar.

2\. I believe we have a unique take on usability and design that is very
valuable to users but isn't present in other products. Rethink is _pleasant_
to set up and (hopefully) _pleasant_ to use. To us, design matters, and we put
an inordinate amount of effort to make the product beautiful. It turns out
users care about this _a lot_.

3\. I feel like leading NoSQL contenders are somewhat stagnant. There is an
enormous amount of innovation that could be done, needs to be done, and isn't
being done. We'll continue releasing really useful features that nobody has,
and (I think) nobody expects. I think we bring a unique philosophy to product
design that's very valuable.

4\. Details matter (which is why Rethink is late to the party). Once you
outgrow the ten-minute blog stage, a lot of underlying architectural decisions
start to _really_ matter, and we took the time to do them right.

TL;DR: Rethink is already really good for building apps on top of it and
offers things nobody else does. It will continue getting better. We'd be
honored if you took it for a spin!

------
dscrd
Does RethinkDB do multiple atomic updates?

~~~
hendzen
Nope, operations on a single document are atomic, but not on multiple
documents. MongoDB works the same way.

AFAIK, the only NoSQL DB with support for multi-doc transactions is HyperDex
[0].

[0] - <http://hyperdex.org/>

~~~
prodigal_erik
Also <http://www.rethinkdb.com/docs/advanced-faq/>. RethinkDB started out as a
MySQL driver that was lock-free and did schema updates non-stupidly, but it
sounds like they've abandoned ACID and schema enforcement entirely like almost
all NoSQL systems (as if what I wanted is faster wrong answers).

~~~
tomjen3
I see this tude' among DB guys quite often -- unless the database guarantees
unconditionally that it will always give the right answer and that no data is
ever lost then it is worse than useless and most be nuked from orbit.

Quite frankly that is pretty arrogant. ACID may be required for a banks
transactions, but in far the majority of cases it isn't required (yes, even in
your business) -- nothing of any consequence is going to happen if your A/B
testing DB lose a couple houndred entries because those weren't flushed to the
cache.

As for non-schema enforcement, well you have too add a few if statements to
your database model class -- and that is it.

~~~
JulianMorrison
Banks do their transactions with reorderable increments and decrements, and
reconciliation after the fact - they are "eventually consistent".

More realistic example of stuff actually mattering: you sold product and took
the money. People will be annoyed if they don't get their goods.

~~~
rescrv
And each of those individual operations must be preserved. The example of
account credit/debit is not used because it reflects reality 100%, but because
its a simple-to-understand example to illustrate the power of transactions.
Most developers should be able to look beyond that simplification and see how
they can put transactions to work (or see that they are unnecessary).

------
jonthepirate
Very cool.. I'm going to try this.

------
papsosouid
Wow, I thought this was a joke press release making fun of "web scale" fad
craziness. They seriously gave a 1.0 release number to a "database" that you
couldn't search?

~~~
apendleton
No, you've always been able to execute queries. They just weren't indexed
before.

~~~
papsosouid
I didn't say "execute queries", I said search. Unless you have a tiny amount
of data, no indexes means searching is not feasible.

~~~
JulianMorrison
The way you search on a key-value store with no secondary indexes is create
the secondary indexes manually as separate tables.

~~~
papsosouid
Or just use a database instead of creating problems for no reason.

