RethinkDB 1.9: Indexing and query runtime performance

jonpaul · on Sept 11, 2013

I've been experimenting a bit with RethinkDB within the last few days. Its stability is definitely not there (too be fair, they've made this clear), so it's crashed when I needed it most. But here are the finer points:

1) Their team is extremely responsive. In fact, the most responsive team that I've dealt with on Github. I've posted at least 4 Github issues during various times of the day and evening and I've gotten a response in less than 10 minutes. Most of the time, more than one member responded.

2) They are not over selling their product like MongoDB did. They are very clear about what their product does and doesn't do.

3) Clustering is stupidly simple. You can have one setup within minutes and have your servers geographically located throughout the world.

4) The web UI makes administering the databases easy and fun.

After they get the kinks (stability & docs) fixed, this database will definitely have a bright future.

stavros · on Sept 12, 2013

Regarding #1, at some point I told them I could use hard timeouts on JS and they said "sure, we can add this", I said "no no, this was more of a general wish for the future, there's no need to spend time on it" and they were like "well too late, we already implemented it".

I hope I get a reason to use Rethink soon again, the team has just been great and the product itself is ticking all the right boxes with me.

coffeemug · on Sept 11, 2013

slava @ rethink here. Could you link to the crashing bug you're referring to here? (just so other people have the benefit of learning about it)

EDIT: also, thanks for the kind words!

jonpaul · on Sept 11, 2013

https://github.com/rethinkdb/rethinkdb/issues/1419 :)

coffeemug · on Sept 11, 2013

Ah thanks. This OOM crashes have been lurking for a while. @srh is working on this now and it should be gone once and for all after 1.10. Sorry you ran into this!

stavros · on Sept 12, 2013

I've asked this before and you replied, but it was lies, all lies! I will ask again: What have you done with Marc and why is he never on Steam? We lost a perfectly good DOTA2 player and all-around good guy!

mglukhovsky · on Sept 12, 2013

Haha, I mentioned it to Marc the other day-- he's out of the office now, but I'll be sure to ask again for you. :)

stavros · on Sept 12, 2013

Alright, thanks :)

andrewmunsell · on Sept 11, 2013

It's great to see an amazing product like RethinkDB move so quickly and improve so much. Since I've been experimenting with it, the performance has improved dramatically.

RethinkDB + Docker is a great setup for me and has made upgrading between versions of RethinkDB less painful, and when combined with tinc, clustering is really easy as well (DigitalOcean, where I'm experimenting with RethinkDB, doesn't support Amazon VPC-like functionality and only recently added support for a private network in one region).

corresation · on Sept 11, 2013

It's great to see an amazing product like RethinkDB move so quickly and improve so much

The counter-argument is that they can move so quickly and improve so much because so much is so poorly implemented. While it may seem cynical, I have formed a habit of eliminating projects that still see magnitudes performance improvements from such rudimentary activities. Once they get to the point where a yearly release are a couple of syntax improvements and some minor speedups, it is more likely to be production ready.

coffeemug · on Sept 11, 2013

Overall, you're right, but I'd point out that this is by design.

In the past two years our top two goals (hopefully like any high-tech startup) were to build a product that electrifies people and has a solid architecture. The reasoning was that if people don't want the product nothing else matters, and if people do want the product but the architecture doesn't allow rapid changes, we'll lose.

Really good implementations of specific features was priority number three, and while we did a lot of that too, it sometimes took a backseat in light of the first two. That's ok though -- now that we have a rapidly growing user base and a solid technical architecture we can go back and replace placeholder implementations with better ones according to user demand. We're not entirely happy with all our choices, but no one ever is. I think that the overall direction worked out really really well for RethinkDB and its users.

EDIT: gruseom and mrkurt beat me to it.

mrkurt · on Sept 11, 2013

Rethink is more deliberately scoped small than it is poorly implemented. Their roadmap has all sorts of known performance improvements on it, and I greatly respect their decision to keep things as simple as possible and deliberately defer some areas of complexity.

gruseom · on Sept 11, 2013

You have a point in general, because that is how these things usually go, but these guys spent like three years working on the foundation for what they're doing now. This level of rapid improvement may be the visible payoff of a lot of careful design and engineering.

andrewflnr · on Sept 12, 2013

Speaking of which, for the rethink people on the thread, how much of old "key-value store for SSDs" code is still active? Is the document store at all a layer on top of that?

SamReidHughes · on Sept 12, 2013

Most of that code still exists, extensively modified. Here are some things that don't:

- Talking to files with libaio -- we now only support use of blocking I/O calls in a thread pool.

- Use of block devices instead of files, and code optimized for older SSDs that spread writes to different parts of the disk.

- A fsck utility and a utility for extracting data from a corrupted file.

Most other stuff has been rewritten and refactored, though -- RethinkDB was once implemented entirely using callbacks on an epoll loop. Then we introduced coroutines (cooperative green threads that sit atop an epoll loop) into the codebase. Nowadays almost everything is implemented in terms of coroutines -- there aren't many APIs left that take a callback and say it'll get called sometime later. There's even still a secret memcache interface.

The storage engine itself does a bit more now than it used to. Support for secondary indexes, MVCC, efficiently bringing back out-of-date replicas, and better on-disk storage are the main ones I can think of at this hour.

vosper · on Sept 11, 2013

I'm currently using a columnar SQL database with a denormalised schema of about 1000 columns that contains several billion rows. 99% of our queries are SUM over some of these columns, with a few simple filters: WHERE order_id IN [...] AND date > xyz

It performs really well (sub-second queries over millions of rows) but it's licensed by data volume and (more importantly) it's a single-server solution - at some point we're going to need a distributed database.

Is RethinkDB suitable for analytics workloads like this?

coffeemug · on Sept 11, 2013

For your workload a columnar store will always outperform a row-based store (like RethinkDB). Rethink does work for analytics, but it can never compete with columnar stores (like Vertica, etc.) on the workloads those systems are good at.

coolsunglasses · on Sept 12, 2013

Your honesty is much appreciated by the rest of us. Don't ever lose that please.

vosper · on Sept 11, 2013

That makes sense - thanks for info.

gadamc · on Sept 11, 2013

If your queries are largely always the exact same and you need to update your sums as new data is added, you could look at the incremental MapReduce of CouchDB/Cloudant. A single CouchDB might hold your data, but Cloudant would scale it out for you. For example, in my experience I have 10 million documents, each with about 100 key-value pairs, where each value is a number. I have a MapReduce function that calculated the statistics of those values. With Cloudant, those statistics are done in the Reduce step and are done incrementally, so I always have the latest information. Additionally, I can select date ranges for the statistics and its already pre-calculated for me. Its very fast response.

For example https://edelweiss.cloudant.com/automat/_design/cryo_2/_view/...

These are the stats for a particular measured value between two dates. Change the dates and you'll see that the return is pretty quick.

Change startkey to "T_Bolo", 1378940038 and you'll get the statistics of that value for the last ~hour. You get sum, but also average and standard deviation.

full disclosure: I work at Cloudant.

cheers, Adam

JulianMorrison · on Sept 11, 2013

You might want to look at that "precog" thing that was mentioned here on HN yesterday. It's for analytics on columnar data.

JulianMorrison · on Sept 11, 2013

http://precog.com/

homerowilson · on Sept 11, 2013

Maybe try http://SciDB.org. It's a free GPL array database for analytics

vosper · on Sept 11, 2013

Thanks for the suggestion - I hadn't thought to look at solutions from the scientific-computing realm. Have you used SciDB yourself?

tonywe64 · on Sept 12, 2013

You should try Vertica, they have 3 nodes community edition.

trebor · on Sept 11, 2013

I have to say, RethinkDB has piqued my interest. And I'm a pretty staunch MySQL/SQLite -based developer.

wicknicks · on Sept 11, 2013

You should give it a try. RethinkDB, imo, embraces a lot of good ideas from the SQL world. Given a few years, I feel they will have a great DB with good ideas from both the SQL and NoSQL worlds.

nickstinemates · on Sept 11, 2013

You can give it a try at http://tryrethink.info

cbsmith · on Sept 11, 2013

My condolences.

siddhant · on Sept 11, 2013

cbsmith · on Sept 13, 2013

Because... MySQL.

dkhenry · on Sept 11, 2013

Awesome no driver changes.

coffeemug · on Sept 11, 2013

For all the driver devs out there, there should be no more driver protocol changes until we hit an LTS release (unless we find some critical API bugs that need to be fixed).

TylerE · on Sept 11, 2013

With that being the case are there plans to spend a bit more time on driver support? At the very least a high-quality 1st party Java driver would go a long way. Even better if you provide Scala/Clojure/Groovy bindings.

dabeeeenster · on Sept 11, 2013

Can anyone comment on the quality of the 3rd party Java driver (https://github.com/dkhenry/rethinkjava)?

dkhenry · on Sept 12, 2013

I am biased, but It is a very straight forward implementation building directly on the protobuf interface. It lacks some of the niceties of the interperted drivers, but It is as fast as ( or faster in some cases ) then the official drivers ( as best I can tell )

The author stopped working on it for a little because he was tired of refactoring his code as the protocol changed and figured he would just add all the missing bits once they finished the protocol.

tigeba · on Sept 12, 2013

I was eager to give these a whirl but from my perspective the AGPL licensing is rather onerous considering that the official drivers are APL. Not sure if this was intentional or accidental.

mglukhovsky · on Sept 12, 2013

The official RethinkDB drivers are licensed as Apache 2 (see https://github.com/rethinkdb/rethinkdb/blob/next/drivers/COP...), whereas the server is licensed under AGPL.

We strongly recommend all community drivers use a less restrictive license like Apache 2 (or other license of choice).

Also: dkhenry, thanks for all your hard work in helping bring RethinkDB to the Java community!

dkhenry · on Sept 12, 2013

Well Color me surprised. Your right. I just used the same license that RethinkDB is licensed under, I didn't see that they had the drivers under a different license.

By Tomorrow I will have them updated to APL ( luckily I can do this since no one else has really supplied a pull request :-( )

tigeba · on Sept 12, 2013

Excellent! I will check them out and send off a pull request if I run into anything.

coffeemug · on Sept 11, 2013

Our goal is to get to an LTS release without increasing the surface area of the project. Once the LTS release is out, we'll likely take on critical community projects under our roof and offer first-class support for them. This should start happening in the next couple of months. (Unfortunately until we get to LTS we don't quite have the resources to support new drivers)

hnnnnng · on Sept 12, 2013

I really like the idea behind rethink and the concept as well. I've also looked at documentation and it seems very simple to use.

Also, I understand that 'benchmarking' databases is extremely situational and generally inaccurate. However, without reasonable ways to measure an increase in performance, its really hard for me to make the decision to switch from mongo to rethink.

What I'm trying to ask is, does anyone have any information that can help me decide where, how, when and why should I use make the decision to switch from mongo to rethink? Not just for me but also so that I can show others in my team to get a consensus to switch.

coffeemug · on Sept 12, 2013

Is performance the main motivation for your team? If so, we'll be publishing some results soon. In the meantime, there are some non-performance related comparisons with Mongo that you might want to take a look at:

* Technical comparison: http://rethinkdb.com/docs/comparison-tables/

* A slightly more biased comparison: http://rethinkdb.com/docs/rethinkdb-vs-mongodb/

hnnnnng · on Sept 12, 2013

Performance would grant us the best reason to switch. Because otherwise we are quite satisfied with mongo. I have seen the technical comparison. I will keep an eye out for the performance results. Thanks for letting me know.

leif · on Sept 12, 2013

If you're on mongo and interested in performance, take a look at tokumx. It's a drop in replacement for mongodb with a better storage engine. http://www.tokutek.com/2013/09/tokumx-vs-mongodb-in-memory-s...