We have used RethinkDB in production for a handful of months now. 100M docs, 250 GB data spread out on two servers.
We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.
Our limited experience of MySQL operations was also a factor. We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions. There were solutions to perform the actions we wanted without downtime but they scared us a bit. We also looked into upgrading to 5.6 or MariaDB but that seemed like it would take a long time and need much testing, while there were no guarantees that we would see performance gains.
We looked for alternative solutions and found RethinkDB. We reused the parts that serialize data for the API and put the resulting documents in RethinkDB. Then we had our API request handlers pull data from there instead of from MySQL and added indexes to support various kinds of filtering, pagination, and so on. We built this for our most problematic endpoint and got the two-server cluster up and running in about a week, tried it out on employees for another week, and then enabled it for everyone (with the option to quickly fall back to pulling data from MySQL).
This turned out to work well and we saw good response times, so we did the same thing for other endpoints.
There's some complexity involved in keeping RethinkDB docs up to date with MySQL (where writes still go) but nothing extreme and we haven't had many sync issues.
RethinkDB has been rock solid and it's a joy to operate.
> We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.
Had you looked into using PostgreSQL's materialized views? you can add indexes to the view with the additional bonus of the view hiding those joins from client code.
> We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions.
Everyone has this problem. But it's been largely solved in practice by performing the schema changes on slaves, and then promoting the slaves to master.
Also, if you're just using RethinkDB as a delayed (and almost certainly inconsistent) secondary storage system, why not use ElasticSearch instead?
BTW, 250GB fits in memory on any decent size box. You're not really going to see how things scale till you get into the terabytes.
An r720 from dell or similar model from dell with 600GB*2 SSD intel s3500DC model, 20 cores & 256GB of RAM will go for 5k-7k. You can bump this to 386GB of ram without going above 10k.
When I changed the country to Japan, the sticker price jumped from 2000 USD to 15,000 USD eq. for a very basic system. I am just at a loss as to what can explain this disparity. Guess I will have to call up my vendor to get a comparable quote.
My tip is always to try to get in contact with a couple of reseller and play them out against each other in the price department.
If you are looking for larger purchases 50k+ USD than you should talk directly with Dell, HP or comparable vendor and put them into the play off for who you choose :)
Rethink's 'ungroup' method lets you chain multiple reductions, which is incredibly powerful for building aggregation queries. Elasticsearch doesn't have that capability, and hence its aggregation capabilities are severely limited.
For example, with Rethink, it's very easy to compute a metric from metrics computed in a previous reduction. You can't do that with Elasticsearch, since its dsl allows metrics to only be computed from fields in the raw document, but not from other aggregation metrics.
splunk has eventstats command which computes metrics and assigns them to fields of documents so you can process them. is that something similar? (except the fact that splunk's invoices are know to cause cardiac arrest?)
We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.
Our limited experience of MySQL operations was also a factor. We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions. There were solutions to perform the actions we wanted without downtime but they scared us a bit. We also looked into upgrading to 5.6 or MariaDB but that seemed like it would take a long time and need much testing, while there were no guarantees that we would see performance gains.
We looked for alternative solutions and found RethinkDB. We reused the parts that serialize data for the API and put the resulting documents in RethinkDB. Then we had our API request handlers pull data from there instead of from MySQL and added indexes to support various kinds of filtering, pagination, and so on. We built this for our most problematic endpoint and got the two-server cluster up and running in about a week, tried it out on employees for another week, and then enabled it for everyone (with the option to quickly fall back to pulling data from MySQL).
This turned out to work well and we saw good response times, so we did the same thing for other endpoints.
There's some complexity involved in keeping RethinkDB docs up to date with MySQL (where writes still go) but nothing extreme and we haven't had many sync issues.
RethinkDB has been rock solid and it's a joy to operate.