I have been using RethinkDB for a while now (although mostly for small side projects + maintaining the Go driver https://github.com/dancannon/gorethink) and have really enjoyed it, having a functional query language is quite refreshing. The recent introduction of change feeds are also really cool, building a realtime app with websockets was surprisingly easy.
Honestly I would recommend RethinkDB to anybody looking to start a new (small-medium sized) project. While there are some small performance issues this is to be expected for a project at this early stage and after seeing how the RethinkDB team works I am confident that these will be sorted pretty quickly.
We have used RethinkDB in production for a handful of months now. 100M docs, 250 GB data spread out on two servers.
We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.
Our limited experience of MySQL operations was also a factor. We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions. There were solutions to perform the actions we wanted without downtime but they scared us a bit. We also looked into upgrading to 5.6 or MariaDB but that seemed like it would take a long time and need much testing, while there were no guarantees that we would see performance gains.
We looked for alternative solutions and found RethinkDB. We reused the parts that serialize data for the API and put the resulting documents in RethinkDB. Then we had our API request handlers pull data from there instead of from MySQL and added indexes to support various kinds of filtering, pagination, and so on. We built this for our most problematic endpoint and got the two-server cluster up and running in about a week, tried it out on employees for another week, and then enabled it for everyone (with the option to quickly fall back to pulling data from MySQL).
This turned out to work well and we saw good response times, so we did the same thing for other endpoints.
There's some complexity involved in keeping RethinkDB docs up to date with MySQL (where writes still go) but nothing extreme and we haven't had many sync issues.
RethinkDB has been rock solid and it's a joy to operate.
> We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.
Had you looked into using PostgreSQL's materialized views? you can add indexes to the view with the additional bonus of the view hiding those joins from client code.
> We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions.
Everyone has this problem. But it's been largely solved in practice by performing the schema changes on slaves, and then promoting the slaves to master.
Also, if you're just using RethinkDB as a delayed (and almost certainly inconsistent) secondary storage system, why not use ElasticSearch instead?
BTW, 250GB fits in memory on any decent size box. You're not really going to see how things scale till you get into the terabytes.
An r720 from dell or similar model from dell with 600GB*2 SSD intel s3500DC model, 20 cores & 256GB of RAM will go for 5k-7k. You can bump this to 386GB of ram without going above 10k.
When I changed the country to Japan, the sticker price jumped from 2000 USD to 15,000 USD eq. for a very basic system. I am just at a loss as to what can explain this disparity. Guess I will have to call up my vendor to get a comparable quote.
My tip is always to try to get in contact with a couple of reseller and play them out against each other in the price department.
If you are looking for larger purchases 50k+ USD than you should talk directly with Dell, HP or comparable vendor and put them into the play off for who you choose :)
Rethink's 'ungroup' method lets you chain multiple reductions, which is incredibly powerful for building aggregation queries. Elasticsearch doesn't have that capability, and hence its aggregation capabilities are severely limited.
For example, with Rethink, it's very easy to compute a metric from metrics computed in a previous reduction. You can't do that with Elasticsearch, since its dsl allows metrics to only be computed from fields in the raw document, but not from other aggregation metrics.
splunk has eventstats command which computes metrics and assigns them to fields of documents so you can process them. is that something similar? (except the fact that splunk's invoices are know to cause cardiac arrest?)
- first class support for JS bindings, unlike mongoose which wraps the super low level mongodb js library into something palatable but crashes in a horribly undebuggable way.
- server-side joins
- a nice web UI for monitoring and running queries packaged up with the service
- public docker images that are super simple to run
From RethinkDB docks [1], I am still a bit confused how this locking system works for read/write and also a bit skeptical regarding their claim that 'in most cases writes can be performed essentially lock-free'.
I am using MongoDB and didn't have many issues when my databases had 120,000 documents either, the problem began when we hit the millions... The combination of write locks and our need for dynamic queries (meaning: we can't index) made the database the worst performance bottleneck in our system by far. Although I must be honest that we haven't yet tried MongoDB's new 3.0 version that promises a boost in performance [2] and also has 'document-level locking and compression' [3]
Is anybody aware of any benchmark that perform random writes (inserts/updates) and non-indexed reads for RethinkDB? (Is it even a common use scenario, anyways?)
> I am still a bit confused how this locking system works for read/write and also a bit skeptical regarding their claim that 'in most cases writes can be performed essentially lock-free'.
Hi, Slava at RethinkDB here.
RethinkDB uses MVCC to do looking. Essentially, when we lock down a block for a write, we make a copy of the block. If another query comes along that wants to read, it reads from the copy. When the write completes, the old copies are destroyed.
There are lots of details I'm glossing over -- optimizations to avoid copying too much, copying entire subbranches of the btree to have a consistent view of the shard, etc. All this stuff isn't unique to RethinkDB -- it's pretty standard database internals stuff, and we haven't done anything new in that department. It's just an implementation of standard database architectures (as far as the caching/query engine/storage engine are concerned).
FYI, with MongoDB, just because you can't and shouldn't index everything, doesn't mean you can't have any indexes... if your most common fields bring your queries down, they're still pretty helpful.
I actually really like where RethinkDB is headed, and within the year most of my issues should be resolved.
Another couple databases to consider, depending on your needs would be ElasticSearch and Cassandra... it reallly depends on your use case.
I'm not involved with RethinkDB but I lurk on their github issues and I'm pretty confident that automagic failover is dependent on them getting (their own implementation of) Raft integrated with everything. Looks like its getting close as a whole slew of issues were opened just the other day relating to Raft work.
We do not plan to implement secondary unique indexes. The philosophy behind RethinkDB is that if a feature cannot be efficiently scaled across multiple nodes we don't add it, and unfortunately unique secondary indexes are one such feature.
AFAIK most NoSQL databases don't implement it, and a few that do take two approaches -- forbid it on multiple shards, or just take a massive performance hit during sharding.
We chose to keep the feature out of the database. This way the application can be architected to account for it, so it remains fast as the database scales up.
This is good, for what it's worth. Having different feature sets for sharded vs unsharded is just utterly confusing, and something MongoDB got really wrong.
We had a similar issue with queries that didn't match an index... later pages would time out, etc... we limited our results for that class of queries... The next generation will use cassandra with some custom searching/caching... that will work a bit differently.
The way business logic in the system is currently designed , there are no 'common ones', except for some ids (already indexed) used in other process rather than this dynamic filter.
Based on that we also evaluated the approach of using something else like druid [1] [2] that is built for reading performance, but I am still studying possibilities and have no idea about the impact and problems a change like that would impose.
+1 to using RethinkDB! I'm also using RethinkDB in production, and I love it! The only issue is that you have to set up persistent filters via iptables in addition to having an authKey. They do have a guide[0] for that, however they do not provide any instructions for ensuring that the filters on iptables stay up, or how to restore them if they are temporarily wiped out :/
I've been using/following RethinkDB since I started as Lavaboom's CTO. It's been a smooth ride so far, and the occasional perf improvements are always welcome. Some aspects of the database are especially lovely, like the web admin or painless deployments of new nodes, especially if you're using Docker.
We use RethinkDB in production and our main frustration lies around the lack of automatic failover. We're looking forward to 2.0, which is supposed to bring automatic failover (using Raft for consensus) to RethinkDB.
Unfortunately automatic failover won't be a part of 2.0, but it will happen very quickly after that. Please hang in there, we expect to ship this feature some time in May.
I just saw a demo of the failover feature yesterday from Tim Maxwell (the lead engineer on this), and it's really impressive! Another side benefit of this feature is live reshards -- you'll be able to reshard/rebalance data without any availability loss on the cluster.
The code is there and just needs a bit more polish and a lot of testing. I'm very excited to get this out, it's probably the last part of RethinkDB that I'm not 100% proud of yet (but will be in a month or two).
You guys are killing it. Wish I had a product I could write around rethink....currently at the day job our stuff is mostly Mongo....all layered under django-nonrel with lots of mongo crud so a port wouldn't really be an option I don't think.
As a person who agrees -- maybe you could write a port/adapter for django-nonrel to Rethink?
Also, why not start a new greenfield project to test out Rethink? a something something realtime something geospacial something app should be a fantastic way to kick the tires, since that's one of the things that Rethink does really well out of the box (as of 1.15) compared to other databases (relational or not)
I was brought on well after the system was originally developed.
The websites are mostly our internal admin tools anyway.
Most of the real work is run through cronjobs or task queues (Celery).
The biggest annoyance is that the Django version the stable django-nonrel is based on is ancient (1.3). There are non-stable branches to newer Djangos (1.5, I think?). When I investigated there were some issues with them so we're still on 1.3.
Some of our biggest users (to be announced in a few weeks for 2.0) use RethinkDB this way. You can't really do deep analytics/machine learning as RethinkDB wasn't designed for that, but if you want to store a lot of data, and then run lightweight aggregation or map-reduce queries on that data, Rethink turns out to be a really good product for it.
One issue I see with this path is that if your queries ever get a lot more complex, you'd have to migrate off of RethinkDB onto Hadoop (which is a pain). I think that if you know for certain you just want lightweight querying capabilities RethinkDB can be really wonderful, but if there is a good chance you might need something deeper, it might be worth the effort to set up Hadoop early on.
Have you thought about a "read at timestamp" construct in RethinkDB?
It's not really an MVCC thing, and you can work around it in data model, but for lots of reports (say, running in a cronjob), I want to run a query "as" the database saw things from at midnight UTC, even if i start running it at 2am? It would also make reports a more reproducible... but maybe this is really a datamodel problem. I felt when I read the Google Spanner papers that it was a pretty potentially useful feature for Read-Only queries.
Currently you'd have to do it in the data model as you would in any other database. It's a pretty cool idea, though -- I'll think about what we can do (though admittedly, this is a bit removed from the current direction).
That sounds like a job for the changes feed, pre-digest data with a query and then pipe its changes feed into Hadoop's storage. (How fast can change feeds run? Would that end up being a bottleneck?)
Most of your developers will use them without understanding how they work. In many cases, the only way to understand how they work is to read the source code.
They have magic features that are advertised as convenient but when they inevitably do something you don't want them to do you'll tear your hair out trying to circumvent them.
They put lots of complicated weird stuff in your stack traces so when something goes wrong with "the database stuff," which is probably going to happen every day, you will feel confused and overwhelmed.
ORM was famously referred to as "the Vietnam war of computer science" by Ted Neward. [2]
There's a point that I think is even more important than the unruly and bewildering complexity of ORM, but I'm not sure I know how to formulate this point.
One way to formulate it would be to point out that your dichotomy of two choices is missing an alternative, so I present:
3) Code your data access in a separate module exposing query & save functions that make sense within your domain model.
In a reasonably complex system, this module might consist of fifty functions that concatenate SQL strings or whatever. In most cases, I'd bet money that rewriting this module to support some other data storage—especially if there are integration tests—would be easier and more pleasant than switching your ORM and then dealing with the random problems that will inevitably occur.
And when some query fails or is slow, the developer issued to fix it will just go into the file, find the query, and change it. It's simpler, there's less obscure technology to worry about, fewer things to get angry at.
Preprocessed prepared queries that are shipped along with whatever package is using them is a far easier solution than quirky ORM tools that can do the simple things, but have a tendency to break on the harder things or encourage an authoring style that destroys performance.
That said, I think the world could use a CoffeeScript-esque transpiler for targeting SQL. Preferably with some kind of frontend/IL/backend separation, so that everyone can take a crack at replacing the awful SQL syntax.
Alternatively.. wrap each domain's data as a separate micro-service, with a convenient API... Though it really depends on how you can break down the boundaries of your application's data. Then you can persist/represent that data however you like.
Well, here's where the difference between query builders and ORMs comes in.
Query builders (usually integrated with ORMs) are usually used by people who don't want to write any SQL. SQL is very performant and powerful, but not that easy to understand or write. This is especially so when you think about the context switch between programming languages and SQL.
The advantage of some of the NoSQL databases (MongoDB and RethinkDB, for example) is that you have the luxury of using an ORM only when it makes sense to use it, instead of relying on ORMs as crutch for not knowing SQL.
The second approach seems better, but obviously SQL (power, performance, and prevalence) cannot be ignored.
I disagree. I use ORMs/Query Builders for two reasons (even if deeply familiar with SQL syntax and semantics).
1. Dynamic queries for reporting purposes. Is it possible to hand roll this with SQL string concat for every query? Yes. Some of us value time and correctness.
2. Type safety/Refactoring and change management. Sure even a Query Builder/ORM model doesn't know if it matches the production database at compile time. But when your database does change and you need to update the model to match it then it is a lot easier to do a refactoring on the record than to manually update every SQL query string where that table is involved and hope there isn't a missing test or invalid query out there which will now be broken.
I see him here and Reddit every once in a while, but there is a cool client-side encrypted note taking app, Turtl, using Common Lisp and RethinkDB server-side, and what was node-webkit client-side. Very cool, everyone should check it.
Not a dedicated user, but I have been playing with this dude's CL work and I like his approach and attitude. Maybe thought people would want to see a self-hosted RethinkDB proj.
I'm also working on a noSQL database. What I'm struggling with is the abstraction for searches/filters. For example, if you want to get all books with "beginner" in the title, in SQL it would look something like:
"SELECT * FROM books WHERE title LIKE %beginner%"
Where in no-SQL it would look like
book.filter({title: ["like", "beginner"]});
Any ideas on how to abstract the filtering in a more clear way?
I'm a relative newcomer to the NoSQL scene and have been using RethinkDB for a couple of sideprojects. The IRC channel (#rethinkdb on FreeNode) is really second-to-none - the people on there are incredibly friendly and patient when answering what are probably obvious questions.
Like it is also trendy to have an over-sized picture of young people working on wooden desks in an industrial-chic office taking up 70% of the screen space.
Goes nicely with the Bootstrap template, Lobster font, Circular cropped photos of founders, Ping pong tables, Ruby on Rails. Etc. Etc.
It's a way to imply that the all of the fiddly little details don't suck.
Everyone's run into that library, SaaS product, or other bit of software where all of the features sound awesome, it appears to do exactly what you need ... and you'd rather debug a plugged in blender than actually use it.
Honestly I would recommend RethinkDB to anybody looking to start a new (small-medium sized) project. While there are some small performance issues this is to be expected for a project at this early stage and after seeing how the RethinkDB team works I am confident that these will be sorted pretty quickly.