Hacker News new | past | comments | ask | show | jobs | submit login
RethinkDB 2.0 is amazing (conery.io)
348 points by TheMissingPiece on Apr 17, 2015 | hide | past | web | favorite | 203 comments

MongoDB has just done so much to erode my trust in novel databases. My knee jerk reaction is always "NOPE stickin with Postgres!". So I'm going to hold off on checking this one out, even though it seems from the comments that it's avoided many of Mongo's horrible design flaws.

Just my 2 cents of Mongo hate :-)

You are absolutely right. However bear in mind this thinking is a slippery slope. Meaning most good CIOs, architects, technical leads, etc shy away from betting the house on new or novel technology precisely because experience had shown them first hand the risks of jumping on the shiny new exciting well marketed technology. At a certain point however you will find that this pushes you behind the technology curve...

the engineering skill here is the ability to trade off risk vs benefit... I will tell you from my own experience the best designed software systems I have personally dealt with tend to use components somewhat behind the curve.

Well you might be want to tell Facebook, Twitter, Netflix, Yahoo, Spotify, eBay etc that they don't know how to design software systems. Because all of them have a long history (check their Githubs) of creating and adopting pretty cutting edge technologies.

For me the best software systems are those that are well architected and use the best available technology. This doesn't mean we all should be writing Tomcat, Oracle, Apache stacks just because they are less shiny.

Yes, please do look at them, because Google, Facebook, Twitter, Yahoo, and eBay use MySQL pretty heavily for many of their core storage needs (look at the contributor list for WebScaleSQL). Spotify uses Postgres for the same. All of them also use other things as well, but only for cases where they are willing to make performance or reliability trade-offs like for colder or analytical data. Of course they also use things like Sherpa, Cassandra, HBase, etc... but they lose consistent low latency or consistency or availability when they do so.

The point is, if you are going to bet your business on a technology, it helps if it has been tested with production workloads in many different conditions and scales. You want to know about as many shortcomings as you can. For many of the use cases that people use things like Cassandra for, they will be tolerant of 30ms++ reads and potential read inversions. Redis is used pretty heavily, but it is relatively simple code and you can trace through the entire writepath pretty easily and get a sense of its limitations (being single threaded is a blessing and a curse, you really need to be careful about bad tenants because a single slow query will cause an availability event for everyone). HBase is used in a few places, but usually only for cold data after they expect it to be read only occasionally and they don't want to use up space on their MySQL pci flash devices for it anymore. There are a bunch more, but they all have some latency, consistency, or availability downsides compared to a traditional sharded B+ tree backed transactional store.

All the big guys do have a history of creating new/novel infrastructure pieces. They create them because they don't (can't really) trust any brand new infrastructure software they they didn't have at least a major hand in creating. You'll notice that if they use a new thing, it's after extensive testing and patching and contributions.

As a small startup, you might not have the time for extensive testing and patching of new hot technologies, which facebook, twitter, netflix do.

For the big guys it's not about trust. They just hit the limits of the current tried and true before anyone else does and as a result they have to forge new ground.

If you don't run at the same scale as those guys you won't hit those same limits. But if you do reach the scale of those guy's you will find that your needs are suddenly very much a unique snowflake that will require you either creating something new or heavily tweaking something that already exists.

Plenty of time for that when you reach the scale that justifies it though.

This is simply not true. Many companies in the world (including on this list) do use brand new software that they had no hand in creating. Classic example is in the Big Data space. There are plenty of very early adopters to most of the Hadoop stack e.g. Spark. Or look how many companies started using Nginx or Go even though less shiny solutions have existed.

And not sure if you've worked for a large company but they largely comprise lots of little startups sized teams. The same principles apply regardless e.g. spiking technologies out, managing risk etc.

What I have an issue with is these stupid generalisations. Less shiny = good, Shiny = bad. The merits of the architecture and technology seemed to be completely ignored.

The common rule of thumb seems to be: Less shiny = battle tested (hey, if it has bullet holes even better), Shiny = not gone even through pre-flight testing.

On the flip side. Less shiny = more accumulated technical debt.

Which then translates to buggier software as you add more and more new features.

There is a reason we rewrite codebases, no ?

Analogies. Fail me every time :)

With less shiny I mean that it's old. Not that it is crummy qualitywise. Old code is not like wine. If it was crap then it will be crap now. Old code is like an old house. If it's well made and tended to, and built on solid principles it can last generations.

Extensible domain logic is something that generally does not age well. Old utility libraries with well defined interfaces, on the other hand, are invaluable in technical computing.

It looks to me like they use bleeding edge solutions or develop their own when "standard" tech isn't doing the job well enough. And they start slow and use it in non-critical pieces of software first.

Couchbase is well established and reliable, has a track record going back to memcached and couchdb. And they're moving very fast with good features (but the way they do it their paid enterprise edition has the new features while the community edition lags a bit, which is fine because this means the community edition is really damn reliable and high quality.)

It is also really heavy weight. I run RethinkDB on a system 1/3 the size of the minimal Couchbase system and still get excellent performance. The system requirements for Couchbase made it a turn-off for me.

When did you last use Mongo? We've been using it in production for 3+ years, and while there were certainly some issues early on, we've had nothing but success with it (especially over the last few major versions).

You could have inconsistencies in your data[1] and not even realize it.

[1] https://aphyr.com/posts/284-call-me-maybe-mongodb

While the call-me-maybe series is definitely informative... it's worth noting that they've called out flaws in every distributed system they've tested against. What it comes down to is, are those flaws fatal in practice. The truth is, it depends.... If you lose a comment on a social media site, no big deal. If you lose part of a transaction for a multi million dollar stock trade, very big deal.

No software system is perfect, but there are definitely practical balances to be made. Especially when you are beyond what a single database/server can offer in terms of write throughput. The fact is, when your traffic needs exceed what a single database can keep up with in terms of writes, you have to give up some level of reliability.

He wasn't able to find issues with Zookeeper and Postgres.

Granted that you can only prove that the system is vulnerable and not the reverse, but if there is a vulnerability it is much harder to trigger it.

In general, strongly consistent distributed datastores like zookeeper tend to be strongly consistent (cf Consul and Etcd too)... But Postgres was not tested as a distributed database, sharded or replicated, and without any form of failover. The difference is: kill a zookeeper node and you will not notice, kill Postgres and your app is dead.

Postgres is a good DB, but since it's not distributed, it's not very useful to compare it to distributed databases. Yes it's consistent, but it's only as reliable as the single node where it is installed.

This is a common misconception about CAP theorem. Significant number of people don't realize that distributed system also includes clients, it's not just communication between servers.

I suspect he did not go over replication, because Postgres technically still fail over support is DIY, although he should. There are two replication methods though which I would like to see:

- asynchronous - this one is fast, but it most likely would have similar issues the other database have - synchronous - the master makes sure data is replicated before returning to the user this should in theory always consistent

You would typically have two nodes in same location replicating synchronously and use asynchronous replication to different data centers. On a failure, you simply fail over to another synchronously replicating server.

Regarding consul/etcd actually those technologies did not do well in his tests, but authors appear to be motivated to fix issues.

I agree with your point about the lcient, but there's always a client... What's missing in the postgresql test is high availability, redundancy and partition tolerance... Similarly any inprocess db would beat the competition :)

That's why I said it's unfair to say "postgresql did well".

Call me maybe is supposed to test all the difficult problems of CAP, which have not been tested at all with Postgresql.

95% of the people I hear about using NoSQL databases are using them on a single node.

That's a real problem. There's little need for a nosql database (except redis maybe, because it's so fast) if you're not trying to overcome partitions and ensure HA...

I don't even know how they upgrade their servers... With mongodb, it's a breeze if you have a replica set. Same goes with backups.

MongoDB doesn't even scale well horizontally[1]. I normally would put a link to paper where they benchmarked Cassandra & HBase with MongoDB 2, but looks like they did their tests again with MongoDB 3.0 and included Couchbase as well.

[1] http://www.datastax.com/wp-content/themes/datastax-2014-08/f...

I've mostly used MongoDB in mostly-read, and in a replica set... that said, if I needed to support pure scale, I'd be more inclined to reach for Cassandra. If I only needed mid-range replication, I'm more inclined to look at RethinkDB or ElasticSearch at this point. In fact the project I'm working on now is using ElasticSearch.

All of that said, you have to take a research paper funded by a database company (Datastax is backing Cassandra) with a grain of salt. Not to mention, that most people reach for MongoDB because it has some flexibility, and is a natural fit for many programming models. Beyond this, setting up a replica set with MongoDB was far easier than with any other database I've had to do the same with... Though I'd say getting setup with RethinkDB is nicer, but there's no automated failover option yet.

The results are so vastly apart than I don't think there's enough salt that you can add to make MongoDB look good here.

They also were quite generous by comparing load using non-durable write for CouchDB, HBase and MongoDB against Cassandra's durable write.

From my personal experience many scaling problems that you have with MongoDB once you switch even to a relational database that can't scale out are laughable.

Postgresql doesn't even try to scale horizontally.

Regarding Mongodb, all I'll say is that I've switched from mysql to mongodb 2 years ago, and I've never looked back. YMMV.

I'm also a user of ElasticSearch and Redis, and looking to add Couchbase to the lot. One size doesn't fit all. mysql and postgres certainly don't fit all either.

Postgres does one thing and does it well, which is keeping your data safe. The whole NoSQL movement was to sacrifice ACID in exchange for speed and scalability, which MongoDB has neither. You effectively get database that not only performs slower in single instance[1], but also can't even scale horizontally.

Also saying that Postgres cant scale horizontally is not entirely true, it in fact can[2], it is currently more complicated but I learned something when I was investigating how our applications would behave with Postgres backend. Turns out that every instance we had mongo we could run postgres on a much smaller instance. In one instance the data was so laughable that you could just run postgres on the same node that was running the app.

The point of it is that even if you think that you need to scale out, unless you're Google, Facebook or similar company you chances are you don't.

[1] http://www.enterprisedb.com/postgres-plus-edb-blog/marc-lins...

[2] http://www.slideshare.net/mason_s/postgres-xl-scaling

Came here thinking just the same thing.

Can anyone explain (SQL pun not intended) to me the advantages / disadvantages between rethinkDB and say PostgreSQL?

Here's a comparison between RethinkDB and MongoDB (written by RethinkDB) http://rethinkdb.com/docs/rethinkdb-vs-mongodb/ and the FAQ for "When is RethinkDB not a good choice": http://rethinkdb.com/faq/#when-is-rethinkdb-not-a-good-choic...

A lot of the things would also apply to PostgreSQL

I realize it is popular these days to hate on Mongo, but the FUD is getting a bit old.

really, the MongoDB haters are pathological at this point. 10gen must have really pissed off the HN crowd early on (I missed that cycle) but today, in the real world, mongo is a useful tool for what it does best. It's time to move on from early impressions and angst about their hype fails, and give them a few crumbs of credit for being the defacto leader in a new space, and working through the inevitable rough patches.

I don't think you understand how badly 10gen overmarketed Mongo as something that it isn't. I personally believe the people at 10gen deserve legal repercussions (I expect we'll see this in the coming years), even incarceration for the pure lies they spouted that cost companies so much. MongoDB is a trojan horse, a trap, completely incapable of serving any real-world need. If someone thinks MongoDB is working from them, they just don't realize yet how it's subtly broken.

Sorry have you just woken up and missed the last 20 years ?

Almost every single IT company exaggerates claims, says their product is "the best" and "amazing" and suitable for every use case under the sun. Oracle did it. Microsoft did it. Mongo did it. And thousands of companies in the future will do it in the future. It's called Marketing.

And I think you should speak to all these customers (http://www.mongodb.com/who-uses-mongodb) and tell them they don't serve any real world need. I would imagine a few would ask the same question of you.

I think "ORGANIZATIONS CREATING APPLICATIONS NEVER BEFORE POSSIBLE" about sums it up on their "who uses" page... There's nothing in particular that mongo does better than it's competitors.

Thanks for making my case for me. This is possibly one of the most ridiculous replies ever posted to this forum.

The day we start jailing people for their marketing hype is... well, it wouldn't be a good day.

I'm confused what horrible design flows MongoDB had apart from collection level locking ?

And I can't imagine any of this is relevant today given that MongoDB allows for pluggable storage engines.

The way it handle failures? You may be interested in this (old-ish) article: http://hackingdistributed.com/2013/01/29/mongo-ft/

Another worth read: https://aphyr.com/posts/284-call-me-maybe-mongodb

Although to be fair, it's not just MongoDB that is performing poorly.

How is that a design fault ? It was purely a poorly chosen configuration setting which reflects the fact that Mongo was originally not a general purpose database. And it was never even an issue for 99% of people because all the drivers at the time used the safer settings.

I always find it amusing when people bring these issues up because it's like a giant sticker on their forehead that says "I've never actually seriously spent time with MongoDB before". I always go through the configuration of the database I use to make sure it meets my needs. Only seems sensible.

>It was purely a poorly chosen configuration setting which reflects the fact that Mongo was originally not a general purpose database.

Thats dubious. 1.) When MongoDB was released, none of the drivers used the "safer" settings. 2.) 10gen, at the time, released benchmarks with the "unsafe" settings comparing it to MySQL and boasted that MongoDB was much faster (ignoring the fact that it wasn't acknowledging your writes).

Hmm, are you able to point us to any of these "benchmarks" with unsafe settings?

AFAIK, until recently (i.e. the last month), there weren't any such benchmarks released by MongoDB - and then, only for 3.x.

I'd be very surprised if any such benchmarks exist, as you claim.

Disclaimer - I work for MongoDB Inc.

MongoDB is advertised as general purpose database though and majority of people use it that way.

Sure. And that's why the configuration was changed years ago.

The article actually talks about the configuration change. The issues are still there.

I'm not sure we've read the same article.

Quick -- give an example of something else unrelated to the original article expressing your disgust with it.

I don't really have a strong opinion in the game but mind you that MongoDB is pretty darn fast http://www.peterbe.com/plog/fastestdb At least when you're not at scale.

And it's pretty cool that you can make a choice between fast writes or safe writes. You can't have both but at least you can have the choice.

Having said all that, I generally prefer Postgres in almost every possible case.

However, this RethinkDB project looks sexy and with a great potential.

/dev/null is pretty darn fast, too. https://www.youtube.com/watch?v=b2F-DItXtZs

You know this benchmark is bullshit because it says memcache is slower than redis (at editing) and mongo (at creating and deleting) and motor (at creating).

'When you query with ReQL you tack on functions and “compose” your data, with SQL you tell the query engine the steps necessary (aka prescribe) to return your data:'

... what? ReQL and SQL are both declarative query languages: I don't really see the author is getting at. Is there an implication that SQL isn't declarative?

The only real difference is that the API is based around chaining function calls rather than expressing what is needed as a string - there are many SQL query builder APIs that will let you build SQL queries by chaining together function calls.

One real difference is that you get most of the benefits of an ORM framework right in the driver, without depending on other packages or frameworks. And the API is very consistent across different languages.

The biggest problem with composing SQL strings is that you have to be very very careful about SQL injections, and if you deal with that in a slightly sophisticated, reusable manner you are half-way to an ORM already. As far as I can determine, the ReQL drivers make injection attacks very difficult.

The biggest problem with a different query language for every project is that when they get around to implementing some of the more esoteric (but extremely useful) SQL features it may not match well with what they've designed so far, and it may be hard to implement for the devs, conceptualize for the users, or just be plain weirdly tacked on causing a cognitive mismatch in how it's used (some side channel, for instance).

Using a query builder (or ORM) of some sort still allows the escape hatch of raw SQL to do those really crazy things that are sometimes needed for performance, or just because what you are trying to do is rather weird. SQL is a very mature language, it's unlikely you are going to run into something someone else hasn't before.

Datomic uses datalog but also exposes lower levels of the data accses api. This allows people with special needs to drop down and do there own thing, while others can use datalog.

It seams like a good idea to allow diffrent layers of data access.

Raw SQL still needs to be parsed and converted into some data structure. The difference I think is that in RQL you are just making the data structure on the client side then sending it to the server.

Another point of ORMs is to make queries reusable. In Django for example, you can reuse the same queryset in views, forms, templating and the admin. You can process and edit the queryset much easier than composed SQL strings.

This should also be possible to build on top of ReQL, though I don't know any examples.

Composing SQL strings at runtime should be the very last resort in my opinion.

This is what stored procedures and parameterized queries are for. Even if I am going to do dynamic SQL, I do it in a stored procedure if I can.

Then you are one of the rare SQL-englightened beings.

I still don't see how you can pass user input from, say, a python string into a stored procedure call without worrying about injections. Or converting between your app's data structures and whatever string is necessary for your stored procedure.

Your driver should be able to handle parameterized queries for you.

    query('SELECT * FROM users WHERE id = ANY ($1::int[])', [1, 2, 3]);
    query('SELECT * FROM users WHERE lower(uname) = lower($1)', 'foo');
Where's the injection vulnerability?

So I may not be an SQL expert, but why would it be difficult to produce an injection string for $1? Of course, if you supply it "guaranteed" integers, then you can't. Injections normally happen with user inputs, not constants.

Because SQL query compilers generally don't execute the parameters. They do not just concatenate the given parameter strings into the query template and then run that. Instead, parameters are always treated as parameters, the query template is compiled and the parameters are passed into that compiled representation of the query where they are simply regarded as variables, not eval'd.

lookup prepared statements.

> The only real difference is that the API is based around chaining function calls rather than expressing what is needed as a string

The query in RethinkDB is very much an expression. In the JavaScript driver you build this expression with function calls. There are other drivers which let you build the expression in a much more declarative way (like my Haskell driver).

Let me give you an arbitrary, unknown sql string in a variable `s`, and I'll do the same with a ReQL query.

The challenge is to make sure the column/field `age` is more than 20.

My code is:

   query.filter(r.row['age'] > 20)
What's yours? (Hint: start by writing a compliant SQL parser)

Let us compare this newfangled "auto-mobile" invention to my favorite form of transportation, the horse: Where would you mount your favorite riding saddle on an auto-mobile?

(Hint: start by learning metalworking)

This is nice, and as others have noted there are similar APIs that can be used with SQL. But I find it bad practice to extend arbitrary queries with additional conditions. This is very likely to lead to poor performance and perhaps correctness problems. In practice you need to have more semantic understanding of your query, so having an opaque 'query' object is no more helpful than a SQL string.

using ActiveRecord:

    query.where('age > ?', 20)
If I'm understanding you correctly.

That's moving the goalposts. The original challenge was take an arbitrary SQL query stored in a string.

I believe army's point was that when using an SQL query builder API, one does not start with a string, but something which allows them to do a similar check that you showed.

I'm also not sure how your comment replies to army's point. The point, as I understood it, is that it is not accurate to characterize SQL queries as steps that tell the engine what to do. SQL is declarative, and leaves the execution plan up to the database itself. army's comments about the API and strings were trying to point out the only perceived difference, which is not relevant to the question of declarative versus imperative.

Wait what? Not sure what you are asking for

"SELECT * FROM table WHERE age > 20"?

You already have an existing SQL query in a string variable. You need to add the age > 20 condition to that query.


I think that will work for the specific case but won't generalize.


"You already have an existing ReQL query in a string variable. You need to add the age > 20 condition to that query."

Same problem. Comparing apples and oranges, strings and some "live" code. If you put ReQL and SQL into the same category (either as a string or as a thing that represents some "live" running code that you can manipulate at runtime) then it is difficult for me, at least, to really grasp what the differences are between them. SQL is certainly not considered an imperative language, eh?


EDIT to respond to the comments below from TylerE and pests:

Oh but you do have ReQL as a string: when you type it into the editor, when it lives on disk as a file of source code. At some point that code becomes live and you can interact with it. The exact same basic transformation happens whether the syntax is ReQL or SQL, just in different ways and at different times depending on how you choose to run it not what syntax it's in. The issues are orthogonal and it certainly fair to demand that we compare the right things.

If you want to say that ReQL is a better syntax than SQL, well, I don't see it (yet.)

If you want to say that the product in question provides a nice way to run ReQL syntax queries in some fashion that is fundamentally better than the way that some other product allows you to run SQL queries, that is a whole different issue (and NOT the one I am addressing in my comment above.)

I hope that makes sense. ;-) Cheers!

No it's not. Because there is no such thing as a ReQL query in a string. It's an object, usually built by chain methods. There is no textual representation.

Edit to your edit: It seems you are fundamentally not getting it. The ReQL is live code in your native programming environment. That means you can inspect it and manipulate it. SQL doesn't get interpreted (or whatever, it's black box) until it hits the server.

Imagine you're in a world where there are no XML parsing libraries. SQL is a string containing XML. ReQL is a DOM object.

One is much more useful than the other.

Your argument is so silly.

You are a comparing a language (SQL is independent to the language you're programming with) to an API.

RethinkDB has API available for three languages: JavaScript, Python & Ruby. If you take look carefully while it tries to be consistent across them, there are still parts that are specific to given language. If you would want to use RethinkDB with a language that is completely different (for example a functional language), assuming RethinkDB would support it, you're guaranteed that the interaction with the DB would be completely different, while you could still use the same SQL language[1].

If you want to compare RethinkDB's API to something similar you should compare it with something like JOOQ[2].

Just to preemptively respond to argument about translating DSL to SQL. Currently modern driver communicates with database using binary protocol, the SQL is compiled on client side. You could actually skip SQL altogether, but then you would lose flexibility of being able to support many other databases.

[1] http://pgocaml.forge.ocamlcore.org/

[2] https://en.wikipedia.org/wiki/Java_Object_Oriented_Querying

Try to understand the context in which what I am saying makes sense, it will blow your mind and make you a better programmer when you do.

(Expanding upon that: We are both correct but not in the same context. There's a context in which what you are saying is true and sensible, and there is another context wherein what I am saying is true and sensible. I can switch between the contexts, so I am not trying to disagree with you, I am trying to give you data to help you to understand this other context and switch between then too. Additionally, this other context is of a higher "logical level" in the mathematical sense than the one we already have in common, and so when you do grok it I can confidently predict both that it with blow you mind and improve your ability to write software.)

You would never have a ReQL query in a string though. I don't think its fair to ask about that case.

Does a string really have the filter method?

A ReQL query does. That's the point, it's a native object you can chain stuff off of.

Then it's not a string, it's a ReQL query object. Same way a Django model instance isn't a string. Your comparison seems way too artificially made up towards Rethink, so much so that it discredits itself.

Imagine me saying "here's a SQL string, let's see which database can execute it more easily, Rethink or Postgres. Hint: Start with writing a parser to convert it to ReQL".

Why would you ever do this?

There is some FUD in this blogpost as well as in the comments in this thread.

I think the current status quo for databases is canned software. And this isn't necessarily bad because neither of the three databases mentioned hide their specs or default settings, the three have very good docs and community willing to help, in addition to companies giving commercial support. Whats your excuse to misuse these products?

RethinkDB writes your data to disk before acknowledging the write but on the other hand can't elect a new primary in case of failure, two completely different features/limitations that might work for someone and not for other ones. Is that hard to understand? Did mongo documentation lie you at some point?

Accept that you are "buying" a general purpose product, the designers thought that their users will need those features, deal with it.

Otherwise build your own database, I know this might sound very hard but I guess in the future we will see smaller building blocks that let you build something that handle your needs like this:


I was just thinking about this issue in the last few days. I'm working on a side project, due to the nature of the data model, converting back and forth to fit into a relational database is kind of annoying, so I was looking around for other databases.

Right now, the situation with database is that we have to convert our internal data structure into a representation that fit the data model of the database we're using (ie rows for relational, document/key for the NoSQL group). I can see the reason the data model has to be that way for scaling, distributed etc... But if I'm happy to scale my database up, and would prefer to have the database storing the data as close to the memory data structure as possible (similar to object databases -- albeit with a boarder definition of "object"), is there any database that could do that?

Otherwise, is there any suggestion on how I could get started to build one?

Redis is good at storing various data structures. Sets, sorted sets, hashes, etc. As long as "scaling up" means adding memory it's good.

> RethinkDB writes your data to disk before acknowledging the write

This is the default, but also note that durability is configurable on an operation-by-operation basis.


Which is also my point, both mongodb and rethinkdb has the same feature in this regards but different defaults. Does that makes one worse than the other one?

Yes, because lots of people end up using software with default values. Can you blame them? Perhaps... in a way that you can blame a user who accidentally clicks a "delete account" button that doesn't have a confirmation step.

Good defaults are expected in quality software, and are just as important as any other part of software interface, CLI or GUI.

It's pretty amusing that "it writes your data to disk before acknowledging the write" is something that has to be described as "impressive" and written in bold text when talking about a database. MongoDB really has lowered the bar.

That said, I love the look of Rethink and I can't wait to give it a try.

Everything you need to understand about persisting data to a physical medium has be written by richard hipp et al. You can't cheat. You can easily write async write calls in your own app to a synchronised storage engine and not assume the cost. When you have to have written logs which are accessed by government agencies you need to think about the atomicity of what a transaction is in you business. Is it when it enters the building as an electrical signal or when you flip the bits on a spinning disk

Methinks the world has forgotten that high throughput systems existed long before the web of recent years. Most of what the web world thinks is high throughput is hilariously slow. The ability to run up another instance to scale sideways has ruined people. It doesn't scale in a linear fashion.

OP here - many NoSQL/document DBs will trade off write acks for eventual consistency. I really liked their approach to pushing toward durability by default - that in particular was the thing that impressed me, which I should have been more clear about.

Does it write to a disk or to a log? If it writes to disk, it may still not be completely fail-safe unless it also writes to a log before writing to disk. Postgres for example has a write ahead log and it writes to disk before acknowledging the write.

RethinkDB has a log-structured storage engine. There is no separate log like in Postgres, the log is implicitly integrated into the storage engine. You don't have to write data twice (like you would with a traditional log), but you're still guaranteed safety in case of power failure. The design is roughly based on this paper: http://www.cs.berkeley.edu/~brewer/cs262/LFS.pdf.

So does postgres. That alone isn't enough because you can get situations like torn pages where part of the database page is old data and part is new data and nothing makes sense anymore. A log fixes that by first writing to the log so that if the database page gets messed up you have a secondary source you can use to restore it from.

>Does it write to a disk or to a log? If it writes to disk, it may still not be completely fail-safe unless it also writes to a log before writing to disk.

why we still discussing it at a tech forum in 21st century in Silicon Value? Shouldn't it (together with isolation, ACID, CAP, etc...) be a base knowledge taught in elementary school? Like you can't expect Daddy to buy you a firetruck that Mommy promised if Mommy hasn't been able to talk to Daddy yet... though until Mommy talks to Daddy you probably can convince Daddy to buy you a railroad...

Yeah, that should be table stakes.

Except it wasn't mongodb that did that. People had the same knock against MySQL with MyISAM.

Right, I'd be a lot more forgiving of MongoDB if they had been bringing the product to market 10-15 years earlier.

Why? It was stupid and unsafe 10-15 years ago when MySQL was doing it, too, and all the devs who had been using more mature DBs (Oracle, DB2, etc.) complained about how bad it was.

I was nodding in agreement right up until the word "Oracle". Essential any history of databases will say that for years, Oracle was not an RDBMS even by non-strict definitions (the claim is that Ellison didn't originally understand the concept correctly), and certainly did not offer ACID guarantees.

Possibly Oracle had fixed 100% of that by the time MySQL came out, but now we're just talking about the timing of adding in safety, again -- and both IBM and Stonebraker's Ingres project (Postgres predecessor) had RDBMS with ACID in the late 1970s, and advertised the fact, so it wasn't a secret.

Except in the early DOS/Windows world, where customers hadn't learned of the importance of reliability in hardware and software, and were more concerned simply with price.

Oracle originally catered to that. MySQL did too, in some sense.

In very recent years, it appears to me that people are re-learning the same lessons from scratch all over again, ignoring history, with certain kinds of recently popular databases.

I am curious as to why. The underlying systems have only gotten more reliable and faster then they were 10-15 years ago. 10-15 years ago writing to disk was actually _more_ of a challenge then it is now with SSD's that have zero seek time.

I don't think it's gotten any easier to verify that something was actually persisted to disk though.

The hard part has always been verifying that the data is actually persisted to the hardware. And the number of layers between you and the physical storage has increased not decreased. And the number of those layers with a tendency to lie to you has increased not decreased.

For some systems it's not considered to be persisted until it's been written to n+1 physical media for exactly these reasons. The os could be lying to you by buffering the write, the driver software for the disk could be lying to you as well by buffering the data. Even the physical hardware could be lying to you by buffering the write.

In many ways writing may have gotten more reliable but verifying the write has gotten way harder.

There's a lot of FUD going around when it comes to MongoDB write durability. Please read the manual.

Mongo lets the user decide whether or not to wait for fsync when writing to an individual node. This is not the default configuration. If you want it, you can enable it. You may complain that Mongo has bad defaults for your particular use case. It continues to have bad defaults to this day. Saying Mongodb is unable to acknowledge writes to disk is pure FUD.

Let the downvotes ensue.

It's like if MySQL shipped with libeatmydata configured by default. The defaults should be safe. Mongo made not just a bad choice, but a really idiotic decision to make their default configuration non durable.

> The defaults should be safe.

That's one opinion fitting one set of use cases. There are plenty of use cases where speed is more important than durability.

Hell, Redis default configs don't enable the append-only log, but you don't see the HN hate train jumping all over Redis. This is because Redis use cases typically don't require that level of durability.

edit for source: cmd+f for "appendonly" https://raw.githubusercontent.com/antirez/redis/2.8/redis.co...

Redis doesn't market itself as a general purpose database, more as an advanced memcached, which is why those make sense. People who value performance over durability can of course change the setting, being aware of what they are getting into. That's very different from someone who doesn't realize that just because his database seems to save his data doesn't mean it won't eat it tomorrow because he didn't know to change the configuration. I stand by my judgment of hopelessly stupid.

Well, MySQL does ship like that. Only instead of not saving your data, it will mangle it in an effort to insert it in the database somehow.

It loses data even with WriteConcern.MAJORITY[1].

Emin Gün Sirer summarized[2] it best:

> WriteConcern is at least well-named: it corresponds to "how concerned would you be if we lost your data?" and the potential answers are "not at all!", "be my guest", and "well, look like you made an effort, but it's ok if you drop it."

[1] https://aphyr.com/posts/284-call-me-maybe-mongodb

[2] http://hackingdistributed.com/2013/01/29/mongo-ft/

>Let the downvotes ensue.

Can we not have this reddit-ism take hold?

> There's a lot of FUD going around when it comes to MongoDB write durability. Please read the manual. Mongo lets the user decide whether or not to wait for fsync when writing to an individual node. [...] It continues to have bad defaults to this day. Saying Mongodb is unable to acknowledge writes to disk is pure FUD.

> Let the downvotes ensue.

There's a lot of FUD going around when it comes to Ford Model X car not having brakes enabled. Please read the manual. Ford Model X lets the user decide whether or not to enable brakes or not. [...] It continues to have bad defaults to this day. Saying Ford Model X is unable to brake is pure FUD.

Let the downvotes ensue.

Oh great, an analogy. Now we can start debating its subtleties instead of discussing the matter at hand.

You are just being childish.

Firstly MongoDB's write durability was set to use the safest option on all of the drivers at the time. So your point makes no sense. And secondly we aren't ignorant users of the system. We are highly technical and as such your analogy again makes no sense.

For some reason I had it in my head that most databases don't actually block until fsync() returns- instead, the guarantee you get is that:

# if execution continues, everything agrees on the state of the transaction

# if execution halts, because of a crash or whatever, you'll come back online at a consistent state from the past

Typically when you COMMIT the changes will be written to the transaction log, which is sequential, then later written asynchronously to the data files. So you get the performance of sequential writes and the flexibility of random writes, which is nice. But once something is COMMIT'd it is permanent, it will survive any crash after COMMIT returns. If it has not yet be written to the datafiles, the recovery process will do that.

Haha very true!! My #1 complain against MongoDB was the silent data loss scenario. Anyways I am curious what RethinkDB has to offer.

Writing to memory before writing to disk can be safe if you do it right. You need to deploy multiple instances in a replica set with a quorum threshold to guarantee safety. This is what Cassandra provides off the box. I don't think MongoDB made it clear at the start to its users that you should never work with a single database instance if you don't want to lose data.

There's a very marked difference between "safe" and "low probability of failure". With uncommitted writes, even with a quorum, there's still a chance that you lose the write.

I have lost plenty of data on three separate occasions with mongo db, never running it by itself. always with at least a 3 member replica set. (this was 1-2 years ago, I'm sure it's improved). but it's not accurate to only blame the data loss issues on documentation.

The API syntax isn't that special. The ArangoDB Query Builder (aqb on npm) for example is even more straightforward (no functions needed):

      a.eq("album.details.media_type_id", 2)
Or in plain AQL (ArangoDB's query language):

    FOR album IN catalog
    FILTER album.details.media_type_id == 2
    RETURN album
The "map" in the second example is simpler, too:

   ….return({artist: 'album.vendor.name'})
Or in plain AQL:

   … RETURN {artist: album.vendor.name}
Also, it doesn't really need drivers because the DB uses a REST API that works with any HTTP client.

That said, the change feeds are pretty neat and RethinkDB is still a pretty exciting project to follow.

(Full disclosure: I wrote the ArangoDB Query Builder without any prior exposure to ReQL, so I may be biased)

You don't actually need to use functions in ReQL either (although you can). For example

  r.table('users').filter(function(row) {
    return row('age').gt(30);
Could be expressed as:

That being said aqb looks pretty cool and quite similar to ReQL.

Neat. I actually prefer the "infix" style for operators (i.e. having the methods on the values instead of on the helper) and I'll see whether I can adjust AQB to support that.

I've implemented the infix/ReQL style operators and published them in the latest release:


Postgres 9.3+ is fairly straight-forward too. Here is go + github.com/mgutz/dat

    // one trip to database using subqueries and Postgres' JSON functions
    con.SelectDoc("id", "user_name", "avatar").
        HasMany("recent_comments", `SELECT id, title FROM comments WHERE id = users.id LIMIT 10`).
        HasMany("recent_posts", `SELECT id, title FROM posts WHERE author_id = users.id LIMIT 10`).
        HasOne("account", `SELECT balance FROM accounts WHERE user_id = users.id`).
        Where("id = $1", 4).
        QueryStruct(&obj) // obj must be agreeable with json.Unmarshal()
results in

        "id": 4,
        "user_name": "mario",
        "avatar": "https://imgur.com/a23x.jpg",
        "recent_comments": [{"id": 1, "title": "..."}],
        "recent_posts": [{"id": 1, "title": "..."}],
        "account": {
            "balance": 42.00

But have you seen the documentation, and the breadth of things you can string together?

I've found the documentation a joy to use

I can echo this post. I've been using Rethink for a few years now on side projects (as well as maintaining a driver, even if I skip a few server versions here and there...but these guys move fast).

The composable queries alone are enough to make any developer happy. You can pretty much treat your data as if it's in-memory because the drivers integrate so well with the language. The relational model works really well.

Things I have not tried yet are clustering and the real-time support (still need to build this into the lisp driver) but I'm trying to slot some time to do this. One of the projects I'm working on (https://turtl.it) is going through a nice upgrade to mobile right now, and this will include some server changes...I'm looking forward to implementing changefeeds to solidify the collaboration aspect.

Overall I've been really impressed with Rethink over the years, and can't express how excited I am they hit production ready. On top of the DB being great, the team is really nice to work with. They are incredibly responsive on github and were really helpful when I was first starting to build out my driver.

Great post, and congrats to the Rethink team!

Have used RethinkDB with a few projects now, mainly in Go with dancannon's Go driver[1], and have found it very functional and robust and has become my go-to schemaless database.

My largest project has been running a couple of years now and has accumulated a significant amount of data, and RethinkDB hasn't had any trouble at all scaling with my data growth. I'm running it on servers below the recommended requirements too (512MB DO instances) and have been really impressed with how it handles constrained resources.

[1] https://github.com/dancannon/gorethink

"RethinkDB is interesting" - absolutely

"RethinkDB is amazing" - TBD

I don't even think Slava would call RethinkDB "amazing" yet. I have no idea how to make a database, but I know there's a lot of work - and even more trial and error - that goes into making one "amazing."

This is certainly a big step for RethinkDB. But I'd be careful to put Petabytes of data across 200 nodes sharded 500 ways each.

There a common form of such posts by developers: "I've found database X and I LOVE it, we're migrating all our systems." Then a year later "Things didn't work out, here's how we migrated from database X to database Y."

Someone should make an index of such blog posts.

Interesting that Wikipedia has no entry for RethinkDB.

I do see that RethinkDB has some "Overview" and "FAQ" links on its website. However, when I encounter a new technology, I like to read its Wikipedia entry first. Wikipedia is usually more impartial, informative, and actually makes it easier to get a high-level sense of a technology than the tech's own website in most cases. This has grown more and more true over the past five or so years, as even developer-facing websites have devolved into "startup-y" marketing nonsense.

I wonder if there WAS a Wikipedia entry, but it's been deleted by some moderator with an axe to grind? I personally haven't contributed in years due to how unpleasant it is to add new content through all of the Wiki-lawyering. I've also noticed that 5 years ago, when you did a Google search you could rely on the Wikipedia entry being one of the top 2 or 3 results. Lately I see more and more instances where I have to scroll to the second or third pages of results to see a Wikipedia link.

Anyhoo... apologies for the tangential aside. I'm just wondering whether the lack of a Wikipedia entry says more about RethinkDB or about Wikipedia?

Slava @ Rethink here. There used to be a few Wikipedia articles that kept getting deleted. I don't really understand the Wikipedia guidelines on this, but I don't worry about it too much. As RethinkDB grows the article will get added back, and it will get harder to make an argument that it should be deleted.

There should be zero trouble getting a RethinkDB article written now, because it takes all of 5 seconds with NEWS.GOOGLE.COM to find reliable sources to back a notability claim.

I can't find evidence of a deleted Rethink article in Wikipedia, but didn't look hard.

Its listed right on the article's page:


Reason: https://en.wikipedia.org/wiki/Wikipedia:Criteria_for_speedy_... (G11. Unambiguous advertising or promotion)

Weird, I tried searching and didn't get it (or the warning not to create a new page for it).

Is there an archived copy of the original page? It's probably best to start with a stub page that contains no advocacy for Rethink and minimal information, and then grow it over time.


> 17:41, 16 August 2013 Alex Shih (talk | contribs) deleted page RethinkDB (G11: Unambiguous advertising or promotion)

I've added a (very) brief article, as a starting point. Would be great if others can add additional information!

I guess there is a higher chance of it not getting deleted if there are a bunch of edits/additions made by different people/accounts. So please, add to it :)


There was an article in the past. Well, that's just how Wikipedia is. The Node.js article, for example, has only been undeleted by an admin because someone mentioned it in some "Wikipedia is lame" thread on Reddit.

> Not earth-shattering, but with ReQL all I need to do is attach a function

Thing is, some SQL databases have columnar storage, and in there selecting everything, then filtering with an attached function would eliminate the performance benefit of not selecting all of the fields.

This is why SELECT looks like it does. Not to mention it's much shorter than attaching a function for the purpose.

The author also himself acknowledges that:

> The downside is that your queries end up quite long and, for some, rather intimidating.

Ok so they're "quite long" and have less potential for optimizing the performance of. Amazing?

His example of creating specific indexes and views is also not new to SQL.

> There are 3 official drivers: Python, Ruby and Node.


> The query itself didn’t change at all – I could copy and paste it right in. I had to wrap it with connection info and a run() function, but that’s it.

So just like an SQL query, except I can connect to an SQL RDBMS from virtually any language I can think of, and not just a narrow selection of 3 script languages.

I sympathize with author's excitement, but from all his examples SQL feels like it has quite an edge both in availability and in terms of design and fit for the domain than a bunch of JS functions composed together (as much as I like composing functions together in JS).

I realize how much hard work the folks at RethinkDB have put into creating their product. But technology adoption is not driven by pity, it's driven by benefits. For a new type of DB to not be a flash in the pan it needs a lot more than being "stable and fast". It needs to offer significant additional benefits when compared to existing DBs. And I ain't seeing it.

ReQL is declarative. The queries are compiled into an AST in the client drivers, the AST is shipped to the database, and is then analyzed and executed on the cluster. There is no information loss, and nothing runs on the client. The query language looks operational (as is, just run these commands in order), but is also declarative in the same way as SQL is.

This blog post explains how this works: http://rethinkdb.com/blog/lambda-functions/

> So just like an SQL query, except I can connect to an SQL RDBMS from virtually any language I can think of

Well, sure, because (1) major SQL databases have DB-specific drivers for many languages (often third-party), and (2) SQL uses a well-established, common model for which generic connectivity tools exist (ODBC, JDBC, etc.) so even minor SQL-based databases can go pretty far if they've got just ODBC and JDBC drivers.

But while RethinkDB may only have the three languages with official drivers, there are lots of third-party drivers, and there is documentation on the protocol and process for writing third-party drivers. Obviously, it kinds of loses out where ODBC/JDBC and similar technologies are concerned (though you probably could build drivers for Rethink using them, but you'd probably have to lose lots of Rethink's unique features -- particularly the push feed one -- when using them.)

> I realize how much hard work the folks at RethinkDB have put into creating their product. But technology adoption is not driven by pity, it's driven by benefits. For a new type of DB to not be a flash in the pan it needs a lot more than being "stable and fast". It needs to offer significant additional benefits when compared to existing DBs. And I ain't seeing it.

The key additional benefit compared to most better-established storage technologies seems to be ability to simply set up push feeds from queries. I'd say the demand (or lack thereof) from that is likely to be the determining factor in whether the resources get devoted (first- and third-party) to the RethinkDB ecosystem to bring the kind of conveniences that are seen with more established DBs.

Do you know for sure that RethinkDB can't work out what columns are being filtered by a function? In the examples it would certainly be possible with some analysis.

It can't work them out because you compose the query in a third party scripting language.

RethinkDB has no access to the structure of the source in order to analyze it statically and work out an optimal I/O read plan. It interacts with the language runtime by providing an API and receiving callbacks to the API from the runtime.

SQL is parsed & analyzed statically at the server, a plan is created based on that analysis and executed. So with SQL it is possible to do so.

With RethinkDB you compose your query in the script, basically, and all of the optimization opportunities end with the exposed API (no function source analysis).

It's not impossible to redesign the API to provide or even mandate static details like requested fields to RethinkDB, and it has a bit of that, but it allows freely mixing in client-side logic and even OP is confused about what it means to have a client-side mapping function.

If they would allow complex expressions to run on the server, it'd become quite verbose to compose that via an API in an introspective way, to the point it'd warrant a DSL in a string... and we're back to SQL again.

> RethinkDB has no access to the structure of the source in order to analyze it statically and work out an optimal I/O read plan.

Actually this isn't true. One of the really cool things about RethinkDB is that despite the fact that queries are specified in third party scripting languages they actually get compiled to an intermediate language that RethinkDB can understand.

That being said AFAIK RethinkDB doesn't optimize selects the way columnar databases do. I believe it can only read from disk at a per document granularity. But it does have the ability to optimize this in the future.

I don't think that's true. From what I perused of the driver implementations, I think that as calls are made, the driver basically builds an AST up, and then when you call run() it compacts it and sends it over to the DB. ie, when you call filter() you aren't actually filtering, you're adding a filter operation to the AST.

I would think that would allow Rethink to analyze the structure of the query and perform appropriate optimizations.

I'm talking about map(), and you're talking about filter().

Here's the code in question:

    return {artist : album("vendor")("name")}
If this is simply adding a node to an AST, it could be expressed without a function:

  .map({artist : ['album','vendor','name']})
Using a function for this would be quite superfluous.

You can express it both ways in RethinkDB, and they'd both do the same thing -- add a node to the AST. The function is just a convenience syntax.

> It can't work them out because you compose the query in a third party scripting language.

The restrictions on what language features you can use in lambdas inside queries exist because the query isn't executed on the client, the query in the client language is parsed into a client-language-independent query description which is shipped back to the server and executed on the server. So all the information about the query is available to the server (how much it actually uses for optimization, I don't know, but the query is not opaque to the server; what is composed in the scripting language has the same relation to what the server sees as when you use an SQL abstraction layer that builds SQL and sends it back to the server with an SQL DB.)

    So just like an SQL query, except I can connect to an SQL RDBMS from virtually any language I can think of, and not just a narrow selection of 3 script languages.

Along with all of this, the team behind rethink DB is really very helpful. Team was always patient when I was new. They always helped. This is one the very few projects that I felt right at home. Rethink DB had(during the days I was trying to use IRC) atleast one core developer available on IRC every day from Monday to Friday. Congratulations team!

So I have a question for you all. But first a little background. I have been playing with PouchDB, an in-the-browser implementation of CouchDB that is mostly compatible with CouchDB. What I really like is its sync function, this sets up no-brainer practically real-time sync between my local (offline) PouchDB database. It is very cool!

I am very impressed by RethinkDB's cluster management, etc, so I would like to explore it as an option, but is there an easy way to sync my offline (browser based) localstorage-like database to rethink and back again? PouchDB makes this dead easy.

There isn't a good way to do that yet. We've been playing with some ideas, but offline sync like this is a surprisingly challenging problem -- it's easy to make something that works, but dramatically harder to make something that works at scale.

I can imagine that. I have seen some sync implementations for Backbone.js for instance a while back, but nothing that really worked well. Again, I am very surprised of how well PouchDB performs in this. Additionally, you get all (most) the features of CouchDB in the browser, you can even use the design documents that you create on the server.

Excuse my curiosity unrelated to the current topic but I'm about to deploy a small scale hybrid (desktop & web) application using PouchDB. I have strongly unreliable networks (3G networks with entire days where the app is offline). The nodes are owning their own chunk of data so there is no risk of conflicts whatsoever, the main goal is to sync as soon as they can. PouchDB/CouchDB seemed clearly the best fit for this kind of unusual application. Did you encounter any problem with it ? Or if you can share your opinion on this technology after using it.

Hi. First a disclaimer: I have no production app running this setup yet, but I have been working with PouchDB and recently CouchDB for a few months now. Experimenting with a Phonegap/Web app in Angular (I would recommend React instead I think though, but that is irrelevant to the Pouch&Couch story).

I have a similar use cases as you in mind for my PouchDB setup; being offline for a while and then syncing many things at once. So far I have been testing with high performance networks mostly (including good 3g however). I have also tested going into airplane mode on my phone, then making changes and then going online again after a while. All changes came through nicely. So you can do what you ask: sync as soon as there is a connection.

Indeed when there are no issues regarding conflicts that you have to resolve it seems that Pouch and Couch are a very very good way to have offline <-> online sync.

I can't imagine using RethinkDB until the query language ensures type safety. The article touts about SQL being full of strings, but ironically there are fully typed query builders for SQL but not for RethinkDB. All DB tables has schema even though it's not listed anywhere, and schemaless databases can't have fully typed query language either.

Many core RethinkDB developers are huge fans of type safety. Check out the RethinkDB Haskell driver (https://github.com/atnnn/haskell-rethinkdb), and the .NET driver (https://github.com/mfenniak/rethinkdb-net). ReQL works really nicely with type safety (and will work even better when we let people declare optional schemas).

Check out my Haskell driver (https://github.com/wereHamster/rethinkdb-client-driver). I think think the first paragraph of the readme describes the differences between my driver and atnnn's quite nicely:

> It differs from the other driver (rethinkdb) in that it uses advanced Haskell magic to properly type the terms, queries and responses.

For example the driver knows that this query returns a number, and tries to parse it as such:

    call2 (lift (+)) (lift 1) (lift 2)
Here are a few more examples from my application:

    -- | The primary key in all our documents is the default "id".
    primaryKeyField :: Text
    primaryKeyField = "id"

    -- | Expression which represents the primary key field.
    primaryKeyFieldE :: Exp Text
    primaryKeyFieldE = lift primaryKeyField

    -- | Expression which represents the value of a field inside of an Object.
    objectFieldE :: (IsDatum a) => Text -> Exp Object -> Exp a
    objectFieldE field obj = GetField (lift field) obj

    -- | True if the object field matches the given value.
    objectFieldEqE :: (ToDatum a) => Text -> a -> Exp Object -> Exp Bool
    objectFieldEqE field value obj = Eq
        (objectFieldE field obj :: Exp Datum)
        (lift $ toDatum value)

    -- | True if the object's primary key matches the given string.
    primaryKeyEqE :: Text -> Exp Object -> Exp Bool
    primaryKeyEqE = objectFieldEqE primaryKeyField

My driver doesn't include all commands of the query language, just those which I need in my product. And I haven't tested it with RethinkDB 2.0 yet.

Still, you are supplying the record fields to match on as T.Text. Check out [1] for a type-safer approach. I've been working on a typed expression builder for just this purpose, but due to lack of time haven't been able to finish it yet [2].

[1] http://hackage.haskell.org/package/structured-mongoDB-0.3 [2] https://github.com/pkamenarsky/typesafe-query

Oh neat, thanks for pointing me to this!

Is there any interest in blessing the Haskell driver as "official"?

Thinky can strictly enforce types.

I'm sure there are use cases where both RethinkDB and Postgresql can equally provide a solution to some problem, but is it fair to compare them so blindly where one has ACID guarantees and the other one does not?

OP here - I wanted to offer a comparison of the SQL vs. the ReQL query. Indeed if ACID is something you need, then yes a horizontally-scaling DB is probably something that deserves longer thought.

This is a broader discussion to be sure, and it's been had. Given that PG now supports jsonb, it does mean that yes, we get to have these discussions more.

Realtime update notifications, official node.js API...sounds like it'd be perfect as a second back end for Meteor.

There's been some talk of it but they haven't committed to anything.


It's a bit worrisome that there are no official Java drivers and the only unofficial Java project has been abandoned by its only author because he no longer uses Rethink.

Daniel @ RethinkDB here. We're going to ship an official Java driver very soon. We decided to focus on a small number of core drivers first while the protocol and query language were still undergoing rapid extension. Now that the protocol is stable, we're going to expand our official driver support step by step. The Java one will be first.

You can follow the progress on https://github.com/rethinkdb/rethinkdb/issues/3930

Thats great news. I love RethinkDB; but not having a Java driver made it a no-go for a couple of projects.

Hey there, I am literally working on the official Java driver right now, fear not!

What do you guys think of Aerospike (http://www.aerospike.com/)? I heard of a few cases that used it successfully. Any issues with it?

Aerospike - distributed key value store has a little bit different of a use case than RethinkDB - distributed JSON document databases. Basically the only valid use that I have seen for Aerospike, and that is the one that they are advertising is a distributed key value profile store for Ad-tech or marketing companies (http://www.aerospike.com/overview/). Keep in mind that to actually use all of Aerospike's features , especially the one they are really proud about - the cross datacenter replication - you need the commercial license.

So..I suggest you figure out what your requirements are and then use the best tool for the job.

CTO at Aerospike here.

Aerospike is battle-tested in large deployments --- ad-tech, marketing-tech, a few new ones in telecom and fin-serv. Pushing huge load with very, very little downtime. That's what we're the most proud of --- and I'm proud that we're able to offer this killer codebase as open source, after being closed source for the first few years of the company.

Most applications have a huge core of key-value --- twitter, for example --- and need a fast and scalable key-value component. You can start with a single server (on your laptop with Vagrant) and scale up later.

We're adding more types, more cool operations, more indexes this year.

The fact that Aerospike has a basic query system, type safety, flash optimization (Amazon has switched over to being very SSD/Flash centric) support for every language under the sun (three contributed Scala layers --- and we see a lot of Go use as well as the usual Java / Node / Python / PHP / HHVM), Hadoop integration....

Here is OP's TekpubTV video on RethinkDB (2 years old): https://vimeo.com/60697270

Rob Conery always blows my mind. I came across him years ago when he used to evangelise .net stuff for MS. He wrote a great ORM with dB migrations that was completely novel for me at the time. He seems to be a complete polyglot and seems to he able to switch between so many different platforms, languages and technologies without any problems. Much respect!

Note that the 9ms performance claim in this article is mistaken. The author has issued a correction: http://rob.conery.io/2015/04/17/rethinkdb-2-0-is-amazing/

I added a RethinkDB benchmark when used with the Python Tornado web server: http://www.peterbe.com/plog/fastestdb Scroll down for the update.

Note that (according to the linked commits), RethinkDB is using what we call "hard" durability in this comparison. This is our default to ensure maximum data safety.

Hard durability means that every individual write will wait for the data to be written to disk before the next one is run (in this benchmark, since it only does one at a time).

I don't think any of the other databases in this test is using a similarly strict requirement, are they?

You'd have to run with the currently commented line "rethinkdb.db('talks').table_create('talks', durability='soft').run(conn)" to get more comparable results.

(Edit for clarification: `durability='soft'` is comparable to the `safe` flag in many of the MongoDB drivers. It means that the server will acknowledge each write when it has been applied, but not wait for disk writes to complete.)

I was interested in RethinkDB when I read the 2.0 release. But after looking at that mess he had to do to write a simple query. No thanks, I'll stick with Mongodb.

   db.catalog.find({ 'details.media_type_id': 2 })

In RethinkDB you'd express it like this:

  r.table('catalog').filter({ details: { media_type: 2}})
Or like this:


For most queries MongoDB syntax and RethinkDB syntax are effectively interchangeable.

What do you do when a document has a period in its key?

Restrictions on Field Names Field names cannot contain dots (i.e. .) or null characters, and they must not start with a dollar sign


SQL is not "a set of arbitrary commands put together into a string". It is a bit ungainly and irregular in parts. It needs improvements[1].

Haskell (or F# or any FPL) are, of course and obviously, a perfect fit as front end for a relational system.

[1]: http://www.thethirdmanifesto.com

Does Rethink store keys with every record or uses more efficient mechanism?

While you could technically use JSON to have a more efficient storage I think it safe to assume they use Key/Value.


I can't believe new databases are still using this model. Some of my data storage would be 90% keys and 10% data as JSON.

The concern I have is that RethinkDB announced their death and vanished for 9+ months leaving everyone in a lurch, before resurfacing. The way they abandoned their users was deplorable so why trust them again? If you are into S&M just buy Oracle licenses...

Slava @ Rethink here.

I'm not entirely sure what you're referring to, but the closest I can think of is the move from the memcached interface in 1.1 to the ReQL interface in 1.2. If that's what you mean, I don't think it's fair to say we abandoned our users at all.

The memcached interface had very few people using it (literally single digits). We tried really hard to make it work, but there just wasn't any demand, so we decided to add a full query language, clustering, and rebuild with the realtime architecture. We supported the binary for a while, and helped most of the users migrate from the memcached interface to ReQL (which was fairly easy).

We also helped people migrate to other memcached alternatives if they chose to not to use ReQL. In almost all cases people could quite literally pick another compatible product without changing any of their code.

We took our time, helped people migrate (either to the newer version of RethinkDB, or to other products), and integrated the original architecture and as much of the code as we could into the new and improved RethinkDB. All of this was completely free of charge.

So respectfully, I really don't think you're being fair to us. I'm sorry if this inconvenienced your company, but given the dire circumstances at the time we really did the best we could (and arguably, much, much more than most companies do in those circumstances).

May I know when that happened? I'm curious about using RethinkDB after reading Rob's post

I imagine the One Direction reference will be lost on most readers.

All of the ReQL examples are completely unreadable relative to SQL.

Call it declarative, format it however you want - a purpose-built DSL like SQL is always going to be easier to grok than a Javascript-inspired functional language (for me at least).

It's not a "javascript inspired language". It's the language you're using. If you're using Rethink from Python, you write your queries in Python, etc.

See, for instance, http://rethinkdb.com/api/python/

(You can easily translate most doc pages to the language of your choice using the links at the top of each page.

So, they're just shipping supported ORMs in multiple languages.

Its not an "ORM" though and ReQL _is the language_.

How can you say it's "the language" if it's different from one language to another?

Depends on your perspective I guess. Having struggled with building easily composable SQL queries for almost 20 years now, I'll take the functional language approach thank you very much.

Thank you, I updated my statement to reflect that.

However, I am imagining a contest where SQL people write SQL and functional people write functional queries... I would bet money that SQL people could identify basic facts about SQL queries faster than functional people could identify the same facts about functional queries.

Could be a fun programming game.

All of the ReQL examples are completely unreadable relative to SQL.

Because SQL is very familiar and ReQL is not.

I think it's more because the javascript-with-callbacks is chatty and ugly. But you can also write it in a more stream-like style. I have no idea why so many examples out there use the nastier style. An example stolen from another commenter:


> I think it's more because the javascript-with-callbacks is chatty and ugly.

We did the best we could with JavaScript. IMO the Ruby implementation of ReQL is dramatically more beautiful. Ruby's blocks fit so well that we don't even provide the `r.row` shortcut in ruby:

  r.table('users').filter{|row| row['age'] > 30}

With ES6:

  r.table('users').filter(row => row('age').gt(30))

You only write javascript if you're using javascript ;)

If you python you use python's native constucts, which if you use python you're naturally familiar with.

    r.table('users').filter(r.row['age'] > 30)

> All of the ReQL examples are completely unreadable relative to SQL.

I don't actually agree with this, but one solution to this would be -- as Google has done with several of its not-traditional-db storage products -- to build a library that takes strings in an SQL-like language (SQL with additions/deletions to align with RethinkDB's features and capabilities) and builds ReQL objects from them. But, while it might have utility, its probably not as high a priority for the RethinkDB team as getting the core features right.

My be an interesting third-party add-on, though.

I seem to recall them saying a few years ago that they were not going that way for a reason. That they are opting for an intelligent protocol/driver approach instead so that there is more control over how data is passed and error handling is done... but I don't remember the details


I know you speak at least one more language than I do.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact