Hacker News new | comments | show | ask | jobs | submit login
We use RethinkDB (workshape.io)
139 points by hiphipjorge 777 days ago | hide | past | web | 73 comments | favorite

I have been using RethinkDB for a while now (although mostly for small side projects + maintaining the Go driver https://github.com/dancannon/gorethink) and have really enjoyed it, having a functional query language is quite refreshing. The recent introduction of change feeds are also really cool, building a realtime app with websockets was surprisingly easy.

Honestly I would recommend RethinkDB to anybody looking to start a new (small-medium sized) project. While there are some small performance issues this is to be expected for a project at this early stage and after seeing how the RethinkDB team works I am confident that these will be sorted pretty quickly.

We have used RethinkDB in production for a handful of months now. 100M docs, 250 GB data spread out on two servers.

We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.

Our limited experience of MySQL operations was also a factor. We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions. There were solutions to perform the actions we wanted without downtime but they scared us a bit. We also looked into upgrading to 5.6 or MariaDB but that seemed like it would take a long time and need much testing, while there were no guarantees that we would see performance gains.

We looked for alternative solutions and found RethinkDB. We reused the parts that serialize data for the API and put the resulting documents in RethinkDB. Then we had our API request handlers pull data from there instead of from MySQL and added indexes to support various kinds of filtering, pagination, and so on. We built this for our most problematic endpoint and got the two-server cluster up and running in about a week, tried it out on employees for another week, and then enabled it for everyone (with the option to quickly fall back to pulling data from MySQL).

This turned out to work well and we saw good response times, so we did the same thing for other endpoints.

There's some complexity involved in keeping RethinkDB docs up to date with MySQL (where writes still go) but nothing extreme and we haven't had many sync issues.

RethinkDB has been rock solid and it's a joy to operate.

> We added it to the mix because it got increasingly difficult to tune SQL queries involved in building API responses, especially for endpoints that needed to pull data from many tables.

Had you looked into using PostgreSQL's materialized views? you can add indexes to the view with the additional bonus of the view hiding those joins from client code.

> We're on 5.5 and couldn't do some table operations that seemed promising without service disruptions.

Everyone has this problem. But it's been largely solved in practice by performing the schema changes on slaves, and then promoting the slaves to master.

Also, if you're just using RethinkDB as a delayed (and almost certainly inconsistent) secondary storage system, why not use ElasticSearch instead?

BTW, 250GB fits in memory on any decent size box. You're not really going to see how things scale till you get into the terabytes.

I don't think 250GB will fit in memory on any reasonable sized box. What world do you live in?

A world where people use real servers.

All of our physical boxes have 256gb+ of RAM.

They run VM's via XenServer for most of our uses, but our production DB's run on bare metal as they need the RAM and disk performance.

An r720 from dell or similar model from dell with 600GB*2 SSD intel s3500DC model, 20 cores & 256GB of RAM will go for 5k-7k. You can bump this to 386GB of ram without going above 10k.

And up to 768GB if you go with 32GB LRDIMMs for about $14k (intel oem chassis).

When I changed the country to Japan, the sticker price jumped from 2000 USD to 15,000 USD eq. for a very basic system. I am just at a loss as to what can explain this disparity. Guess I will have to call up my vendor to get a comparable quote.

My tip is always to try to get in contact with a couple of reseller and play them out against each other in the price department.

If you are looking for larger purchases 50k+ USD than you should talk directly with Dell, HP or comparable vendor and put them into the play off for who you choose :)

I always do that. I have worked with Strategic Sourcing for a while...so... ;-)

re: Elasticsearch

Rethink's 'ungroup' method lets you chain multiple reductions, which is incredibly powerful for building aggregation queries. Elasticsearch doesn't have that capability, and hence its aggregation capabilities are severely limited.

For example, with Rethink, it's very easy to compute a metric from metrics computed in a previous reduction. You can't do that with Elasticsearch, since its dsl allows metrics to only be computed from fields in the raw document, but not from other aggregation metrics.

splunk has eventstats command which computes metrics and assigns them to fields of documents so you can process them. is that something similar? (except the fact that splunk's invoices are know to cause cardiac arrest?)

besides the price, it's an extra moving part in the system. Also, can it add the fields to the aggregated result or just the raw documents?

splunk can do both: stats aggregates, eventstats adds aggregate fields to raw documents.

What is the hardware of your servers and how many queries are you performing per second? How is the performance and latency?

Things I love about RethinkDB as a NodeJS dev:

- first class support for JS bindings, unlike mongoose which wraps the super low level mongodb js library into something palatable but crashes in a horribly undebuggable way.

- server-side joins

- a nice web UI for monitoring and running queries packaged up with the service

- public docker images that are super simple to run

- easy clustering

From RethinkDB docks [1], I am still a bit confused how this locking system works for read/write and also a bit skeptical regarding their claim that 'in most cases writes can be performed essentially lock-free'.

I am using MongoDB and didn't have many issues when my databases had 120,000 documents either, the problem began when we hit the millions... The combination of write locks and our need for dynamic queries (meaning: we can't index) made the database the worst performance bottleneck in our system by far. Although I must be honest that we haven't yet tried MongoDB's new 3.0 version that promises a boost in performance [2] and also has 'document-level locking and compression' [3]

Is anybody aware of any benchmark that perform random writes (inserts/updates) and non-indexed reads for RethinkDB? (Is it even a common use scenario, anyways?)

[1] http://rethinkdb.com/docs/architecture/#how-does-rethinkdb-e...

[2] http://www.mongodb.com/mongodb-3.0#performance

[3] http://docs.mongodb.org/manual/release-notes/3.0/#wiredtiger...

> I am still a bit confused how this locking system works for read/write and also a bit skeptical regarding their claim that 'in most cases writes can be performed essentially lock-free'.

Hi, Slava at RethinkDB here.

RethinkDB uses MVCC to do looking. Essentially, when we lock down a block for a write, we make a copy of the block. If another query comes along that wants to read, it reads from the copy. When the write completes, the old copies are destroyed.

There are lots of details I'm glossing over -- optimizations to avoid copying too much, copying entire subbranches of the btree to have a consistent view of the shard, etc. All this stuff isn't unique to RethinkDB -- it's pretty standard database internals stuff, and we haven't done anything new in that department. It's just an implementation of standard database architectures (as far as the caching/query engine/storage engine are concerned).

FYI, with MongoDB, just because you can't and shouldn't index everything, doesn't mean you can't have any indexes... if your most common fields bring your queries down, they're still pretty helpful.

I actually really like where RethinkDB is headed, and within the year most of my issues should be resolved.

Another couple databases to consider, depending on your needs would be ElasticSearch and Cassandra... it reallly depends on your use case.

Could you list your issues with Rethink? It would really help for product prioritization.

As mentioned in another thread, automagic failover the one still pending, and geospatial indexes/searches (now in the product).

It's wild how many options there are that tailor themselves to all kinds of data out there.

I'm not involved with RethinkDB but I lurk on their github issues and I'm pretty confident that automagic failover is dependent on them getting (their own implementation of) Raft integrated with everything. Looks like its getting close as a whole slew of issues were opened just the other day relating to Raft work.

Are unique indexes planned for any of the upcoming releases?

The primary index for the table is always unique.

We do not plan to implement secondary unique indexes. The philosophy behind RethinkDB is that if a feature cannot be efficiently scaled across multiple nodes we don't add it, and unfortunately unique secondary indexes are one such feature.

AFAIK most NoSQL databases don't implement it, and a few that do take two approaches -- forbid it on multiple shards, or just take a massive performance hit during sharding.

We chose to keep the feature out of the database. This way the application can be architected to account for it, so it remains fast as the database scales up.

This is good, for what it's worth. Having different feature sets for sharded vs unsharded is just utterly confusing, and something MongoDB got really wrong.

Oh yes, agreed! and we do have some indexes, but unfortunately it isn't enough :(

And we are actually considering ElasticSearch to deal with this querying performance issue.

We had a similar issue with queries that didn't match an index... later pages would time out, etc... we limited our results for that class of queries... The next generation will use cassandra with some custom searching/caching... that will work a bit differently.

No database can be fast without indexes. If the queries you get are defined by your users, you can still create indexes for the common ones.

The way business logic in the system is currently designed , there are no 'common ones', except for some ids (already indexed) used in other process rather than this dynamic filter.

Based on that we also evaluated the approach of using something else like druid [1] [2] that is built for reading performance, but I am still studying possibilities and have no idea about the impact and problems a change like that would impose.

[1] http://druid.io/

[2] https://metamarkets.com/2014/building-a-data-pipeline-that-h...

Another thing to note is that Mongo is simply very slow. Without indexes, something like Postgres is much more likely to be fast.

+1 to using RethinkDB! I'm also using RethinkDB in production, and I love it! The only issue is that you have to set up persistent filters via iptables in addition to having an authKey. They do have a guide[0] for that, however they do not provide any instructions for ensuring that the filters on iptables stay up, or how to restore them if they are temporarily wiped out :/

[0] http://rethinkdb.com/docs/security/

I created an issue for this on our docs repo[0]. Thanks for the feedback!


It's a good point. Though I think people tend to run these on non-public machines. Even most of AWS is on VPC nowadays.

I've been using/following RethinkDB since I started as Lavaboom's CTO. It's been a smooth ride so far, and the occasional perf improvements are always welcome. Some aspects of the database are especially lovely, like the web admin or painless deployments of new nodes, especially if you're using Docker.

Shameless plug (we've just went open sourced most of our services): https://www.lavaboom.com/

We use RethinkDB in production and our main frustration lies around the lack of automatic failover. We're looking forward to 2.0, which is supposed to bring automatic failover (using Raft for consensus) to RethinkDB.

Slava @ RethinkDB here.

Unfortunately automatic failover won't be a part of 2.0, but it will happen very quickly after that. Please hang in there, we expect to ship this feature some time in May.

I just saw a demo of the failover feature yesterday from Tim Maxwell (the lead engineer on this), and it's really impressive! Another side benefit of this feature is live reshards -- you'll be able to reshard/rebalance data without any availability loss on the cluster.

The code is there and just needs a bit more polish and a lot of testing. I'm very excited to get this out, it's probably the last part of RethinkDB that I'm not 100% proud of yet (but will be in a month or two).

You guys are killing it. Wish I had a product I could write around rethink....currently at the day job our stuff is mostly Mongo....all layered under django-nonrel with lots of mongo crud so a port wouldn't really be an option I don't think.

As a person who agrees -- maybe you could write a port/adapter for django-nonrel to Rethink?

Also, why not start a new greenfield project to test out Rethink? a something something realtime something geospacial something app should be a fantastic way to kick the tires, since that's one of the things that Rethink does really well out of the box (as of 1.15) compared to other databases (relational or not)

Plenty busy at the moment and I don't generally code in my free time - I work from home so I really try to maintain that work/home life seperation.

As the ecosystem engineer at Rethink, I would love if someone adapted Django nonrel for rdb. It's a big job

I am rather horrified to hear of your setup, how has it worked out for you in practice?

It works, I guess?

I was brought on well after the system was originally developed.

The websites are mostly our internal admin tools anyway.

Most of the real work is run through cronjobs or task queues (Celery).

The biggest annoyance is that the Django version the stable django-nonrel is based on is ancient (1.3). There are non-stable branches to newer Djangos (1.5, I think?). When I investigated there were some issues with them so we're still on 1.3.

That was one of my two bigger issues earlier on... the other one being geographic based searches (which is now implemented iirc).

Yep, geospatial support has been in since 1.15.

Is anyone using RethinkDB as a "lightweight" Business Intelligence / Analytics / Datawharehouse store? (Maybe for use cases like Amazon Redshift?)

It seems like there could be a sweet spot of their nice query language, schemaless, and easy scaling for moderate size data sets?

I'm kinda tired of having to go all in on a big hadoop-ecosystem just to figure out average X to Y in a dataset....

Slava @ RethinkDB here.

Some of our biggest users (to be announced in a few weeks for 2.0) use RethinkDB this way. You can't really do deep analytics/machine learning as RethinkDB wasn't designed for that, but if you want to store a lot of data, and then run lightweight aggregation or map-reduce queries on that data, Rethink turns out to be a really good product for it.

One issue I see with this path is that if your queries ever get a lot more complex, you'd have to migrate off of RethinkDB onto Hadoop (which is a pain). I think that if you know for certain you just want lightweight querying capabilities RethinkDB can be really wonderful, but if there is a good chance you might need something deeper, it might be worth the effort to set up Hadoop early on.

Have you thought about a "read at timestamp" construct in RethinkDB?

It's not really an MVCC thing, and you can work around it in data model, but for lots of reports (say, running in a cronjob), I want to run a query "as" the database saw things from at midnight UTC, even if i start running it at 2am? It would also make reports a more reproducible... but maybe this is really a datamodel problem. I felt when I read the Google Spanner papers that it was a pretty potentially useful feature for Read-Only queries.

I frequently use Datomic for this and it's awesome.

Currently you'd have to do it in the data model as you would in any other database. It's a pretty cool idea, though -- I'll think about what we can do (though admittedly, this is a bit removed from the current direction).

That sounds like a job for the changes feed, pre-digest data with a query and then pipe its changes feed into Hadoop's storage. (How fast can change feeds run? Would that end up being a bottleneck?)

I've always wondered what's the best way to integrate a database engine with the application.

1) Use a middleware/ORM/Whatever which abstracts away the query-lang of the db, and provides a pluggable multi-db support

2) Just use native db query language with all exclusive features of the engine.

Companies like workshape.io, why do they prefer the latter?

Abstractions are leaky. [1]

ORM systems are highly complicated abstractions.

Most of your developers will use them without understanding how they work. In many cases, the only way to understand how they work is to read the source code.

They have magic features that are advertised as convenient but when they inevitably do something you don't want them to do you'll tear your hair out trying to circumvent them.

They put lots of complicated weird stuff in your stack traces so when something goes wrong with "the database stuff," which is probably going to happen every day, you will feel confused and overwhelmed.

ORM was famously referred to as "the Vietnam war of computer science" by Ted Neward. [2]

There's a point that I think is even more important than the unruly and bewildering complexity of ORM, but I'm not sure I know how to formulate this point.

One way to formulate it would be to point out that your dichotomy of two choices is missing an alternative, so I present:

3) Code your data access in a separate module exposing query & save functions that make sense within your domain model.

In a reasonably complex system, this module might consist of fifty functions that concatenate SQL strings or whatever. In most cases, I'd bet money that rewriting this module to support some other data storage—especially if there are integration tests—would be easier and more pleasant than switching your ORM and then dealing with the random problems that will inevitably occur.

And when some query fails or is slow, the developer issued to fix it will just go into the file, find the query, and change it. It's simpler, there's less obscure technology to worry about, fewer things to get angry at.

[1]: http://www.joelonsoftware.com/articles/LeakyAbstractions.htm...

[2]: http://blog.codinghorror.com/object-relational-mapping-is-th...

Preprocessed prepared queries that are shipped along with whatever package is using them is a far easier solution than quirky ORM tools that can do the simple things, but have a tendency to break on the harder things or encourage an authoring style that destroys performance.

That said, I think the world could use a CoffeeScript-esque transpiler for targeting SQL. Preferably with some kind of frontend/IL/backend separation, so that everyone can take a crack at replacing the awful SQL syntax.

You should check out thinky.

Because of the chainable query language you don't have to build a new abstractions. You can just use the same.

Meaning that there is nothing (or very little) to learn yp switch from the driver to thinky.

Alternatively.. wrap each domain's data as a separate micro-service, with a convenient API... Though it really depends on how you can break down the boundaries of your application's data. Then you can persist/represent that data however you like.

Well, here's where the difference between query builders and ORMs comes in.

Query builders (usually integrated with ORMs) are usually used by people who don't want to write any SQL. SQL is very performant and powerful, but not that easy to understand or write. This is especially so when you think about the context switch between programming languages and SQL.

The advantage of some of the NoSQL databases (MongoDB and RethinkDB, for example) is that you have the luxury of using an ORM only when it makes sense to use it, instead of relying on ORMs as crutch for not knowing SQL.

The second approach seems better, but obviously SQL (power, performance, and prevalence) cannot be ignored.

I disagree. I use ORMs/Query Builders for two reasons (even if deeply familiar with SQL syntax and semantics).

1. Dynamic queries for reporting purposes. Is it possible to hand roll this with SQL string concat for every query? Yes. Some of us value time and correctness.

2. Type safety/Refactoring and change management. Sure even a Query Builder/ORM model doesn't know if it matches the production database at compile time. But when your database does change and you need to update the model to match it then it is a lot easier to do a refactoring on the record than to manually update every SQL query string where that table is involved and hope there isn't a missing test or invalid query out there which will now be broken.

I see him here and Reddit every once in a while, but there is a cool client-side encrypted note taking app, Turtl, using Common Lisp and RethinkDB server-side, and what was node-webkit client-side. Very cool, everyone should check it.



Not a dedicated user, but I have been playing with this dude's CL work and I like his approach and attitude. Maybe thought people would want to see a self-hosted RethinkDB proj.

I'm also working on a noSQL database. What I'm struggling with is the abstraction for searches/filters. For example, if you want to get all books with "beginner" in the title, in SQL it would look something like:

  "SELECT * FROM books WHERE title LIKE %beginner%"
Where in no-SQL it would look like

  book.filter({title: ["like", "beginner"]});

Any ideas on how to abstract the filtering in a more clear way?

In RethinkDB the query looks like this:

        return doc('title').match("beginner")
Or like this:

Or if you have a secondary index setup correctly, this could work:

    r.table('books').getAll("beginner", {index: "title"})

Interesting to hear - would be interested to hear someone utilizing more of the selling points - distributed queries, etc.

I'm a relative newcomer to the NoSQL scene and have been using RethinkDB for a couple of sideprojects. The IRC channel (#rethinkdb on FreeNode) is really second-to-none - the people on there are incredibly friendly and patient when answering what are probably obvious questions.

Is there a Windows version on the roadmap?

Great post, would love to see RethinkDB with horizontal scalability in future releases.

Why is it so trendy these days to say that everything was built "with love"?

It's just a meme.

Like it is also trendy to have an over-sized picture of young people working on wooden desks in an industrial-chic office taking up 70% of the screen space.

Goes nicely with the Bootstrap template, Lobster font, Circular cropped photos of founders, Ping pong tables, Ruby on Rails. Etc. Etc.

Upvoted for sheer cynical accuracy ;-)

That's a good point! Though in RethinkDB's case I'd like to make an exception.

~30,000 commits and long discussion such as this one: https://github.com/rethinkdb/rethinkdb/issues/281

It's a way to imply that the all of the fiddly little details don't suck.

Everyone's run into that library, SaaS product, or other bit of software where all of the features sound awesome, it appears to do exactly what you need ... and you'd rather debug a plugged in blender than actually use it.

Would you much rather it was not built with love?

I really fail to see how saying "built with love" matters at all.

I would kinda like a DB built with cold-heated ruthless efficiency and precision, actually.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact