If you're an ambitious engineer thinking about joining a start-up, this is your chance to be smart about this.
RethinkDB team is the nicest possible group of people one can hope to work with in Bay Area. They have a great combination of often mutually exclusive things: hacker-friendly business model (no ads, tech for cash), aggressiveness and tech-savviness of founders, yet they're humble, honest and nice.
And the product works, well-liked, serving an exploding market, so the probability of failure is quite low by a typical start-up standards.
We are aggressively hiring. Most job descriptions aren't up yet, but will be posted in the next few days. Here is a quick list of high priority positions:
* Reliability engineer
- Good working knowledge of Bash and Python
- Good working knoweldge of C and/or C++
- Automate testing and benchmarking infrastructure
- Make sure long tests/benchmarks reliably run
- Get to the bottom of stability/performance problems, and
work with the engineering team to fix them
More info: http://rethinkdb.com/jobs/reliability/
* Technical writer/developer advocate
- Good command of english language, style, tone, etc.
- Ability to explain complex ideas in simple ways in written form
- Improve API documentation, guides, and tutorials
- Help support users and bring their concerns back into the product
- Deep understanding of one of the above stacks (you should
be able to hack on core Rails/Django/Node, know the
conventions, and respect the community)
- Write software to make RethinkDB the absolute best possible
experience for your respective stack
- Then tell the community about it
* Visual designer
- Produce beautiful web designs and illustrations
- Mastery of Adobe Creative Suite
- Mastery with a tablet and stylus
- Exhibit creative taste
Responsibilities (we take visual design and user experience extremely seriously):
- Design user interfaces
- Help rebrand our website for commercial / community versions
- Incorporate illustration and a unique visual style into our brand
- Creative one-off projects (logo, design t-shirts, marketing
materials, postcards, packaging, generally make things
Email email@example.com -- we'd love to hear from you.
I don't seem like a good fit for any of these positions (Java/C++ engineer >.<), but I'd love to learn more about database internals and how things work and how to make them better. Is there much to do in that area, or is it pretty much complete? Should I just pick a bug and try to fix it, or is there some process to follow >.<
We're hiring C++ engineers. You don't have to know anything about database internals, but you do have to know C or C++ well and understand all the standard software engineering stuff. (I didn't list this position because it's a less hair on fire sort of thing). Please e-mail me -- firstname.lastname@example.org.
As far as getting started on the codebase, we unfortunately don't have a guide/mentorship program yet for hacking on the internals. The best way is probably to pick a really simple bug and try to fix it. I'll see if we can work on making getting started with core contributions easier.
I just recently had the opportunity to speak with Slava and Michael after one of my posts hit the front page of HackerNews. They are definitely nice people, I enjoyed speaking with them.
I am no expert on databases, so I don't have much to say on that front, but the admin UI for RethinkDB is definitely beyond anything I've seen for databases before. I know that's also something they value tremendously and I'm very happy to see them doing well.
Rethinkdb is a really well-designed system. I've been using it (not in production currently) as a better-designed MongoDB with a proper query language. I would recommend checking it out for new projects where a document datastore is appropriate, and to migrate away from a troublesome Mongo.
Three words: JOIN, JOIN, JOIN. For any non-trivial domain, you'll need to join different entity types. For example, a collection of users is fine for user names, (hashed) passwords, last logins, etc. A collection of document is fine for text, fonts, URL. Now what happens if you have multiple users collaborating on multiple documents? Are you going to denormalize the shared documents to user objects, or are you only keeping document IDs in them? If you choose the former, you'll run into inconsistency very quickly. If you choose the latter, you'll need join support, or you have to write the code to do the poor man's join which could be very inefficient.
I actually have been running into this a lot with my documents in my project. Thinking about making the switch to Rethink. The only thing I'm bummed about is having to migrate all my Mongo code over. Anyone here end up doing something similar?
So, if I have 30 nodes and my data is sharded between them, how is that better than having app/client side join, where I have more control what to fetch and what not? I could potentially cache results.
I mean, you realize if data is distributed at large scale, it may take a while till it gets from all nodes the data and joins it...
If you let the DB do the joins, it could handle more efficiently. For example, it could distribute the joins to those 30 partitions of the main table, and then merge the results, so the heavy computation is distributed, and less bits to move around the network.
Now in the cases where if you can optimize the joins, you still have the option of doing it in your code in RethinkDB/CouchDB. I've done that too, and it's usually when I know for sure that I can prune a big collection to a very small subset more efficiently than using an index.
I would still argue that client app is not the right level of abstraction for data join though, unless it is a big performance gain for very little extra complication.
If the join is "compute all f(s, t) for s in S and t in T" then you'll save bandwidth (having O(|S| + |T|) bandwidth over the network instead of O(|S||T|)) by doing it on the client. Of course you could just run `rethinkdb proxy` if you want to save ethernet bandwidth and run the query on RethinkDB while connecting to the local cluster node.
Ah, I see, that makes sense. Usually pulling both tables in full (or even a single table in full) to the client is not an option (and nobody does cross products in real-time systems). So people end up pulling a subset of table A, and then for each document in the subset issue a separate get to the db for table B (which is obviously worse than having the db do it).
This was brought up in a big way in the recent firestorm around Mongo, but the lack of joins precludes it from any use as a relational database (which is fine, since it's not intended as one). However, I've never been able to use Mongo as a large-scale store as a result.
It seems like the addition of joins lends some relational capabilities which in my eyes is very impressive. I'll be watching the advance of this one with interest.
I was talking to a guy from Mongo at a conference and he said that Rethinkdb's sharding is as bad as Mongo's because it can't be done in realtime. Is this true? Do you think he meant that it would just be slow to move to sharding for a big database?
In practice today, Rethink's sharding is very similar to Mongo's sharding. This won't remain the case for long -- you'll soon be able to add and remove shards live if that's a requirement in your application. The architecture's already in place, but there are some loose ends we have to tie up to expose this functionality to users.
In addition to what other's posted, CouchDB is one of the few NoSQL solutions that will give you master-master replication.
Mongo and Rethink are both single-master/multi-slave solutions.
One is not better than the other, just depends what you need.
And as someone else mentioned, if you want complexity out of your CouchDB queries, you must write map-reduce functions to provide those "views" for you to query. Rethink you can treat similar to a SQL data store and just execute queries against it.
Couch has a much shorter list of operations. Really only get, put, delete, and whatever you can cook up yourself with mapreduce. Rethink has ad-hoc live queries and joins. I don't think Rethink has a tailable update log in the way Couch does.
Imagine an RDBMS table where many columns are nullable and many rows contain nulls in practice. Dealing with these is really painful in traditional database systems (you don't know how painful until you try a document store).
A document store flips the default. It makes dealing with data that has lots of nullable columns much, much easier. (It also makes dealing with hierarchical data a breeze)
There are lots of details, but this is the gist of it.
How is it better than an XML field in SQL Server, which allows indexing, schemas (if you want), and full querying inside the document? I think Postgres also has similar functionality with JSON, now, too.
Certainly a lot of applications would benefit from having a full RDBMS they can opt-in to document-style data when they feel like it?
Built-in horizontal scaling is one selling point for non-RDBMS stores, but large systems seem to just shard on top of RDBMSes anyways, right?
> How is it better than an XML field in SQL Server
It changes the default, which results in a drastically different programming experience. The difference is difficult to describe in the same way a dynamically typed programming language is difficult to describe to someone who's never tried one.
I'd encourage you to try a document store (Mongo, Rethink, whatever) for a throw-away project. A ten minute tutorial walkthrough is worth a thousand HN comments when it comes to stuff like this :)
OK, I will try it out for something and see how it feels different.
Related: The comparison to a dynamically typed language makes me suspicious. I spent a bit of time trying to find any examples of dynamic code that actually provided any benefit. Even read "Metaprogramming Ruby" and was dismayed to see examples of reading a CSV - big deal if I save a few quotation marks. The others were just places where the static type system wasn't good enough (duck typing), or dynamic code was a pain to get going (poor reflection/codegen APIs).
Both document databases and dynamic typing are at their best when you don't understand your problem domain. They let you express what you do know about your problem domain concisely, and then fill in the blanks later on. So in a document database, when you find that you want to record a new bit of data - just add it as a field to newly-created XML/JSON documents, and only display it in the UI if it's present. Or pick a default value if you need to perform computations with it. Don't bother with data migrations, don't bother with schemas, don't bother trying to backfill previous data. Try out your idea and see if it works first, because chances are, it doesn't.
If you always work on projects where the requirements are handed to you, specs are complete, and the problem domain is understood, this will seem terribly irresponsible to you. And it is - if you understand your problem domain, you should capture as much of that knowledge in the software system you build to understand it.
But if you are working in startups, or in consumer web, where you absolutely have to be on the leading edge or die and the only opportunities that haven't been picked over yet are the ones that nobody understands - being able to try things out without having to flesh out all your assumptions is crucial. You will run circles around the people who spend time defining their data model and speccing out their objects. And then when consumer tastes change - which happens quite regularly - you can adapt to them immediately instead of throwing out all the work you did under the old assumptions.
The other bit of context I'll toss in is to get in the mindset of solving a problem that you don't know how to solve and assume that your first 10 solutions won't work. For example, if you're reading a CSV - everybody knows how to do that, dynamic typing doesn't really help there. If you're cloning Stack Overflow, you can probably figure out what your database schema should be. But what if you're trying to figure out a new way for people to socialize over mobile phones? Where do you start there? That's the use case for dynamic languages and document DBs. The problems where technology is a tool for understanding & manipulating vaguely-defined social behaviors.
Thank you for the explanation. I can understand the logic.
Databases make it a more cumbersome, and it takes more than one line to start using a new field. I totally sympathize with the flexibility issue there. Even with a document type, most syntax I've seen doesn't have truly first-class querying support (not as easy a column, anyways). And it feels ugly to have some fields defined in schema, and some in a document. But that seems like a minor tooling issue -- there's no fundamental reason SQL can't let me do "WHERE x.SomeDoc.SomeField.OtherField > 5" (perhaps some minor scope resolution issues to ensure I'm not referring to some other multi-part name).
I'm using it to collect stats from various systems, events and alerts too. It makes it easy to collect new data points and add fields to new events and query things in a consistent way across the entire data set.
When the schema is not rigid and likely to change on a daily basis I prefer a document database over an SQL one. There are probably other use cases as well, this is just my favourite.
One example is a CMS. An article these days is a title, body and 5-10 comments, and whatever other metadata that you would want to present on a page. Don't update the corresponding NoSQL record till there are writes in the RDMBS. Serve from the NoSQL system, with at least another caching layer in front. You get the best of both worlds and just one query instead of many.
I've been looking at Rethinkdb for an upcoming project that has a decent amount of relational data. Rethinkdb seems a lot different from Mongodb in that it's partially designed to accomodate relational queries. Has that been your experience?
Also, does Rethinkdb have any kind of transactional capabilities?
> Does Rethinkdb have any kind of transactional capabilities
You have full ACID on a single document, but not across multiple documents. In this way Rethink is similar to other NoSQL systems (except you can do almost any operation imaginable atomically on a single document in RethinkDB).
Nice, can't wait to try it out! Though in my case (single node installations) hot backup missing feature is a deal breaker, so I'll have to wait till RethinkDB supports it. The current workaround (replication, stop the slave, backup, start the slave) is just not good enough for me...
I'm looking forward to the LTS release so I can feel more comfortable using it in a production app.
A slight aside, but I spotted this (currently broken) integration of RethinkDB and Meteor the other day and wanted to share. It does away with the long poll Meteor is doing on Mongo. (I have no involvement in this project at all.)
Meteor core dev here! We are super excited about Rethink and Slava and I have been talking for a while about an official integration. We scoped multidatabase support out of the upcoming Meteor 1.0 release just to get it out the door faster -- we need to support people using Meteor in production with frozen APIs, and we need to do it yesterday -- but support for Rethink is something I'm very interested in exploring in 2014.
As for the polling on Mongo, you'll love what Meteor is shipping this week. Meteor now by default connects to Mongo as a replication slave and slurps up the replication log to drive your realtime queries.
Oh, I was just reading an article about this (http://www.hackreactor.com/blog/Building-a-RethinkDB-Module-...) and I'm sad to hear it's currently broken. I really hope the meteor guys add support for something better than Mongo. I've already ran into scaling issues on a meteor app I use to gather analytic information. You really need to implement sharding very early on with MongoDB in order to keep memory use and locking issues under control. To me that means a large investment in hardware and infrastructure just for a personal project.
Used to think the same way about most investing till I figured out better.
I do not know the specifics of this particular deal, nor have I used the product, but I hope the points are of going to be some use.
Let us look at it this way. Assume there are 200 funds out there who can do a series A of this size. That would naturally mean that not every fund is going to be either a leader or someone who spots new trends (there are not enough trends out there). Naturally, a lot of them have to invest in deals in other companies in a hot sector.
A lot of investment is momentum-driven and momentum is often driven by the narrative. You have to remember that as long as a successful exit happens, the fund winds up with a good deal irrespective of whether the public (IPO) or the acquiring company (M&A) eventually profits from it. NoSQL has that momentum at the moment.
A healthy start-up ecosystem can easily support more than a handful of companies in a single domain. Once the narrative for the domain really picks up, even the not-so-great ones (again, I have no clue about RethinkDB) stand a good chance of being acquired as long as there is decent enough traction and the sector is so hot that there is pressure on the GPs to make a play in it.
The later they get into the game, the pricier the ticket becomes, but you get lesser risk too.
This is a really good breakdown. I can't read our investors's minds, but I'm pretty sure this would be a worst case scenario for them. It's certainly not why we're doing Rethink -- if we thought it would be a #5 company in the space, we'd pack up and do something else (life's too short).
The NoSQL market is reminiscent of "horseless carriages" -- as long as you define a technology by an absence of something, you know you're early in the game. Databases are a fundamental part of the technology stack, and they tend to easily stick around for 20-30 years. We think we can build a long-term open source company that will stick around for that long (incidentally, that's why we take conventions in ReQL so seriously -- we imagine millions of programmers fifteen years from now cursing at us for a stupid naming convention).
It's not hard to imagine groundbreaking features in NoSQL products that nobody is shipping. That's why RethinkDB exists, and we think we won't be a niche product for long.
Slava, I love the "horseless carriage" analogy. I'll have to use that as I try to raise money in the NoSQL space :) Seriously though, it also points to a future where the NoSQL name is going make less and less sense. Anyone have suggestions for the NoSQL database equivalent of the word 'car'?
Like my sibling commenter said, I prefer names that are more descriptive. I prefer "schemaless" databases, but it depends on what you do. Redis is called NoSQL but it's not a document database, it's more a key-value store with lots of slicing and dicing features.
First up, congrats. Whatever the back story is, getting funded (save runaway revenue/profits/margins) is always a crucial inflection point in a company/product's lifecycle as an enabler for bigger and better things. Whether those will eventually happen or not, nobody knows. But there are a lot of things that only money can accomplish and investment is a key enabler for a start-up that needs capital to scale/grow. If you find an investor who is aligned extremely well, it is a massive bonus.
Historically, a lot of good has also happened from a combination of events that may not exactly be awesome. Outcomes always trump everything else. So don't sweat the mind-reading angle much!
Can't comment much on the technical aspects of the product as I am not even remotely qualified to do something like that.
I'm excited for your team and I have immense respect for anyone who builds an OSS company. There is much that the world owes to numerous companies and individuals releasing code like this and don't get enough credit for it. So, thank you and hopefully it will come together very well for everyone involved :)
RethinkDB is credible (perhaps major) competitor for MongoDB valuated at $1.2B. So this founding seems very reasonable.
There is actually real shortage of innovative databases. For example Redis was released just a few years ago, but all ingredients were here for decades.
It takes years to develop, quality demands are very high and takes long time to build reputation to get enterprise users. Also experienced people are scarce and get 'hired away' by big guys. Very hard field for start-up.
RethinkDB is awesome. I have a stealth project which uses RethinkDB in the backend. I moved it over from Mongo over the weekend. It will be revealed soon. But I'm working with about 100 million records. Currently testing it with node, slamming it with thousand of concurrent requests.
I'm doing the same thing, launching a service built on RethinkDB and Flask within the next week. I recommend giving Rethink a go if you're contemplating an easy to setup and easy to deploy (AWS supports them) database solution. It also helps that the community behind Rethink are a friendly bunch and the documentation is steadily growing.
I've barely scratched the surface of Rethink in terms of functionality and features; but I definitely see a market for it as a competent and approachable database for people who want something that 'just works'.
Second, the administration UI checks for version updates, which gives us some information about how many developers are using RethinkDB and how long they stuck around.
There is no perfect way to know because there are many sources of imprecise information, but we're quite confident about the overall conclusions of the data.
> I hope we'll see soon more official drivers...
We've been trying to keep the surface area of the project low for now. Which specific drivers are you interested in? (we probably won't be able to do anything about it for the next ~6 months or so, but having the info helps enormously)
Been trying RethinkDB on and off and really like the query language. Does anybody know if there are any RethinkDB hosting/DBaaS services out there? Sysadmin/devops is not my forte, and with LTS coming, I hope companies like MongoHQ/MongoLab/IrisCouch for RethinkDB start to prop up.
I have been using RethinkDB over the last month in a new project. If you know that a document store is the right solution for you, take a look at RethinkDB. I evaluated it against some of its competitors, and I must say that I was really amazed at the deep engineering thinking that is going into RethinkDB. The ease and power of its programming model (use of AST/lamda functions and like abstractions are awesome), and attention to ease of deployment and manageability (great UI!) is unparalled in like products. RethinkDb is a young product for sure, but one with a very bright potential. In addition, being well funded should help alleviate fears and hopefully help it further gain traction.
I wonder if the low limits on the number of databases and tables will be fixed in 1.12.
I've seen issue 1648 closed for 1.12 and issue 97 to be completed. Are these two enough to fix the limits?
Will it be possibly to have, say 100K tables in a single database? In 1000 databases? Is it possible to have 10K databases?
I've read @cofeemug's explanation that a table is a heavyweight object requiring a few megs of disk space. But that's just a few TBs for 100K tables which is perfectly fine for a 32 node cluster.
Also, it would be great of you could make a page like Mongos' "Limits and Thresholds". I understand that you have lots of other things to do but that one is key in making a decision to use Rethink vs other options.
This is so awesome. I have been excited about RethinkDB and a real mongodb alternative and was just hesitating based on RethinkDb's ability to last. But now I know you will last. Just awesome. I'm going to start using Rethinkdb on my next project.
There are two things that bother me about the Node.js driver. The first is that it doesn't have its own repository, you have to go to https://github.com/rethinkdb/rethinkdb/tree/next/drivers/jav...
The second one is that it's written in CoffeeScript. That may have been a good idea at the beginning, but if you want to have more traction and more developers looking into the source code I think you should 'translate' it into raw JS.
Read this article, installed and started messing w/it.. Got any stats or insight how this holds up in a real prod environment?
A very crude measurement - I just threw it on a box that I'm 70ms away from, I'm getting insert responses back in 90ms, on Mongo which I HATED (the "old" query language I was trying to understand.. a while back it was thought for just a moment we could get by without the relational algebra) - I was about 200ms - so far so good, but how much can I hammer a particular node?
Looks to be using ~20MB ram on the svr process, works for me..
Just curious, how many indexes were on the structure you were inserting? Also, i am assuming 90ms per structure, if that is correct, then that is equal to 11 inserts per second which is unfortunately very slow..
The latter isn't subject to network round trips and can batch disk writes, drastically increasing performance relative to multiple individual writes. You can see a similar effect in most other database systems.
Assuming your 70ms in transit is accurate, although you have not provided details on how you measured that. 20ms svr processing time is only 50 inserts per second which is simply untenable in respect to a high performing rethink..
In effect if i wanted to do 10000 single writes per second typical of most interactive systems, i will need 200 nodes to pull that Off.
By default Rethink requires an fsync from disk to guarantee data safety. So a typical default query goes like this: client request -> network trip -> wait on disk fsync -> network trip -> ack. That's going to be really slow in any system, and makes 10k writes/sec pretty much impossible on any rotational disk.
In Rethink you can turn on soft durability which will allow buffering writes in memory. That would drastically speed up write performance. Another option is to use concurrent clients. Rethink is optimized for throughput and concurrency, so if you use multiple concurrent clients the latency won't scale linearly.
RethinkDB is much easier to set up and operate, has a much more powerful query language, and is much friendlier for application developers.
Cassandra supports high write availability in case of network partitions. RethinkDB does not. The flipside of that is that you (as an app developer) have to deal with conflicts in Cassandra which makes writing applications much more difficult.
If you need high write availability in case of network partitioning, go with Cassandra or Riak. If you don't, go with RethinkDB or MongoDB.
Much easier to set up? What can be much easier than unpacking an archive and running a script to start a node?
IMHO the main difference is data model. RethinkDB is a document store, Cassandra is a wide-table store. As of the query language I agree, but this is caused by Cassandra never including anything that would not scale out.
Slava @ rethink here. There are hundreds of really cool production use cases of RethinkDB that we know of. We'll be publishing a "who's using it" page soon (we have tons of work and haven't gotten around to this bit yet, but will soon).
Note that Rethink is still in beta. Lots of companies already use it in production, but we advise people to test carefully until we ship a long term support (LTS) release. We'll also offer commercial support options then.
I tried to find the restrictions/performance costs of storing and accessing medium to large blobs of text in rethinkdb store (say, for a simple web crawler), but couldn't find docs related to that. Is there a size restriction or some other insights from you rethinkdb users or developers out there?
A blob is generally restricted to 10MB. We haven't published performance data thus far as we're still working on improving performance. If you try Rethink and run into problems, please let us know -- we'd love to fix them!
Build database, sell support. Like any other open-source DB. Or Riak, for instance, charges for site-to-site replication. Generally if you have an open-source project many people are using, you can make money with it.