

RethinkDB 1.10: multi-indexing - coffeemug
http://rethinkdb.com/blog/1.10-release/

======
Cyranix
I kicked off a new project just a few days ago with RethinkDB -- absolutely
loving it so far. This new release addresses a minor pain point and gives me
confidence in my choice of data store.

In case `coffeemug is still checking in here, I'd like to put in a request to
update the documentation page on drivers. The listed Elixir driver has been
broken for a few months; a working one can be found at azukiapp/elixir-
rethinkdb. Overall the page could be organized a bit better: putting the
languages in alphabetical order and not duplicating the list (which pushes
contributor names below the fold) seem like straightforward improvements.
Thanks for all the hard work -- hope to be sharing my project and showing off
some RethinkDB swag before too long!

~~~
jdoliner
These sound like very good ideas. I'll see if I can get them implemented.
Thanks for the suggestions.

~~~
coffeemug
A message has been dispatched to @mglukhovsky.

------
elithrar
Constantly impressed by the release speed and feature list. Thinking back to
1.4/1.5 (when I started using RethinkDB) and I wouldn't have believed myself
had I seen the current features.

------
Justsignedup
[http://rethinkdb.com/faq/](http://rethinkdb.com/faq/)

WHY DOES EVERYONE RECOMMEND MYSQL? Seriously, its 2013, and postgres is right
there in the land of amazing databases. Easy to set up. Easy to get started.
And works so beautifully. Why do people mention MySQL. There is zero benefit
to using it. Less than zero. It's harder to use than pg.

~~~
xd
I've been using MySQL for over 12 years. It's easy to set up, easy to get
started with, and works very beautifully.

Zero benefit, really, do you have to be so insulting?

~~~
justin_vanw
I can't tell you how many MySQL users I've worked with where 50% of their dev
work was doing complex caching and denormalization to avoid joins.

Now, as a MySQL user, you're saying to yourself "joins are slow". Yes. In
MySQL, joins are very, very slow.

I mean, you could write a book on all the things MySQL does in a fundamentally
broken way. It's a very long list. Generally, MySQL fans have only ever used a
single database system, MySQL, and they extrapolate that it's completely brain
damaged behavior is somehow a characteristic of relational databases. This is
not so.

To be blunt, if you think MySQL works beautifully, I mean, will you at least
admit that at least 95% of your experience is with MySQL? Anyone who had ever
used any of Oracle, Postgres or even SQLServer could ever say that with a
straight face.

~~~
xd
I develop data management systems and one area of my job is to work with
existing solutions and develop import/export routines, so I can assure you
have I have a lot of experience across the entire spectrum of RDBMS's.

The MOST interesting thing I've found with the systems I've worked with is the
insane use of SQL. Quite frankly, most developers have little to no real idea
what they are doing when it comes to SQL, which is evident when you look at
the rise of "NoSQL".

If you are a good software developer you will develop good software no matter
the language. If you are a good DBA you will write good SQL no matter the
underlying database system.

~~~
justin_vanw
MySQL prevents you from writing good SQL..

If you have a nice normalized schema, you will need to do joins, sometimes
involving multiple tables and joining to sub selects. With MySQL, this is just
a disaster, so you can't have a nice normalized schema, or if you do, you have
to stash data in a denormalized way somewhere.

~~~
xd
I've written just as complicated joins with MySQL as I have Oracle,
PostgreSQL, SQL Server etc etc all with very similar performance.

Can you provide/email me with some examples of what you are finding a
"disaster" and I'll help find the flaws in the SQL or Schema, whichever it may
be.

------
continuations
How is multi-indexing implemented? What kind of query performance can it
attain?

This seems somewhat similar to hyperdex's search function. It is implemented
through what is called hyperspace hashing. How would you compare yourself to
hyperdex?

~~~
coffeemug
An index in RethinkDB is an additional BTree that stores a record keyed by the
value of the index, not by the value of a primary key. When you query by an
index, RethinkDB looks the records up by looking at the index BTree instead of
the main BTree. This is similar to how most RDBMSes implement indexes.

Multi-indexing is implemented by inserting multiple rows into the index
B-Tree. So if your index for a document returns [1, 2], Rethink stores two
records into the index BTree -- one keyed by 1, the other keyed by 2.

This is different from hyperdex -- we don't do any algorithmic tricks with
spaces to make this more efficient. The model we used fit extremely well into
our architecture, so @jdoliner just went ahead and implemented it.

Hope this makes sense -- great questions! (Also, love your username)

~~~
continuations
How do you handle distribution of documents among nodes? Say document X has 3
tags "A", "B", "C". Do you place all documents with tag "A" in the same node,
all docs with tag "B" in the same (but potentially different) node, and all
docs with tag "C"...? In that case you'd need to put document X in potentially
3 different nodes. Or do you not do that kind of "pre-sorting" and just send
queries to every node asking for match for tag "A"? (I like my username too. I
was surprised it was still available)

~~~
coffeemug
Documents are distributed across nodes based on a primary key. Currently we
use range-based sharding, but will be moving towards hash-based sharding soon.
So, for any given document, we look at the primary key and determine where a
document should be.

We store a secondary index for a shard on the same node where the master for
the shard resides. So if you're storing users whose last names are between A-M
on node A, all secondary indexes for users with last names between A and M
will also be on node A. That means that for any secondary index query we have
to contact all nodes that have shards for a table, but we do a number of
systems tricks to make this really efficient.

~~~
continuations
> but we do a number of systems tricks to make this really efficient.

Can you talk about these tricks? I'd love to learn more about it.

------
Xeoncross
Brand new, written in C++, looks beautiful, nice client libraries, trying to
steal all the best parts of other NoSQL databases? I'm going to give it a
spin.

------
justin_vanw
Multi-indexing: index a single document by multiple values

So.. This is an inverted index. Not exactly a new concept, it certainly
doesn't justify a new name. Unless, maybe, the people making RethinkDB didn't
know.. no, that's impossible right?

Could a company commercially selling a database not realize they just
reinvented a basic data structure?

Then again, the google query "site:rethinkdb.com inverted index" has no
relevant results.

Maybe they think the name inverted index is more confusing than just inventing
a new name?

~~~
susi22
They settled for the name since MongoDB calles it "Multi-key" index.

Most people would probably not be familiar with the name "inverted index".

------
jonny_eh
How do you guys release such polished and worthwhile releases on such a
regular schedule? It's rather impressive.

~~~
coffeemug
Thanks. I'm one of the founders @ Rethink, and project/engineering management
is my job. The dev team here is simply amazing. It's not false modesty -- I've
been busy with some administrative aspects of the company for the past month
and haven't been able to put any time at all into project management. The guys
just took the issue tracker and ran with it -- I woke up this morning to
discover that we've got a new release.

So, my hat is off to all the DB Rethinkers. It's an honor to lead such an
amazing team.

EDIT: I do hire people and occasionally set the tone, so I deserve some credit
:)

~~~
justin_vanw
I strongly suspect that nobody thinks that when someone brags about the
company they own it's any form of modesty, false or otherwise.

------
dkhenry
Can secondary indexes be compound ?

~~~
coffeemug
Yes!

    
    
      r.table('users').index_create('full_name',
                                    [r.row['first_name'], r.row['last_name']])
    

This indexes users both on first name and last name.

------
alecco
Is this similar to multi-column indexes in SQL?

~~~
coffeemug
No -- multicolumn indexes are equivalent to compound indexes in RethinkDB.

RDBMSes don't need multi-indexing because they don't allow storing arrays of
values in any meaningful way. Imagine having a row in a database that contains
an array of values. For example, suppose that for every user you store an
array of cities they've ever lived in. RDBMSes typically can't do that, but
document stores can. If you'd like to efficiently find every user that has
ever lived in Mountain View, multi-indexing lets you do that.

~~~
corresation
>For example, suppose that for every user you store an array of cities they've
ever lived in. RDBMSes typically can't do that, but document stores can.

?? What? Are you talking about a junction table?

~~~
Nate75Sanders
That's not storing a collection in a table. That's approximating the storage
of a collection in a table by using standard RDBMS features.

If you think these are the same somehow, I encourage you to think about the
number of disk seeks required for recovering the full collection in both
cases.

~~~
vidarh
If all you need is the indexed values, then most mature RDBMs's supports using
covering indexes in queries and satisfying the query entirely from the index.

Since the point of B-tree's is to minimize disk seeks and packing values with
the same key close together, the number of disk seeks can often be made just
as small as if the collection had been stored in the "owning" record.

Secondly, whether or not it _is_ is an implementation detail. There's nothing
inherently preventing RDBMSs from supporting "inlining" of collections from
related tables. But in _practice_ most RDBMSs choose to support that exactly
through the much more generic optimization of supporting queries using
covering indexes...

In other words: It only takes a lot of disk seeks if you use a RDBMS that is
lacking in features and/or haven't added the proper indexes.

