
ToroDB – Document-oriented JSON database on top of PostgreSQL - davidgl
https://github.com/torodb/torodb
======
e1ven
Another project worth mentioning is mongolike - It uses PLV8 to implement the
Mongodb interface ontop of Postgres.

[https://github.com/JerrySievert/mongolike](https://github.com/JerrySievert/mongolike)

~~~
jeffdavis
And don't forget Mongres:
[https://github.com/umitanuki/mongres](https://github.com/umitanuki/mongres)

~~~
tracker1
This is really cool, thanks for pointing it out... I'd, personally prefer that
some of these implementations allowed for both a mongo wire protocol
compatibility, as well as being able to use a similar interface within plv8
... that would be awesome.

------
ahachete
There is now on the project's wiki
([https://github.com/torodb/torodb/wiki/How-ToroDB-stores-
json...](https://github.com/torodb/torodb/wiki/How-ToroDB-stores-json-
documents-internally)) more detailed information about how ToroDB stores in
tables the JSON documents. Hope this information helps :)

------
michaelchisari
As someone who has been coding web apps for a very long time, it feels odd to
say this, the PostgreSQL community is doing some very interesting things. I'm
excited to see where things go.

~~~
jsherer
Genuinely curious...why does it feel odd to say that?

~~~
chaostheory
I second that. Postgres has been doing interesting things for years now.

~~~
michaelchisari
When I first started web development (mid/late 90's), Postgres was the very
conservative open source choice. MySQL was fast, good enough, and broke a few
big rules.

That hasn't been the case for a while, but with the recent speed improvements
in the past few years, prompted by NoSQL, and MySQL's troubles with Oracle,
Postgres has become the Open Source SQL frontrunner, and everybody's taking it
in really interesting directions.

~~~
tracker1
I'm hoping the JSON & plv8 support gets much more in the box, and that the
replication setup becomes more baked in as well.

Right now, I'm migrating from MS-SQL to a MongoDB replica set for most of our
core data. Mainly because the failover works better than most, and the
licensing costs for even MS-SQL with replication and failover are budget
blowing. Most of the PostgreSQL options for replication are really bolted on
with some serious drawbacks, and automatic failover to a new master is another
issue.

I've always liked PostgreSQL (and Firebird for what that's worth)... I do hope
that some of these features become more of a checkbox item during
installation, and less of a bolted on, have to compile and dive into the deep
to get them, and even then only have it half baked.

------
ocdnix
"JSON documents are stored relationally, not as a blob/jsonb."

How is the transformation designed, to go from a structured document to a flat
set of tables, akin to what an object-relational mapper would do?

~~~
ahachete
The transformation is performed at the ToroDB layer. Each document is analyzed
and several steps are performed:

\- Document is received in BSON format (as per the MongoDB wire protocol) and
transformed into a KVDocument. KVDocument is an internal implementation, an
abstraction of the concrete representation of a JSON document (i.e.,
hierarchical, nested sets of key-pairs).

\- Then, KVDocuments are split by levels (called sub-documents).

\- Each subdocument is further split into a subdocument type and the data. The
subdocument type is basically an ordered set of the data types of that
subdocument.

\- Subdocuments are matched 1:1 to tables. If there is an existing table for
the given subdocument type, the document is stored directly there. If there
isn't, a new table with that type is directly created. This means that there
is also a 1:1 mapping between the attribute names (columns) and key names, and
makes it very readable from a SQL user perspective.

\- There is a table called structure that is basically a representation of the
JSON objetct but without the (scalar) data. Think of the JSON object but only
the braces and square brackets, plus all the keys (or entries in arrays) that
contain objects. There is, per level, a key in this structure that cointains
the name of the table where the data for this object is stored. This table
uses a jsonb field to store this structure, but note that there's no actual
data in this jsonb field.

\- There's finally a root table which matches structure with the current
document. This is used as structures are frequently re-used for many
documents. This is in part one of the biggest factors which contributes to
significantly reduce the storage required compared to, for example, MongoDB,
as the "common" information of that "types of documents" is stored only once.

This information and more will be shortly added to the project's wiki.
However, it's very easy to see if you run ToroDB and look at the created
tables :)

Note: I'm one of the authors of ToroDB

~~~
yvsong
Is it possible to quantify the savings of storage and IO?

~~~
yangyang
I'd be very interested to know more about this too. There is considerable
overhead for each row in PostgreSQL (24 byte header, which must be aligned).
Obviously you'll save a bit if there are repeated keys higher up, but you'll
pay for the foreign key references to those rows in the children.

~~~
ahachete
It is true that PostgreSQL has such a high row overhead. But the storage
savings are significant, in any way. Please see the presentation pointed out
earlier.

------
platform
as far as I understand it, shredding hiearchical documents into a relationa
store is optional feature of Oracle XMLDB

[http://www.oracle.com/technetwork/database-
features/xmldb/xm...](http://www.oracle.com/technetwork/database-
features/xmldb/xmlchoosestorage-v1-132078.pdf)

It is called XMLDB structured storage (vs binary storage, that one actual
stores the hierarchical XML naitively)

The XMLDB structured storage I think has been available since 2003 (but I
might be wrong there).

Oracle's XMLDB structure requires XSD declaration. Within that declaration
XMLDB can use 'hints' to indicate if a given set of attributes across
documents is related. And if yes, it will shred the docs in a way that the
related attributes will be co-joined.

The advantage, as the author of ToroDB noted below is a) space saving b)
ability to use relational joins that are using disk-optimized access
strategies.

The disadvantage (at least in Oracle XMLDB structured option) -- is the need
for declaration of the model ahead of time, and joins for deeply/complex
structured documents.

It looks like ToroDB 'senses' the model of each document on the way in. I
think there are definetely use cases for this approach, that tolerate the
trade off between ingestion speed and storage control. Plus using a relational
engine underneath allows for ACID properties (eg multi-object
rollback/commmits) -- which native mongo does not provide.

~~~
ahachete
platform, very interesting the XMLDB info. As you point out, ToroDB does not
require any model declaration ahead. It's perfectly "schema-less", as it
'senses' the model of each document.

Regarding ingestion speed, it is very high, even compared to MongoDB's. There
are some benchmarks in this presentation:
[http://www.slideshare.net/8kdata/toro-db-
pgconfeu2014](http://www.slideshare.net/8kdata/toro-db-pgconfeu2014)

~~~
cmkrnl
I don't really see why ToroDB can't be implemented on top of Oracle :) There
could be some demand for it there.

I needed this for a side project at work a while back, but didn't get the time
to implement it.

Oracle's cumbersome enough already; I for one have no desire to drag XSDs into
it to use XMLDB.

~~~
ahachete
I'm not saying it's not gonna happen...

------
pvh
Cool concept. I'm not a huge fan of normalizing out the underlying data into
pseudorelational tables.

Reconstructing trees in Postgres is a little gnarly on larger tables but it's
definitely neat to see more people trying to combine the user experience of
Mongo with the technology advantages of Postgres.

What inspired the creation of ToroDB? Are you using it in production
yourselves in a limited way?

~~~
ahachete
Thanks!

Documents are split into chunks before hitting the database, so there is no
need to reconstruct trees in PostgreSQL. Indeed, many queries don't need to
reconstruct the (whole) tree, just a part of it or even just one level.
However, in doing so, ToroDB is able to offer queries that only need so scan a
small subset of the database (compared to the whole database) to query your
data.

ToroDB was inspired by the DRY principle: relational databases like PostgreSQL
are already good enough that with some tweaking may perfectly well as a
document-store.

------
jchamberlain
Very interesting approach. One question I have is around the licensing. I see
that it's licensed using AGPLv3. I would think that would require any code
that connects to it to be open source. Am I reading that right? If so, then I
am guessing that there will be a commercial offering w/o the restrictions?

Thanks

~~~
ahachete
Yes and yes :)

~~~
kofejnik
Wait. Does AGPL really require my Python code to be released as open-source if
I want to connect to your server? As I understand AGPL , if you modify the
code, you must make it available to people connecting over the network, that's
all.

~~~
ahachete
Errr yes, you're right, my "yes and yes" was too quick. Sorry for that!

------
kxo
I'd really like to write a similar, mongo-compatible layer for FoundationDB.

This is great work.

------
jakozaur
Super interesting. Though would be interested how it works. Is it using
JSON/hstore underneath?

Right now reading the code seems to be the only option to analyze it.

~~~
ahachete
No, it's not using json/jsonb/hstore (to store the true JSON data). It uses a
bit of jsonb but for a side part of the information.

And this is the true power of it, that data is stored in normal, relational
tables. Please see a comment above explaining this in more detail :)

~~~
drawkbox
Does all data have to be stored relationally or can you pick and choose?

Reason I ask is for example something like a game session blob from a
multiplayer game, I might not want that to be merged with the relational data
and just throw it away after a certain amount of time/days/weeks so they are
nice as blobs but other data would be better relational.

Yes there are app/cache level ways around this but might be cool if there was
a way to choose auto relational or blobby. I guess ultimately you could just
have multiple stores for live and archival data but something to think about.

~~~
ahachete
Data is laid out transparently to the user. Indeed, there should be no
(visible) difference as to how it is stored. Data within a single "level" is
stored on a single table. If you want to keep blobs, you can definitely do
that, they will be stored as such (bytea in postgres).

------
jmasonherr
So can I use it with meteor?

~~~
justinsb
I would be surprised if ToroDB implemented the oplog which Meteor uses for
efficient live-updating, but I guess it _should_ work with the polling mode.

SQL support (i.e. direct Postgres support) is on the Meteor roadmap for
post-1.0.

~~~
ahachete
ToroDB is definitely going to have MongoDB's oplog, so it should work as is
for live-updating.

SQL support for ToroDB... there's surely room for it in the future ;)

------
arthursilva
To the author: you should probably use VODKA index.

~~~
jeltz
VODKA indexes are still experimental and exist only as a branch in Alexanders
own repo
([https://github.com/akorotkov/postgres/tree/vodka](https://github.com/akorotkov/postgres/tree/vodka)).
I do not think they are anywhere near production ready, and even if they are
they are not an extension.

Alexander is working on allowing new index access methods (the type of index:
e.g. hash, btree, gin, gist) to be added in extensions.

~~~
arthursilva
Yep. And that's why I believe it's a great way to test/improve VODKA.

PS: now I'm not sure if it applies considering that the data is actually
normalized inside Postgres.

