
Run SQL on MongoDB - ozgune
http://www.citusdata.com/blog/51-run-sql-on-mongodb
======
nviennot
This abstraction will definitely leak (or performance will be horrible, pick
your poison):

* A big benefit for SQL are transactions, I have a hard time seeing anything done efficiently with MongoDB.

* Additionally, joins cannot be consistent with the MongoDB model (shards, reading from secondaries, ...).

While this idea is charming, I don't see this reaching any serious production
environment.

But it's cool, Citus implemented a somewhat working prototype which is useful
for some use cases. Kudos ! :)

~~~
ozgune
Hi there, we are mostly targeting analytic use-cases with this. SQL already
has a lot of analytics functionality built into the language, so we thought it
might be interesting to enable it.

* We find that in the context of analytics, transactions aren't as important (they are still relevant, but people are OK if results are a bit off).

* For the MongoDB sharding model, CitusDB already keeps similar metadata and moves around shards accordingly. By syncing MongoDB's config server metadata to the CitusDB master, we can ensure that shards are moved around properly. We can also run queries across shards using this metadata.

And thanks a bunch for the encouragement. As you noted, we think of this as
addressing some interesting use-cases, and are looking to get the community's
feedback.

~~~
dotborg
Your idea is great, but you will end up with average Joe using it for other
cases thus MongoDB will be nontransactional database engine with SQL interface
and sharding feature, which will be much slower and will have less features
than any other RDBMS with proper replication funcionality.

------
archon
Why? If you want to run SQL queries (with joins, etc.), maybe you should run a
traditional RDBMS.

Is there some sort of advantage to this that I'm missing? It just seems like
reinventing the wheel.

~~~
metdos
Why not? It may be useful to people who already have their data in MongoDB and
don't want to migrate their data to an RDBMS.

~~~
nviennot
Migrating data can be automatic: <https://github.com/crowdtap/promiscuous/>

We use this tool to replicate data across our main MongoDB database, and a
PostgreSQL one (eventually consistent).

We haven't wrote a comprehensive readme yet. The only documentation we have so
far are slides: <http://viennot.biz/promiscuous>

Stay tuned :)

~~~
Tobu
Those slides gave me a headache, so I'll just point out for the benefit of
everyone that there is a README, and that it's some form of replication via
AMQP queues.

~~~
rhizome
Thanks for the tip on AMQP.

------
jcromartie
Hypothesis: RDBMS are better document-oriented databases than document-
oriented databases.

~~~
w0utert
Define 'better'

~~~
jcromartie
Faster, less resource-intensive, more durable, fewer application errors, lower
TCO, etc. etc. ...

~~~
taligent
Faster ? NO. In my use case I have one domain object with 30 embedded
relationships. MongoDB: one request. SQL: 30 joins.

Durable ? NO. It is trivial to distribute data between multiple nodes using
MongoDB and have writes be on some or all of them. I have yet to see a
clustering SQL system that makes the process as simple as any NoSQL database.

Fewer application errors ? Well this just makes no sense. Why would a database
be causing application errors ? Either the fetch/store request fails or it
doesn't.

Lower TCO ? NO. MongoDB has an order of magnitude less TCO than any database
I've used. Period. It is trivial to cluster, shard, deploy, manage and being
schema-less there are no migrations to worry about. Ease of use is one of the
big reasons why MongoDB is popular right now.

~~~
jeremyjh
I don't think you understood him. There would not be many joins. You can use
an RDBMS as a document database by storing documents in a single column, with
key values in another column. For example, storing JSON in a text field
(Postgres also has a native json type now as well I belive). You'd also want
an Indexes table. Another approach that will also work well is to use a
document model, you might have three tables, one of which has columns like
(key, name, value) - all your data is in that table. Reddit uses something
like this approach based on previous Hack News discussions.

~~~
taligent
But you are talking more about key/value model. Which I see as a different use
case to document model.

And PostgreSQL HSTORE seems like a bit of a hack to me. It has a ridiculously
complex syntax and still not sure how it handles querying other data types
e.g. dates.

~~~
zrail
HSTORE stores values as strings, so in your query you'd have to cast to a
date, or whatever type you're wanting to get out of it.

The thing is, an RDBMS really is a superset of most of the other data models
floating out there.. Entity-attribute-value, Document, plain key-value,
whatever. Your every day ordinary RDBMS can do it, with transactions,
durability, ease of administration, whatever. When you use a NoSQL data
storage system you're explicitly trading off one or more of the capabilities
of an RDBMS to get some other benefit. The one thing that Postgres isn't good
at, as far as I know, is a graph-like data structure.

------
craigkerstiens
In addition to calling Mongo from within Postgres theres a variety of other
Foreign Data Wrappers that exist, including the ability to call MySQL, Oracle,
Redis, and CouchDB. You can find the list of FDWs at
<http://pgxn.org/tag/foreign%20data%20wrapper/>

It would be great to have the Mongo one listed there as well.

------
eranation
This is a very interesting notion, I wonder what are the performance
constraints for such a SQL to MongoDB query, has anyone attempted?

~~~
ozgune
We measured performance across several SQL queries, and found that it depends.
Assuming that the working set fits into memory:

* If the query goes over a single table and doesn't have filter clauses, then the cost of converting BSON data to PostgreSQL tuples becomes the bottleneck.

* If the query goes over a single table, doesn't have filter clauses and touches only a few columns, then the cost of reading the data from MongoDB (over TCP) becomes the bottleneck.

* If the query joins several tables or has complex sub-selects, then the lack of accurate data statistics forces choosing bad execution plans and becomes the bottleneck.

For most queries though, we found that the query performance was reasonable.
PostgreSQL processed around 200K-400K documents per second per CPU core.

------
malkia
That comes at the right time. While I was reading through the docs of
Postgres, a colleague of mine spent time doing the same for mongodb. Our goal
is to have on each computer at the studio local database with most of our
files from the game - 3d models, assets, etc - and we are looking what is
going to suit us. He was able to model and put in mongo all our level maps, we
have the asset information already in sqlite and all our shaders are cached
there.

So this comes as a good surprise. I've been excited reading about PostGres,
and now this seems like a very good fit for us.

------
hcarvalhoalves
The question that remains is: if the documents are well defined enough that
you can map them to a schema, why use a document store at all?

------
krsunny
Have we reached the pinnacle of "using technology as a sex doll"?

------
Maro
Could be useful for going in and quickly repairing data.

    
    
        // find bad data
        SELECT * FROM table WHERE ...
        // repair
        UPDATE table SET ... WHERE ...

~~~
e1ven
Couldn't you do that just as easily with a database.update({'data':'bad'},
{$set:{'data':'good'},false,true}) ?

------
arunoda
Is there any possibility to translate SQL to Mongo's JS Query API (or may be
to map/reduce)?

~~~
ozgune
Behind the covers, we convert filtering expressions in the WHERE clause to
Mongo's query API, and send that query over.

We found that translating the rest of SQL amounted to writing a database query
executor, and instead opted for leaving that to Postgres.

For these translations, what use cases did you have in mind?

~~~
arunoda
I hope ORDER BY, LIMIT and SELECT also can be easily converted to mongo's
query API.

JOIN is the hardest part, I think.

------
zemo
except the example document is flat and the collection has to be homogenous
for this to work. Isn't that... kinda missing the point of mongo?

~~~
ozgune
Hi there - the collection doesn't need to be homogenous; if the query
references a missing field in a document, the wrapper interprets that value as
null.

For nested documents, we have some ideas on how to make that accessible, and
talk about it in our README:
<https://github.com/citusdata/mongo_fwd/blob/master/README>

