Hacker News new | past | comments | ask | show | jobs | submit login
Run SQL on MongoDB (citusdata.com)
103 points by ozgune on Oct 17, 2012 | hide | past | favorite | 34 comments

This abstraction will definitely leak (or performance will be horrible, pick your poison):

* A big benefit for SQL are transactions, I have a hard time seeing anything done efficiently with MongoDB.

* Additionally, joins cannot be consistent with the MongoDB model (shards, reading from secondaries, ...).

While this idea is charming, I don't see this reaching any serious production environment.

But it's cool, Citus implemented a somewhat working prototype which is useful for some use cases. Kudos ! :)

Hi there, we are mostly targeting analytic use-cases with this. SQL already has a lot of analytics functionality built into the language, so we thought it might be interesting to enable it.

* We find that in the context of analytics, transactions aren't as important (they are still relevant, but people are OK if results are a bit off).

* For the MongoDB sharding model, CitusDB already keeps similar metadata and moves around shards accordingly. By syncing MongoDB's config server metadata to the CitusDB master, we can ensure that shards are moved around properly. We can also run queries across shards using this metadata.

And thanks a bunch for the encouragement. As you noted, we think of this as addressing some interesting use-cases, and are looking to get the community's feedback.

Your idea is great, but you will end up with average Joe using it for other cases thus MongoDB will be nontransactional database engine with SQL interface and sharding feature, which will be much slower and will have less features than any other RDBMS with proper replication funcionality.

Why? If you want to run SQL queries (with joins, etc.), maybe you should run a traditional RDBMS.

Is there some sort of advantage to this that I'm missing? It just seems like reinventing the wheel.

I have to wonder if this is a full-circle hack to deal with a poor architecture decision in committing to MongoDB.

Maybe they want to recycle complex analytic queries and visualization tools for rdbms-databases, while still enjoying the advantages of mongodb. They mention this in the first paragraph.

"This way, you can run complex analytic queries, join different collections, and benefit from SQL data visualization tools."

"In the following, we show a rather simple query, but complex queries that involve sub-selects, SQL window functions, and collection joins are also possible. In fact, you can even join a MongoDB collection with a PostgreSQL table."

"And the really exciting part for us is that CitusDB can easily enable analytic queries for sharded MongoDB installations, as it is built on PostgreSQL and can already distribute SQL queries."

This company builds CitusDB, "a distributed database that lets you run SQL queries over very large data sets. Designed for analytical queries, CitusDB enables real-time responsiveness." This apparently allows their software to work with MongoDB.

Why not? It may be useful to people who already have their data in MongoDB and don't want to migrate their data to an RDBMS.

Migrating data can be automatic: https://github.com/crowdtap/promiscuous/

We use this tool to replicate data across our main MongoDB database, and a PostgreSQL one (eventually consistent).

We haven't wrote a comprehensive readme yet. The only documentation we have so far are slides: http://viennot.biz/promiscuous

Stay tuned :)

Those slides gave me a headache, so I'll just point out for the benefit of everyone that there is a README, and that it's some form of replication via AMQP queues.

Thanks for the tip on AMQP.

You can potentially use existing tooling that expects a RDBMS against a mongo database.

Hypothesis: RDBMS are better document-oriented databases than document-oriented databases.

Define 'better'

Faster, less resource-intensive, more durable, fewer application errors, lower TCO, etc. etc. ...

Well, that's related to the particular implementations, not theory. If PostgreSQL works faster and is less resource-intensive than MongoDB, it says nothing about their theoretical underpinnings. The whole point of the NoSQL movement is that one approach (relational databases with SQL) to work with data and its structure cannot be the best answer to all the possibilities in that space. Not because of particularities in the implementation (eg. speed), but because it gives you an easier time working with it; there's no mismatch.

I'd say that if you are working with diverse documents with a few or no relationships, a document oriented database is necessarily a better fit to the task at hand than a relational one. It's a tautology.

If you are needing many JOINS and your data is homogeneous, there's a mismatch. You probably should have chosen a relational database in the first place, as it better fits that model.

Agreed. It's hilarious to watch people who keep trying to setup this RDBMS versus NoSQL situation. As though every data model can be perfectly handled by a RDBMS if only I learnt enough SQL.

Faster ? NO. In my use case I have one domain object with 30 embedded relationships. MongoDB: one request. SQL: 30 joins.

Durable ? NO. It is trivial to distribute data between multiple nodes using MongoDB and have writes be on some or all of them. I have yet to see a clustering SQL system that makes the process as simple as any NoSQL database.

Fewer application errors ? Well this just makes no sense. Why would a database be causing application errors ? Either the fetch/store request fails or it doesn't.

Lower TCO ? NO. MongoDB has an order of magnitude less TCO than any database I've used. Period. It is trivial to cluster, shard, deploy, manage and being schema-less there are no migrations to worry about. Ease of use is one of the big reasons why MongoDB is popular right now.

I don't think you understood him. There would not be many joins. You can use an RDBMS as a document database by storing documents in a single column, with key values in another column. For example, storing JSON in a text field (Postgres also has a native json type now as well I belive). You'd also want an Indexes table. Another approach that will also work well is to use a document model, you might have three tables, one of which has columns like (key, name, value) - all your data is in that table. Reddit uses something like this approach based on previous Hack News discussions.

But you are talking more about key/value model. Which I see as a different use case to document model.

And PostgreSQL HSTORE seems like a bit of a hack to me. It has a ridiculously complex syntax and still not sure how it handles querying other data types e.g. dates.

HSTORE stores values as strings, so in your query you'd have to cast to a date, or whatever type you're wanting to get out of it.

The thing is, an RDBMS really is a superset of most of the other data models floating out there.. Entity-attribute-value, Document, plain key-value, whatever. Your every day ordinary RDBMS can do it, with transactions, durability, ease of administration, whatever. When you use a NoSQL data storage system you're explicitly trading off one or more of the capabilities of an RDBMS to get some other benefit. The one thing that Postgres isn't good at, as far as I know, is a graph-like data structure.

In addition to calling Mongo from within Postgres theres a variety of other Foreign Data Wrappers that exist, including the ability to call MySQL, Oracle, Redis, and CouchDB. You can find the list of FDWs at http://pgxn.org/tag/foreign%20data%20wrapper/

It would be great to have the Mongo one listed there as well.

This is a very interesting notion, I wonder what are the performance constraints for such a SQL to MongoDB query, has anyone attempted?

We measured performance across several SQL queries, and found that it depends. Assuming that the working set fits into memory:

* If the query goes over a single table and doesn't have filter clauses, then the cost of converting BSON data to PostgreSQL tuples becomes the bottleneck.

* If the query goes over a single table, doesn't have filter clauses and touches only a few columns, then the cost of reading the data from MongoDB (over TCP) becomes the bottleneck.

* If the query joins several tables or has complex sub-selects, then the lack of accurate data statistics forces choosing bad execution plans and becomes the bottleneck.

For most queries though, we found that the query performance was reasonable. PostgreSQL processed around 200K-400K documents per second per CPU core.

That comes at the right time. While I was reading through the docs of Postgres, a colleague of mine spent time doing the same for mongodb. Our goal is to have on each computer at the studio local database with most of our files from the game - 3d models, assets, etc - and we are looking what is going to suit us. He was able to model and put in mongo all our level maps, we have the asset information already in sqlite and all our shaders are cached there.

So this comes as a good surprise. I've been excited reading about PostGres, and now this seems like a very good fit for us.

The question that remains is: if the documents are well defined enough that you can map them to a schema, why use a document store at all?

Have we reached the pinnacle of "using technology as a sex doll"?

Could be useful for going in and quickly repairing data.

    // find bad data
    SELECT * FROM table WHERE ...
    // repair
    UPDATE table SET ... WHERE ...

Couldn't you do that just as easily with a database.update({'data':'bad'}, {$set:{'data':'good'},false,true}) ?

Is there any possibility to translate SQL to Mongo's JS Query API (or may be to map/reduce)?

Behind the covers, we convert filtering expressions in the WHERE clause to Mongo's query API, and send that query over.

We found that translating the rest of SQL amounted to writing a database query executor, and instead opted for leaving that to Postgres.

For these translations, what use cases did you have in mind?

I hope ORDER BY, LIMIT and SELECT also can be easily converted to mongo's query API.

JOIN is the hardest part, I think.

except the example document is flat and the collection has to be homogenous for this to work. Isn't that... kinda missing the point of mongo?

Hi there - the collection doesn't need to be homogenous; if the query references a missing field in a document, the wrapper interprets that value as null.

For nested documents, we have some ideas on how to make that accessible, and talk about it in our README: https://github.com/citusdata/mongo_fwd/blob/master/README

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact