I'm not saying don't have a rigid schema, just don't enforce it at the storage l...

einhverfr · on Sept 18, 2012

>If you're doing an ad-hoc report then you wouldn't have indexes in place for it in the SQL case, so I don't see how it's any worse or harder in mongodb.

Only true in a case where you ahve to index everything you might want to search on, like with InnoDB. In PostgreSQL all you really need are your foreign key indexes and a couple (if that) of criteria indexes and you are good. That's more of an InnoDB limitation than a relational limitation. Basically InnoDB tables are primary key indexes and they can only be traversed in key order, not physical order, so sequential scans are painful....

lmm · on Sept 18, 2012

If that level of indexing is sufficient I don't see why you can't just do the same thing in mongodb.

einhverfr · on Sept 18, 2012

>"If that level of indexing is sufficient I don't see why you can't just do the same thing in mongodb."

What level of indexing is sufficient depends a great deal on the specifics of the database layout on disk. In InnoDB for example, sequential scans are very costly, and primary key lookups are very cheap. This is because the table is more or less contained in the primary key index and this must be traversed in key order since physical order is not supported. This means a sequential scan of a table means lots of random disk I/O and OS prefetching is useless.

So to address this you end up indexing everything you want to search on later. Note that non-pk indexes are a little slower in InnoDB because you have to traverse the index to find the primary key value, then you have to traverse the primary key index to retrieve the table info.

In PostgreSQL things work differently. The table is a series of pages on disk and rows are allocated from these as a heap. You can scan a table, but not an index, in physical order in PostgreSQL. Therefore typically PostgreSQL sequential scans on tables are lot faster than on MySQL because it is sequential, rather than random, page reads. Indexes point at the tuple ID which stores the page number and row number within a page. An index scan is a tree traversal followed by processing pages indicated in the tuple ID.

This leads to a bunch of interesting things: Adding indexes is usually a performance win with InnoDB. However for PostgreSQL, it will typically look up what indexes it has and balance index scans against sequential scans of tables. Unlike InnoDB, sequential scans sometimes win out planner-wise, esp. on small tables.

So what indexes you need depends quite highly on how things are organized.

lmm · on Sept 18, 2012

That's all pretty interesting, but I still don't see what you get with postgres that you don't get with mongodb. Your database won't enforce your schema for you, but I don't see how that means "ad hoc reporting is impossible".

einhverfr · on Sept 19, 2012

I have read through the Mongo db query docs and it does look like you can do some ad hoc retrieval queries, and some aggregation. But in the SQL world that's not really the same thing as ad hoc reporting.

I suppose "can't do" is too strong assuming your reporting matches your query. However these things look a lot simpler to do in SQL than in Mongo's approach, and I don't see how you can reliably transform data on output if you don't have a guaranteed reliable schema on input. Also I don't really understand how you would transform the data in this way with Mongo's API. I suppose you always could but it looks painful to my (admittedly uninitiated) eyes.

How many lines of code are required to express a 50 line SQL query doing 5 joins, complex filters and group-by transformations, etc?

lmm · on Sept 20, 2012

Obviously if your report makes assumptions about your data which aren't true then you might get invalid data out. I absolutely agree with having a single point through which writes to the data store must pass which enforces business-level constraints on the data. I just don't find SQL a convenient form to express those constraints (and my experience has been that any given business domain will have some constraints that are too complex to express in the SQL model, forcing you to resort to e.g. triggers - at which point the constraint is not integrated with the rest of your data model, it's just a piece of code that runs to check your inserts, which you could do equally well in any language); I'd rather do it in a "normal" programming language.

I see what you're getting at with reporting now, you're talking about doing actual calculations on the data? For mongodb I'd probably use its map-reduce API, at which point you're writing javascript and you can do anything, and performance should be fine. Though honestly other than performance I've never had a problem with just fetching the rows I need and doing whatever transformation/aggregation logic in e.g. python. SQL has never struck me as particularly typesafe or gaining much advantage from being close to the data; its sets-oriented way of thinking can be helpful in some places, but it's not the only language that can do that.

If you like SQL as a language for writing logic in I can see why a traditional database would appeal. But even then I feel like input validation, data storage and query processing should be separate layers (and I see some movement towards that with e.g. MySQL offering an interface to do key/value lookups directly). If SQL is a good way of doing transform/aggregation logic then it should stand alone, and I'd expect to see SQL-like query layers on top of key-value stores.