Hacker News new | past | comments | ask | show | jobs | submit login
SQL and NoSQL are two sides of the same coin, mathematically (acm.org)
83 points by 6ren on Dec 5, 2011 | hide | past | favorite | 18 comments

> Referential integrity implies a closed-world assumption where transactions on the database are serialized by (conceptually) suspending the world synchronously, applying the required changes, and resuming the world again when referential integrity has been restored successfully, rolling back any changes otherwise. Assuming a closed world is, we claim, both a strength and a weakness of the relational model. [...] The closed-world assumption, however, is in direct contradiction with distribution and scale-out. The bigger the world, the harder it is to keep closed.

This is the first time I have heard the expression "Closed World Assumption" used to refer to transactional atomicity.

For example, the Terry Halpin textbook "Information Modeling and Relational Databases" describes the CWA thus:

> When modeling a business domain, one may take different positions with respect to the completeness of the information captured in the model. The Closed World Assumption (CWA) is the assumption that all relevant facts are known (i.e, appear in the model, either as asserted facts or derived facts) and that all the relevant business rules are known.)

I find this discrepancy in academic terminology difficult to reconcile.

I am not sure what you are calling a discrepancy. For a Database to guaranty the assumption in Halpin's definition, a database must guaranty all business rules and relevant facts are known, and followed.

Say we are going to insert a row in a database that has a foreign key, for Halpin's assumption to hold the database must verify that the foreign key is present and valid in the data set. If it doesn't we have violated the assumption. Same for the transaction quote you mention.

It is an assumption you make can make because it is a guaranty your database provides.

The discrepancy is that Halpin's use of the phrase relates to the correspondence between the real world and the database - to the modelling relationship itself, whereas Meijer and Bierman's use solely relates to the valid states of the database - separate from any meaning that might be taken.

You highlighted the element of referential integrity within their description, whereas I highlighted atomicity. But both are just operational characteristics of the database.

Halpin's CWA comes into play even within a database that is not being updated and has only one relation. Eg. Consider a relation P that models the fact type that person with NAME was born on DOB. P = { {NAME: Bob, DOB: 2/2/22} }

CWA answers the query "Was Mary born on 2/2/22?" with False.

I would argue that it is also difficult to reconcile the idea that these worlds are flip sides of a coin with the idea that there are fundamentally different assumptions made about the data.

Indeed I would further argue that the the awkward handling of NULLs in SQL is in part because it assumes too open of a world.

SQL is not relational algebra. I think the more mathematic ideas in the article should be viewed in the context of relational alebra and not in the context of SQL.

So, the problems with NULLs is an SQL problem not a relational algebra problem. The 'flip side' is really the flip side of relational algebra and not SQL even though the article is not very clear in keeping the two separate.

SQL is not relational algebra but rather an attempt to bridge a relational algebra/programming gap. The null issue arises in SQL due to it being used in a relational algebra context in one case and a software development context in another. The point though is that these are treated as equivalents because of an open world assumption.

"A more skeptical view of O/R mappers is that they are undoing the damage caused by normalizing the original object model into multiple tables. "

Somehow inconsistent with the idea that SQL and NoSQL are two sides of the same coin, mathematically.....

How is it inconsistent?

That SQL/NoSQL are "two sides of the same coin" does not mean, as you seem to imply, that they are the same SIDE. Two sides means that they are opposite ends. Thus, each has different (actually complimentary) pros/cons.

Which is exactly what the phrase you quote implies: that normalization incurs some damage that we have to undo (via O/R mapping in the example given), if we want to get the same pros of a NoSQL solution from a SQL datastore.

(Also note that O/R mappers != NoSQL)

Phenomenal quote:

Just as Codd's discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization model and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

In other words: "this article will move mountains".

Let's wait and see!

Disclaimer: I did not dive into their mathematical adventures, but am rather fascinated by their underlying economical foundation: If a market offers choice (the "monopolistically competitive market") it will not deliver profit in the long run. If it's a oligopoly on the other hand, said oligopoly can retain profits for a long time to come.

What is this? Tautological? Suicide of the market? At least it's spelled out that the user is the loser in that system.

One of the authors is Erik Meijer, who was key to getting LINQ into Microsoft's C# and VB products. It'd be quite interesting if Microsoft seriously worked on alternative data stores for SQL Server.

But what about many to many relations? How do those map to nosql?

Instead of:

    [person_id,person_name]  [team_id,team_name] [person_id, team_id]
You get:

    [person_id, person_name, [team_id, team_id, ...]]
    [team_id, team_name, [person_id, person_id, ...]]

I think there are three physical ways to do this. With nosql they are all 'in model', including the relational one, with elements with both other keys as a key.

But the other two with a list of keys as part of the values is available too.

Brief discussion earlier this year with input from antirez defending Redis:


I'd like to see a definitive response by someone without full knowledge of the entire spectrum of NoSQL whether the mathematical insights described in this article cover the entire space of NoSQL databases. That is: document store, graph, and key-value store (in all their many guises). http://en.wikipedia.org/wiki/NoSQL Thanks in advance!

I think what the author is suggesting is that you can bridge some of the gap by a sort of non-SQL, quasi-relational data store. The major ways this is differentiated from the pure relational model is the general hostility towards representing information in well normalized relations and favoring a fairly object-oriented store instead.

This is not a new idea. PostgreSQL was originally designed with this sort of workflow in mind and now is pretty relational in outlook. In theory it is perfectly possible to transform one model into the other, provided adequate mappings, but at the same time one cannot reasonably derive one model from the other, which is why I disagree with the author's view that these are flip sides of the same coin.

Now, on to what is gained and what is lost. There are cases where the sorts of representation she is talking about are helpful (an in fact with recent PostgreSQL instances you can do them already).

On the other hand, the article isn't really clear about what the costs of this sort of representation are. The closest you get is the last line of his comparison table which contrasts query optimizer with developer/pattern. So what this implies is that such a system is optimized for developer/transaction workload and not for flexible reporting later. In other words, the developer is expected to know in advance all the uses the data will have so that the database can be properly optimized for this. Add new uses and this element becomes hard.

Similarly the ACID/BASE comparison also is telling in the sense that he is contrasting a system which is designed to give you accurate, consistent information when you ask for it to a system which is designed to tend towards a state of consistency, at least for older data. The "eventually consistent" part of his BASE definition ought to scare the pants off anyone interested in looking at how data in such a system could be put to decision support tasks.

I'd rather suggest to rename 'NoSql' to 'NSQL' meaning 'New Structured Query Language'.

CoSql doesn't stand up the test of aggregation, as KV nodes are distance apart and sometimes partitioned too. Map/Reduce is kind of a substitute for aggregation and this fact can be somehow reflected in NSQL.

Also, NSQL should reflect the fact that more advanced filtering queries cannot be run against KV stores, like when referencing secondary indexes in other nodes.

All-in-all, there is little remaining of original SQL in KV-stores and NoSql currently makes sense. But as the products continue evolving there is a definitive need to have a ground level logical model for data retrieval and then some sort of NSQL will be born.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact