Similarly, using an ORM doesn't help you. All that does is hide the details from you. It still uses SQL underneath, but limits what you can do with it.
Let's talk schemas. You have one whether your database understands it or not. Trying to pretend you don't have one doesn't make managing your schema any easier. It doesn't make building indicies on fields any easier.
These are hard problem and pretending they don't exist by rolling your own implementation or hiding them under an ORM won't make them go away.
It's also not about it being a hard problem (which it incidentally is but that's another story). It's about controlling the scale of the problem as time goes on and sufficiently abstracting it without adding complexity. That's not hiding it - that's making it a non issue. Permanently.
I am only advocating using a minimal subset of the functionality to aid migration in the future between engines, platforms, versions and storage paradigms.
An ORM (in our case NHibernate) is a clean abstraction over the top which allows us to be completely mobile between vendors and versions. There is no NIH in it.
The world changes pretty quick. Those who adapt quickly stay around.
An ORM does not replace proper database design. And letting the ORM do all your work for you w/r/t to mobility means accepting that you will not have good access to anything but the least common denominator among databases. Which is approximately the SQL-92 standard.
Another choice, which doesn't involve cutting off one's nose to spite one's face, is to couple the ORM to a public interface which is defined in terms of a specific set of views and stored procedures. Then you can very easily write schema which conforms to this interface whenever you need to switch to a new DBMS. The new schema is then free to take advantage of whatever features that DBMS provides in order to achieve good performance.
And the potential performance gains can be huge. There are plenty of examples out there of replacing 1-hour basic SQL + in-app processing procedures with 100ms common table expressions and the like. If you've got someone who knows what they're doing in charge of it, the latter solution will often take quite a bit less time to write and debug, too.
Ah, therein lies the rub. Spot on.
The last decade has seen a gradual weening off of reliance on the database stack for processing any form of "logic". This has been a evolutionary disaster. When companies like Sun pushed out application platforms such as Java, Microsoft followed suite to remain in the game. That's where the action was and that's where their revenues came from for the good part of 1998-2008 (still does). As a result, the newcomer programmers find it unfathomable to position any amount of processing in the hands of RDBMSs. These newlings choke at the idea of stored procedures or triggers - citing that these constructs are a step in the WRONG direction. Set-based SQL operations remain an anathema for a large majority of the new programming stock.
This is a pity because the side-effect of this attention diversion has been poorer data modeling skills and gross underutilisation of new and often improved RDBMS concepts & tools.
In the end, these new-school graduates of denormalisers and NoSQLers will have to contend with the one and only critical challenge that RDBMSs set out to fix: data correctness.
Performance & scalability mean shyte if you can't guarantee correctness. Bring to the table ACID transactions and MVCC with your new concepts of scalability and performance - then we talk ... kinda thing.
Believe me, we still need ACID/MVCC and by no means do I want to break the consistency part of CAP theorem if possible.
Also set operations break down under the following conditions:
1. when you have to aggregate a LOT of data together they don't scale. If it scaled well we wouldn't need OLAP solutions and could do everything with live, perfectly normalised data. That is just not the way it is.
2. They really don't work for temporal or complex non-uniformly structured data at all well.
You say you still need ACID, but you advocate manually writing joins on top of an ORM layer which does not give you ACID and introduces massive latency. Using an ORM on top of an a key value store doesn't absolve you of ACID or having joins. You still do joins and transactions, but you need to write them manually and test correctness manually. Both are very expensive to do correctly and running some unit tests is not sufficient to show correctness at this level. You are writing these things without recognizing what they are.
The whole NoSQL movement is fundamentally about trading correctness for performance, and usually read performance at that. There is nothing else new about NoSQL databases and ironically many even use a variant of SQL for queries! SQL is just a language, not an implementation.
As for aggregates, yes they are hard. By doing it yourself you are not making the problem less hard. Rather, you are possibly implementing solutions that RDBMS's can't yet do. I suspect that with a reasonable relational design you could probably do whatever aggregates you need to in a high performance manner while still taking advantage of the work done in an RDBMS.
It sounds like you don't understand the requirements of your application. If you need full ACID, your solution with nosql + ORM + manually written joins and queries is failing you. Unless, of course, if you are also writing your own global lock or using some form of lock-free algorithm.
NHibernate, SQLAlchemy, Hibernate are where you need to look and compare.
The ORM does a few joins where a performance gain is possible but we avoid them though as they are expensive on IO and transfer.
Most of our tables match the UI 1:1 and we populate these from the domain model when it is mutated or when we can afford it to be done (either directly or via message queues). Effectively they are materialized views which are updated when data changes or before it is required. Google CQRS.
I'm not suggesting NoSQL - just caution when using RDBMS features. I'm not an advocate of NoSQL.
In all those cases, the devil remains in the details. For example if you're just looking to add one or two columns to the result set, an INNER or OUTER JOIN usually reduces the amount of data that has to be shoved through the pipe. That's because you avoid transmitting the keys that are needed to join the data in-app. On the other hand, if you're joining a small number of larger values to a table with lots of rows, perhaps the join duplicates those values enough to really bulk up the result set. But it's effectively impossible to know whether that will be the case without measuring the performance for a couple different options.
Here's an alternative consideration: ORMs tend to encourage a style of coding wherein every column in the table is always grabbed all the time, regardless of how much of that information is actually needed for the task at hand. This is done in order to be able to generate full-fledged objects like the ORM wants to, and to reduce the amount of custom mapping that needs to be maintained. Now if you're looking for a needless waste of bandwidth, there's a needless waste of bandwidth.
If you're consistently having performance problems on joins, it's likely that the real culprit is poor configuration of keys and indexes. Or if you're not watching how the ORM chooses to join the data too closely, perhaps you've got some N+1 queries sneaking in there.
I take it you are aware of Brewer's CAP theorem? (http://www.julianbrowne.com/article/viewer/brewers-cap-theor...) If not, you should be.
Whoah. This is a type of sociopathic reasoning applied to data modeling. The premise is based on accepting compromise knowing fully the consequences of data corruption and yet failing to address the actual rider: data shall not be left in an inconsistent state. An argument turned on its head.
Much like asking the question “is it ethical to be deliberately inaccurate on your IRS returns” is answered with “this is not a question about ethics”.
Is it not a gigantic assumption to claim:
1. The compromise has “worked” for others (“most successful websites on the planet” [sic]) without providing any real evidence supporting this claim? What does “worked” mean? Is Twitter “ok” with providing suggestions for Followers which contain people already in the user’s follow list? What were the consequences of data inconsistencies in matters related to financial transactions? Was every requirement able to simulate a catch-all “place a back-order” solution? These are real questions that have to be answered before evangelising the compromise.
2. If it works for others (meaning: if you are willing to accept compromises about data consistency) this compromise ought to be brought to your garden-variety corporate environment? Gasp! As it is, volumes or scalability issues aside, corporate environments suffer from data inconsistencies, poor or no transactional control, ineffective referential integrity etc. There’s very little positive impact of encouraging this habit as an overt practise under the false generalised promise of “you will eventually need to scale to millions”.
They can keep their CAP theorem or BASE principles for the likes of social toys like Twitter and Facebook. When credit card companies and banks start accepting compromises on their transactions, I’ll start paying due attention.
Most of these large -- and particularly, global -- systems you speak of are eventually consistent also; they use the relational database as a potent backend member of a distributed system, able to check and deal with a great many constraint problems that can be localized.
If you look at NHibernate for example, if you do paging on SQL 2005+ it will use CTE yet on MySQL it will use LIMIT clauses underneath.
Some enterprise products have a really crappy schema with an ORM thrown over the top - that's where the pain is.