

High scalability: SQL and computational complexity - coffeemug
http://www.rethinkdb.com/blog/2010/03/high-scalability-sql-and-computational-complexity/

======
justinsb
You can't rule out queries to suit your implementation; if the data is needed
and can't be computed within the constraints, then the user will just run
multiple queries to get that data.

For example, a key-value store can run simple lookups fast - O(1) - but if you
want to run a complex analytical query you'll be running lots of those queries
- maybe O(N) because you have to fetch all the data. You'd be better off
simply allowing one O(N) (or better) query, rather than dealing with the
overhead of N O(1) queries.

Disallowing some fraction of queries doesn't magically make your database
scalable, it's just an accountancy trick.

~~~
coffeemug
You don't allow N O(1) queries, you allow k O(1) queries, where k is a small
constant independent from N. It is simply a physical reality that a massively
scalable realtime system cannot run anything other than a small number of
log(N) queries. You can't run N O(1) queries or one O(N) query - there is just
no way to evaluate it in real time. If you need to present the result of such
a query to users, you can run it as an analytical query and cache it, but you
will not be able to run it on every request.

~~~
justinsb
But the users are going to issue those N queries while you're still at the
whiteboard talking about big-O notation.

Exposing a full query language to the database system means that you have the
opportunity to optimize the query fully. Limiting the query language only
limits your ability to optimize, so you won't even be able to explain to the
user that there's an alternative way to get their answer without bringing down
the database.

Are there limitations of databases such that it is possible to come up with
better query execution strategies than the database can figure out
automatically - of course there are! But in that case, fix the query optimizer
or fix the database. Don't try to solve the problem by redefining it.

~~~
coffeemug
If the query language is designed with this in mind, the compiler would detect
that a non-realtime query is passed to a database marked as realtime and would
throw an error. Of course the user could issue N realtime queries from within
a host language, but that would involve really trying to work against the
database. A good analogy might be CORBA/DCOM vs. SOAP. In DCOM days it was
easy to call many different functions while being oblivious that each call
requires a network roundtrip. Most programmers made this mistake most of the
time. With SOAP, you can still work this way, but it makes the likelihood of
this happening in practice significantly lower because it exposes the issue to
the programmer.

Regarding your argument of a fuller optimization, presumably a well designed
query language would be flexible enough to allow the user to express _every_
possible realtime query (assuming this is theoretically possible), so the user
doesn't need to combine smaller queries via host-language trickery. People
have been trying to fix the optimizer for decades, but most production
optimizers still have plenty of edge cases. Perhaps redefining the problem is
exactly what the doctor ordered?

~~~
justinsb
Nice theory, but consider that the N+1 problem is so widespread that it
actually has a name, and I think you'll realize that people are a much harder
problem than you think. And this isn't even a place where SQL is at all
problematic! Perhaps this supports your DCOM argument though, in that it's the
abstraction layers that really cause the problem.

That said, it's great that you're re-examining the problem; that's how
progress is made.

------
briansmith
If you're willing to pay Oracle, you can get a SQL database that runs queries
using the same execution plan every time. I don't remember exactly how to do
it, but basically you run the query wrapped in a command that says "fix the
execution plan of this query." The execution plan gets saved in some table
keyed by the query text. Whenever Oracle sees that query, it commits itself to
the same execution plan. This feature is like a decade old.

A lot of the NoSQL arguments are really "No-MySQL" or "NoMoney" arguments. If
you've never used Oracle (or DB2, or even SQL Server), you really don't have a
full understanding of the capabilities and limitations of SQL databases.
AFAICT, there's almost nothing in MySQL that wasn't old and boring in Oracle
over a decade ago.

~~~
rythie
Isn't that the same as doing a "STRAIGHT JOIN" with a "USE INDEX()" in MySQL?

~~~
coffeemug
Well, USE INDEX is just a hint, and MySQL is free to ignore it. You probably
mean FORCE INDEX, but I've still encountered situations where the index was
ignored because the optimizer (incorrectly) thought there was no way to use
it. (That was probably a bug, though)

------
madair
"Oh, and by the way, RethinkDB is hiring right now to create the next
generation of products that will fix all these problems with SQL."

It's an interesting article. But isn't he just saying, _P=NP is a problem, so
let's ban P=NP_? Is this programmer newspeak?

~~~
coffeemug
No. What I'm saying is that P=NP is a problem, so don't put O(2^N) algorithms
into your production software. People seem to be offended by the fact that you
can't run certain queries in realtime the moment you start talking about
scale. It isn't _my_ arbitrary definition, it's a fundamental limit in the
physical universe.

~~~
madair
I get it. I just wanted to point out that's what we're talking about, because
it's also still a P=NP problem to even know if something is a P=NP problem, so
all we can do, it would seem, is put up guards against the likely suspects.
And that's all great. I'm trying not to be a negative nancy, but its
statements like these that just seem so sloppy:

 _"If this holds true, even if we can trivially parallelize each query, an
exponential function (the amount of information) divided by a polynomial
(number of machines) still dominates the logarithmic function we defined
earlier as acceptable. This means that given modern trends, if a given query
isn’t scalable vertically, it also isn’t scalable horizontally, which makes
SQL fundamentally unscalable, period."_

Cut out the fluff and all that's saying is "Given modern trends, SQL is
fundamentally unscalable, period." And why? Because anyone can write a bad
query. And what's the answer? Complexity guards? Well, it certainly sounds
good, more guards in SQL, or let's not call it SQL if there are organizational
issues (i.e. non-modular design) and syntactical paradigms to improve. And
lets insert some other [undescribed] computational theory to improve it, but
_the paradigms don't change the algorithmic reality_ , they can only hope to
strategize better.

Having said all that, I don't want to be misunderstood, it sounds like there
may be some very interesting ideas behind all of this. It's hard to know in a
case like this whether to comment on something, considering that much of the
comment is negative. What motivates me is that in bullet points the article
seems to imply something quite different than it sounds like when taken at
face value. And I honestly appreciate push back if I'm just wrong about that.
It's all about filters required to find the good stuff in all the the
hyperbole that surrounds us.

------
jchrisa
"Given modern trends, if a given query isn’t scalable vertically, it also
isn’t scalable horizontally, which makes SQL fundamentally unscalable,
period."

~~~
barrkel
That was the line that jumped out at me too. The justification: information
grows exponentially, machines grow polynomially, therefore horizontal scaling
doesn't help - the information growth is still exponential - and therefore
vertical scaling is the same as horizontal scaling.

Seems rather unrealistic and academic to me.

The point about a query language that provides a complexity guarantee is more
interesting.

~~~
wmf
I agree. That quote applies to all possible queries, especially the worst-case
ones that touch every record in the database. Obviously such queries don't
scale in any paradigm, and thus they aren't useful for comparing systems.
Small queries are more interesting IMO, because they scale on some systems but
not others.

