SQL only hangs around because it was the best thing available at the time when DBs first exploded into common use. It’s pretty declarative, which is a point in its favor, but it’s not very amenable to query construction since it has AppleScript levels of “being like English.” Have fun writing something that can generate it while handling highly-adaptive and/or customizable requirements.
I’m not going to call Datalog the definitive answer, but it’s a step in the right direction.
Honestly, this really confuses me. On the surface, it seems what you said is true. How can a programming language that emulates natural language be good?
However, in practice, SQL is surprisingly easy to write and expressive. I do prefer using ORMs for simpler queries because of intellisense and typing, but for complex queries, SQL is just better. I really don't understand why.
When you're dealing with SQL you're dealing with items in sets, and really all you're doing is joining, slicing and filtering sets of data to obtain the results you want, perhaps with some extra sugar like aggregation, but in the main it's groups of objects and that's quite intuitive.
On the opposite side is object-orientation where you're dealing with instances of objects. Very intuitive when dealing with properties of an instance (perhaps columns in a table for a particular instance) but not when dealing with multiple instances. Instances of objects don't really gel with sets of objects for various reasons - this class of problems is known as object-relational impedance mismatch (Ireland et al, 2009; Colley et al, 2020). This is why ORMs are often used for development, as a smooth translation layer between object-orientation and set-orientation. Each paradigm is perfectly good within its own context, but they are not particularly compatible approaches.
References if you're interested:
Ireland, C., Bowers, D., Newton, M. and Waugh, K., 2009, March. A classification of object-relational impedance mismatch. In 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications (pp. 36-43). IEEE. https://ieeexplore.ieee.org/abstract/document/5071809/
Colley, D., Stanier, C. and Asaduzzaman, M., 2020. Investigating the effects of object-relational impedance mismatch on the efficiency of object-relational mapping frameworks. Journal of Database Management (JDM), 31(4), pp.1-23. https://www.igi-global.com/article/investigating-the-effects...
Being english-like is just appearances. SQL is really just relational algebra (the traditional querying bit of it, at least) with some technical limitations. But in essence, it's relational algebra and it works well because the topic was studied a lot by many smart people and the language reflects these developments.
You sound like you mostly write ORM-like OLTP data retrieval. As somebody who's been around analytical dbs for the last 6-7 years I can tell you that SQL falls apart almost immediately on anything longer than 20-30 lines.
I wonder what the problem is, how exactly is it falling apart: a) you need to write a lot of "clean up" sub-queries and fragments or b) the question you ask is more complicated and does not fit into the algebra of SQL.
If it's more the first: if your dataset was perfectly cleaned up (say, 3NF, with all data values cleaned up) — how longer would SQL go for you without falling apart?
As I write this comment I realize that it could be worded as "is the problem with the logic or the data"?
Thanks,
I've had to help some of our data scientists debug their queries after we migrated to Hive 3, and it's mostly the way large queries are structured. You often want to add a single column to the table that is referenced by additional columns (like, convert date of birth to age, then create age buckets and an is_minor flag, calculate the age when they opened their first account).
CTEs are the best way to do this but get clunky and repetitive as you have to give each one a name. CROSS LATERAL JOIN is completely illegible. Something like LET column = expression would have improved that:
FROM customers
LET age = EXTRACT(YEAR FROM current_date() - dob)
LET age_bucket = CASE ... END
LET date_1st_ac = (SELECT MIN ...)
...
SELECT ...
The other problem lies in the fact that an SQL query is one huge query. You have to take it apart to sanity-check the intermediate result: okay, let me quickly turn this CTE into the main query and run it, no, let me try again with another CTE, okay, found it, now I have to convert everything back to run the whole query again.
In comparison, Spark dataframes are all made equal, I can name any one of them, examine their schema, run them to look at the intermediate result, all without introducing optimization barriers.
1. All the little things, inconsistencies and annoyances. You know, NULL behaviour, identifier usage, things that you just have to memorise (everything is non-orthogonal).
2. Language-level. No real abstraction, no code reuse, weird sublanguages for trivial stuff, no libraries/modules/packages.
3. Standard-level. I mean, this is not standard anybody can use. It is unreadable! I mean I can go a very long way if necessary - used to reading the original C++ and C stds. But ANSI SQL... A bottomless pit. Databases do implement subsets of something ANSI-ish but these vary a lot.
All of this results in templated queries, 1000s of lines of queries that are only slightly different and no chance for a truly db-agnostic language. I mean... Have you seen write-once 3000 LOC queries? This is what I can "fall apart"!
I like the original Codd's relational alebra, even love it. It is both simple and practical. Most databases use RelAlg-like internal representations that optimisers work on for a reason!
But we're stuck with SQL and it will never go away.
SQL is just too old and too powerful to be replaced by something. There might be implementations that improve on some subset of it's functionality. But nothing can possibly cover everything it does without itself becoming kludgy feeling
> SQL is surprisingly easy to write and expressive
Counter-example: using only majority-support ISO SQL string functions and operators (SUBSTRING, CHARINDEX, LIKE, NOT LIKE, LEN, etc - but not any regular-expressions), for a string column that contains URI query-string style name+value pairs, e.g. "foo=bar&baz=qux", try to extract out the value for the key "baz" - note that the source column value could also be "&baz=qux&foo=bar&" or "baz=qux&foo=bar&".
Doing that is trivial in most conventional programming languages: just use a couple of local-variables to store CHARINDEX values then pass those into SUBSTRING to get the desired "qux" value.
...but in SQL (at least to my knowledge) there's no concept of local-variables in a SELECT query, instead you need to pass values along via derived-tables or CTEs - or move the logic to a scalar UDF - or give up trying to be succint and do it with dozens of identical CHARINDEX call-sites and pray that the RDBMS is smart enough to memoize UDFs in a query step; none of those options are appealing even in their own right - so we're kinda stuck (and if you want to be succint you'll need not only UDF support, but also inlined and concurrent execution of UDFs otherwise your query performance will tank).
Instead, what if ISO SQL had the ability to define in-query "mini" pure-functions, which represent a (monadic?) sequence of operations, including mutable local variables, over its input parameters - this would solve the verbosity/DRY problem entirely - and without needing you to split a query's logic into two (i.e. the main SELECT query but also many CREATE FUNCTION statements). I note that the existing CTE syntax could be adapted for defining those functions, for example a query to extract querystring values could look like this:
```
WITH FUNCTION getQueryStringValue( input varchar, key varchar ) RETURNS varchar NULL AS (
SET key = CONCAT( key, '=' );
LET keyIdx = CHARINDEX( haystack: input, needle: key );
IF keyIdx < 0 THEN RETURN NULL;
LET valueIdx = keyIdx + LEN( key );
LET nextSeparatorIdx = CHARINDEX( haystack: input, needle: '&', startIndex: keyIdx );
IF nextSeparatorIdx < 0 THEN SET nextSeparatorIdx = LEN( input );
LET valueLen = nextSeparatorIdx - valueIdx;
LET value = SUBSTRING( text: input, startIndex: valueIdx, length: valueLen );
RETURN value;
),
cte AS (
SELECT
getQueryStringValue( src.QueryStringA, 'foo' ) AS fooValue1,
getQueryStringValue( src.QueryStringB, 'foo' ) AS fooValue2,
getQueryStringValue( src.QueryStringC, 'foo' ) AS fooValue3
FROM
src
)
SELECT
CONCAT( fooValue1, fooValue2, fooValue3 ) AS whatevs
FROM
cte
ORDER BY
someCol;
```
I don't disagree things like this are pain points, but one of the key things about unlocking the power of relational algebra is having normalized data, and putting multiple things (eg a bunch of key value pairs) in a single value sort of flies in the face of that. You can often get away with it, but it starts breaking down in more complex cases like this.
In the real world of course, sometimes it happens anyway. Maybe you were storing URIs and only later figured out you needed to extract keys from them. But thinking about this sort of thing beforehand as much as possible can save you a lot of heartache later on.
That said, this fails the ISO SQL requirement (I believe), but I'd use the table-valued function STRING_SPLIT in MS SQL, then I can just filter on LIKE 'baz=%", and SELECT the substring after baz=. Table-valued functions are great. The nice thing about this solution is it's very easy to generalize.
I believe MySQL doesn't have a similar function, but you could do something with SUBSTRING and LOCATE - search for the first occurrence of baz=, then you can use that to find the first location of & after baz=, which will allow you to select the substring between baz= and &.
Let's not erase the work of Stonebraker et al whose efforts through the 70s were a lot closer to Codd's relational algebra than SQL. Ingres's query language, Quel, springs to mind: being directly based on the relational calculus it was far more readily composable than SQL (which might be charitably described as "borrowing ideas from the relational model") was or is.
As with the videotape format war, the eventual industry standard was decided commercially: Oracle was by far the most aggressive, and promoted SQL.
Yes, we can have relational databases without SQL, there are better ways for composability. Some might argue that SQL itself isn't a mess, but it sure does help people make a mess of themselves.
I had a period of perferring MongoDB for this reason. Because the query is the bare AST in Mongo, dynamic query construction is simply manipulating some JSON.
I’ve long felt that SQL was somehow backwards, but I don’t know what would be better. ‘Go and get two eggs’ Is better than ‘go to the fridge and get eggs, only two’.
Strictly following the semantics order of the querying, whether forward or reverse.
The ordering issue of SQL is that it uses a mixed ordering, the physical layout of the query in text form doesn’t follow the logical execution of the query, which makes it very confusing.
Comparing to short cutesy sentences is not helpful as a non-trivial query has dozens or hundreds of steps, it’s closer to a build book or checklist.
Furthermore english is far from a paragon of clarity or regularity, and languages which try to follow it (SQL, AppleScript) tend to scale a lot worse with instruction sizes and complexity.
That's not standard SQL. This was standardized pretty late, so different RDBMS have various syntaxes, but the standardized one is to add a clause like this:
FETCH FIRST n ROWS ONLY
In PostgreSQL and MySQL the usual syntax is a LIMIT clause, but PostgreSQL also supports the standard. MSSQL and some others use TOP n directly after SELECT. I think in some others you need to use a window function.
I’m not going to call Datalog the definitive answer, but it’s a step in the right direction.