Hacker News new | past | comments | ask | show | jobs | submit login
SQL: One of the most valuable skills (craigkerstiens.com)
821 points by duck 40 days ago | hide | past | web | favorite | 379 comments

SQL is one the most amazing concepts I've ever experienced. It's nearly 5 decades old and there is no sign of a replacement. We've created countless other technologies to store and process data, and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).

I run a early stage company that builds analytics infrastructure for companies. We are betting very heavily on SQL, and Craigs post rings true now more than ever.

Increasingly, more SQL is written in companies by analysts and data scientists than typical software engineers.

The advent of the MMP data warehouse (redshift, bigquery, snowflake, etc) has given companies with even the most limited budget the ability to warehouse and query an enormous amount of data just using SQL. SQL is more powerful and valuable today than it ever has been.

When you look into a typical organization, most software engineers aren't very good at SQL. Why should they be? Most complex queries are analytics queries. ORMs can handle a majority of the basic functions application code needs to handle.

Perhaps going against Craig's point is the simple fact that we've abstracted SQL away from a lot of engineers across the backend and certainly frontend and mobile. You can be a great developer and not know a lot about SQL.

On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.

But if you really understand SQL, and it seems few people truly today, you command an immense amount of power. Probably more than ever.

"we always seem to try to re-create SQL in those languages (e.g. Hive, Presto, KSQL, etc)."

This is largely because of the number of non-programmers who know SQL. Add an SQL layer on top of your non-SQL database and you instantly open up a wide variety of reporting & analytics functionality to PMs, data scientists, business analysts, finance people, librarians (seriously! I have a couple librarian-as-in-dead-trees friends who know SQL), scientists, etc.

In some ways this is too bad because SQL sucks as a language for many reasons (very clumsily compositional; verbose; duplicates math expressions & string manipulation of the host language, poorly; poor support for trees & graphs; easy to write insecure code; doesn't express the full relational algebra), but if it didn't suck in those ways it probably wouldn't have proven learnable by all those other professions that make it so popular.

SQL is amazing IF you understand it. You need to think in sets of things. It's like functional programming paradigms or recursion; once you really truly "get it" you start to feel like a Jedi master.

Unfortunately the vast majority of SQL users aren't that proficient. It's also fairly hard to learn because it's something that you only pick up with experience and specifically longer time experience with a sufficiently complex data model.

I personally would never have gotten good at SQL if I hadn't stayed in my first two jobs for 5 years each working daily with the data-models and the domain.

I was fortunate that my early SE career experience involved a lot of SQL (and my first experience with a SQL-like query language happened under PICK).

But as my career has gone forward, I'm touching it less and less; today, I hardly touch it at all outside of my personal usage.

Much of that has to do with the fact that I'm now employed building and maintaining an SPA using javascript and nodejs where the backend is accessed thru a RESTful API; we never get to touch the actual database.

The few times I have seen some queries for that DB - albeit not in our API, though I could probably find them somewhere - all I can hope is that the SQL engine being used does some kind of query optimization on-the-fly, because there's so many inner selects that make me cringe it ain't funny (like I wonder if they've heard of joins and such).

Before, I was involved in a lot of PHP web apps and backend server automation, where I needed to use SQL a lot; I feel like I am getting rusty in it.

You could make a language that had all those benefits + almost all the things nostrademons mentioned. For example, Apache Spark on top of Scala does this pretty well (but it's significantly more complex than SQL is.) Pandas in Python and SAS also have many of these features (but also have a higher learning curve.)

You need to think in bags of things

SQL certainly does not suck as a language, unfortunately this is a very common perception, almost a (false) consensus within the developer community. I changed careers from Finance - I was a ACA (akin to CPA) , wizard at Excel VBA etc - to software development, after 10 years learning, I’m now finally also proficient with SQL. I have also learnt other langauges, but SQL is by far my favourite, Yes it has some mistakes in it’s design - the biggest being SELECT before FROM, Yes some key words, esp windows functions, have a silly length - this creates an impression of verbosity but SQL is actually the opposite, it is far more expressive than any Object Oriented language I can think of. 1 line of code, say a window function with a filter clause in Postgres is the equivalent of pages of Java code. Lateral Joins enable chaining, Views CTEs and Functions provide simple safely scoped composability. Postgres has incredible aggregation and analysis ability, built in out of the box. It’s String manipulation capabilities are wondrous, a 1 line StringAgg function can achieve the same as pages of VBA code - you may scoff but VBA with it’s ancient horrible editor is still necessary, if you are in Excel, is Javascript that much less verbose though? Pure SQL Postgres Functions are amazing, you can chain them, you can easily make them pure, they can include fast recursion and conditional logic. I never delete data, just append, a bitemporal immutable functional db is so powerful, enabling time travel and rock solid data integrity. Mat Views or Indexes or summary tables are easily created and often automatically. They provide an efficient reliable cache layer, which can address most performance issues that may arise

You're confusing the language and the databases engine. I don't think that seasoned developpers think that relational databases sucks per se, since most of their constraints are technically justified, but don't like SQL as a language because it's grammar is completely awful, it's inconsistent (toward itself and databases engines), filled with specific perks and clumsy.

To illustrate on `select` queries, you start listing the attributes then the table, while on `updates` you start by specifying the table and then the attributes on which on operate. This illustrate the grammar problem: in one case, you start by bringing what table you will use and set on which attributes, the other the attributes you need, while keeping in mind on which table name since you specify it after. It's not really a problem, but developer tends to hate any kind of cognitive load, and this one source of load.

I'm not an ORM fan, but developper often use the programming language of their application to build SQL queries string, and ie. with such tools you always start by specifying the target table.

Personally I really wished that RDMS would provide another intermediate language, or better, data structure, to interface with them.

SQL is wildly powerful and important, but it's also got clear deficiencies for data traversal and manipulation.

I see it on a spectrum between declerative and imperative data structures, and where I think most people go wrong with it is trying to create a monolithic solution to a broad spectrum of problems. I think you need a graduated approach where each data layer is simplifying and satisfying the next, so you're using Tables, Procs, Views, and in-memory constructs in concert. The database is a powerful tool, and SQL is just part of that bigger puzzle :)

I can't agree more with this. As with anything we use in our applications, understanding and leveraging the strengths of our different tools is vitally important. If, for example, I needed a cartesian product of two datasets, SQL is the first thing that comes to mind due to how simple it is to write and the speed with which it will be processed in the database. On the flip side, I would never want to intermix frontend code in SQL, which crazy as that sounds, I have seen before. Everything has its place in a logical development flow.

I’ve seen queries returning html and JavaScript

It completely blows my mind when I see stuff like this. We're not all perfect programmers, I'm sure, but still. An alarm should go off in your head when the thought occurs to put front-end code into a query. The only exception I've encountered is if I want to inject some HTML formatting into a string for emails sent from SQL Server, but even then it's extremely limited in use, and I still think to myself, there must be a better way..

The worst thing is that the guilty programmer probably felt very clever when he wrote the code.

Starting with the end in mind (hi covey) is actually not such a bad idea. First state what result you want the query to produce,and then start describing where that data should come from.

Updates aren't different from select statements: first you state what you want, update a table with some new column values, and then you state where this data should come from, and what data you want to update.

> Yes it has some mistakes in it’s design

And this is why it sucks. Which is fine, the language comes from a different era when our understanding of computer languages was much more primitive. It is understandable that mistakes would be made. What is unfortunate is that there has been little to no progress in improving on those mistakes in this problem space.

In the procedural world, you could also say that C has some mistakes in its design. However, we've gone to great lengths to try and improve on C, for example in semi-recent times with Rust and Go. SQL could really benefit from the same. SQL gets a lot of things right. So does C. But we can do better on both fronts.

Unfortunately, it seems that people regularly confuse SQL with the underlying concepts that SQL is based on. Because there is no basically no competition in this space, it is assumed by many that this type of problem can be expressed in no other way, and that you simply do not understand SQL well enough if you do not see it as perfect. I guess it comes from the same place of those who argue that if you make memory management mistakes in C, you just don't understand C well enough.

As a software engineer who later learned SQL, I could not disagree more. Within the parameters that it is designed for, SQL is a terrific language that makes exploring and manipulating data much easier than tools like python or Scala. That doesn't mean I have no place for python or Scala, but that I definitely see a class of problems where an SQL interface is far superior.

I use python/Pandas every day for data analysis and the like, and I would never dream of not writing most of the aggregation and filtering logic in SQL. If you're working with large datasets, there is absolutely no reason to pull unnecessary data into memory.

I'm not sure what it is today - I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.

That is, minimize the network bandwidth by putting the work on the DB engine.

This of course necessitated creating and understanding proper SQL query building practices. It was real easy to mess up if you didn't know what you were doing (ie - inner selects, improper joins, etc) and cause a combinatorial explosion that would consume all the RAM on the server and grind it to a halt.

That, or bring back a load of data that you then filtered on the "client" - better to let the DB server do that if you can. Of course, this was back when the clients were 486s and early Pentiums with maybe 8-16 MB RAM. Today it's a bit different, but you still want to minimize the network traffic.

> I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.

We're a machine learning shop, and this is absolutely one of our core design principles.

The hardware progress since the 486 era, along with massive parallelization, has only reinforced this approach.

Pulling into memory is an attribute of implementation - you could write LINQ in C# (which is a great abstraction too) and not care about the fact that it's translated to SQL that runs server-side.

Depending on the use case, I'd argue Spark can be a better choice for aggregating/filtering than SQL. SQL is great for simple queries, but once my queries start getting into the many hundreds of lines than I start to miss all of the complexity management features of a true programming language.

There are a few things, but not much that I have had a better experience in Spark with as compared to using something like apache pig with UDFs. Now this part might be a matter of how things are set up where I work, but I find that working with Tez for process management and debugging to be far easier than working with the process management built into Spark.

EDIT: when you read process management above, perhaps it's better to think task management.

I understand your perspective, but I look at it a different way.

SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.

You're right about some of it (especially string manipulation, which is brutal) but I think you look at it like most programmers. SQL engages more of a simulative mindset—one popular with analysts—than the acquisitive mindset that most software developers outside of data science employ.

Most programmers who are at all familiar with functional programming, DSLs, configuration languages, or optimizing compilers are very well versed with asking the computer for what you want rather than telling it what to do. At least when I was there, this was a very large percentage of Googlers.

My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software. SQL doesn't have this structure. If you have a query that almost does what you want but you need to add one more filter, you need to reissue the query (modulo views/temporary tables, which is why I said "very clumsily compositional" rather than "not compositional"). If you have an expression that computes some quantity and then you want to use it with slight modifications in a lot of places, you usually end up copy & pasting it (modulo stored procedures). Modern programming languages have made this sort of abstraction really easy, but it's quite clunky in SQL. You can do it (by using stored procedures, views, triggers, subqueries, etc.), but then most of your application ends up written in SQL and it starts to feel like something out of the 1970s.

My preferred interface would be something like the relational algebra where relations are represented as typed values in the host programming language and operators are normal method calls (or binary operators, depending on language flavor) and importantly, intermediate results can be assigned to variables. I don't care what the particular execution strategy is of the query, but I do think it should be possible to refine, join, project, and subquery using values you've already defined.

> My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software.

When I write a complex query it's kind of like this.

I start with one table (viewed in my head as an excel style grid of results). I join another table then filter on the join / where clause. Again the output is effectively another table. Rinse repeat.

The difference is that SQL is very powerful at what it does, so you don't need to compose blocks in the way that you would in other languages. Just think of everything as a table. The result of every join or filtering clause is another table. That's your building block.

You are going to need significantly less SQL than Java / Python to get the same results from your data.

True, if you modulo all the features like views, stored procedures, functions, foreign keys, triggers there is no reusability in SQL...

On the serious side: It's not a bad idea to contain critical business logic in the database. It's shared by any app using the database. Foreign keys esp link and let you cascade changes with no extra code.

Less overall code. Higher guarantees.

Everyone agrees it's a good idea to use a `datetime` field for a `created` field instead of `varchar` and letting the application parse/extract it's own format in there.

It's also a good idea to add triggers that do specific things to keep your db in a consistent state upon actions.

Maybe adding/deleting a row needs to update a counter in a statistics table for quick access because count(*) on that table takes too long.

You could do that it in your app - but what happens if someone changes the data without going through your app.

Instead you add it as a trigger. Now you can be sure that the number in the stats table is always the same as the actual number.

Not sure why are you being downvoted, I would like to at least hear the counter argument.

Database changes have been painful for me in every job I have worked at. They often took me longer to write, longer to test, and longer to fix when I introduced bugs. In my experience, doing things in the database is great for simple actions, transformations done in huge volumes, preprocessing large datasets, or logic that is unlikely to change. It's bad for doing things which don't have those properties, especially the last property due to database changes being painful.

I don't disagree with him (though I don't necessarily agree as strongly with respect to triggers), but the counter argument is that it's extremely easy to put up a web service that offers an API and interacts with JSON. Now all the complicated bits are in the API service, and the multiple apps don't need to know anything at all about the underlying data store.

The counter to that is that it can be very difficult to get an API that is as comprehensive and flexible as SQL is, and eventually you'd find yourself re-implementing SQL if you ever tried. Certain reporting scenarios or mass data manipulation scenarios really require actual data store access in order to perform remotely well.

Your counter counter argument on not being as flexible as SQL is spot on. A DAL that sits in front of storage mechanisms is not a bad idea though.

I've found moderate success in doing the reverse and using communication channels from SQL => services either via NOTIFY or queues.

ie. Dispatching an event that signals a non SQL service to do an action. eg saving a file to S3, then updating the database when the action is complete.

This doesn't really work when retrieving data synchronously but in most cases if the data is stored elsewhere - then you probably want to access it via non db methods anyway (ie. for an S3 file, you'd use the url)

WRT JSON, please for the love of god no. Someone made the same argument a month ago, my reply here: https://news.ycombinator.com/item?id=18870838

> WRT JSON, please for the love of god no.

I don't like it either, but for most scenarios that this sort of thing works for a RESTful API or microservice or equivalent (or whatever you want to call it these days since I think the expiration date on those terms has elapsed) and not all data can be meaningfully coded without a structure like JSON or XML. Or, rather, you can, but, you're reinventing the wheel just like you would with an API trying to be as flexible as SQL. You may run into various system limits on URL length, too.

Events and queues are great, but not all requests work that way. If you're USPS and you're providing an address and ZIP code resolver, you've got different requirements than getting data from one system to cascade across a series of systems with a dozen different asynchronous widgets. You'd have a burden of using a format that your customers would prefer, too.

Having worked with three large JSON http microservice based projects in the past, I abhor the day they got popular. Relative productivity has definitely gone down due to the project overheads caused by them.

Not saying microservices don't work, they are great for specific use cases, but 9/10 people just want to microservice everything just to say they use microservices.

> If you're USPS and you're providing an address and ZIP code resolver

Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?

> Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?

Until you get to how well it handles misspellings and incomplete information. Addresses are also notoriously difficult to parse. There is some logic and ranking at work behind the scenes. The only times I've seen it consistently fail are when the city is incorrect, or it's a genuinely new address (new construction or address renumbering).

In any event, is a read-only service somehow less of a service? I'd wager read-only services see a lot higher demand than anything.

Not a down voter, but I imagine it could be because triggers are associated with write amplification and visibility concerns. They have their place, just like stored procedures.

This is a very real problem I bang my head against regularly. There just seems no way to achieve all three of readability, maintainability and performance in a large enough SQL codebase. You can piece together views right up until the moment the query planner forgets to push where clauses down. You can wrap a query in a function to guarantee the where clause is evaluated then and there, but now you have to maintain that _and_ your view/table. You can slowly rewrite your code bit by bit, adding complexity to force the query planner (in this supposedly declarative language) to behave the way you want it to, for identical results. You can wrap all of it in a materialized view so you don't have to care about performance anymore, right up until the point it takes 12 hours to refresh.

I've be very much in the market for something more modular than SQL, which had a much more customisable query planner so you could say "I don't care if you go away and compile this for an hour as long as you make it as quick as physically possible and then save the query plan forever".

>save the query plan forever

is this even possible? Afaik, the query plan can depend on the data distribution and needs to be re-optimized when the data distribution changes.

Yeah, some of it is based on statistics, although in my experience, the relative distributions of rows don't massively change in my databases, so an ostensibly optimal query plan on day one isn't likely to be rubbish later. You can obviously imagine other people's mileage varying.

    > My preferred interface would be something like the 
    > relational algebra where relations are represented 
    > as typed values in the host programming language and 
    > operators are normal method calls (or binary operators, 
    > depending on language flavor) and importantly, 
    > intermediate results can be assigned to variables.
Much if not all of what you're asking for can be done today at the database level with modern SQL affordances (CTEs, etc) and vendor-specific stuff like stored procedures. You can write functions that return tables/resultsets, assign those results to variables, etc.

There's not a host programming language database library that exposes those things in quite the manner you seem to be asking for, but that's not a limitation of these SQL-based RDBMSs themselves.

In the realm of what's possible right now, you could of course write your stored procedures directly in .NET languages (MSSQL) Python (Postgres) etc.

'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.

Your preferred interface sounds like Linq. An IQueryable<T> interface represent a query and can be composed and assigned to variables, and the type T represent the type of the relation or projection. Linq composes a lot more elegantly than SQL itself. The instances does not represent intermediate results though, they represent composed queries, which are executed in one go on demand. But I think that is what you want anyway.

> 'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.

These are great, but their implementation varies pretty significantly. Notably, MySQL didn't have them at all until v8.0, and PostgreSQL's CTEs are... wonky, for lack of a better term (I believe they use an "evaluate and store" method that's closer to a temp table, and as a result they don't optimize well and have been known to have strange side effects). Not all RDBMSs support recursive CTEs, either.

There are indeed a number of decent swipes at the idea, like Linq. But they all suffer from some impedance mismatch. It would be better if the native language underneath was written to support it more natively.

As you and others cite, there's a variety of features that have crept up on the idea over the years, but they're all bodging things on to the side of something they can't really change. I'd love to see the Rust of database querying to SQL's C++. I draw that comparison because I think a lot of the same structural problems are in play preventing it; SQL is just so good that it is actively prevented from being fundamentally improved. It can be incrementally improved, but not fundamentally.

Views - Create a general top level view, then build more specific views on views using more filters, to go down, ie more granular - then just join these with yet more views to combine, or aggregate to go back up. How is that not composable? If you hit performance issues, they are easily solved by using a few materialized views. Also CTEs and User defined Functions (I use pure SQL functions but in Postgres you can easily use Python or Javascript instead)

At least in the DB we're using, Sybase SQLAnywhere, materialized views comes with a hefty price tag. They must be dropped and recreated every time you touch any of the base tables, like adding a column, which in turn requires any indexes on the materialized views to be recreated.

For a few of our customers, that meant that a 15 minute DB change (adding a column) turned into a several hour DB change (rebuilding materialized view).

If they're refreshed immediately, they also incur a penalty on the underlying tables, since any data changes essentially runs a trigger to see if the materialized view needs changes as well.

In our case we had one such immediate refresh materialized view which caused such a performance degradation of an underlying table that we had to find an alternate solution.

Materialized views for sure add some complexity and overhead, but I think that's true of any caching mechanism. They should probably be one of the last tools you reach for when trying to optimize queries, but in certain critical situations they can be super effective.

What's the difference between that and a non sql approach?

Writing a program that makes a temporary table somewhere, and if there's a mistake you have to start all over again?

Well, for immediate refresh materialized views, how do you keep your non-sql solution up to date?

For manual refresh, I agree, either non-sql or just plain temp tables is a decent alternative. In our case the views do a lot of joins and subqueries, so we've mostly used temp tables.

Datomic has host-process value compatibility see #2 http://www.dustingetz.com/:datomic-in-four-snippets/ and composes #4

I think you make a valid point about compositionality. SQL lacks elegant composition at the source/module/syntax level. Views and stored procedures are all stateful ways of achieving something similar by using them also involves migrating that state. I don't think you're saying SQL is not expensive which seems to be the rebuttal offered by siblings.

Preach! I just started building something like this in Purescript which might interest you - https://github.com/ajnsit/purescript-consequence.

It allows writing things like -

    type PersonRel = Relation (name :: String, age :: Int)
    type EmployeeRel = Relation (employeeName :: String, managerName :: String)

    -- Get all managers older than 60
    oldManagers :: PersonRel
    oldManagers =
        # renameField (SProxy :: SProxy "managerName") (SProxy :: SProxy "name")
        # join (RProxy :: RProxy (name :: String)) personTable
        # filter (\r -> r.age > 60)
        # project (RProxy :: RProxy (name :: String, age :: Int))

Resembles R dplyr pipelines using magrittr, although i find the dplyr version easier to read.

I am not familiar with R dplyr.

The code snippet I posted doesn't adequately demonstrate it but the DSL is strongly typed with full type inference at every step. So for example, it knows that the type of the table after the "join" is has a field "age" of type "int", and that filtering on it is okay. Trying to filter on a non-existent field will cause a compile (not runtime) error.

Second, the "steps" are actually plain functions. `#` is simply the reverse function application operator defined as `a # f = f a`. Which means you can abstract operations. For example, you can define a generic function that can take any relation with an age field and filters for age > 60. Thanks to strong types, using it incorrectly (i.e. on a relation without an age field of type integer) would cause an compile error.

Also, the language itself is a deeply embedded DSL. So "ordinary" code also compiles down to SQL. You can do things like loops etc. (for example, create a function that filters over a list of age limits rather than just one) and it will try to generate the best SQL possible.

of course the SQL generation is currently not implemented, so all this is theoretical, but I'm getting there!

There is also the approach of https://querycombinators.org which is inspired by functional programming but results in a declarative language. Queries are composed of functions that can be named and tested separately.

You may be interested in my project "Binate"; I have some incomplete notes at http://canonical.org/~kragen/binary-relations about it, exploring a possible design space in which we might find a more concise and compositional query language. Of course that's not in a host programming language; for that there do exist some things like SQLAlchemy and SQLObject which, as a bonus, can compile the resulting query to SQL. The most popular of these is Django's "ORM", which unfortunately is very limited.

I am the author of a small parser/typechecker for a subset[0] of SQL, roughly matching that supported by SQLite, and I agree with you COMPLETELY. What follows is my rant directed at those who don't.

It is an ugly, verbose language. You can be very familiar with thinking in sets, and still not like SQL. It's what we've got, and the SQL databases available are very, very good products. But I do wish a cleaner language could have won. As you know, anyone who's used LINQ in C#, particularly by directly calling the extension methods .Select(...).Where(...).OrderBy(...), sees how much better it is from a composability standpoint.

SQL is the anti-Lisp. Lisp's design is about 4 pieces of syntax and a few fundamental operations from which all else is built.

Conversely, nearly every operation in SQL is a tacked-on special case to the ridiculously complex SELECT syntax. Filtering results? That's a clause of SELECT. Ordering? Clause of SELECT. Filtering after aggregating? Oh, that's a different clause of SELECT.

Once this philosophy has infected the brain of a SQL implementer, it spreads like wildfire. That's why you even see custom syntax pop up even in good'ol function calls sometimes, like in Postgres: overlay('abcdef' placing 'wt' from 3 for 2).

SQL fans often talk about the beauty of relational algebra. Once you achieve relational enlightenment, SQL is supposed to be beautiful. But if we wrote math like SQL, you wouldn't say 2 * 3 + 4. There would be a grand COMPUTE statement with clauses for each operation you could wish to perform. So you'd write COMPUTE MULTIPLY 2 ADD 4 FROM 3. Of course, the COMPUTE statement is a pipeline, and multiplication comes after addition in the pipeline, so if you wanted to represent 2 * (3 + 4) you'll push that into a sub-compute, like COMPUTE MULTIPLY 2 FROM (COMPUTE MULTIPLY 1 ADD 4 FROM 3).

SQL clauses could have been "functions" with well-defined input and output types, if the language designers had come up with a type system to match the relational algebra.

    WHERE - table<row type> -> predicate<row type> -> table<row type>
    JOIN - table<row x> -> table<row y> -> predicate<row x, row y> -> table<row x, row y>
    ORDER - table<row type> -> list<expression<row type>> -> table<row type>
These could be pipelined, rather than nested, either with an OO-style method call syntax or a functional style pipe operator.

You understand the idea. Again, as you've mentioned, the LINQ methods[1] are a great resource for people not familiar with this style. But, counterargument. Languages with minimal syntax and great power are often claimed to be unreadable. Sometimes it's nice to have special syntax, to help give a recognizable shape to what you're reading, instead of it being operator / function call soup.

So what did SQL accomplish by making everything a special case of SELECT?

Well, it's got this sensible flow to every statement. You see, the execution of SELECT logically flows as I've numbered the lines below.

    7 TOP n
    5 column expressions
    1 FROM tables
    2 WHERE predicate
    3 GROUP BY expressions
    4 HAVING expressions
    6 ORDER BY ...
I sure am glad they cleared that up. If it had been a chain of individual operations I'd have been utterly baffled.

Ignoring completely the syntactical design, the lack of basic operations is a pain too. Why can't I declare variables within queries? I often would like to do something like:

        let x = compute_something(...)
        x + q as column1
        x + r as column2
    from ...
Instead, when I really need that, I end up wrapping the whole thing in an outer query and computing X in the inner query. Hooray for SELECT, the answer to all problems! Oh and yes, views and stored procedures and UDFs are no answer to the need for one-off local composability within queries.

Then you have the sloppy design of the type system in virtually all SQL dialects. SQL Server doesn't even have a boolean type. There is no type you can declare for `@x` that will let you `set @x = (1 <> 2)`.

And don't even get me started on the GROUP BY clause, the design of which contorts the whole rest of the language. If you GROUP BY some columns, you must not refer to any non-grouped columns in the later parts of your query (refer to my table above for which parts are "later"). Unless, that is, you are referring to them in aggregates. Then the HAVING clause was tacked on so that you'd have a way to do a filter -- the same thing as WHERE -- after the GROUP BY. Does it all make sense once you understand it? Yes, in that you can see how you'd end up with this system if you were adding things piece by piece and never went back to redesign from square one.

Wow, I have a lot of ranting to do about SQL. I feel like I haven't even scratched the surface. And hell, I still pick SQL databases every time I start a project! The damn language is useful enough and the products work great. But it has all the design elegance of the US tax code.

[0] https://github.com/rspeele/Rezoom.SQL [1] https://docs.microsoft.com/en-us/dotnet/api/system.linq.enum...

There is no need for ranting, if you need to reuse expression from select you could simply use `LATERAL JOIN` or `CROSS APPLY`: https://blog.jooq.org/2014/01/06/the-sql-languages-most-miss... Please check comment section of this blog

You make some good points about SQL's shortcomings but some of the claims you make are infactual or seem intentionally misleading. Since you mentioned SQL Server, I have to chime in with some corrections to your statements

While you can't declare new variables within a SELECT statement, you can modify them with functions:


      @foo = dbo.ComputeSomething(...),
      @foo + @bar AS Val1
Additionally, SQL Server does have a boolean type by another name: bit. It is 1/0/NULL and you can convert the string values TRUE and FALSE into their respective numerical values.

To modify your example to use proper SQL:

  SELECT @x = 'TRUE' WHERE 1 <> 2
Or, to use a more real-world example, you'd probably use a CASE statement:

    @x = CASE WHEN 1 <> 2 THEN 1 ELSE 0 END
Yes, it's verbose. But it's possible. (SQL's slogan, probably, and understandably the root of a lot of your remarks.)

Finally, your GROUP BY / HAVING / WHERE example lacks an important distinction. HAVING <> WHERE. HAVING is specifically meant to deal with aggregate functions, while WHERE is a simple filter. You wouldn't use "HAVING PostId = @foo" any more than you might use "WHERE COUNT(*) > @bar". This is why it comes after the GROUP BY clause -- which you should think of as a "key" for the data that's being aggregated.

Look, SQL is not perfect or near perfect. As with any language, it could use improvements. But to criticize a language based on ignorance, misleading statements, or use cases that are not meant for the language is fairly dishonest, in my opinion.

If I use your variable example I get: A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations.

I'm assuming there is a way to make it work, but then it seems like it'd be kind of a messy imperative-style approach. Would the value of @foo persist between each row's evaluation, so now my query would be capable of (intentionally or not) stateful execution? Would that interfere with the optimizer's ability to re-order evaluation or would it be undefined behavior?

Other than the special cased string value conversion, SQL server treats bit as just a very short integer type. You cannot supply a bit value alone to WHERE or CASE, you must use a comparison operator. And likewise, you cannot assign the result of a comparison expression to a bit, you must wrap it in CASE as you have demonstrated. In fact my own project, in an attempt to squeeze various dialects of SQL into a common type system, fakes booleans in T-SQL by inserting CASE expressions and (<> 0) comparisons as needed See example:


If a normal programming language had special places where boolean expressions could appear (like in if and while loops), and boolean expressions did not produce values in and of themselves, it would stand out as being extremely hacky.

Of course that is a complaint only for T-SQL, other dialects do have true bools. But it is emblematic of design undertaken without effort to unify fundamentals (The question not asked by T-SQL's designers: "What if predicate expressions were just like any other expression, and had a value?") and that is what I find distasteful about SQL in general.

HAVING is another example of that lazy, slap-more-syntax-on design. As you correctly point out, HAVING evaluates after the GROUP BY, WHERE before, and that's why HAVING can refer to aggregate results. But if "WHERE" was just an operator it could be used in the order it was needed without hassle. In LINQ, you write:

    .Where(u => u.AccountIsActivated)
    .GroupBy(u => u.Name)
    .Select(g => new { Name = g.Key, NumberOfUsersWithName = g.Count() })
    .Where(g => g.NumberOfUsersWithName > 1)
    .Select(g => $"The name {g.Name} is shared by {g.NumberOfUsersWithName} users")
The .Where before the grouping is the same function as the .Where after it. No need for a special case.

Of course you could do the same thing in SQL without HAVING using a subquery:

    select 'The name ' + sq.Name + ' is shared by ' + sq.NumberOfUsersWithName + ' users'
        (select u.Name, count(*) as NumberOfUsersWithName
         from Users u
         where u.AccountIsActivated = 1
         group by u.Name) sq
    where sq.NumberOfUsersWithName > 1
But it's ugly, so they threw in HAVING and that lets you do it all in one query.

    select 'The name ' + u.Name + ' is shared by ' + count(*) + ' users'
    from Users u
    where u.AccountIsActivated = 1
    group by u.Name
    having count(*) > 1
Understandable choice, because subqueries start to get unreadable fast. I believe this is due to the middle-out (select outer-stuff from inner-queries where more-outer-stuff) syntax, which gets hard to follow when nested more than a level or two. I find the front-to-back pipeline approach of LINQ or F#'s Seq module far easier to track as data gets filtered and transformed in the same sequence that the code reads.

Let's go back to my needlessly cruel and exaggerated example of "SQL math", in which all expressions are written as a fill-in-the-clauses COMPUTE statement with each operation being its own optional clause, in a fixed order. Suppose that it was decided that it's helpful to be able to MULTIPLY after adding, which normally comes _before_ multiplying in the fixed order, but people were sick of writing:

So they came up with a new keyword for multiplying that goes _after_ the addition step. They'd use a different word so you could tell the operators apart, but it'd basically be a synonym.

That's how HAVING feels to me. It's a band-aid that wouldn't be necessary or even desired if WHERE was a standalone operator on table types, just as our hypothetical PRODUCT WITH clause wouldn't be desired if we had a standalone * operator.

I get that SQL works and is extremely useful. Remember, I spent quite a bit of time implementing a library for working with it. But there are languages that make you think things like, "Oh, that's simpler than I thought. I can use this kind of construct to iterate over anything that implements the right interface. That was smart of the language designers". Then there are languages that make you think, "Huh. Another special case. Couldn't they have thought this through a bit more carefully before speccing out the language?".

SQL feels like the latter and yet I see people call it a beautiful language or their favorite language and I get a strong "stop liking what I don't like" impulse.

I don't see views and subqueries as clumsy. Especially with sets, you think in creating new sets, and combining those into other sets.

In postgres i've created financial year reports, with monthly summaries per category just by having a few layers of views.

I think it's really valuable i can think in logical sets, and the database will takes all those layers of views and combine those into one optimized query plan.

But views and functions still require you to persist those objects in the database first. There's no such thing as a query "variable" that you can then re-use in multiple subsequent statements. Of course, you can use table variables or temporary tables to hold intermediate data, but those are eagerly evaluated, whereas functions, views and CTE's are lazily evaluated, and that allows for a massive performance boost (due to optimization).

I can see the appeal of such a construct in SQL. The requirement that reusable objects are persisted in the database requires a top-down design approach, and that doesn't really blend well with modern coding practices.

There is no limit to reuse of views?? You can create as many intermediate or base views as you like, within a view definition, CTE’s have local scope and are similar to variables, if you need schema scope, just define the query block as a view instead. UDFs are simple to create are lazy and can be easily reused and composed, in Postgres they are first class objects, scalar UDFs can be called anywhere in SELECT and Set Returning UDFs in the WHERE clause, use more than 1 and they also automatically Lateral Joined - which is incredibly powerful and composable - when you get your head around them

No, CTEs do not have local scope, at least not like variables. CTEs are statement-scoped; no matter how many SQL statements you have in your current scope, your CTE vanishes after the first statement terminates.

This is what glue languages do with data. Not what SQL does with data. Why ever would you want it to?

> SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.

That's basically how most tools work outside of software development. When you use Google, you aren't telling it how to find your result; you're telling it what you want.

Meh, not quite. We've gotten so accustomed to phrasing our requests in search-engine-friendly terms that we don't realize we're doing it (e.g. cutting out all the small words).

You’re definitely right that SQL is alien to many procedural / object oriented / functional / etc developers.

Even with myself, who has worked with Oracle writing PL/SQL (which is an abomination imo) in past jobs still has to sit down and get into the right mindset before writing efficient SQL for more complex queries.

I think SQL is entrenched by network effects. It has a unique conceptual paradigm, which is why it is hard to learn. But any replacement is also likely to be conceptually strange, but noone is gonna put in the time in to learn something weird unless it has the adoption of SQL. This we are stuck on a local mini.a with SQL. (A pretty good one though)

It is not hard to learn - non programmers use it all the time.

Conceptually SQL is much more simple than programming, it basically reads like english:

  SELECT customer, SUM(total)
  FROM orders
  GROUP BY customer
  WHERE created BETWEEN '2018-01-01' AND '2018-12-31'`
Compare that to the programming necessary to implement the above:

  totals = {}
  for row in rows:
      if row.created > '2018-12-31':
      if row.created < '2018-01-01':
      if row.customer not in customer_totals_2018:
          totals[customer] = 0
      totals[customer] = totals[customer] + row.total
  def _sorter((customer, total)):
      return -total
  for customer, total in sorted(totals.items(), key=_sorter):
      print(customer, total)
Not to mention the SQL version gets first hand knowledge on available indexes in order to speed up the query.

Now imagine adding an AVG(total) to both the SQL and programming version...

I don't think it's non-programmers that keep SQL around; it's that there hasn't been a replacement for SQL that is better enough to replace it. I think C suffers from the same problem.

I find cypher to be a superior language to SQL. It expresses relationships more succinctly and intuitively, and has better support for aggregations.

I also find datalog superior.

I don't disagree. I just don't think we've figured out the next step. Other paradigms (MapReduce and graph databases are perfect examples) have introduced very interesting and clever ways to querying data, but nothing has replaced SQL. If anything, it seems like it will be additive on top of what SQL has already done.

I know a lot about data, but I can't solve that problem. To anyone out there, much smarter than me: this is a real problem. If you solve it, you'll cement yourself the history of computer science.

I have no doubt it will get done, though.

I think graph databases with time will eventually be eating some of the RDBMs market (used for the same purpose), but they haven't had the time to mature as SQL/RDBMs.

I’m dubious, Postgres added OO features when that was the fashion, then XML, then JSON. To my mind Graphs are still relations just indirect one, I use graphs a lot within Postgres no problem, it has native recursive CTEs which are surprisingly fast for most Graph queries. For v large graphs I cache the graph with an automatically generated transitive closure table or mat view. With Postgres you can do union distinct inside the Recursive CTE to prevent cycles and it’s remarkably fast, millions of rows under 1 minute.

I agree with you and the parent comment. I run a data science department and we rely on SQL. We even embed SQL queries into R scripts to pull the latest data. It isn't worth working with ORM for an analytics project when a few dozen lines of SQL can get you everything you need.

SQL has many limitations but at its core its a beautiful and very understandeable language very close to human language in the ways it is expressed.

It’s not surprising librarians use SQL. Library science is all about organizing (informational) data!

I mean, a library is basically a B+-tree ;)

So it's protectionism?

It's way easier to teach someone to code who is already really good at manipulating/modeling data than the other way around, IMO.

> On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.

I always ask a basic SQL question (involving sorting/grouping and one simple join) in my interviews for backend and/or data. IMO if you are calling yourself a Data Engineer and do not know SQL, then you haven't really worked with data, you've mostly just worked on data infrastructure.

It's mandatory. It's like saying you understand calculus but not being able to do algebra.

I interview a lot of data engineers/analytics engineers and I always start with SQL. If you can't grasp SQL, you're in for a very bad time.

(please excuse my typing. i have one hand to use ATM) yup. we bet super hard on sql too. Its fantastic if you know what you're doing. We ca import arbitrary data and expose it as normal tables to our customers for analysis/transform/export along with having a silod access controlled place for them to see just their data. sql is a great technology, and is extremely flexible.

I love it. If you're looking to learn it and fundamentals, check out Jennifer Widom's course from Harvard. I can safely credit her w/ my career.

Presumably GP took this as a MOOC... some people are pretty blitzed in college but it would be hard to mistake California for Massachusetts!

Certain subsets of the developer community are usually excellent with SQL. It's especially evident with people who have been working with C# for > 10 years because Microsoft have always heavily pushed SQL Server with their developers.

There are plenty of use-cases for SQL with developers: especially in batch processes such as invoicing. A well crafted SQL query can execute exponentially faster than iterative code takes a lot less time to implement.

> we always seem to try to re-create SQL

This tells us that people think SQL is good, but it's unfortunate that this is attempted, because the good thing about SQL is how well it corresponds to its data structures. Attempting to imitate SQL's form, rather than its principal of correspondence, has been counter-productive and prevented the reproduction of its merits. Case in point: Cypher.

SQL is so long lasting because its math, sort of. It's a language for expressing data relationships in a very systematic logical and set theoretic way, not a random ad hoc hack.

Its syntax is a bit old fashioned and I do think efforts to make it more native to programming environments rather than a bolt on might be fruitful, but its concepts are timeless.

Can I ask a simple question about efficiency?

It seems to me that graph databases are far more efficient than relational ones for most tasks.

That’s because all lookups are O(1) instead of O(log N). That adds up. Also, copying a subgraph is far easier, and so is joining.

Think about it, when you shard you are essentially approaching graph databases because your hash or range by which you find your shard is basically the “pointer” half of the way there. And then you do a bunch of O(log N) lookups.

Also, data locality and caching would be better in graph databases. Especially when you have stuff distributed around the world, you aren’t going to want to do relational index lookups. You shard — in the limit you have a graph database. Why not just use one from the start?

So it seems to me that something like Neo4J and Cypher would have taken off and beat SQL. Why is it hardly heard of, and NoSQL is instead relegated to documents without schemas?

Chasing pointers (on the same medium) is usually slower than the access patterns that databases usually use.

First, a database can have hash indexes instead of btree indexes, so lookups can be O(1) too, but it turns out that btrees are often better because they can return range results efficiently, and finding the range in a btree is only logarithmic for the first lookup. If your index is clustered - if it covers the fields needed downstream in the plan - no further indirections are needed and locality is far better than a hash lookup. For example, a filter could be efficiently evaluated along with the index scan. And if your final results need to be sorted, major bonus if the index also covers this.

Second, it's best to think of databases doing operations on batches of data at a time. Depending on your database planner, different queries will tend to result in more row-based lookups (MySQL generally does everything that isn't a derived table, uncorrelated subquery or a filesort in a nested loop fashion) but others build lookups for more efficiency (a derived table in MySQL, or a hash join in Postgres). The flexibility to mix and match hash lookups with scans and sequential access - which are usually a few orders of magnitude faster than iterated indirection - means it can outperform graph traversal that is constantly getting caches misses at every level of the hierarchy.

The reason NoSQL and document databases suck from a relational standpoint is that they are bad at joins, and work better if you denormalize your joins and embed child structures in your rows and documents. Add decent indexes to NoSQL and they gradually turn back into relational databases - indexes are a denormalization of your data that is automatically handled by the database, but have nonlocality and write amplification consequences which can slow things down.

In terms of distribution, a relational plan can be parallelized and distributed somewhat easily if your data is also replicated - sometimes join results need to be redundantly recalculated or shuffled / copied around - and most analytics databases use this approach, though usually with column stores rather than row stores, again because scanning sequential access is so much faster than indirection. Joins don't always distribute well, is the main catch.

Depends a lot on the actual access patterns of your data.

Many recent web & mobile apps have a lot of screens where you just want to grab one blob of heterogenous data and format it with the UI toolkit of choice. Or if they do display multiple results, it's O(10) rather than O(1000) or O(1M). Chasing pointers is fine for use-cases like this, because you do it once and you have all the information you're looking for.

This is also behind the recent popularity of key/value stores and document databases. If all you need is a key/value lookup, well, just do a key/value lookup and don't pay the overhead of query parsing, query planning, predicate matching, joins, etc. When I was working on Google Search > 50% of features could get by with read-only datasets that supported only key/value lookup. You don't need a database for that, just binary search or hash indexes over a big file.

Oh I agree. Though I think having the rest of a relational DB is really nice should you need more than a simple key-value lookup, and not have to cobble it together through application-side joins and denormalization.

The application I work on in my day job does not match the key/value lookup idiom at all. User-defined sorts and filters over user-defined schema, and mass automated operations over data matching certain criteria. If you squint a bit, the app even looks a bit like a database in terms of user actions.

And even relational databases (at least row-oriented with primarily disk storage) have their limit here. With increasing volumes of data, it can't keep up. We can't index all the columns, and indexes can't span multiple tables. We increasingly need more denormalization solutions that convert hotter bits of data into e.g. in-memory caches that are faster for ad-hoc sorts and filters. Database first is a decent place to start, though having a first-class event feed for updates would certainly be nice...

Depends a lot on the actual access patterns of your data.

Yup. Thing is: with RDBMSs you are in control of both the storage patterns and the access patterns of your data. That's where a large part of the performance benefit comes from.

> 50% of features could get by with read-only datasets that supported only key/value lookup

Did you implement a storage pattern that was ordered by key (or hash(key) if you used hashing)?

"Did you implement a storage pattern that was ordered by key (or hash(key) if you used hashing)?"

Well, I didn't implement it, Jeff Dean did, but yes, that's what we used for a lot of features.



A hash index is O(1) no matter what language it's called from. The reason most people don't use hash indices very often is that they do not allow you to do a lot of useful things that you often end up wanting to do, such as retrieving all values within a given range.

I still don’t think it’s O(1) in the limit. Why not just store pointers to the data, or one indirection if the data can be moved?

Graph databases are great for retrieving existing data with associated data.

The power of SQL comes from the fact that you can easily create new information out of the data: create new sets, group by certain features, aggregates on certain features.

It's a lot more powerful than just store and retrieve.

People forget that SQL isn't just Query (DQL), it's also definition (DDL), manipulation (DML), control (DCL), and transaction control (TCL), language [0].

Checkout that platform that a full-blown Oracle license can provide to your DBA... It's definitely more than just CRUD.


[0] https://www.oreilly.com/library/view/discovering-sql-a/97811...

N4J's a lot slower at tons of common tasks. If you look into its underlying data model, it's clear why. It achieves the speed it does at certain graph-traversal operations by storing its data in such a way that it's highly specialized for those operations—one would expect this to come at a high cost for other operations, and sure enough, it does.

It also doesn't bother with tons of consistency guarantees and such that you (may) get from, say, PostgreSQL. Yet is still slower for many purposes.

In practice enterprise applications do a lot of relational operations.

Graph DBs might be valuable for a lot of problems, and it does feel like something like Neo4J would make a lot of sense for stuff like social networks, but for business records stuff isn't really that spread out

Although if you want versioning to the attribute level, you'd want something that spread out, see Datomic

> SQL is one the most amazing concepts I've ever experienced. ...and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).

It's not SQL that's the concept. The concept there is set theory/intersection/union and predicates.

That's why you think you are "recreating" those, because they can be mapped using the same concept

SQL is only one way of expressing that concept.

When people say "SQL" they often refer to a bunch of concepts, some better than others, all bundled together.

SQL gets you tables and joins, sure. But it also gets you queries that (some) non-programmers can write, outputs that are always a table, a variety of GUI programs to compose queries and display the results, analytic functions to do things like find percentiles, and tools to connect from MS Excel. And it often means you get transactions, ACID compliance, compound indexes, explicit table schemas, multiple join types, arbitrary-precision decimal data types, constraints, views, and a statistics-based query optimiser.

Of course it also gets you weird null handling, stored procedures, and triggers.

I blame the NoSQL movement for a lot of the lack of understanding in recent years. That was a step backwards.

People behind NoSQL movement understand SQL better, than any people glorifying SQL here. In fact, I think this is a testament of how poorly most people actually understand SQL, they can't even see basic on the surface problems with it.

All I heard from NoSQL proponents was that "relational databases don't scale". These same people were usually dealing with data sets that would fit easily onto a single server.

No, I'm pretty sure you only heard it from your fellow RDBMS friends. I have never argued for scale myself, but plenty of times tried to debunk this exact stereotype.


I think there are people who went with NoSQL because it fit their specific needs better than a regular RDBMS, and people who went NoSQL because they found it easier to work with than a RDBMS (at first).

The latter group often didn't want to learn SQL and went with NoSQL's as a shortcut, not because they had an intimate understanding of the tradeoffs between the approaches and decided NoSQL was the better option.

There was the third group of resume driven developers who bought into the MongoDB hype around 6 or 7 years ago, which probably overlapped with the second group a bit.

Just because it is resilient, does not mean it is good.

'Being great' usually helps a product gain traction. When SQL came about, it was very useful because hey 'I can query data!'.

But it's usually other reasons that drive incumbency.

SQLite is a testament to the power of SQL, it's a little wonder itself, and it's likely the most installed database globally.

Can you recommend good resources for engineers to up their skills with SQL that'll provide the understanding you discuss?

I am not the OP but I would suggest anyone who wants to get better at SQL to read this book:


I disagree you can't use SQL for Google search.

No engineer has ever made this possible.

We've CommonCrawl data but you can't run SQL queries on that data this makes SQL useless.

When smart people have figured out how to do that, come back claiming SQL is important.

Lisbeth Salander sure can.

> It's nearly 5 decades old

5 decade old sql is nothing like modern sql with tons of proprietary extensions, partition, windows, collations, typecast, json and god knows what else. Your examples " Hive, Presto, KSQL, etc" are a proof of this, they are so vastly different from each other you cannot simply learn "sql" and expect to use those tools in any serious manner.

This is precisely the proof of opposite that sql has not stood the test of time.

How is this even close to sql of 5 decades old,

CREATE STREAM pageviews (viewtime BIGINT, user_id VARCHAR, page_id VARCHAR) WITH (VALUE_FORMAT = 'JSON', KAFKA_TOPIC = 'my-pageviews-topic');

or this

CREATE STREAM pageviews (viewtime BIGINT, user_id VARCHAR, page_id VARCHAR, `Properties` VARCHAR) WITH (VALUE_FORMAT = 'JSON', KAFKA_TOPIC = 'my-pageviews-topic');

anything that looks vaguely like a weirdly formed english sentence is sql?

I understood when OP said the "concept of SQL" that they referred to a pretty broad idea of querying tabular data in rows and columns with some structured language.

I find it hard to accept that ppl are talking about 'querying tabular data in rows and columns with some structured language' when they talk about sql. They are surely talking about the specific syntax.


And then, when working in a legacy code base, you run into a dark corner inhabited by pre-ANSI joins and get eaten by a gru.

Great, maybe you can ask the parent poster, but the following are now factual statements:

1. You believe in your interpretation. This is true.

2. The parent poster has their interpretation. This is true.

3. The statement "They are surely talking about the specific syntax" is true if and only if their interpretation is the same as your interpretation.

4. This is a semantic argument now.

> 4. This is a semantic argument now.

What about wikipedia entry that lists a specific syntax?


It would be insane to expect somebody arrive at a perfect feature full working version of anything in the first try.

That's true for any piece of software. SQL is not done yet, the fact that you can put in neat little features still into it is a proof to how well it was designed.

Heck we still have Lisp evolving today. A lot of things got done right in the early days.

> Learning SQL once will allow you to re-use it heavily across your career span without having to re-learn.

Like all good abstractions, SQL is the practical expression of a mathematical theory. In the case of SQL, you use Zermelo-Fränckel (ZF) set theory to reason about data sets.

While it is easy to come up with merely conjectural implementations for haphazardly doing things -- arbitrary trial and error, really -- it is notoriously hard to develop a systematic axiomatization such as ZF. Mathematics do not change and are hard to defeat. That is why math-based knowledge and skills are able to span an entire career without having to re-learn.

SQL got kicked off in all earnest when David Child's March 1968 paper, "Description of A Set-Theoretic Data Structure", explained that programmers can query data using set-theoretic expressions instead of navigating through fixed structures. In August 1968, Childs published "Feasibility of a Set-Theoretic Data Structure. A General Structure Based on a Reconstituted Set-Theoretic Definition for Relations".

Data independence, by relying on set-theoretical constructs, was explicitly called out as one of the major objectives of the relational model by Ted Codd in 1970 in his famous paper "A Relational Model of Data for Large Shared Data Banks" (Communications of the ACM 13, No. 6, June 1970).

Using Codd's work, Donald Chamberlin and Raymond Boyce at IBM, developed in the early 1970s what would later turn out to become SQL.

Relational algebra is a much closer foundation for SQL than ZFC (or plain ZF, because for finite sets the difference doesn't matter). You can say SQL has ZFC as its foundation but that's not very useful. In the same way you can say ZFC is the foundation of everything in mathematics. SQL doesn't even use anything beyond naïve set theory because we are dealing with finite sets, let alone the full power of ZF.

> In the same way you can say ZFC is the foundation of everything in mathematics.

Yes, technically, that is indeed true of every Turing-complete axiomatic theory.

You could conceivably even use the SKI combinator calculus as the foundation of everything in mathematics. Still, something tells me that a database query language in widespread use, based on the SKI combinator calculus is rather unlikely to emerge (certainly not at this point).

Hence, since ZFC is one of the many alternative, equivalent foundations of everything in mathematics, picking ZFC still amounts to making a choice, no matter how small ...

> SQL doesn't even use anything beyond naïve set theory ...

Agreed. However, we also have to consider what situation we came from. David Childs objected to the use of custom, application-specific code, data types and functions, because these custom structures could (more easily) be represented by one, general abstraction. Given the fact that back then, people were not even using naive set theory, his considerations were clearly a step forward in the right direction.

Just to say there is "Project:M36 Relational Algebra Engine" [1], which says about itself:


Project:M36 implements a relational algebra engine as inspired by the writings of Chris Date.

Description Unlike most database management systems (DBMS), Project:M36 is opinionated software which adheres strictly to the mathematics of the relational algebra. The purpose of this adherence is to prove that software which implements mathematically-sound design principles reaps benefits in the form of code clarity, consistency, performance, and future-proofing.

Project:M36 can be used as an in-process or remote DBMS.

1: https://github.com/agentm/project-m36

Have you written more things I can read?

His/her comment history is fairly interesting. Plenty of seemingly needlessly downvoted comments, which to me is a marker of a perceptive mind.

My first job out of university was on an analytics team at a consulting firm (big enough that you know them) that used MS SQL Server for absolutely everything.

Data cleaning? SQL. Feature engineering? SQL.

Pipelines of stored procedures, stored in other stored procedures. Some of these procedures were so convoluted that they outputted tables with over 700 features, and had queries that were hundreds of lines long.

Every input, stored procedure, and output timestamped, so a change to one script involved changing every procedure downstream of it. My cries to use git were unheeded (would have required upskilling everyone on the team).

It was probably the worst year of my life. By the end of it I built a framework in T-SQL that would generate T-SQL scripts. In the final week of a project (which had been consistent 60-70 hour weeks), the partner on the project came in, saw the procedures written in my framework and demanded that they all be converted back into raw SQL. I moved teams a few weeks later.

The only good bit looking at it, is that now I'm REALLY good with SQL. It's incredibly powerful stuff, and more devs should work on it.

That happens in the wild unfortunately, I ve seen some sql-Frankensteins here and there but generally, if done properly, sql is the easiest to grok and understand the domain from. I was introduced to SQL circa 2000(by my dad) and I've been heavily using it since.

My bulk of experience in Sql was in MsSql. Recently I got into an Oracle shop and the dialect is a bit different though it didnt take too long to be productive in. However, being productive and such, if you ask me, oracle smells clunky and has way too many features, let's say it's not my cop of tea...

The only question that remains is--Accenture or KPMG?

It could have been Deloitte, or as I convinced most of the contractors on several projects to pronounce it, "Delouche".

Both incredibly close, but not right on the money.

You can do some insanely complicated stuff in a stored procedure. A favourite was versioning and updating a set of data across multiple tables as an atomic transaction.

Also MS SQL let you throw a data object (say XML) at a stored procedure, convert it to a table structure with some XPath (another useful black art like regex) and use that as the input to a single INSERT.

>You can do some insanely complicated stuff in a stored procedure.

You’re giving me flashbacks to stored procs that directly invoke java methods, and batch scrips that load client supplied data from and FTP server into external tables.

> You’re giving me flashbacks to stored procs that directly invoke java methods

A legacy application I once worked at used this extensively, and good grief it gave me nightmares.

How anyone ever thought this is a good idea to use is beyond me; that stuff is not maintainable at all.

Yeah, the business logic was all over the place. Everything needed to be reverse engineered any time there was a problem, or if you wanted to make any changes.

To make it worse, the app had about 300 scheduled tasks that were a mixture of batch files, SQL scripts and java classes. None of which had source code, most of which were slightly different and essentially operated as mysterious black boxes. We ended up having a copy of JD-GUI on all the production servers because we needed to decompile for debugging so often.

xp_cmdshell always seemed like a really bad idea.

The sql server agent was easy and reliable for scheduling so it made sense to let it run the whole process.

There's no reason you can't store your sql/tsql/plsql in version control.

We were doing this 20+ years ago, all code was in csv (we upgraded from rcs to csv), and we had a productized distributed scheduling system that would deploy all the sql scripts every night on a number of oracle databases running from aix, to solaris, to vms, to hpux, to irix, and later linux and windows NT.

Similar like you would now use jenkins to build, deploy and test your java apps.

Developers would never touch the test/production databases, only commit sql to csv. Develop local, sql text files, test on a local or shared development database, and then commit to csv.

There's no reason you can't - but there's limited tooling support and limited worker mindshare for that kind of approach.

What tooling do you need? They're just source code files that you can edit with a code editor or an IDE (DataGrip). In addition you need a file to deploy your sql to a database, this can be a sql or shell script, or a gradle file or even make.

> What tooling do you need?

Jump-to-definition when something calls something else. Unit tests. Atomic build/install.

> this can be a sql or shell script, or a gradle file or even make.

Precisely the problem. There's too many different ways to do it, and no consistency.

Atomic build/install - well, you get transactions OOTB. Unit tests - In Microsoft world, 1st class citizen: https://docs.microsoft.com/en-us/sql/ssdt/verifying-database...

Debugging leaves more to wish, but still available in some limited form.

umm. That's cvs. csv is something else, and it matters if you're a datashoveler.

Oops, you're right. it's been a long time. Csv has aged better than cvs.

> My cries to use git were unheeded (would have required upskilling everyone on the team).

What is the best practice workflow using git with SQL server views/procedures? Can you actually somehow track changes in the views/procs themselves so that if someone happens to run ALTER VIEW, git diff is going to show something?

You can version your scripts & DB as they get pushed with a common tracking number. Personally I think it's best to establish a workflow where manual changes to the DB would be pointless and likely to result in them being overwritten.

Outside of commercial tools dedicated to the purpose: you can query the DB for the content of the procedures/tables and compare them to a given set of scripts, or the most recent expected version. Auto-generated ORM models can be used to validate table/view composition for a given DB/App version, as well. Having these capacities baked into the versioning and upgrade process can do a lot over time to correct deviating schemas and train developers away from meddling with DBs outside the normal update procedure :)

It's mostly about using git to ensure a consistent snapshot as changes are made rather than updating one procedure at a time.

If you are asking about SQL text then a good formatter would make it easy to see the physical diffs. Otherwise there are various vendors and tools that parse SQL and show the logical diffs in queries and schemas, along with doing deployments, backups, syncing changes, etc.

Every conversation I've ever had came back to using RedGate SQL Compare to diff databases and TeamCity for CI.

You basically shouldn't allow anyone to modify anything without it being scripted (bonus points if it comes with a rollback and is repeatable for testing). Your scripts then all go into Git.

RedGate's SQL Change Automation (formerly ReadyRoll) is pretty slick. DbUp is a good, free alternative.

I second the rollback and repeatability bonus. Every script should leave the database in either the new state or the previous good state no matter how many times it's run.

> Every script should leave the database in either the new state or the previous good state no matter how many times it's run.

I wonder if there were somewhere a website to describe good idioms to achieve this?

Where I've seen this process work, testing for repeatability was part of the peer review process where any sql scripts to be reviewed were executed.

Having a second person run a script is the best way to ensure no mistakes.

I spent a year in a role where 50% of my duties was writing sql reports. These reports where usually between 500 and 1000 lines of sql a pop. Sometimes the runtime of the report was measured in hours, so learning efficient sql was important. The company had a lot of people that had been writing sql for awhile, and there were lots of cool code snippets floating around. I learned a lot in that year.

I've moved to writing backend code. I'm surprised most of my peers cannot write anything more complicated than a join. Most people are perfectly happy to let the orm do all the work, and never care to dig into the data directly. Every once in a while my sql skills save the day and several people in other departments contact me directly when they need excel files of data in our database we don't have UIs to pull yet.

Once your SQL gets into 500-1000 lines, and hours of runtime, I would suggest using data frames instead (in R or Python).

I wrote this post to introduce the idea:

What Is a Data Frame? (In Python, R, and SQL) https://www.oilshell.org/blog/2018/11/30.html

It's often useful to treat SQL as an extraction/filtering language, and then use R or Pandas as a computation/reporting language.

I think of it as separating I/O and computation. SQL does enough filtering to cut the data down to a reasonable size, and maybe some preliminary logic. And then you compute on that smaller data set in R or Pandas -- iterating in SECONDS instead of hours. The code will likely be shorter as well, so it's a win-win (see examples in my blog post).

I can't think of many situations where hours of runtime is "reasonable" for an SQL query. In 2 hours you could probably do a linear scan over every table in most production databases 10-100 times.

For example, if your database is 10 GB, you could cat all of its files in less than 5 minutes (probably much less on a modern SSD). In 2 hours, you can do a five minute operation 24 times. I can't think of many reports that should take longer than 24 full passes over all the data in the database (i.e. pretending that you're not bothering to use indices in the most basic way). If it takes longer than that, the joins should be expressible with something orders of magnitude more efficient.

I've mainly worked with big data frameworks, but I think that almost any SQL database (sqlite, MySQL, Postgres) should be able to do a scan of a single table with some trivial predicates within 10x the speed of 'cat' (should be within 2x really). They can probably do better on realistic workloads because of caching.

This isn't meant to offend, rather as a point of consideration, but seeing your example use case being 10GB and then talk about big data frameworks makes it hard for me to take this advice seriously.

I might reach for that kind of tooling at the hundreds of TB to PB scale, but in our production applications we have _tables_ that are multiple terabytes. SQL is just fine.

Yes, we also have have queries that run in the timescale of hours and they are always of the reporting/scheduled task variety and absolutely vital to our customers. Long running reporting queries are pretty acceptable (and pretty much the norm since forever) outside of the tech industry and your customers won't balk at it.

It seems like you misunderstood what I wrote. I'm saying you should consider using R or Python if your reports are taking a long time, not big data frameworks.

Big data was a reference to thinking about the problem in terms of the speed of the hardware. If it's 1000x slower than what the hardware can do, that's a sign you're using the wrong tool for the job.

Getting within 10x is reasonable, but not 100x or 1000x, which is VERY COMMON in my experience. These two situations are very common:

1) SQL queries that are orders of magnitude slower than a simple offline computation in Python or R (let alone C++). The underlying cause is usually due to bad query planning of joins / lack of indices.

You might not have the ability to add indices easily, and even if you did, that has its drawbacks for one-off queries.

2) You need to do some computation that's awkward inside SQL. Statistics beyond basic aggregations, iterative computations (loops), and tree structures are common problems.

ETL pipelines that hop between different kinds of storage/computing platforms to exploit local maxima, like you're pointing out with R and SQL working in concert, is pretty common in companies working their way up to BigData and academia.

From the Enterprise side I think too many developers have an unfounded expectations around data storage technology. There's this unchallenged belief that monolithic datastorage that will solve thier problems across the entire time/storage/complexity spectrum. By bringing multiple tools to bear, instead, you end up with more purpose built storage but far less domain impedence.

Slapping a denormalized NoSQL front-end for webscale onto a legacy RDBMS can be a cheap win/win to maximize the capabilities of both. SQL + R is oodles better than R or SQL in isolation.

Your advice is pretty good, but I would definitely say that SQL doesnt have a strong relationship between lines of code and runtime, you can line break (and many do) your wide table's select into many columns and get there pretty quick.

If you are writing SQL regularly, understanding the basics of how the queries you write is not that hard for you engine of choice, and everyone should be required to understand the basics of reading an execution plan so they can find the right inflection points between data gathering and processing.

I regularly sigh write and maintain SQL procedures that are >10k LOC, and their runtime never would exceed minutes, much less hours.

If catting 10gb of files takes 5 minutes than catting 500gb of files takes 250 minutes. This is, of course, an over estimation of how long it takes to look at data on a disk. It's also the most trivial thing you can do with data, read it once.

I think we have a misunderstanding about the scale of data.

EDIT: Reading your post you mention that dataframes stores data in memory. Working with data in ram would provide a significant speedup. It just wasn't possible.

I'm not denying that you could have a query that takes hours and is within 10x the speed it should be, just saying that it's very common to have inefficient SQL queries for reporting. I would say it's almost the "default" state unless you do a bit of work.

Some more detail here: https://news.ycombinator.com/item?id=19150971

There are many interesting queries that don't touch all the data in the database. The 'cat' thing was basically assuming the worst case, e.g. as a sanity check.

Knowing how to optimise SQL (and also database indexes) is a valuable skill.

Reducing a highly used query's execution time by several orders of magnitude can be quite gratifying.

Stick the --+ORDERED flag on a random select sometime and look at what happens to the estimated query cost to see how easy it would be to fuck this up if you had to make all the optimisation choices yourself.

I had a similar role where I was writing boring LOB apps in a very gross language, but since we were using an SQL backend for all the data, I instead challenged myself to using bare templates in the actual programming language and writing all the extraction, transforms and logic into SQL selects and (when unavoidable) programmatic bits.

I also learned a crapload about obscure SQL since I would go to extreme lengths to achieve this. There was a lot of meta-SQL programming, where I would use SQL to generate more SQL and execute that within my statement, sometimes multiple layers deep. It was beautiful in its own way, expanding out in intricate patterns.

You might be interested in this project then https://www.getdbt.com/ it's a SQL compiler and executor which has many features like you're talking about (macros to generate SQL and so on). It makes it pretty easy to build up a complex DAG of SQL queries and transformations.

I never personally wrote meta-SQL but it was used at that office and I read it. I didn't see beauty. When I read the code and understood it I felt terror.

> Most people are perfectly happy to let the orm do all the work, and never care to dig into the data directly.

ORM all-too-often defines data structures from code.

Linus Torvalds wrote,

> I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.


Any recommendations of a place for sharing "extremely advanced" SQL skills?

Asking from wanting to make use of such a place, and haven't seen anything like it. So, probably need to bootstrap one instead (etc).

Are you asking for examples of advanced SQL skills?

In my experience, if you can grok lateral joins (aka cross apply), recursive CTEs, window functions, and fully understand all the join types, that's a gold star for understanding SQL!

I would add indexes to that list. Knowing the different types and how they impact performance can be very valuable.

I didn't add indexes mainly b/c for analytic warehouses that are columnar, indexes are less important / meaningless. Partitions though, that's important!

> I've moved to writing backend code. I'm surprised most of my peers cannot write anything more complicated than a join.

I'd wager more don't even know what a join is.

Do you do much with Excel's SQL connectors? Sheets populates with database results are powerful and reasonably user friendly.

A thing I find even more user friendly is Import-Excel https://github.com/dfinke/ImportExcel

It's a powershell module that allows you to easily dump things directly to excel files, does pretty decent datatables, multi-tabs, etc.

From a performance perspective, most ORMs are trash if used naively.

SQL is a mind bender for me. I do a lot of work on Data, use python and pandas to do a lot of data magic, but the problem with me is my mind is too procedural in thinking.

- Step 1 - Step 2 - Loop through results in Step 2 - Curate and finish output.

I try very hard to transform the above steps into an SQL statement spanning multiple tables, but always fail and I usually fallback to python for manually extracting and processing the data.

Does anyone else face this problem?

Any suggested guides / books to make me think more SQL'ley ?

I took a course in prolog, really seemed to help get my mind in the right place.

Basically, instead of thinking in terms of how to get to the result, think instead of the result and figure out how to get there. For example, let's say I need to get a mapping of students to their classes. One way would be to get the students, then loop through that to get classes for each one. Another way to do it would be to ask the database to combine those sets of data for us instead and return the result (join students and class tables on the student ID).

Basically, find the data you want from each table, tell the database about the relationship between them (join on <field>), and set up your constraints. I guess you could think about the query as a function, and you're just writing the unit tests for it (when I call X with parameters Y, I expect Z).

> I took a course in prolog, really seemed to help get my mind in the right place.

I have the reverse problem: every time I motivate myself to learn Prolog I feel like I could just input the constraints in a db and use SQL to get my results.

Two Things.

(1) Think of a database as a "big pile of stuff in a room". Some of it's ordered in a sane way, some of it's not. There is "a person at the door" to the room preventing you from entering - this is the Database Engine, not to be confused with the Database itself. You need to get things out of the room, and you need to put things in the room. You're not allowed to enter. If you want something out of the room, you must ask for it (query) by specifying what the thing looks like and which part of the pile you think it's in. If you want to put something in the room, you must provide explicit instructions for where it must go. When you write SQL, you are either asking a question of or giving instructions to the database engine. The DB Engine is distinct from the Database (the big pile of stuff...which you never have direct access to).

(2) As others have pointed out, SQL is very much the mathematics of Set Theory put into a programming language. Stanford has a free class on Databases. I took it a few years ago (it may have changed), and most of it had no SQL at all. It was fairly very straight forward - things you can do with pencil and paper. It's free, so no pressure. Go sign up and do some of the homework assignments. Do them until you get them right (you can re-take wrongly answered problems immediately). It'll help break your procedural mindset. https://online.stanford.edu/courses/soe-ydatabases-databases

Thank you for the Stanford class reference. Will check it out.

I took this one ages ago, before I started programming (in prep for a bootcamp) and really enjoyed it.

I've thought about the same problem and think I have an explanation. SQL inherently avoids operations that cannot scale. This means that while the same operation can be written much more simpler as code if you're doing with pandas or spark, trying to do it with SQL would be less forgiving if your underlying data schema is not optimal or if your logic isn't scalable. But more often than not, the SQL query will run faster and more reliably given similar amounts of compute power to the two engines.

Of course you can always screw up and write suboptimal SQL that will take forever to finish, but it's just harder to do unless you're really trying to be dumb about it.

So I do think there's still a lot of merit in trying to learn and use SQL more.

It also ignores operations of type (set of rows) => (set of rows) or (row) => (set of rows). and makes you jump through hoops of CTEs to implement them except in some special cases. Limited versions of these (flatMaps) work just fine in spark and have no problems scaling.

I want to learn and do the same things in SQL because, I don't always have the opportunity to run python programs on the database server environment.

There are certain situations where an SQL query is all that I can work with.

Try some functional programming. It will liberate your brain from thinking operationally. You will soon no longer think in terms of steps, loops or anything like that. You will get used to thinking about data transformations, which is what SQL is good at.

I was about to add exactly this. Setwise operations, map/transform, aggregate, avoidance of mutable state, filters, avoidance of if/for with break conditions, quite strong avoidance of exceptions.

Any books that teach function programming with python?

I work a lot with data as well primarily I use SAS, R and SQL. I'm also very procedurally oriented (as in I prefer breaking task down into discrete chunks), R's functional way of chaining operations with pipe operator feels so alien to me.

I find myself having the opposite problem I probably use (abuse?) SQL more than should.

My world (industrial plant) is very DB/historian heavy everything speaks SQL it's pretty much the common tongue connecting everything. I think this is slowly changing some PLC/historians now offer a webservice which returns JSON objects via an ajax query - personally I vastly prefer SQL to ajax.

I'd estimate for a typical problem I'm working with maybe 90% is done in SQL vs 10% in R/SAS code.

When I need Regression, Principal Component Analysis, Time Series manipulation, Plots etc I have to break into dedicated language.

For most other things - extract, merge, filtering, high level aggregation (Count sum etc) type of operations using SQL feels more natural and expressive to me.

I used to think this too, and wondered what magical SQL perspective my brain was missing. I'd see these elegant but seemingly impenetrable queries and wonder why my brain wasn't in the "SQL dimension" yet.

But over time, and very heavy SQL reporting, I realized the procedural mindset still applies, it's just not expressed as such.

You need to think through the steps of how you want to retrieve, transform, aggregate etc. the data and build up the query to accomplish those steps. The resulting query doesn't look procedural, but the thought process was. Of course you need to know the sometimes opaque constructs/tricks/syntax to express the steps, which again, look like magic at face value or just reading docs.

I think this is why people struggle with understanding SQL. The thought process is still procedural, but the query isn't. You need to translate the query back into procedural terms for your own understanding, which is a PITA.

To me trying to emulate procedural stuff in SQL feels wrong, especially with multi-table queries.

Yes, that's what I'm saying. The difficulty people have is that the thought process (solution) is still procedural but the query is expressed declaratively. Once you learn how to navigate that gap you get good at SQL.

I like the Ben Forta SQL in Minutes book.

Just the basics, very clear and easy to understand.


I still struggle with thinking too "proceduraly" sometimes, but what really helps out with me is drawing / envisioning the sets of data as a series of circles and Venn diagrams.

It also helps to change the language you use in your inner monologue. Instead of thinking, "For each row in table A...", you should think, "For all the rows in table A that match on...".

If raw speed is not an issue you can always fall back on temp tables or a cursor for those cases where you might have extracted an initial dataset to then iterate over.

EDIT: A CTE (Common Table Expression) might be what you're thinking you need. You can do some fun recursive queries with them.

> recursive queries

It's like all my fears concentrated, like the power of a thousand suns, onto a single sentence.

I should have mentioned cursors are conceptually a C pointer to really put the fear into people.

Highly recommend Dmitri Fontaine's book "Mastering PostgreSQL in Application Development". Really helps emphasize how you can leverage the power of SQL to minimize all the Python / Panda steps to get the same result.

I have the same issue, procedural thinking. I've made some small progress thinking in sets and working to re-phrase my goals in terms of sets of data, sometimes using window functions (row_number(), etc) when I need n-1 items from the query.

Some problems can to be assembled from CTEs / subqueries when you might have reached for a procedural solution to 'finish' the problem. Grasping CTEs was a big thing for me and now I'm wrestling with issues of style mixing CTEs with tiny subqueries and wondering if the increase in readability (for me) is correct.

I'd love to see suggestions for digging deeper on this too.

These mental paradigms helped me a great deal:

1) Think in terms of excel spreadsheets. At each step in your SQL you are creating a set of data. Doesn't usually matter how many rites it is, you just want the right columns. 2) Once you have a recordset, decide how the next recordset goes together. I would mentally visualize putting one worksheet next to the other and think through what columns I had to join together to get the rows I wanted. 3) Make sure you know what your first recordset should be. This is the base data that everything can be joined into. I refer to this as the domain of records.

A good starting point could be understanding https://en.wikipedia.org/wiki/Declarative_programming

If you're mathematically minded (really just basic set theory), maybe try learning relational calculus. You'll express database queries in terms of Cartesian products, selections and projections. These operations translate directly into SQL, so maybe it'll help you think clearly first, then apply the syntactic translation. Honestly, if I had the choice I'd rather just use the syntax of relational calculus, as I find SQL too verbose and convoluted.

This book can help with getting into the right mindset: https://www.amazon.com/Joe-Celkos-Thinking-Sets-Management/d...

I think what your are doing is fine. SQL is strongest at answering questions not processing data. I think metabase has the right approach: What question do you want to ask your data? If you want to process and transform your data, I think your tools you are using are great for that.

Not trying to play word games, but what is the difference between answering a question and processing data? Aren't they effectively the same?

Using another tool for processing data often results in recreating SQL mechanics at application level. E.g. select this data, retrieve it, loop and if this, then set that, etc. SQL does it way better, guaranteed.

Of course that's often required for technical reasons (scalability etc.) or processing that's too complex to implement at data layer, or just for cleaner design.

But SQL is amazing at processing data!

Agreed that SQL is amazing at processing data! I would argue that a lot of people are trying to both process their data _and_ ask a question of it in the same statement. Separating those out is really important to make analytics more scalable.

We do >95% of our transformations with pure SQL and the queries are primarily in Looker.

Processing to me - e.g. running through some sort of algorithm or complex logic. Not just a transformation.

I really like "Beginning database design" by Claire Churcher, and the follow-on book "Beginning SQL queries" . They sound like really basic books, and they definitelystart from the basics, but then build up steadily.

Reading the Postgres manual cover to cover is a fairly good starting point.

SQL is the most powerful query language ever invented and widely implemented over relation databases IMO (qualified heavily for lurking RDF zealots). Every time you see someone start to invent their own query language, I almost always mark is as folly (similar to when people invent their own configuration languages). Prometheus and GraphQL stand out as recent examples.

DBs like Kafka who recognize this and instead offer SQL on top of their things take the right approach IMO KSQL.

SQL is the most powerful query language ever invented

No, its Datalog. Joking aside, they are equally powerful but one could argue about explicit vs implicit joins.

No idea why this is getting downvoted, people must really dislike Datalog.

A recent up-and-coming software tool is the Open Policy Agent[0] and it uses Datalog -- though it's a custom implementation. This comment felt familiar so I went back and looked and it's come up on HN before[1].

[0]: https://www.openpolicyagent.org/

[1]: https://news.ycombinator.com/item?id=18156144

While I agree with you, I met a friend of a friend in Denmark who works for PDC who makes the claim that, outside of them, there's only one other respectably-sized company on the planet that is actively keeping Prolog alive.

If I'm going to bet my career on one or the other, it'd be SQL.

Slight aside, I think there are good lessons to be learned from LINQ still.

I think one of major reason for being powerful is its bases on pure mathematical foundations.

What don’t you like about GraphQL? I haven’t used it enough to form a strong opinion but it seems ok so far.

GraphQL doesn't even seem like the same kind of tool. GraphQL is what you expose as an api to users. It can only do a fraction of what SQL does and thats a very good thing. You certainly don't want users running their own SQL on your server.

I'd disagree -- I (and some other SQL/postgres zealots), would say that allowing users to run SQL on your own server is only a bad thing when you are running a database with insufficient configuration and/or security features.

We just might happen to live in a world where all the RDBMSes have that same failing, but I'd argue Postgres doesn't fail as bad in the are with things like row level security[0].

[0]: https://www.postgresql.org/docs/current/ddl-rowsecurity.html

It's near trivial to exhaust resources when you have SQL access. That's not information disclosure, but it's darn painful if the set of users includes a sizable enough group of bad faith actors.

Some people will even do it in good faith. They’ll tank performance and for some reason it will never cross their mind that they could be causing the issue. Happens with any shared system without rate limiting.

A near canonical example of "Tragedy of the Commons".

That which is no one's responsibility will inevitably trend towards crap, IME.

I would 1000% not take my opinion as gospel on GraphQL -- I know how it works, but I have not written an implementation myself, and have not even spun up Apollo. With that heavy grain of salt, here's how I feel about GraphQL:

- It doesn't feel very interoperable nor flexible (it's a different "ecosystem") with regards to other software -- meaning that if it's much less "integrate with GraphQL" and more "build on top of GraphQL" or find the "GraphQL" way to do things. I find this to be different from approaches like OpenAPI/Swagger.

- Complicated GraphQL queries start to look like SQL to me, and I suspect that will become even more the case as time goes on, and people decide they want advanced things that are in SQL, like window functions.

- GraphQL locks you in to knowing the backend design from the frontend -- though this was mostly always the case with REST (need to know which endpoint to call), "true" REST/HATEOAS compliant backends held the promise of you just being able to ask for a shape of an object, and make decisions about whether to dive deeper as you go in a principled way. This kind of gets into the semantic web concepts -- but that's mostly a seas of never delivered promises so...

- It's features are mostly offered by tools like Postgrest (vertical/horizontal filtering[0], resource embedding[1]), which offers less complexity.

- GraphQL is going to very likely spend the next few years running into issues/features that the REST/HATEOAS has already solved (albeit not in a standardized way). I looked at a random issue[2] and this is definitely something that seems quite solved in the RESTish HATEOAS world (at the very least you don't have to wait for GraphQL to do something to solve it).

Basically, I wish someone had built the ease-of-use GraphQL offers on top of the HTTP+REST+HATEOAS model -- because it's wonderfully extensible. Excellent solutions often lose out to "good enough" solutions, and I just... don't want that to happen this time (I mean I don't think it will, it's pretty hard to beat out good ol' http).

We just got to a really good place with Swagger/OpenAPIv3 + jsonschema/hyperschema, in being able to create the standardized abstractions on top of HTTP (and bring with it all the benefits of HTTP1/2/3 as they arrive, and it feels like GraphQL is a distraction. The "entity" graph is just a rehashing of the relational database model and it doesn't feel like it offers much outside of a standardized way of presenting your DB -- if you want to present your DB to frontends why not just let them send SQL queries directly? No I'm not tone deaf, I know most devs don't want to write SQL, but it seems like they're about to get into bed with GraphQL despite the likelihood that it's just going to be another SQL once it has enough features.

That said, GraphQL is doing amazing things for developer velocity, and it deserves praise for that -- tools that grasp mind share this quickly usually offer a large amount of real benefits to the people using it, and I've seen that GraphQL does that.

[0]: https://postgrest.org/en/v5.1/api.html#vertical-filtering-co...

[1]: http://postgrest.org/en/v5.1/api.html#resource-embedding

[2]: https://github.com/hasura/graphql-engine/issues/1598

Write a trivial app with it. I thought "oh yeah, looks ok, but not a huge benefit over REST", now I've started playing with it, I'm saying REST is the new SOAP.

SQL is nice on a surface level and helpful in practice.

Having a working intuition for relational databases is valuable on a deep level. I mean having a sense of how to organize the tables, what sizes are large and small, when to add what kind of index and what the size and speed limits are likely to be for a given data structure. That's extremely valuable.

BTW, we're preparing to move a postgres database that's a few TB from Heroku to AWS RDS. The catch is that we can't afford more than a few minutes of downtime. If this is in your wheelhouse, reach out! We'd like to talk.

Ok, we had a MySQL DB around 500GB, and not on Heroku, but we used the AWS Database Migration Service: https://aws.amazon.com/dms/

We had just a few minutes of downtime.

I did have few issues around permissions, but an AWS support person walked me through the process and we got it up and running. The support was great, and we just have standard business support.

EDIT: DMS might be a no go, as jpatokal points out. While Heroku allows external connections, it doesn't allow super user access which, AFAIK, DMS requires.


Heroku Postgres doesn't support external replicas, which complicates migrations out considerably. If you can't tolerate more than a few min of downtime for writes, you pretty much have to build the replication at the app level.

> BTW, we're preparing to move a postgres database that's a few TB from Heroku to AWS RDS. The catch is that we can't afford more than a few minutes of downtime.

Would it be ok to serve stale data for an extended amount of time? If so then you could modify your application to write to AWS RDS but to keep reading from Heroku. Then you start copying over the data from Heroku to AWS RDS, and once the copying is done then you point your application to read from AWS RDS.

Another thing: If your tables have foreign key constraints then you need to create the tables in the new DB without them first, since the writes coming to the new DB could be referencing data from the old DB which have not yet been copied over. Once you’ve completed copying the data over you add the FK constraints to the new DB. Depending on how your application was made this could be a problem or it could be ok.

Depending on the size of the relations, something like adding a new FK might result in more than minutes of downtime due to an exclusive lock--although that's a heavily couched "might" given all the work the Postgres team has put towards online DDL.

That does sound in our wheelhouse. If you drop me an email happy to setup a time to talk.

If you're going Postgres to Postgres I enthusiastically recommend Bucardo.[1] I'm not very technical, but even I managed to get an ~1TB DB successfully replicating using it. In my case it was going from AWS RDS to AWS Aurora. Previously others had tried AWS DMS, but we kept having data integrity issues. And DMS cost us significantly more than Bucardo whose only cost was a medium size EC2 instance and data in/out.

[1] https://bucardo.org/Bucardo/

Are you controlling all the writes? If so I’d go a phased migration .. easier to pull off and much lower risk than a big-bang one-off approach.

I've been saying this for years, because I started doing web apps when we pretty much were sending raw SQL over to a database. Even after ORMs got bigger, I still stress knowing SQL even if you don't use it often, because it helps you understand the ORM.

And oddly enough, now I'm a data engineer. Everything I do, even when I'm not doing pure SQL, is influenced by years of experience in SQL. Either the languages of my big tools are still based on SQL in some fashion, or it simply helps to have an understanding of the ecosystem to figure out what's going on at scale.

Everything else has come and gone and come again, but SQL has help up pretty nicely. The only other skills that come close for me are working in languages that branched out of C (because the syntax isn't so different over time) and years of on again off again procedural languages (ColdFusion, vanilla Javascript, etc. leads to it being way easier to pick up Python).

For those trying to up their SQL game from basic selects and grouping, I suggest learning:

- Common Table Expressions (CTEs). They will equip you to write clean, organized, and expressive SQL.

- CREATE TABLE AS SELECT. This command powers SQL-driven data transformation.

- Window functions, particularly SUM, LEAD, LAG, and ROW_NUMBER. These enable complex calculations without messy self-joins.

Learning those three will change your relationship with SQL and blow the door open of problems you can solve with the language.

I’d add LATERAL JOINs to the list.

For those unaware of LATERAL JOINs, it’s like a for each loop in SQL. Blew my mind and opened up so many possibilities.

ROW_NUMBER is absolutely the best, especially for marketing related customer data.

I consider myself functional, but not proficient in SQL. Is there a reliable and easy way to assess one's skills in SQL? I feel that most of the new things I learn about SQL these days are database specific. I wonder whether I'm missing out on something, or do I already have the "core" SQL knowledge down, and everything else is special cases?

The core concepts are the most important things to know. How do SQL queries and their various clauses work. Tables, data types, indices, and constraints are all very important. With that knowledge you can handle most SQL databases, and begin to learn and work around their quirks. Yes, there are a lot of database specific details. Procedure languages can be useful such as PL/pgSQL for PostgreSQL and T-SQL for SQL Server.

(self plug) take a look at pgexercises.com . It's Postgres-focused but mostly cross-platform, and I think it covers most of the important stuff.

Thanks, I'll check it out!

I recommend you look at one of Markus Winand's presentations, such as this one : https://youtu.be/xEDZQtAHpX8 I felt it summarized well the modern foundation of SQL databases.

Thanks, I'll be sure to watch it!

The more queries I write on Elasticsearch, the more I value SQL.

Seriously. Each time I have to do something in elasticsearch I'm constantly trying to reason why the creators went about it like this. Yeah hey let's serialize some nested JSON object that contains no less than 2,000 brackets.

Me too. I spent years complaining about databases and all the structure and rules and inflexibility. It was initially great to go to NoSQL backends where suddenly you're free. Now I really miss them :(

Well, at least I can do simple query with SQL and Elasticsearch and on Hive.

I really miss being able to make arbitrary where clauses to adjust to feature changes.

Hell Yes. Although it takes some time to switch from Procedural or OOP to highly Declarative world of SQL, but man, its worth pursuing. Only thing which really helped me build my confidence was solving more real world problems which involves 2NF and 3NF design, JOINs, Triggers, Indexing, Views & Materialized Views for denormalization, CTE and Recursive CTE's.

You may call me an extremist but from server side programming point of view with Postgres and FDW (Foreign Data Wrappers) which has ton of features other than SQL only thing i miss is HTTP server. :)

I’ve been a career analyst for about 12 years so I write SQL every day. I know it very well. I’ve developed a very strong love / hate relationship with it over the years.

On one hand, you write what you want to find in fairly common language and you almost always get back what you want if you do it correctly. In many ways, it’s like a precursor to Alexa but in written form. It’s super easy to pick up for non technical people.

On the other hand, it’s extremely difficult to code review, and on a very complicated piece of business logic, errors could mean the difference between hiring 10 more people or laying off 100. So almost always it’s just easier to re-write.

Imagine if engineers couldn’t understand each other’s work, and had to rewrite code every time someone leaves the team. It’s insane to me that this is standard practice.

> It’s super easy to pick up for non technical people.

I haven't found that particularly true. I've trained people in SQL and yes, they can do the basics pretty well, but it requires a particular skill to actually write effect complex SQL statements. Non-technical people, being non-technical, generally cannot do that.

> it’s extremely difficult to code review

I don't see why it's any more difficult than any other logic? It's always important to test.

I think the operative word there is complex.

Yes non technical people can’t write complex sql, but they can self service basic questions easier than learning Python or R for example.

And it’s difficult to code review SQL because complex business logic means complex data manipulations and those are hard to visualize and comprehend. You don’t have to think in a 3D space so much when you write python.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact