What I personally like about SQL here is that you can start to think about your transformations in a purer "relational algebra" sense. You're able to declare the expected outcomes, and let the engine figure out the best way to handle that, as opposed to actually having to figure out how to join, aggregate, window, etc.
SQL transformations shine in environments where regular batching in short intervals is acceptable. Materialize and Flink are both super interesting for overcoming the batching shortcoming into realtime materialization via SQL.
Regarding the concerns about tests, it's true it's harder to do unit testing of SQL, but for me the tradeoff is it feels like there's a whole class of bug that's eliminated by focusing on the relational algebra. If you can avoid the category of work of defining the procedures to transform the data, and only define the expected outcome, things can be much simpler.
In my experience, testing SQL is just like any other language. You define input data, execute the script under test, and then check the output (and possibly the state of the system) to confirm. The issue is more pronounced when you start doing DDL on the fly, but as long as you generally confine yourself to extracting data in a specific format, it's not too hard. I've done it entirely in sql - you can have a "test data" table and compare it to a "results" table that gives you an oracle of truth*.
The best is when you have rollups or other basic math that needs to be tested, so you can do parallel calculations (in another language or by hand) to ensure that the math is right. The worst is when you have small format changes that alter e.g. the order of the output - then you're left with either making your tests order-invariant or twiddling them for every ORDER parameter, which can be frustrating.
*Truth is only as good as your ability to input results, of course.
To me, data pipelines are about moving data from one place to another. and that's not really covered by the SQL spec. The article doesn't specify what constitutes a "data engineering pipeline", but for me there's three components to it:
- orchestration, or what happens when
- data manipulation, aka the T in ETL
- data movement, getting data from A to B
Orchestration? Please do not use dynamic SQL, you're digging a hole you'll never get out of. Moreover, SQL has zero support for time-based triggers so you'll need a secondary system anyway.
Manipulating the data? Sure, use SQL, as long as you're operating on data from one database. Trying to collate data from multiple databases may have unpredictable performance issues even in the engines that support it.
Moving data? There are much better options. And the same caveat as for orchestration applies here: every RDBMS vendor has its own idea of what external query's should look like (openrowset, external tables, linked servers) with their own performance considerations.
Author here: All good points - I should have been more specific: I'm really talking here about the data transformation logic. I live in a world where raw data generally gets loaded into a data lake, and data engineering proceeds from that point - but appreciate that's not the general case.
I'm really contrasting SQL to other commonly used data manipulation APIs and dataframe libraries like dplyr, pandas, polars etc., but again, appreciate this should be clearer in the blog
In a datalake, PySpark should be the default for large operations with multiple transformations. Since you're essentially working with jupyter notebooks, it allows you to lay out your transformation steps in a way that's more clear to the reader what each step does than with SQL.
The PySpark API has methods for SQL-like operations, so it's familiar to people like you and me who know and love SQL. There is no point to Pandas on a datalake since PySpark has dataframes.
I'm a huge proponent of SQL. I'm dubious of ORMs and think a lot of websites would be more performant if people just learned SQL, or whatever the idiomatic querying language for their chosen DB is. But with datalakes, pyspark is great.
I worry that if I write significant amounts of code in pyspark it will have fallen out of fashion in ten year's time and someone will end up either rewriting it or running an ancient, unmaintained version of pyspark forever just to keep my stuff working.
I'm much less worried about that happening to SQL scripts.
About 20 years ago I wrote a bunch of complex reports using SQL. The engineer who came after me tried briefly to understand it, decided it all needed to be rewritten in "a real language", broke all the reports, and they never really got them back.
So merely writing in a language that we believe will be around forever does not prevent your code from being rewritten.
I think that's always a risk (the old "which idiot wrote this? clearly needs a full rewrite!" followed by the rewriter rediscovering all the bugs the original had fixed), but with a well established language like SQL hopefully the risk is lower.
The fact is that, word for word, SQL is incredibly more efficient for reporting than most programming languages. But since few programmers treat it as a real language, it tends to be written without formatting. Which makes it hard to read.
Also unfamiliarity will make SQL feel inefficient for non-SQL programmers. The experience of rewriting SQL teaches some this lesson. But others find that the familiarity of the language of the rewrite makes it seem better to them, even though it is objectively worse. An objectively worse that only becomes visible when someone needs to optimize it for performance reasons. And it is seldom the programmer who did the inefficient write who has the skills to optimize it back to what SQL did in the first place!
PySpark is such a performance killer. How many startups will there be that just have a goal to improve performance and reduce cost. It locks you into a small number of platforms (bc many dbs don't have an impl and details vary), every database vendor eventually struggles to make their own implementation.
Spark is useful in that it’s pretty cheap and flexible. I can grind through 10 TB of data for like $50 and a few hours. If my experiments work, I can use it as a component in other projects. Plus, I can use (slightly non-standard) SQL right in pyspark, if I want to. And, it often performs faster because of the query optimizer.
If your db is scalable enough (like Snowflake) and you have the money, you can use SQL directly in the same way. But, I've seen data analysts writing bad SQL against Snowflake costing thousands of dollars per query.
It’s really tough to future proof anything. At least with pyspark you will have the code. And, there will always be experts out there. Worse case, if it’s important enough, someone will wrap it in an API and use it until something better comes along (see: 50 year old mainframes still in use).
I don't want to pile on, but I do see startup after startup that is focused on improving the cost/perf payoff of pyspark. Pyspark seems to be basically implemented by having the databricks query planner embedded in it, then it generates multiple queries to evaluate it. I think there's nothing inherent in that process that should make it faster than raw sql queries. There's theoretical advantages of maybe looking at the overall giant pyspark query and decomposing it into efficient pieces. Compare that to someone who wrote a series of separate sql queries to accomplish the same thing, and the qp can't take advantage of any reused computation.
a single node of clickhouse would likely be sufficient for a fraction of the cost and time. use attached nvme disks (ebs on aws) or a physical local disk.
Thanks, that's an idea that makes a certain sense. The counter point is that technical things get replaced quickly at times when something vastly superior arrives. We don't use cpm computers or apple IIs any more. I think saying sql has the lindy affect is a clever slam against it. We don't use system 360 assembler any more. X86 seemed to be certain to own the world, maybe arm or risc-v will usurp it.
Either you work with data that’s not really big data or you have some super awesome engineers who know spark very well, because spark is pretty much not declarative as much as it says so. You have to constantly battle crap like GC errors, data skew, etc. Even on supposedly managed solutions like databricks. As opposed to solutions like snowflake which are a lot more declarative (nothing is perfect of course).
And the vendors are happy to tell you to throw more cores at it instead of grapple with the core issues.
One of the biggest problems with spark imo is that people think that the magic cloud is going to make it go fast, when it actually means that the REPL is just muuuuch slower to iterate on.
This doesn't really respond to the article. The argument here is "jupyter notebooks are clearer", but that's pretty subjective, and the article specifically addresses this kind of point:
"By using SQL, a much wider range of people can read your code, including BI developers, business analysts, data engineers and data scientists."
If you really want to advocate for PySpark in this context, you've gotta contend with everything the article brings up:
- More people will be able to understand your code
- Future proofed, with automatic speed improvements and ‘autoscaling’
- Make data typing someone else’s problem
- Simpler maintenance, with less dependency management
- Compatibility with good practice software engineering
- SQL is more expressive and versatile than it used to be
Even for data transformation logic, SQL isn’t the best choice. How would you handle the case when you need to apply the same transformations to few dozens or hundreds columns?
Some SQL engines support generating and evaluating queries. I stumbled upon a function in MariaDB code base that is introduced and keps specifically for that use case (it returns properly quoted SQL value as a string, including "NULL" string for null values). It is aptly named QUOTE [1].
I'm not sure I understand the question, you can create tables/views based on others that have the transformations applied to them, tools like dbt[1] make this easy.
Even on the manipulation side I tend to disfavor sql.
Even though it’s improved a lot the testing situation in sql is still not where a more traditional programming language is, modularity comes either in the form of ever nested cte, stored procs or dbt style templates. And sql types are wholly dependent on the sql engine, which can lead to wacky transforms.
Sql is great for adhoc analysis. If something isn’t adhoc, there is almost always a better tool.
Agreed. I would also point out maintainability. How do you test some SQL logic in isolation?
Additionally, in this day and age where enriching the data by running it through some ML model isn't that rare, doing it in SQL by exposing it through an API and invoking some UDF on a per-row basis is extremely inefficient due to network RTT. In my opinion it is much better to use something like Apache Beam and load the model in memory of your workers and run predictions "locally" on batches of data at the time.
On the other hand I see the value in expressing "simple" logic in SQL, especially when joining a series of tabular sources. That's why I am super happy with Apache beam SQL extensions (https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.tr...) which, IMHO, has the benefits of both worlds.
This is not different than regular software development in a language like java.
I would argue its even better better because unit tests are always in tabular format and pretty easy to understand. Java unit tests on other hand are never read by devs in practice.
> in this day and age where enriching the data by running it through some ML model isn't that rare,
Still pretty rare, This constitutes a very minor percentage of ETL in an typical enterprise.
> Even though it’s improved a lot the testing situation in sql is still not where a more traditional programming language is, modularity comes either in the form of ever nested cte, stored procs or dbt style templates.
I don't use nested cte/stored procedures . I simply extract extract it to a new file and mark it 'ephemeral' in dbt, no different than what you'd do in a regular programming language.
> Manipulating the data? Sure, use SQL, as long as you're operating on data from one database. Trying to collate data from multiple databases may have unpredictable performance issues even in the engines that support it.
Depending on how big your data & org is - orgs solve this by writing their own query engines that can query many different data formats.
It is much more practical to optimize an engine, than to expect every data engineer to know how to optimally manipulate data from many different formats.
And why shouldn't this engine accept SQL queries as well if it's accessing existing SQL databases (probably just with different weights, retries, and batching/spooling)?
If you want to use the same logic / language to query some exotic dataset for specific use cases - it can often be worth it to write a custom engine that can do it - rather than expect all end-users to learn the ins & outs of the other databases & datasets (not to mention, to learn something beside SQL).
Instead, a single team (the query engine owners) can optimize the query engine - rather than individual users trying to optimize every script individually.
Your users can become masters of their engine / language (SQL) - because it can be used for the vast majority of cases.
For many reasons - you might want to store data in a format that MySQL / Postgres does not support natively (see Google Spanner). But, ideally, you'd still be able to leverage the fact that almost every programmer in the world can write SQL.
The simplicity and (relative) consistency of SQL makes it such a great choice. Sure, there are language variations but those can be avoided if that is a goal of the team.
Orchestration and data movement can be expressed with SQL. Assign shards however you need, run queries.
I am working on the (parallel, distributed) SQL engine and you can't even fathom how often I see SQL queries disguised-as/hand-compiled-into for loops in C++. Essentially, whatever SQL engine works on can and should be expressed as data tables and their processing and transformation should also be SQL-expressed.
What is then, in your opinion, an actually good ETL stack? Many use Spark/Databricks/Synapse, which has all 3, but these seem bulky and needlessly complicated for many smaller projects. Even TB sized datasets don’t need Spark, in my opinion.
I'd advocate starting as simple as possible. It's very context specific, but one very simple architecture to consider if seeing if duckdb can handle your data (maybe on quite a large machine).
For larger data, i'm a fan of writing everything in SQL, and executing it using AWS Athena (a Presto engine), which is extremely cheap.
If you have even larger data, you can consider AWS Glue (steering clear of any aws specific stuff and literally using it as spark SQL as a service). For many simple cases you don't really need to know any Spark, and you're just submitting SQL to AWS for execution.
But all of this assumes you're storing your data as files on disk (e.g. parquet). I'm less experienced in running everything in more traditional SQL engines like MS SQL server or postgres.
There's AWS Glue and Azure Data Factory, integration tools like Qlik Replicate or FiveTran Replication, or ConnX solutions if you have enterprise-scale money to burn.
We don't use any of them though, we have custom C# ingestion code that can talk to most databases and Odata API's. .Net's lazy collections make it quite easy to create a consumer/producer pipeline for streaming large datasets without too much overhead.
I'm fine with this, but SQL has some lingering shortcomings:
- Inability to express recursive/nested datasets directly (tree-like).
- General inability to express structural sharing and graphs directly.
- Inability to express associative (map-like) relationships directly.
- ...
Some of these are solved in the SQL standard, but not universally adopted. Others are solved by particular databases, but also not universally adopted. For the rest, of course you can solve it if we add the condition "some assembly required" when you get the data.
But if you have to assemble and disassemble the data at every step, it stops being a viable pipeline choice. It's as good as serializing into CSV, JSON, or whatever. In fact, JSON at least can nest.
Largely my take as well... I will say, I'm not a fan of using the DB Engine to self-injest or export data... I do prefer newline delimited json for import/export as it tends to be easier than CSV usually is.
The variations in SQL dialects/engines are pretty broad... and in some cases (mysql, ugh) straight ANSI SQL syntax may or may not work as expected. Not to mention more esoteric data like JSON or XML data columns.
One other niggle, don't do data analytics against your live database servicing your applications... use a read mirror or replica node... The types of queries data analysts tend to run are often less than ideal in terms of performance characteristics. Developers can create bad enough queries as it is, let alone a DA query locking a key table for many seconds.
It’s definitely true that making a dataset into something SQL can elegantly handle is not always easy, for example, dealing with JSON. But that’s also kind of the point. It thrives on flat, relational data. And having that relational layer makes doing things with data very easy.
IMO there's a big difference between university level SQL and the ability to competently write SQL in a professional setting. In university, at least at my school, you are getting maybe at most 3 tables overall, with 3-5 columns each. Professionally, you're dealing with hundreds, maybe thousands of tables, each with between 10-1000 columns, and usually not even in 1NF either. It really takes a lot of practice writing queries professionally to get to a level of competence here.
I think it's bc of how popular ORM's are. You can get pretty far with an ORM without knowing any SQL, but you can inadvertently do some really dumb things if you don't understand how SQL works.
SQL was the first language I ever learned, so perhaps my perspective is skewed, but it seems like it would be pretty hard to use an ORM without gaining an intuition into how SQL works.
What is a trap I see often among ORM users is them running into the n+1 problem. However, that's actually a consequence of database implementations not being theoretically pure. If you had an ideal SQL database then their approach would be technically correct. They aren't wrong to think that way.
It's just that we don't have ideal SQL databases, so we have to resort to hacks to make things work in the real world. Why we need to break from the theoretical model and use those hacks can be difficult to understand if you aren't familiar with the implementation at a lower level.
"Object/Relational Mapping is the Vietnam of Computer Science. It represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy." (Ted Neward)
Remember an ORM is an object-relational mapper not a query builder. The "simple stuff" is repetitive, extremely common, straight forward, and boring -- exactly the sort of thing you want to automate.
> Yeah, but when you add/change a new property to your most used class you don't have to change 500 odd SQL statements, it just works.
ORM doesn't have anything to do with SQL statements. ORM only operates on the results those SQL statements produce.
You're probably thinking of the active record pattern, which combines ORM with query building into an interesting monstrosity. Indeed, query builders produce SQL statements and can save you from changing 500 odd raw SQL queries.
A lot of ORMs create SQL statements too, they have an integrated query builder. And functionality for updating, deleting and even creating/modifying schemas (migrations).
But as always: use the right tool for the job. A hammer is a great tool, but not suitable for removing screws (most of the time).
> A lot of ORMs create SQL statements too, they have an integrated query builder.
These are technically active record libraries. One of the popular ones is literally known as ActiveRecord, but the whole suite of them are implementations of the active record pattern.
ORM is simply the process of converting relations (i.e. sets of tuples, i.e. rows and columns) into structured objects and back again. This layer of concern is independent of query generation.
ORMs are not only great for simple stuff. They are great for a lot of complicated queries too. But as always, you need to know what you’re doing. An ORM is not a magic layer, that solves all of your problems.
I’m mostly using entity framework core, which allows you to do quite complex queries, which can be really easy to read and maintain. But if you’re careless, you can create quite slow queries, that’s clear. But you can easily create slow queries with plain SQL too.
Yes, I’ve found that it’s possible to write hundreds of lines of ORM code to do some very complex and sophisticated processing, that could be done in a dozen lines of plain SQL.
The ORMs I’ve used also needed every entity from the database to be duplicated in the host language, which is a horrific amount of duplication in any non trivial project. Not just the entities but also the relationships. And then they need to be kept in sync.
And then they have these awful and subtle caching issues that mean if you fall back to SQL then you take a massive performance hit. Or if you run multiple servers against the same database you need another layer of complex caching infrastructure. Or if you update a table outside of the application you can get inconsistent results.
The whole thing is a nightmare of unnecessary complexity apparently intended to hide developers from the much lower complexity of databases.
ORMs are an inefficient, leaky and slow abstraction layer on top of a fast and complete abstraction layer.
I’ll never willingly use an ORM again.
Edited to add: the whole fragile contraption is done in an effort to move slices of the database into the memory context of a remote application. To what end? So we can update some counter and push the whole lot back out? It literally makes no sense.
There are two ways to achieve a single source of truth:
1. Generate ORM entities from the database (usually there is a too for that)
2. Generate the database from ORM entities, and make changes via generated/manual migrations
You are supposed not to edit the generated side, just re-generate it on a change of the model.
I don't think it's the syntax of SQL that causes problems (and which ORMs try to replace)... it's the complex underlying logic necessary to accurately join multiple tables, set where and having conditions, and aggregating records, and knowing to which dataset each should apply.
Countless "replacements" for SQL have come around to make data accessible by non-expert, but regardless if your using a BI tool, reporting platform, ORM, whatever, pretty soon you're going to need to understand and express the logic, and that's when you really appreciate SQL's directness and terseness.
Every single org I've found that says "we don't use ORMs because they're slow" ends up constructing their own bespoke, wheel-reinventing "orm" or sticks all their application logic in sprocs, which are nightmares for imperative logic.
My memory could be a little hazy, but I don't remember any required course that dealt with SQL when I was in the CS program at a pretty highly-regarded university 25 years ago.
I took a course in which I learned quite a lot about SQL and in retrospect it was an extremely useful course to have taken.
I remember taking a databases class (either junior or senior year) that covered database design and also SQL with Oracle. We even got into Pro*C, which was some crazy Oracle-specific C pre-processor. It definitely wasn't required. My roommate took it and failed.
Mine ~15 years ago in the US had a required course that did some relational algebra and database normalization without ever touching a database (as a small part of a broader course, IIRC), and an elective where we did really simple stuff with a database (I came out of it not really understanding the difference between INNER JOIN and LEFT JOIN, for example).
At a Dutch CS study we had an SQL course that started with first-order logic, relational algebra and went to on to project that into SQL. It also taught 3NF/4NF and BCNF, indices, r-trees, query planning and optimisation.
The one I took was elective. I am willing to admit that from a professional ROI perspective, it was one of the best uses of time in my entire life and easily worth hundreds of thousands of dollars.
I avoided learning SQL in the client-server training program that got me started in tech back in the late 90s, only to have to learn it on the job during a global ERP deployment a decade later. Should have taken the class, would have meant at least a few less sleepless nights.
Database education in CS programs in North America seems minimal. Most people I meet haven't taken any, or maybe just took a practical "how to query" course. In North America, it seems like only CMU has an active (excellent) program around this topic.
There's a woeful ignorance about what the relational data model is, how the industry arrived here, and how this is implemented. Problems or archaisms with SQL specifically become synonymous in people's heads with the relational model generally, and for a while that led down the quite problematic NoSQL road. Then slingshotted back to SQL -- but from my perspective SQL itself (not the relational model) is a problem. It doesn't compose well. It doesn't handle recursive relations well. It has an awkward syntax. It conflates concepts. It has an archaic datatype model. None of this is intrinsic to the relational model, but SQL becomes a limiting factor.
Disclaimer: I work for a DB company doing awesome stuff with the relational data model, but not SQL (RelationalAI) so am ... biased. Though I have always had those biases. :-)
I'm a faculty member at the Rochester Institute of Technology. We have multiple courses for both our undergrads and grad students in databases. I primarily teach Introduction to Big Data for our graduate students. We happen to cover all the things that you mention in your second paragraph :)
I've worked with many data scientists whose typical SQL usage was to load entire rows (or with simple 'with' filtering) into Pandas / R dataframes and then do all their work there.
I think it's a combination of Pandas and R having much simpler APIs and a ton of documentation on Stack Overflow, and modern hardware just being so good that you can load a big enough chunk of your dataset in-memory on a laptop.
I mostly use SAS, I tend to prefer using plain sql queries where I typically depart SQL and jump into code is doing what SAS calls "By Group processing" (example https://support.sas.com/kb/26/013.html#)
I am not as familiar with R. Last time I worked in R (some years ago) equivalent R code was something like this caution I'm no expert in writing R so might be a better /more intuitive way...
> Have they stopped teaching this in University or something?
It wasn't part of the mandatory coursework when I went to school. I really think it should be. It's not as though my CS program was 'pure theory', they taught lots of 'vocational' stuff but not SQL.
As for why people put off learning it.. I put off learning it longer than I care to admit because I thought I could simply do without it (ORMs, BerkeleyDB, etc), and fake it if put on the spot (I always understood the very basics, insofar as a SQL query can be understood if read as declarative English.) Ever since I bit the bullet and actually learned it properly, I've been kicking myself for not learning it upfront.
I think one of the main uncertainties in higher education the last couple decades is what balance, if any, they should strike between job training for software developers, coding as an engineering discipline, and preparation for an academic career in math.
Most programs include parts of all three, but SQL per se (rather than like, relational algrebra) is pretty firmly in the professional training for software developers set, and schools that reject that aspect may not teach it.
In my experience recent code school/boot camp grads have as much or more practical sql as recent CS grads; probably because those schools are nearly totally focused on professionally applicable skills.
That may be likely, but then schools which emphasize the math/theory part are then missing the opportunity to really teach the theoretical part of the relational algebraic stuff; which is really set theory and first order logic. There's a lot of math-y stuff to dig into there.
It should be mandatory along with understanding O notation and datastructures and algorithms, because really... unless you're doing a career which is like... 100% embedded development... you're going to be encountering databases and datamodeling as part of your career.
I'd be much happier if CS programs would graduate people who did a whole semester of first order logic, Date&Codd's foundational papers and why network&hierarchical databases are problematic, relational algebra / calculus, Datalog & friends, and then just toss in a "and this is how SQL does some of this but also mangles all of this..." at the end.
+ as a bonus, a DB implementation/internals course so people can understand what goes into query execution etc.
Because those people would then have some proper context before going off and butchering the world with ORMs and microservices...
The issue isn't usually competency, can vs. cannot. It is usually a matter of should or should not. I have seen databases thrown at problems where they decidedly should not have, and have seen the product development cycle suffer where it should not have. While some engineers and analysts relish CTE spaghetti -- solving complex logic utilizing enormous wads of SQL, others are more wary and rightly look to tackle complexity with more accountable and powerful tools when possible. I admit that tools like DBT can help address some of the shortcomings of SQL, but it still does not have the generality and conciseness of modern languages.
Anecdata point here, personally my SQL skills have lapsed because I've been incentivized to know NoSQL more. Same for frontend stuff like JS/React btw.
I would strongly recommend reconsidering the suggestion that SQL serve as a data engineering pipeline default. We use SQL heavily at our organization, and use it for broad, significant ELT workloads. Its use does help to interface across multiple departments to provide business intelligence across our organization. It does serve as a sort of data lingua franca and touchstone for analyst and engineer interactions. In essence, we use SQL, largely via DBT, where appropriate.
However, everywhere else, which comprises a significant collection of data pipeline services, uniform where possible but definitely heterogenous, we default to Clojure; we also utilize Python where more appropriate. Of these 3 languages SQL clearly has the least generality, least testability, least API integration capability, and definitely the most awkward data processing capability. I feel it is naive to suggest it could serve as a default language for data engineering pipelines, particularly given the need to interact with cloud services and third parties.
I’ve found that sql pipelines end up depending heavily on a specific database (ssis packages for sqlserver) so really suck when you don’t want to use that database any more, or still need something to coordinate and run all the sql.
So if I have to have something to run all the sql and test it , etc etc why not do the pipeline there? You can still include sql in the pipeline for the parts in database.
But trying to do everything in database can be unwieldy. For example, pulling a csv from a file system, pulling a json from an api, linking them, and storing the output as a csv somewhere. Doing that in sql is possible, but why?
I think having a mixed bag of tools for the pipeline is the right default. And having something highly portable for the top layer of orchestration is probably not going to be sql.
I agree for one-offs and for simple mappings. If I had to do this problem as part of some personal workflow used only by myself, then I would just use `pandas` or some equivalent for the entire thing and have it live in a jupyter notebook.
However, if the mapping is even somewhat complicated, or this pipeline has to be shared and productized in some way, then it would be better to load the data using some `pandas` like tool, store it on a `tsql` flavored database or datalake, and then exported as a .csv file using a native tool or another `pandas` equivalent again.
Having a pipeline live solely on a notebook that is passed around leaves too much risk for dependency hell and relying on myself to create the csv as needed is too brittle. Either have the pipeline live on its own container that can be started and run as needed by anyone, or dump the relevant data into the datalake and perform all the needed transformations there where the workflow can be stored and used repeatedly.
> Either have the pipeline live on its own container that can be started and run as needed by anyone
This is what I do. Each pipeline starts in a fresh conda environment, or sometimes entirely fresh container (if running in GitLab CI or GitHub actions, for example) and does a pip install to pinned packages. So every run is a fresh install and there’s no dependency hell.
Author here. Appreciate the feedback. I agree It's a tricky stylistic choice. I did think about it, and decided to go with it this time. The rationale was to avoid interrupting the flow with writing 'in my opinion', or 'I recommend' multiple times. I have tried to be balanced though, and explain there's is no 'one size fits all' solution. FWIW everything I reference in the blog is open source - I have no commercial motive.
One of the few lessons in school that is emblazoned in my mind is a writing class in middle school in the US (in central PA) where they deducted points for the use of first person in opinion pieces. The reasoning was that it's obviously an opinion piece, not law. Any use of first-person would be redundant.
SQL is simpler to understand for majority of users. It's easier to get started without learning a tool chain, programming best practices, etc. that could present a challenge to new users. SQL is also great at representing relationships between datasets and developing business logic transformations tends to be simpler and easier to understand. Oftentimes, when using another programming language you end up with many modules, imported libraries, code hacks and optimizations. It all can very quickly make it difficult to read.
One example is dbt. They started as writing simple SQL models. With the introduction of Jinja, majority of models look nothing like SQL anymore. You need to visualize your model to understand relationships. It took the beauty of SQL and mucked it.
At Upsolver we built a streaming+batch ETL tool that lets you build data pipelines in SQL. We did it because it's easier for non-data engineers to get started, easy to version and maintain as code and easy to automate (not that you can't do this with other languages). The same goes with kSQLDB, Materialize and even Spark and Flink use SQL as a way to simplify onboarding for non-developers.
> One example is dbt. They started as writing simple SQL models. With the introduction of Jinja, majority of models look nothing like SQL anymore. You need to visualize your model to understand relationships. It took the beauty of SQL and mucked it.
I know you are pitching your startup, but that's just completely untrue.
It's a cultural thing, not a problem with the architecture. DBAs, analysts and scientists don't write tests in general, so when someone with that background builds a project you'll tend to see that. On the flip side, a lot of engineers seem afraid of the database, and since it lacks visible coverage metrics, they will skip writing tests for it.
It's actually pretty easy to write test SQL test scripts for stuff.
Imho the biggest challenge is that testing prioritizes the zero-side-effects fully-isolated-to-one-class stateless unit test as king (which is good! That's the gold standard for a reason), which is the polar opposite of testing a giant ball of hyper-interconnected state that is an RDBMS.
I've had good luck writing scripts where the first one sets up preconditions, then it runs a million interconnected tests with no isolation between them, and then it tears down. It means you can't run a single test in isolation and you have to be aware of each test's postconditions, but it's a good "worse is better" solution for "I need to start testing now, not after I rearchitect my whole project to be isolatable and right a crapload of scaffolding and boilerplate". But every testing framework fights against this pattern.
I haven't had any problem testing with RDBMS, so I'm wondering if there's something I'm not understanding.
For me, a testing script will load a clean schema+procedures with no data (empty tables), and the schema never changes. Each test inserts a handful of rows as needed in relevant tables, runs the query, and then checks the query result and/or changed table state. Then DELETE all table contents, and move on to the next test. Finally delete the entire test database once you're done.
If you have queries that alter tables, then each test starts with a brand-new schema and deletes the whole database once done.
The only kinds of tests this doesn't scale to are performance tests, e.g. is this query performant on a table with 500 million rows? But to me that's not the domain of unit tests, but rather full-scale testing environments that periodically copy the entire prod database.
So I'm curious why you consider an RDMBS to be hyper-interconnected state with interconnected tests?
My group is working on getting testing set up on the DB of a highly configurable SAAS product that's heavily normalized, so configuration for every feature is smeared across dozens of tables. It dates back from the "use procs for everything" era of development.
Thus, the code to set up preconditions would be as complicated as the entire service tier of the admin screens. So it makes sense to write the test scripts as, instead of "insert" it's
1. "set up known good mostly-blank DB"
2. "Test the CREATE methods"
3. "Test the UPDATE methods"
4. "Test the DELETE methods"
5. "teardown"
* obviously it's not simple CRUD, this is just a simplification of how it goes.
It's not that this is an ideal workflow, it's just that "worse is better" here. This lets us get testing ASAP and move forward with confidence.
Of course, schemas change, and it's the migration of live data with those changes that are singularly most likely to break things. Will all your db access code still work? If you're normalizing everything through stored procedures, it gets somewhat harder still.
It's not just testing a single build up... but build ups and migrations from deployed, working options. Some solutions have different versions of an application/database at different clients that will update to "approved" versions in progress. (Especially in govt work)
Ah, got it. That workflow makes a lot of sense in that case -- not perfect but the most pragmatic.
I think the concepts of "application-owned RDBMS" vs "3rd party product that uses its own RDBMS" are the source of a lot of confusion sometimes.
If you manage your own database, testing shouldn't usually be particularly difficult. But when you're integrating with a third-party product, you generally can't effectively do unit tests. Just end-to-end integration tests of the type you're describing.
I'm in a similar boat with my current project, but am probably fairing better than you (as described). I do work in the ERP space... so not unlike you describe, 100s of highly normalized tables (the biggest of these systems I worked with had just over 1000 tables).
I'm currently working on a greenfield system in this class and the database development workflow is specifically designed to help make things sanely testable. So, first the application is designed to be very granularly modular and each module handles its own database tables... and each module is fully testable by itself. Yes, there can be/are dependencies between modules, but only the direct dependencies only ever have to be dealt with... when unit testing in the module itself, it's usually a pretty small slice of the database that needs to be handled. So what that means is any one component's test suite only needs to set the initial state for itself and its dependencies. What I do then is individual functional test can take that starting DB state, do it's work in a transaction, and then roll the transaction back at the end of the test. This way the only initial state created at the start of the process necessarily has to be dealt with. My "integration tests" will walk the business process statefuly, using only the initial DB state for dependents setup and walking the business process of creating all the data the module itself handles. Finally, unit test starting data is only ever defined in the component defining the tables which are to be loaded. This means that when dependents have their own tables, reference's are made to load that component's test data rather than trying to redefine a fresh set every time it shows up in the dependency tree.
Anyway, as said earlier, this kind of code development testing discipline isn't common amongst DBA types (the best to be able to figure out a good methodology for it) while a lot of application developers that have the testing discipling avoid the DB like the plague. So it never gets done until it absolutely has to be done. And they you end up ad hoc'ing it together.
Yeah but sql is even harder as it implies an embedded engine to even execute a first "hello world" test.
This gets even more complicated when you start using vendor-specific extensions for string manipulation, json parsing etc in the data pipelines case. It might be literally impossible to execute your SQL outside of prod or a prod-like environment.
I would be delighted to see a tool grab enough mindshare to become the safe default testing tool for SQL code, in a similar way to pytest in Python world.
If you praise pytest as some unique test runner, I pressume you haven't worked with many actually good testing engines... SQL has more than enough to compete with the stdlib og Pythong for testing
If you can point me to testing frameworks that are better (as in more enjoyable to use) than pytest I would LOVE to hear about them.
(pytest isn't part of the Python standard library.)
Especially if you've got some suggestions for JavaScript, where I have yet to find anything that makes me enjoy test writing in the same way that pytest does.
Absolutely agree. I guess this is part of the tension and tradeoffs here: lots more people understand SQL, but many of them don't come from a software engineering background.
And historically it has been relatively hard to set up simple, fast running unit tests on SQL pipelines, though I think that's becoming easier nowadays.
For one employer I wrote a ton of tools that used sqlite as an memory datastore for manipulating data. All of the tools were written in perl and I found that if the data had any complexity then you needed an expert perl coder to deal with the in-memory representation. However, if you simplified it down to a few SQL queries that updated/queried an sqlite database then even the most novice perl coder could use/maintain/enhance the tools. For really large datasets you could back the table with a file and get really good results - a lot of programmers can't deal with a dataset larger than core memory, sqlite handles it with ease.
I found something similar, we moved a lot of logic out of Java and into SQL and PL/PGSQL. We realised that most of our business logic is mostly data transformation anyway.
Suddenly, the more advanced non developers who were logging into github for other reasons, could understand the code, and even occasionally debug it or help us work through new features.
These people were domain specialists, so being able to tap into their skills and knowledge at the code level was amazing. And this was in addition to the very significant performance and code density benefits we achieved.
There were so many unexpected benefits from moving our logic into SQL that I would struggle to justify implementing future projects any other way.
I've joined projects for a few big data pipelines now and every single time people were using R (shudder) or Pandas for what ultimately could have been done with SQL (ideally with DBT, which makes testing much easier). And every time it was because data scientists apparently thought every single problem needed R or Pandas and SQL was beneath them.
Fortunately the field is splitting into data engineers who build the pipelines and data scientists who analyse the results. I think that’s a good thing as the pipeline should look more like software engineering.
The point that stood out to me was SQL and strong types.
I still don’t understand the fear of data types, even when dynamic language programmers still treat their variables with invisible types, and leave the guessing game to the next guy.
There are only a handful of static values in the known universe that don’t have types or units. Intentionally avoiding types is irrational (math pun).
SQL is strongly typed, but not statically typed. It is dynamically typed. Like you say, it leaves you guessing what the variables might be and waits until runtime to blow up if you made a mistake.
I don't understand the fear of types either. We can let the lack of static typing pass for early revisions of SQL as perhaps we didn't know any better back then, but how has modern SQL not caught up to the modern age of software engineering? Even Javascript/EMCAScript is starting to introduce static typing features, and that's a low bar to contend with.
Most implementations of SQL are not dynamically typed - they are statically typed. There is an explicit compilation phase (`PREPARE`) that compiles the entire plan and handles any type errors.
For example - this query throws a type error when run in Postgres during query compilation without executing anything or reading a row of the input data:
CREATE TABLE varchars(v VARCHAR);
PREPARE v1 AS SELECT v + 42 FROM varchars;
ERROR: operator does not exist: character varying + integer
LINE 1: PREPARE v1 AS SELECT v + 42 FROM varchars;
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
The one notable exception to this is SQLite which has per-value typing, rather than per-column typing. As such SQLite is dynamically typed.
Types are static within the context of a query, not necessarily within the context of the lifetime of a database. Types are determined and propagated at query compile time. In a query, the value "v" refers to a VARCHAR column (as determined by the table definitions). At query compile time an error is thrown when a type violation is detected.
It is distinct from Python because in Python individual objects have types - and type resolution is done strictly at run-time. There is no compile time type checking because types do not exist at compile time at all. Note how in my example I am preparing (i.e. compiling) a query, whereas in your example you are executing the statement.
If you refrain from changing the types of columns in a database and prepare your queries you can detect all type errors before having to process a single row. That is not possible in a dynamically typed language like Python, and is very comparable to the guarantees that a language like Typescript offers you.
Types in Python are also static within the context of a statement. That's not particularly meaningful, though, as programs don't run as individual statements in a vacuum. Just like SQL, the state that has been built up beforehand is quite significant to what each individual statement ends up meaning.
> types do not exist at compile time at all.
Same goes for SQL. Given the previous program, it is impossible to determine what the type of v is until you are ready to execute the SELECT statement. If the CREATE TABLE statement fails, v will not be the VARCHAR you think it is. If someone else modifies the world between your CREATE TABLE statement and your SELECT statement, v also will not be the VARCHAR you think it is. All you can do is throw the code at the database at runtime and hope it doesn't blow up.
> whereas in your example you are executing the statement.
What do you think "" + 42 executes to, then? The thing is, it doesn't execute because it doesn't satisfy the type constraints. It fails before execution. If it were a weakly typed language, like Javascript, then execution may be possible. Javascript produces "42".
> and is very comparable to the guarantees that a language like Typescript offers you.
Not at all. Something like Typescript provides guarantees during development. SQL, being a dynamically typed language, cannot know the types ahead of time – they don't exist until the program has already begun execution – and will throw errors at you in runtime after you've moved into production if you've encountered a problem.
> Same goes for SQL. Given the previous program, it is impossible to determine what the type of v is until you are ready to execute the SELECT statement. If the CREATE TABLE statement fails, v will not be the VARCHAR you think it is. If someone else modifies the world between your CREATE TABLE statement and your SELECT statement, v also will not be the VARCHAR you think it is. All you can do is throw the code at the database at runtime and hope it doesn't blow up.
If you are working in a hypothetical adversarial world where people are changing data types out from under you then that might happen. That will also happen with any other shared data source, regardless of what language you might be using. If you are sharing a set of JSON documents and people start altering documents and removing keys then your Typescript program that ingests those documents will also start failing.
That is not a problem of the language but a fundamental problem of having a shared data source - changing to a different language will not solve that problem.
I think you are conflating SQL the language with a relational database management system. A relational database management system models a shared data source - and a shared data source can be modified by others. That introduces challenges no matter what language you are using.
> What do you think "" + 42 executes to, then? The thing is, it doesn't execute because it doesn't satisfy the type constraints. It fails before execution. If it were a weakly typed language, like Javascript, then execution may be possible. Javascript produces "42".
It is fundamentally different. Given a Python program, you cannot know whether or not it will produce type errors at run-time without executing the program. Type checking is done at run-time as part of the execution of the "+" operator. In SQL, type checking is done as a separate compilation step that can be performed without requiring any data. In Python the object has a type. In SQL the variable has a type.
Given a fixed database schema and a set of queries, you can compile the queries and know whether or not the queries will provide type errors. You cannot do this with Python.
> Not at all. The benefits of something like Typescript is that the guarantees are provided during development. SQL cannot know the types ahead of time – they don't exist until the program has begun execution – and will throw errors at you in runtime after you've moved into production if you've encountered a problem. That's exactly when you don't want to be running into type programs, and why we're largely moving away from dynamically typed languages in general.
SQL can provide the same guarantees as Typescript. However, unlike Typescript in which the variables are fixed, SQL always deals with processing data from an external data source - the tables. The data sources (i.e. tables) can be changed. That is not a problem a language can solve.
> Given a Python program, you cannot know whether or not it will produce type errors at run-time without executing the program.
Like you mentioned earlier, if you refrain from changing types, a compiler can theoretically determine Python types as well. The Typescript type checker when run on Javascript code is able to do this. But that puts all the onus on you to be super careful instead of letting the language hold your hand.
> SQL can provide the same guarantees as Typescript.
1. It cannot as it does not provide guarantees about its memory model. Declaring a variable as a given type does not guarantee the variable will be created with that type. Typescript does guarantee that when you declare a type that's what the variable type is going to be.
2. As SQL is dynamically typed, allowing you to change the type of a variable mid-execution, what type a variable is depends on when that statement is up for execution. And as SQL is not processed sequentially, at least not in a meaningful sense, it is impossible for a compiler to determine when in the sequence it will end up being called upon.
Again, you are still conflating tables in a relational database with variables in Typescript. While I can understand the confusion - given that SQL makes it so natural to interact with tables - they are not equivalent.
SQL as a language is statically typed. It has variables that are statically typed [1], much like Typescript. All columns in a query have fixed types, and there is a separate compilation phase that resolves types of parameters - much like any other statically typed language.
SQL generally operates on tables - and tables model a persistent, shared data source. The Typescript equivalent to a table is JSON files on disk. There is a fundamental need to be able to change data sources as time goes on and business requirements change. That is why SQL supports modifying shared data sources, and supports operations like adding columns, changing column types and dropping columns. SQL deals with the fact that changes can be made to tables by rebinding (recompiling) queries when a data source is changed.
Any hypothetical language that will replace SQL will have to deal with the very same realities of people wanting to modify shared data sources. A different language cannot solve this problem because there is a fundamental need to change how data is stored at times.
Perhaps what you are looking for is better tooling around this problem. SQL can be used to power such tooling, because of its strong static type system that works alongside the persistent shared data source. For example - you could check if all your queries will still successfully compile after changing the type of a column.
> Again, you are still conflating tables in a relational database with variables in Typescript.
No, we're only talking about SQL here, and specifically to the example you gave. It demonstrated strong typing, but not static typing. I am not sure where you think Typescript comes into play with respect to the core discussion.
> SQL as a language is statically typed.
As before, SQL defines the ALTER statement. SQL is no doubt strongly typed. Once you declare a type it will enforce that type until such time as it is explicitly changed (SQLite notwithstanding), but it can be changed. To be static means that it cannot change.
> SQL generally operates on tables - and tables model a persistent, shared data source. The Typescript equivalent to a table is JSON files on disk.
That's but an implementation detail. Much of the beauty of SQL, the language, is that it abstracts the disk and other such machine details away. Conceptually all you have is a machine that has functions that operate on sets (or vectors, perhaps, since SQL deviates from the relational model here) of tuples. Perhaps your struggle here is that you are hung up on specific implementations rather than the concepts expressed in these languages?
> Any hypothetical language that will replace SQL will have to deal with the very same realities of people wanting to modify shared data sources.
That's not a reasonable assumption. In practice you don't want arbitrary groups of people modifying a shared data source. You want one master operator that maintains full control of that data source, extracting the information other people need on their behalf. The n number of users feeding a virtual machine code fragments to build up an entire application is an interesting concept, but one that I am not sure has proven to be successful. It turns out we've learned a lot about software development since the 1970s. These days we usually build a program that sits in front of the SQL program to act as the master source in order to hide this grievous flaw, but a hypothetical SQL replacement can address it directly.
> SQL can be used to power such tooling, because of its strong static type system
Let's go back to your original example:
CREATE TABLE varchars(v VARCHAR);
ALTER TABLE varchars ALTER COLUMN v TYPE INTEGER USING v::integer;
PREPARE v1 AS SELECT v + 42 FROM varchars;
What type is v? Well, it could either be a VARCHAR or an INTEGER. It depends on when the machine gets the `PREPARE v1 AS SELECT v + 42 FROM varchars;` statement. SQL makes no claims about order of operations, so PREPARE could come at any point. Therefore it is impossible for such tooling to figure out what type v is. If SQL were statically typed you could derive more information, but as it is dynamically typed...
Now, you said before that each statement when observed in isolation is statically typed. While I suppose that is true to the extent that the the type won't randomly change in the middle of the execution of that statement, the statement alone doesn't provide type information either. Parsing `PREPARE v1 AS SELECT v + 42 FROM varchars;` in isolation, we still don't know what v is.
Data engineering pipeline is basically ETL. SQL is for running reports.
ETL can have many variances, especially with so many different kinds of data source, structured and unstructured. A new kind of data source (e.g. video) would require a new way to extract useful data, transform it into some useful forms (e.g. products mentioned in video), and load them into the common stores (e.g. RDBMS). Then can use SQL to manipulate the data and run reports.
SQL might be the right choice in many or even most situations, but I'm really skeptical whenever somebody says there's a "default choice" in technology. Every system has certain strengths and weaknesses compared to others. If you're only reaching for one technology choice without first assessing the project in context with all of the tradeoffs involved, I don't think you're doing things right.
I'd say that A SQL variation is probably the right first choice for most software projects where server stored data is involved. As much as I enjoy other options (Mongo, Cassandra, BigTable, Dynamo, etc) in different scenarios.
A lot of this will come down to many developers do use SQL first... Especially in Java and C# circles (corporate it developers in particular). So staying closer to what people are familiar with has value.
I will generally push for PostgreSQL or CockroachDB over other SQL varieties though. MS-SQL is okay, Oracle can be a pain, both being costly. Not a fan of Maria/MySQL only because of old behaviors, and every time I've used it, there's something annoying (utf8 isn't, as an example).
In either case, a SQL database service can often scale to the low millions of users if you're pretty good with how you structure things... A poorly constructed database and application can generally scale at least to thousands of users without issue. As most application development are internal business applications, that's usually enough.
A default choice really means to check your context for the weaknesses of the default. If you aren't going into something tool is specifically bad at, use it.
Default choices are an agile optimization -- they make sure you don't spend a lot of time on solving problems that you might not have. The detailed tradeoffs for the best and worst tool for a job are context dependent, and that context changes over time. Your default choice will be good enough under most contexts.
Note in the comments here people have used ELT instead of ETL a few times. It's not a typo, doing the "load" before the "transform" specifically refers to copying the data into the database without changing it, then doing the transformation within the database (probably with SQL).
Yes, this was my world for a number of years too. And now "suddenly", it's not. All the "niche" stuff I started hearing about in the early 2010s seems to have hit critical mass to even be the norm in boring industries' enterprise world.
I kind of think the pivoting to a new concept keeps arising because there is no magic solution. Data is hard. And yet since everybody is pretending to be experts in data mining, AI, ML, corporate execs feel left behind because they either don't have these people, or the team they do have isn't producing much, if anything, of value.
Every company/enterprise has more and more data yet it isn't translating to knowledge. The existing data infrastructure isn't working, some new idea comes along and gains steam and everyone thinks it will fix their problems, it won't, and then it too will become the solution that needs to be replaced by the next new idea.
Agree with the OP that SQL will almost assuredly still be in use for 20+ years in the future, given the simplicity and flexibility of the declarative language, standardization, and as applicable to today as it was then to our big data problems.
But as an engineer, I much prefer getting a notebook over a 1000-line plate of SQL spaghetti.
Some of our data scientists prefer SQL and that's fine. We figure out how to speed it up and ship it.
But I've gotten a few of them onto the PySpark+notebook train and it is just a much more productive way of working, IMO. We can extract things into functions with docs & linting. We can easily look at intermediate sub-queries. Yes, you "can" do all the in SQL but the open-source tooling for Python is just really nice.
If you want to do it all in SQL, DBT gives similar capabilities. Modern data warehouses + DBT is really why there is a lot more "just use SQL" talk in the data engineering world recently.
Yeah except that SQL is not composable and managing 1000s of lines of SQL in a repository without much more than partial means for abstraction (e.g. views, UDFs, etc) is unattractive.
So any transformation thay requires accessing something else than the data must be done in sql? Connecting to a ftp, parsing a yaml file, getting data from a rst api? And then text, date and maths opeartiond also in sql?
Yeah, no.
Sql is good at look up and filtering, but not for linking heterogenous sources and massaging data.
I recently ran into this dilemma. I was tasked with writing a feature to allow importing a bunch of CSV into a system. Do I 1) stand up Azure Data Factory and build a pipeline to take the CSV from blob storage, massage it, and upsert the db or 2) write a few lines of C#+SQL to suck in the CSV and do the same? #2 is not quite as trivial as it may seem in real-world. Customer wants preview of data before update, undo capability, etc. But even so, I went with #2. I'm wary of doing things because they seem "easier" to me, but these huge data pipeline tools just seem like overkill at least for the use cases I'm tasked with.
Most people don’t actually have a big data problem. If you can load and manipulate your CSV in memory within your app then it’s not worth all of that data infrastructure.
2) seems very reasonable to me. as a Python user, I'd have done something similar using the Python duckdb library (i.e. steering clear of any heavyweight tools)
While this is true its also not moving the needle forward. HTML and CSS are great at building views, but you know what is easier to work with, JSX.
All of these specialized tools do their job well, but its empirically easier and faster to just work in a single language to accomplish your project goals if at all possible.
Node didn't get popular because it was the "best" framework for apis. It got popular because a bunch of people who already knew JS could start writing server code.
Nothing trumps developer experience, and I hardly know anyone who could honestly say SQL is their favorite language.
The biggest win to me is: when your data pipelines are in SQL, changes and maintenance can be somebody else's problem.
I've had a ton of success asking our marketing and business teams to own changes and updates to their data - very often they know the data far better than any engineer would, and likewise they actually prefer to own the business logic and to be able to change it faster than engineering cycles would otherwise allow.
We do the same thing with our business logic & product configuration. I still haven't found something I couldn't expose as a SQL configuration opportunity.
Even complex things where you have to evaluate a rule for multiple entities can be covered with some clever functions/properties.
With some oversight this can be fine but non engineers can easily end up making the sql infinitely slower not to mention get things wrong (most commonly doing inner joins or have nulls in where in joins).
I've seen plenty of devs and da's do the same. The nice thing about SQL is it's easy enough to create a query that gives you what you want, but at scale it falls flat or purforms poorly. It's easy enough to work through most of the time, but too many lack the understanding of knowing when bottlenecks are likely to happen, and if/when it may be an issue.
I think of more than a couple basic joins in a query to be a code smell.
In Geo Analysis, you start with Raw data and then apply a series of transformations to it. The idea that all these transformations could be summarized in chain of SQL commands fascinated me.
I took that with me and apply it frequently every time anything even remotely resembles an ETL: "could it be done in SQL?"
Í‘m not happy with SQL for querying data. Especially joins, windowing, subqueries and CTEs are often quite clunky. Especially CTEs are often not doing what you intend (full table scans, poor performance).
SQL is a great standard, because you can use it everywhere, but it’s not perfect for everything. You can feel, that it comes from ancient times.
You could also argue that the 20th century relational database is a terrible way to model information as it is so different than the way we humans actually do. Hierarchies are a key to information modeling for us, but a disaster with relational. We do not store information in countless tables with increasingly akward joins between them, we think in terms of relations between things and their hierarchies of relations. One thing is not in 20 different places referenced via a key id, its just in one place mentally linked in relation to many others.
For this reason, the noSql databases, like Apache Solr etc (which are open source at no cost), are far superior to actually capturing information in a more human way, and doing it at far bigger scale and faster. So I would say the last thing you would want to start with on any modern knowledge/information system would be SQL.
Funny, to me relational databases seem to fit human thinking much better.
Hierarchies aren't a disaster with relational, they're trivial. A table maps product attributes to products, products to orders, orders to customers.
But relational allows you to escape the limitations of strict hierarchy. Because another table also maps products to suppliers, and orders to shipments, and orders to customer satisfaction surveys.
People think with hierarchies yes, but those hierarchies are built out of relations. I'd actually say we think in relations, and that some of those relations can be expressed as hierarchies, but others (sometimes most) can't at all.
If you're using "countless tables" then you might have a problem with your modeling. And if you're using "increasingly awkward joins" you might be writing your joins wrong, or again modeling the data in a confusing way. Or you just have a prejudice against joins -- they're no more "awkward" than pointers in C. Database normal forms exist for a reason, because they ensure you're storing data in the way relational databases are designed to work with it.
I'd say the first thing you should start with on any modern knowledge/information system is SQL, and only migrate away from that if you're absolutely sure it's necessary.
The whole no-sql era was caused by various performance issues that made the SQL databases of that era unusable, or difficult to work with for certain kinds of workloads and data needs. As SQL databases evolved to support these workloads with features like JSONfields, the need for highly specialized database servers is less than it was, but there are certainly places where noSQLs shine. Cache and data structure servers, schemaless for persistence where everything is defined in frontend code, huge scale, etc...
> the noSql databases, like Apache Solr
SOLR (and Elastic Search) is a search engine, and is not a database, and should never, not be confused as one.
> We do not store information in countless tables with increasingly akward joins between them,
The whole "no joins" thing was really silly as the first thing that you do when working with a noSQL is write code like this parent=someThing.get(someObject); children=someObject.get(childrenMatchingCriteria); Which is a join. What Mongo and other noSQLs did was ship a reasonably performant and more importantly, easy for the developer way to store JSON objects. I spent a good number of years building on CouchDB, Firebase, DynamoDB and Mongo because it was easy to go fast, especially where the back end was just storing settings objects... Ironically, every project eventually needed two things that noSQL was supposed to not need: schemas and joins... so we implemented them in code (often in a React, Angular or Vue frontend).
The primary issue is if you are trying to model information the way humans use it and need it, relational tables are just not even close, hierarchies are a disaster, etc. We humans don't have/use tables mentally.
The answer here is, like all things in tech, it depends. It depends on what the data is. It depends on what you want to do with the data. It depends on what expectations are being made about that data.
> We humans don't have/use tables mentally
I just ate lunch. When I stepped in they wrote my name on a list with the number in my party. The server wrote my order on a paper with one grid line per item. Two tables, the old fashioned way. Tables are very natural for people. Evidence: the enduring popularity of spreadsheets, and lunch.
Humans think in terms of categories of thing. The reason things live in the same category is that they share the same attributes.
The relational model works really well for this.
I'm ready to be convinced otherwise, but I'm going to need some really good examples of "human data" that fits NoSQL systems but can't be reasonably represented relationally.
Try doing recursive hierarchical traversal with SQL, something we humans do all the time, like realizing something is a person over there, and then instantly having access to all the inherited properties of a person, and then predicting what might happen next based on that, etc.
What you're saying sounds good on paper, but I don't think real-world experience backs it up.
I've not seen a non-relational system really have sticking power for human knowledge representation yet, despite many people making many attempts at it.
Meanwhile, most SQL databases include excellent support for JSON column types now, if you want to mix in some data that doesn't fit neatly into rows and columns.
And why is that, what do you get out of relational in that case that Solr doesn't give you? The Solr boolean query interface is very impressive and very fast. The caveat is you need a good essentialized schema.
I would call joins a bug of relational rather than a feature. If you have all the data in one collection, you don't need or want joins. For transactions, we do that server side (in our own system) as we have what we call temporary awareness created for each request that only writes when all operations have completed successfully before importing the state changes into primary awareness for that user.
Every large system I've ever encountered that's built on a NoSQL system such as MongoDB or Elasticsearch has inevitably ended up reinventing joins, poorly.
We have 1 collection with 4 doc types (attribute, state, event, process) and 14 relations between them, so a join concept makes no sense for us. I think the schema is the key issue in having success vs not. (Apache Solr - billions of docs)
Yes, humans think in hierarchies, but they think in multiple hierarchies.
You’re in one hierarchy at work, another in your family, another at your Karate dojo.
This cannot be represented in a single hierarchy. In non-relational you therefore end up duplicating data and suffer the associated ACID and performance issues. The only reason people are fine with it is because they don’t care about data integrity and won’t fix these problems in hindsight, but that’s what the non-relational database developers tell them to do since they declare it’s not their responsibility.
You wind up defining relationships in code instead, across dozens of API calls without a central spec and it's super easy to get wrong. If it were as easy as you say we'd all be using MongoDB now.
I tend to think of a classifieds site as pretty much the ideal case for a typical nosql approach... especially Mongo being a great fit. Today, I might use PostgreSQL/Cockroach with some fixed and a JSON column for extra bits based on type of classified.
Not everything fits into this mold, it's a matter of mix/match today, and there are no clear rules for how you get into scaling as needed...
That would be a different approach than us as our entire schema consists of 4 doc types with 14 keys/functons which models the universe, and then we use specifications in the data to control inference and system operations etc. So it's becoming a no-code system powered by the specs captured in the data.
We do both documents, the internals of documents and a full model of space-time with our digital twin technology, but we've spent decades arriving at a extremely essentialized lower ontology schema with which we can model everything - thats really the secret to it.
At my shop we do exactly this, we do ELT as opposed to ETL. (E=Extract, L=Load, T=Transform).
We put our input files (think of JSON documents) in the database, and then the data processing is a materialised SQL query.
This has a few benefits:
- The SQL dumps are very light as they will consist only of the input files, the materialisation is just a query, no need to store the transformed data.
- Some time there is an error in the business logic, how do we backfill? That is easy, we update the body of the materialised SQL query and then refresh the materialisation.
- Transactions are very hard.
We find this great for batch use cases.
Exciting to see progress in this space in the last years:
- materialize for incremental maintenance of materialised views
- postgres has a patch to start support incremental maintenance of materialised views in the works .
> These alternative tools were developed to address deficiencies in SQL, and they are undoubtedly better in certain respects.
Specifically, dplyr and pandas are handy because they live inside a complete programming language, which is something you might need if you are working on a data project.
One of the tricks that I have used with SQL, particularly during development, debugging and unit testing is to use the pattern of INPUT + PROCESS -> OUTPUT.
INPUT is the transactions we wish to process and PROCESS is the SQL that takes the INPUT and the set of tables required for processing (essentially an SQL join) and create the OUTPUT table. This is an atomic step and can be defined using a CREATE TABLE XX AS SELECT ...; statement. This is quite fast and even for large tables, this can be parallelized (using the parallel query extensions for the database engine).
It is also testable by comparing the INPUT with the OUTPUT and ensuring that the OUTPUT meets the specs.
In batch environments, this is an invaluable trick.
I find SQL very hard to use when the data schema and/or transformation graph needs to be dynamic (e.g. depends on the data itself). It's hard to make SQL dynamic even at build time -- Jinja+SQL is one of the worst development experiences I have ever had.
I use graph database, and resources of the graph are typed with types/supertypes. Relationships also are typed with types/supertypes.
And my queries are heavily dependant on that typing structure.
Honestly, I cannot live without that feature.
[sorry, that is my OOP minute. Continue without me...]
Inside the "from:" area, those alias and fields and crosstab keys are all at the same level.
But... "alias:" and "fields:" correspond to features of the from clause itself, whereas "crosstab:" is a special custom thing (a macro?) which ends up being compiled into a call to the identically-named crosstab() feature.
So to understand the YAML, I need to understand both how the crosstab() feature in PostgreSQL works AND how that YAML DSL represents "crosstab:" calls using a different syntax.
I looked up crosstab in the PostgreSQL manual but that information won't help me figure out the DSL.
My goal is to build a composable and consistent DSL.
It should have few keywords and structures as possible. Of course there's some "fixed keyword" like above crosstab (due to inconsistent SQL syntax i think).
"alias" + "fields" helps you build "as (field1, field2,...)" kind of thing.
"select" + "from" must form a normal "sql query".
The problem is that the default/naive mode for an ETL in most ACID SQL implementations is "lock every table you're touching and give zero feedback as to how long it will be locked".
Eventually you end up writing more and more cryptic code to work around the fact that this is not something supported out-of-the-box in any mainstream RDBMS.
Now, should it be better? Would I expect a vendor-supported solution like Microsoft SQL server to provide a simple tool that lets me write a Merge statement with the understanding that I want it to be eventually-consistent rather than expecting me to spend weeks learning yet another tool with it's own DSL to handle this case? Absolutely.
I agree, and disagree. Thankfully new technology ZZZ challenges the current paradigm XXX and ensures continous improvent in area YYY. Also XXX picks up the cool stuff from ZZZ (occasionally).
As someone who works with 10s to 100s of TB of data from SQL and SQL-like DBs on a daily basis, writes lots of analytical code, I can say, emphatically, BS.
Unless your analyses are trivial (column mean, max, etc.), chances are you need a real programming language behind you. And when you are working with dataframes that are 300+ million rows (glances over to his work machine, yup, that's a smaller analysis I need to run), you need a compiled and fast language for this, which can run in parallel on multiple threads and machines.
You aren't using SQL for this. You aren't likely using pandas for this, or pyspark.
There is a ton that can be done in SQL to transform data after data is bulk exported into staging tables. I really do like the ELT transition from ETL. Which is kind of nice as the DB is usually really beefy in terms of compute and if distributable like say BigQuery it just scales seamlessly (provided your credit card has a high enough limit) and you don't have to worry about all the distributed systems stuff you might have to deal with if you were running a Spark cluster on your own.
I would say it may be easier to transform data into a canonical data model from outside the SQL sphere. Parsing flat files or XML files in SQL was common in the past. Today, there are tools with capabilities beyond SQL that can extract data and populate data sets. A clear separation between the responsibilities of data ingestion and data makes the architecture more robust. SSIS for instance can easily become messy because things end up being coupled in unintentional ways.
It's a really interesting new (emerged in the last few years) piece of technology.
It's kind of a cross between SQLite and columnar analytical databases such as Snowflake or Cassandra.
You can run it happily on a laptop, but because it's column oriented it can calculate aggregates (group by/count queries for example) incredibly fast - so it's better for a lot of analytical workloads than regular row-oriented databases.
Cassandra isn't columnar, it's row-oriented. The original authors coined a new term "column family" which is just a nonsense (and confusing) term, meant to describe its original relatively unstructured API, which has unfortunately stuck in all descriptions of the database, and leads to confusion like this.
Cassandra is also not an analytics database, it is intended for "OLTP"-like workloads.
Thanks. I wanted another example other than Snowflake so I did a quick Google search and grabbed the first example from the top search summary box that I recognized.
I always considered Cassandra's 'column' designation a guarantee that OLTP transactions will apply atomically to column groups; or that you'll minimize IO writes for a row to the column group (as opposed to writing a new version of the entire row). But this is actually just a wild guess. Anyone want to chime in and confirm/deny?
DuckDB is a no-dependency SQL engine that's available from multiple programming languages. A bit like SQLite but optimised for analytics.
For instance, in Python, it can be installed using `pip install duckdb`. For things like unit tests, it can be very valuable because no additional services need to be running.
I generally agree but the lack of support and inadequate speed for PIVOT like operations stops it from being true. https://stackoverflow.com/tags/pivot/info Compare to tidyverse pivot_wider() ans pivot_longer(). No contest.
Much of my current work is supporting a large data warehouse in the energy industry. Most of the ETL code is SQL. I hate it but I know that there's no good alternative. Lots of the ETL stored procs are >1000 lines of code. There is lots of duplication - of code and data. There is no testing. There is little documentation.
Its interesting that mapping templates for AWS products like api gateway and appsync use VTL and JavaScript for doing their data transformations instead of SQL.
I haven't seen SQL used as a transformation from api input to a table request, but it sounds interesting to explore
> it is the most dense language in essential complexity I know
QUEL is more dense for equivalent functionality:
replace users (preference = "blue") where id = 123
Not to mention that it actually adheres to relational calculus, unlike the wild and reckless SQL.
> * type checking
It is only dynamically typed, though, which isn't all that useful. There is good reason why we are seeing static type systems being bolted on to most dynamically typed languages these days (e.g. Typescript). SQL would do well to add the same, but it seems it is seen as a sacred cow that cannot be touched, so I won't hold my breath.*
I use to write complex SQL query very usually, and I can say that it's far more concise than imperative code.
Compared to it's own semantics, SQL is actually verbose, but its semantics is so powerful that you gain size.
Please share how many lines of sql it will take to statistically normalize multiple columns. Or even something simple like null cleaning columns based on dynamic thresholds.
In my experience SQL gets unreadable fast when it comes to more complex query. I prefer PySpark where you can use python functions and classes to structure your code.
Surely, AutoHotKey should be the default. Most «data moving people» are not programmers. Okay, maybe that’s «data pipelines», not «data engineering pipelines», but…
Copy/paste is the most widely used data pipeline in the world.
Sounds to me like a natural winner for data management.
[but AutoHotKey is definitely a valid #2 !]
Please no. SQL is more like the assembly language of databases. Close to the database "metal" but not really suitable to elegantly express data transfomation logic.
In fact this analogy may be suggesting that what we are missing in this space is higher-level SQL dialects that "compile" to SQL
See the last part of the blog post! There are numerous promising attempts to do this. The problem at the moment is that most are at quite an early stage and it's unclear which will become popular.
It appears to me that many spreadsheet users have started to realize that - for spreadsheets that are being used to store data - there's a lot of value in sticking to a rigid rows-and-columns format for those sheets, rather then messing around with creative layouts.
At which point importing them into and out of database tables becomes a whole lot more feasible.
TLDR: SQL feels most natural for joining and aggregating data sets; Python is favorable for filtering and transforming data due to its higher flexibility and extendability
SQL transformations shine in environments where regular batching in short intervals is acceptable. Materialize and Flink are both super interesting for overcoming the batching shortcoming into realtime materialization via SQL.
Regarding the concerns about tests, it's true it's harder to do unit testing of SQL, but for me the tradeoff is it feels like there's a whole class of bug that's eliminated by focusing on the relational algebra. If you can avoid the category of work of defining the procedures to transform the data, and only define the expected outcome, things can be much simpler.