Hacker News new | past | comments | ask | show | jobs | submit login
Treating SQL like code (devgenius.io)
41 points by sebg 14 days ago | hide | past | favorite | 58 comments

The esteemed Google monorepo. Piper is a beautiful thing. Trunk-based development with rapid CI/CD. Everyone with access is essentially a knowledge worker and can create DAG-like data pipelines with their SQL code. There is no data team—they are all data team. They also have access to SQL macros and functions that have a global scope and evolves over time as more and more people build. Couples with a high performing database called F1, this is the secret to their success.

I think that many people forget that version control is part of your delivery pipeline, not just a way to “store and move code”. I think there are possibilities to use the version control tools to much more than just that. GitOps touches that but I believe there’s a lot more to it. I think there’s some of it in what you describe here.

Yes! VCS hasn’t been fully utilized yet. All pipelines need versioning and most things are a pipeline. This is an exciting area of exploration

I’ve been looking for something like piper, is there only perforce?

Microsoft chose to extend git for their internal source control, and Facebook went all-in on Mercurial+extensions. Google also has a Mercurial+extensions interface to their monorepo. The sheer size of their monorepos mean that tools that don't work for them will work elsewhere.

In particular, Git has made improvements recently working with giant (as opposed to merely bit or still-too-humongous-to-manage) repos so it's now workable after massaging the config.

The other alternative to Perforce is SVN. There have been some improvements with it that make it more workable in today's git/"decentralized" world. Not knowing your particular use case it's hard to say if that's at all appropriate.

dbt + dagster + GitHub

I've consulted at so many places that just wrote stored procedures/functions and put them in the DB but never put those in source control and managed them like code. Can't tell you how many times they'd split the SQL box onto two separate machines to support different clients and growth and then forget to update one copy of procs on one of the machines causing all sorts of hell.

I found it a lot in small to medium sized business which were transitioning from the couple of people who wrote all their software to a team needing to do things properly and consistently. But I also found it at 50 person dev teams, which was weird and the worst.

I've generally always managed it as code, just easier that way.

My experience also. This is particularly bad in the SQL Server space, since Microsoft essentially dropped source control integration in SQL Server Management Studio 2016 and later (there's an official workaround[0], but it's not well received).

Most likely they're happy to offload version control to add-ins. There are a few, but most of them are way out of budget for small businesses or small IT departments. Which means version control usually doesn't happen.

This has been such a problem for me with clients that I started a side project[1] to try to help fill the gap. If anything it's shown me that the problem is more widespread than I originally thought.

[0] https://cloudblogs.microsoft.com/sqlserver/2016/11/21/source...

[1] https://www.versionsql.com/

In the past, I used the SQL project feature of Visual Studio to great success to manage the DB. Everything except data up/down migrations was well supported, and it enabled us to keep our SQL in source control. I never even knew there was direct integration with SQL server.

SQL Projects are still nice and still work in the lastest Visual Studio. You can sync the DDL from a current DB to the project and vice versa.

That's good to hear. When I started that job, the DB upgrade was a massive, manually curated SQL file that would be run for every upgrade and contained every change from the beginning of time. You can imagine how well that worked. I blew peoples' minds with the idea of "building" a database package

It happened with me as well.

My previous employer created stored procedures in a MSSQL database and called it a day.

Once saw a system like that, however, not only was the system not structured like most are nowadays (e.g. one that allows the app to do most of the CRUD), but it also stored almost everything in the database - not just stored procedures, but information about what sorts of data views in the web interface should have, all of the configuration, all of the validation logic etc.

So if it were to be a traditional MVC design (which it quite wasn't), it would have both the model and controller in the DB and the Java bits only handled the view part with JSP and some other outdated tech. Rendering a view basically involved calling a DB stored procedure which prepared everything that should be visible and returned the data to the "fake back end" for further processing, before it got sent off to the browser. And saving new data or editing the existing data involved the same approach, but with numerous function parameters (think 20 to 40).

On the bright side: it worked fast.

On the not so bright side: everything else was horrible. The Java code was nightmarish and badly maintained, the discoverability of it was horrible since it also had to deal with naming things similarly to how they were in the DB, with unreasonable identifier length limits. The DB code was almost impossible to version well (no automatic migrations to speak of), routinely broke and there was little to no logging in place, as most DBMSes out there handle the concept of "application logs" pretty badly, unless you write your own, which the other devs hadn't done. Oh, and you can forget about putting a breakpoint in those stored procedures, or even having Apache2/Nginx/whatever tell you what was being called for any action. It really taught me a thing or two about whether i should eagerly agree to help with code that other companies have developed and someone now needs to fix/improve.

Since then, i've also seen all sorts of systems, some that have automated migrations but don't have a baseline, some that have so much data and complexity that everyone has to share a test DB instance, others where the migrations are automated but there are no seeding scripts so a newly initialized DB instance has the correct structure for local development but the system is not usable because of no data. Of course, i've fixed what i could over the years, but there are very few approaches that truly work.

The most functional approach that i've seen: automatic DB migration scripts in the project (versioned in VCS) from day 1, each developer having their local DB instance, with seeding scripts for that as well, so that they can do breaking changes locally and test them out as often as they like, without fearing ruining the schema (after all, locally you can just wipe the schema and data, run the stable migrations/seeding scripts and continue from there again), no manual changes on DB servers to schema/data (unless in an emergency, but even then the same should be done in scripts later).

Additionally: it's useful to have scripts that won't fail if you run them more than once (check if data needs to be altered first, do so if necessary, otherwise do nothing, maybe output information in some log table).

Optionally: it can be useful to have the ability to reverse migrations but sometimes that's not easy to do and increases the total work ~2X, depending on the complexity of your migrations. Forward only approach has been sufficient for most of my projects.

Optionally: i've also explored model driven development with MySQL Workbench, where i made all of the models in ER diagrams and used forward engineering to get the SQL for scripts. It was a wonderfully nice approach, but pgAdmin and SQL Workbench as well as others don't really support anything like that, so it's a niche concept: https://dev.mysql.com/doc/workbench/en/wb-forward-engineerin...

Not sure if article saying something too deep for me to comprehend or too trivial not to be obvious to me.

It's obvious to me and I imagine most programmers, too.

However, as a data engineer I've been at some companies that treat SQL differently and seen bad practices that I believe the article is addressing, like putting SQL statements as stored procedures or materialized views instead of in git.

I didn't understand the point about stored procedures. Why not have stored procedures? Yes, you treat it like code any changes to the stored procedure are scripted, stored in git and deployed with a data migration tool.

But store procedures give you a single place to call SQL. If there is a procedure that does a standard calculation used by 5 reports then put it in a function or a stored procedure and have every report call that procedure / function.

From the outside looking in, and when framed this way, it's obvious. But when SQL at a company is entered into a semi-bespoke internal system fronted by a textentry box on a webform, and there's no editor or git anywhere to be seen, it's less obvious that the work required to get it into a version control system like the rest of the code is even possible, never mind worth it. Especially when the data science team has a primarily statistics background and less of a programmer one. (even then, a programmer background doesn't imply a person knows git/version control.)

I think the gap is that most startup types don't use stored procedures, so their SQL is committed to version control by default. Few places do migrations to update their database in any way including the stored procedures that also works well. Nothing new, except for those not already doing one of these things.

"Because it is code?"

Exactly. “Why Google treats Javascript like code and you should too” would seem equally plausible as an article.

It’s obvious to most programmers - this was common practice in the previous century - but there’s been a spike of data scientist/analyst jobs where that might not be as common, and there used to be more of a subset of DBAs who self-identified as not being programmers and avoided “hard” things like this. I’ve definitely had people in roles like marketing have to deal with confusion or bugs caused by reports changing in undocumented ways.

Yes, it was that bad of an article.

It's the second one.

I wish this article also covered how google manages sql like code. In my projects I have sometimes struggled with keeping my table definitions, constraints, functions, etc in sync with the SQL in my git repo

You start with an initial table schema definition in a file, the base schema. The next batch of altering sql statements are put in a file with an explicit version in the file name. The version in the file name can just be the current date. The next batch of changes are put in another file with its own version name in the file name. And so on. These are called delta files. All the delta files are checked into the source repository.

To recreate the database upto the current schema, just check out the list of schema delta files. Play them back one by one in the order of the version in the file names.

Track any prepopulating table data the same way.

Edit: having a table called schema_version to track the current applied schema version of the database is very helpful. Then you can upgrade the db schema from any version rather than from the base schema by checking the schema_version table before applying the versioned schema delta files. Be sure to put an insert statement to the end of every versioned schema script file adding a row containing the version of the file to the schema_version table.

The issue with reconciling state is that you can have multiple branches at different times. Forgive the crude diagram but I’m on my phone.

If you have many branches and you want to bug fix on one or your schema needs to stay how it is for a certain release window then you quickly run into issues with the serialised version approach.

All the schema delta files are checked into repository. They themselves can be version tracked. Different branches can have different versions of a delta file. Merging branches need to merge the changes of the delta file. Might need special attention since merging sql statements needs special care.

Edit: Or you might choose to be explicit with the version of the delta files in a branch. Each branch has its own delta file naming format. When merging into main, roll the changes from all delta files into a new main delta file. If main has moved forward a lot, you have to insert the changes carefully.

Declarative schema management tools make this much easier. The concept is your schema repo just stores CREATE statements, and the schema management tool knows how to generate DDL to transition between the current state in your DB and the desired state in your repo. Not sure about Google, but Facebook/Meta has used this approach company-wide (via internal / non-open-source systems) for over a decade.

I'm the author of declarative schema management tool skeema (https://www.skeema.io, for MySQL / MariaDB). Some other options in this space are sqldef (https://github.com/k0kubun/sqldef, for MySQL or Postgres) and migra (https://github.com/djrobstep/migra, for Postgres). In MS SQL Server, SSDT DACPACs are also somewhat similar.

For OLAP workloads at least, Terraform for defining data warehouse resources acting as input, and dbt for ETL will give you version control, dependency graphs for development and execution, data quality assertions, CI/CD, all conveniently under version control in an external repository.

But your woes seem more related to the transactional systems.

Search for SQL or database as code and you'll find quite a few established SAAS offerings in this realm, with varying approaches and quite decent docs as well. Worth looking up I'd say.

Having done it the wrong way many times and a better way a few times, I can tell you that an ORM with migrations is a huge improvement.

Yep, something like Flyway or Liquibase in the Java world has always made a lot of sense, though you also should be able to explicitly say when you do/don't want to run said migrations (e.g. run different profiles).

Of course, recently i've also found Dbmate (https://github.com/amacneil/dbmate) which is a lightweight and language-agnostic tool that supports a variety of DBMSes (PostgreSQL, MySQL/MariaDB, SQLite and more recently even ClickHouse) and can be used for running SQL migration scripts regardless of what technologies that particular app uses (Java, .NET, Node, PHP and so on).

Now, setting up the CI might be a little bit more difficult, but having one tool across all of the services in your architecture can be worth it!

Bytebase (https://github.com/bytebase/bytebase) is another option. It has bulit-in integration with GitLab to achieve so called database-as-code (disclaimer: I am the author of it)

Not strictly germane to the article, but as code SQL has things precisely backwards IMHO. Consider:

- DML is declarative; actually, though, I prefer to think of my query (or update) in terms of a pipeline splicing tuples together, filtering etc, and I'd be happier if I could write it as such, i.e. in a more imperative style (to express intent I mean - the planner should still be free to do all sorts of clever shenanigans, as now)

- DDL, otoh, is imperative: "add this index", "remove this column". But here I want a declarative style! i.e. "I would like my table to look like this - make it so". A lot of the aggravation with database maintenance comes from having to manage a stack of non-idempotent DDL operations in order to wrestle the db into the required shape.

I think it's a bit more nuanced - while I would say SELECT is declarative, UPDATE or INSERT or DELETE feel pretty imperative to me. Not only are they literally orders, but only DELETE is idempotent.

But for analytics systems, you can get close to having declarative, idempotent DDL by doing CREATE OR REPLACE VIEW ... SELECT for derived data. Some systems even allow transactional and this idempotent DROP TABLE IF EXISTS; CREATE TABLE ... AS SELECT.

This doesn't in any way help with indexes, with tables that are primary storage and you can't just drop and recreate them and so on, so there definitely is something missing.

So they push SQL scripts to a repository?

Seems like a very basic and common approach. How do shops that don’t use it track database changes? Can’t remember the last time I was involved in a (non-legacy) project that did things differently.

"SQL is a query language. Why would Google treat it like code?"

Because...it's a programming language, and therefore...it is code?

> Because...it's a programming language, and therefore...it is code?

It's also literally the classic example of a "fourth-generation programming language."

Does anyone treat SQL as anything other than code?

Yes, it does exist. In the domain of "Drag & Drop" ETL tools (looking at you, Matillion, Talend, Alteryx etc.), I've seen horrible setups.

No version control, no centralized overview of queries, dependencies, code linting or auto-completion etc.

Normally these tools are chosen by unexperienced, rather non-technical data or business people, to whom the initial ease of development is highly appealing. Of course compelxity quickly spirals out of control as LoC across all kinds of different transformation steps increase.

What I don't get is that most of these tools have their own steep learning curve. You have massive menus with dozens of actions, and your workflow look like complex electrical diagrams. It will put off the least technical people as well as SQL. So in the end you end up with really motivated, kind-of-geeky business users who might as well learn SQL and perhaps a little bit of python for the same effort.

I would love to get every office users to do more with their computer than copy-pasting all day, but I came to belive that the problem is mentality, not tooling.

It might be hard to believe, but statisticians and data scientists can be really good at their job while being bad at traditional programming that would lend themselves to learning Python and SQL. I've seen some do transforms on data that are beyond my abilities, while also having difficulty writing functions. If the abstractions in their tools make more sense to them, who are we to question them? Learning SQL is a non-trivial investment, and while it may seem like being an, eg, expert user of Alteryx (and nothing else) may be a career limiting move, there are also tons of programmers who only know Java, don't want to learn anything else, and still have really good careers. I'm sure there's the same for an expert Alteryx user.

I agree but you are starting with a user who is already an Alteryx expert. I am starting with users who are expert at nothing and are wasting time with manual work. What I say is that those users will either learn nothing, or if they do, they might as well learn SQL rather than learn to use a complicated, no-code software like Alteryx for the same effort than SQL.

SQL seems pretty straightforward to people who done a fair bit of coding. Totally opaque (possibly terrifyingly so) to people with no coding experience.

There is a reason why no-code data wrangling tools are a multi-billion dollar industry.

Yes, because way more people write SQL than write a "proper" programming language.

Author of the article here, if anyone has any questions (or just continue roasting away; that’s good too)

I wrote the article because many companies treat SQL like excel files. It gets emailed around, unknowingly forked and modified, saved on desktops or shared file storage, and nobody knows how to fix anything.

For everyone who said “SQL is code is this news” - I agree with you. It is code. And it shouldn’t be news. And I hope all companies implement version control with SQL. I see Java version controlled at a significantly higher rate than SQL. That’s all. Good day.

from TFA:

Spend the time to migrate scripts from local locations to global locations, eliminate views, materialized views, and stored procedures.

Wow. So things will be better if we don't use the modularization you can get through views? I've just finished up a tool* to make it easy to keep code for SQL views in VCS, but I must have been thoroughly misguided ;-)

*) https://pypi.org/project/DBSamizdat/

Yeah I'm with you on this one. I think they keep these in code so that they can deploy over multiple databases. But I think this should be seen as a schema change. If you make a change to a view, materialized view or procedure you need to store that change in git and deploy it out to all instances via a migration tool.

It seems obvious, but one would be surprised how often the argument has to be made.

Though the real argument is that Tableau Workbooks, Alteryx ETL-graph-things, and the whole panoply of opaque propriety formats that contain, in their heart of hearts, a bunch of SQL are also themselves "code" and deserve the same testing, source control, code review, and deployment practices as the rest.

Am I missing something here?

SQL is code.

So what else would I treat it like?

They mean version control here. I've worked at places where the schemas and stored procedures were not tracked anywhere - only the latest version existed in the database itself. Change control was a disaster.

I've worked at places that threw scripts into a folder somewhere in a project, so that it would at least have history. Again, change control was a disaster.

If you're not using a change control tool of some sort specifically for databases, and using it and only it to manage all changes, it's going to be a disaster.

Of course, but this also applies to all code.

I still remember the time when it was deemed normal, or at least common, to email around diffs and patches in a dev team, or upload everything to the "code server" in little directories with iso-timestamps as names.

Not taking care of code properly, aka. in a VCS, is a disaster, no matter if its C, SQL or utility bash-scripts.

We've built in-house a tool to move database objects (everything except tables) from development to testing to production. It even lets you pick a function or procedure from a package and only send that.

At each step, the code gets commited to a git repository.

Google and many other companies that use Looker do something else, too, which is that they store models in source control as opposed to SQL. Models describe the schema of the database, and things like calculations, measures, rollups — which are reusable and composable.


Using models from your ORM of choice instead of the DDL they generate isn't that unreasonable. Suggesting that one should (or then could only?) use 'a simple API .. or UX' to actually query the database seems pretty objectionable though.

It locks you down to that ORM. If you are the owner of the ORM such as Google with Looker that is fine, otherwise it doesn't sounds like a good idea to me

Last time I checked the sky is blue, and yes I do keep SQL in git

Does anybody know a good SQL formatter and linter?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact