Hacker News new | past | comments | ask | show | jobs | submit login
Simple declarative schema migration for SQLite (rothlis.net)
109 points by drothlis on May 3, 2022 | hide | past | favorite | 28 comments



For anyone who needs to implement the 12 step procedure (https://www.sqlite.org/lang_altertable.html#otheralter) from the SQLite documentation for applying complex alters, I've built a CLI tool and Python library that can apply that for you.

CLI example: https://sqlite-utils.datasette.io/en/stable/cli.html#cli-tra...

    sqlite-utils transform fixtures.db roadside_attractions \
      --rename pk id \
      --default name Untitled \
      --type longitude float \
      --type latitude float \
      --drop address
Python example: https://sqlite-utils.datasette.io/en/stable/python-api.html#...

    table.tranform(
        rename={"pk":"id"},
        defaults={"name": "Untitled"},
        types={
            "latitude": float,
            "longitude": float
        },
        drop={"address"}
    )
Wrote more about those here: https://simonwillison.net/2020/Sep/23/sqlite-advanced-alter-...

(Just noticed this is already mentioned at the bottom of the blog entry!)


We are big fans of Datasette (though I've yet to use it in real life). I think it's important work.


This is an interesting approach to this problem.

I like the idea of building the new schema as an in-memory database. You could even go a step further and compare them using a SQL query that joins across the two databases (SQLite supports this, it's pretty neat).

I'm a bit nervous about how edge-cases might screw things up, but a nice thing about SQLite is that backups are really cheap so I guess you can protect against any risks by creating a backup before running this script.

My preferred approach to database migrations is the one implemented by Django: migrations are a sequence of transformations stored in files on disk, and the database includes a table that says which of those migrations have been applied. This keeps everything in version control and means there's no chance of the wrong migration being applied in the wrong way.

It's quite a bit more work to setup though. I don't have my own Django-style migration system for SQLite yet and I really need one.

I really love how clean and short the implementation of this is! https://david.rothlis.net/declarative-schema-migration-for-s...


I worked with Django at a startup for <1y on a team of ~10 engineers. Django migrations are easy and better than a lot of what's out there, but I thought they could be better for the following reasons:

1. Merge migrations are annoying. Especially when you get some engineering velocity going, it's very common to get some conflict. Almost every single time, the merge migrations did not conflict at all.

2. Migrations made checking out someone's code to run on your local machine difficult. You gotta remember to unmigrate before checking out your own branch again, otherwise some future migration may fail.

3. Migrating/unmigrating can be hard to reason with. There has been a lot of cases were a junior engineer pulls the latest code, then realizes their DB is borked due to migrations (maybe a phantom migration that is no longer in their codebase). In some of those cases, we just tell them to start from scratch, since it's easier.

The solution outlined by OP is nice because it eliminates the overhead of those migration files, and only looks at the intended state (the SQL file), and the actual state (the database). While the migration files do provide some necessary ceremony for dealing with user data in prod, I'm not sure if we've ever gotten anything good out of said ceremony versus the overhead we needed to maintain.


Yeah I think what you're describing is an issue for all "traditional" database migrations, that is, switching between branches is an issue.

I think the best thing to do is, during development, recreate and re-populate the database on every branch change automatically. I have that set up in my current project, although I have to remove the database file manually for now. My test data is hardcoded into my codebase, using the repository access code to save the data on startup, if a "load test data" flag is set.

of course, using an in-memory database would also resolve it, but then any changes I make in the data at runtime would be lost every time.


Disclosure, I'm the co-author of the blog post.

> My preferred approach to database migrations is the one implemented by Django: migrations are a sequence of transformations stored in files on disk, and the database includes a table that says which of those migrations have been applied. This keeps everything in version control and means there's no chance of the wrong migration being applied in the wrong way.

Do the transformations have to be applied in sequence? If so how does this handle resolving git conflicts? With our system mostly conflicts in our database.sql script are resolved automatically with git merging. And CI works well because we downgrade the database to the version of database.sql found in $(git merge-base main HEAD) before running integration tests.


I don't know about Django but in Rails† the migrations are numbered with a timestamp (in the filename itself), instead of a sequential number, precisely to avoid (git) merge conflicts. Then they're run in alphabetical order. There's a dedicated table to record which migrations have been run already.

† 10 years ago, anyway.


Using a timestamp in a migration system doesn't really solve the problem at edge cases though: the time-based ordering of migration file creation may differ from the time-based ordering of migration application, especially accounting for different git branches, and different DB environments (prod, staging, dev DBs for each developer).

The canonical problem is two engineers with branches that each add a new column to the end of the same table. It's easy to end up with schema drift between environments, or cases where executing the migrations in alphabetical order yields a different result than executing them in git history order. You can add various guard rails and band-aids to catch this, but it's ugly.

In contrast, declarative schema management systems inherently avoid this situation. For example, if two engineers try to add a new column to the end of the same table, one of them will get a git merge conflict -- exactly like if two branches in an application code repo tried to edit the same line of a function.


> Using a timestamp in a migration system doesn't really solve the problem at edge cases though: the time-based ordering of migration file creation may differ from the time-based ordering of migration application, especially accounting for different git branches, and different DB environments (prod, staging, dev DBs for each developer).

In many cases this might be "good enough", depending on change frequency.

For schema drift, I've given up on persisted databases outside of production. we have a docker image that gets made on each commit to our migrations. Staging environments have stopped making sense to me. It may or may not be representative of production. At that point, what does it do to serve me?

I'm sure there's some team scale where this doesn't hold up, especially around things like load testing. If that were a concern of mine, I'd still want to treat the environments as ephemeral, and would try my best to automatically setup/teardown the environment.


One of the many benefits of declarative tooling is the ability to easily see if an environment's state differs from expectations, and then bring it back to the desired state automatically in a minimal number of steps. Generally this is much faster than tearing down and re-creating the environment from scratch.

You can still use declarative tooling to replace an environment, or spin up a new environment, whether persisted or ephemeral. Declarative tools are faster at this as well, relative to a traditional migration system which has to apply a full history of iterative DDL (e.g. multiple ALTERs vs a declarative tool's single CREATE).

And of course declarative tools may also be used on production, allowing use of the exact same tool chain in prod as in non-prod.


With Django migrations each migration needs to specify the migrations that have to be executed first - its dependencies.

Usually this is a single ID, but multiple IDs are supported.

Here's an example: https://github.com/CAVaccineInventory/vial/blob/0359ec1b44d0...

This means that you can actually have two migrations with the same number and they won't break, provided they have correctly specified their dependencies.

Django has a mechanism whereby you can resolve conflicts caused by two people working in branches: https://docs.djangoproject.com/en/4.0/topics/migrations/#ver...

In practice though I've found it best to avoid these - as a developer I'll reconstruct my migration if I find someone else has landed one before I did, to keep things simple and linear.


Disclosure: I'm the creator of Skeema (https://www.skeema.io), a declarative schema management system for MySQL/MariaDB mentioned at the bottom of the original article.

> migrations are a sequence of transformations stored in files on disk [...] This keeps everything in version control

While traditional migration systems do give you something to put in version control, I'd argue that what they give you is the wrong abstraction, at least for schema management / DDL.

The migration files need to be ordered in some way based on their filenames, and their contents contain imperative DDL that must be executed in that order. This is conceptually equivalent to storing a bunch of .patch files in a git repo, and then using a file naming scheme to indicate the intended application of patches.

But, version control systems already have a first-class notion of ordering and diff'ing -- after all that's one of the main purposes of the version control system! So at best this scheme is redundant, and at worst it's a second source-of-truth which may conflict.

And then, separate from the repo ordering/source-of-truth problem, you also have a third source of truth which may conflict: the actual state on any given database server, which could also include further drift. Sometimes things get run out-of-order in the course of development, sometimes people run things manually as part of an emergency hot-fix, etc.

Finally, there's the additional problem of these migration files not truly being immutable. If someone accidentally edits, deletes, or renames one of these migration files, the result is basically undefined, if that migration has already been executed on at least one database environment. Some migration systems store hashes of the migration contents, which helps for catching the edited migration after execution case, but not the other cases.

> and means there's no chance of the wrong migration being applied in the wrong way.

Assuming that your commit order, filename order, and execution order are always all aligned, and your migration files are immutable, this is true. And some engineers really value having that history of ALTER TABLE statements.

But of course there's another wrinkle, at least in MySQL/MariaDB as well as SQLite: there are many situations where you're not really using ALTER TABLE, but instead are using a separate external tool to mimic the desired effects of the ALTER, due to various peculiarities and shortcomings of the database's ALTER support.

So if you're not really running an ALTER anyway, is a repo with a bunch of name-ordered ALTER statements even valuable? In contrast, a declarative system gives you a repo which expresses the intended state of the database, which may be used as an easy reference for engineers, and also contains useful `git blame` results to see why/when/who made a change.

That all said, I do concede that imperative migrations are useful for data migrations (DML), as well as renames (renaming a table or renaming a column within a table). The declarative model doesn't capture imperative changes. But nothing prevents you from using a declarative system for schema changes and pairing it with a traditional migration system for imperative changes.


Disclosure: I'm the co-author of the blog post.

One nice thing about this approach is that because it works by inspecting a database that has been put in a desired state. This means it doesn't matter how you set up your database (with an ORM, a SQL file, etc.) the same function can be used to do the migration.

I'd really like if something like this were built into SQLite. That way we'd have more confidence that it would be/stay correct and complete. It seems generic enough that it could be a part of built-in functionality. I would have proposed it upstream, but AIUI the SQLite authors don't accept external patches.

Edit: Another benefit of this being built-in: there are plenty of places where SQLite is applicable where you can't, or don't want to run Python.


SQLite actually ships a little-known utility called sqldiff that implements a subset of this idea: https://www.sqlite.org/sqldiff.html

It's more focused on content diffs than schema diffs though.


Wow, I hadn't come across that before. It's so close to being what we want, but not quite right. They've even considered this specific use case:

> 3. The sqldiff utility is not designed to support schema migrations and is forgiving with respect to differing column definitions. Normally, only the column names and their order are compared for like-named tables before content comparison proceeds. > > However, the single-table comparison option, with "sqlite_schema" named, can be used to show or detect detailed schema differences between a pair of databases. When doing this, the output should not be used directly to modify a database.


An additional benefit: The CREATE TABLE statements serve as documentation of the structure of the production database.


Interesting approach, I'm taking notes!

Something I'm currently doing for my CV application[0] is auto-generating variables for every table/column name in my SQLite DB. As I'm using a dynamically typed language, it's normally easy to mess up with a typo when getting a valuable out of a query result. But when I use the auto-generated variables I'll know I messed up during compile time. That also means my SQLite schema is always my source of truth. Maybe there are better ways to do it, but I find it really useful so far!

[0] https://withoutdistractions.com/cv - I'm doing a "Show HN" today here: https://news.ycombinator.com/item?id=31246696

(I have to admit I'm kinda jaleous you beat me to the front page within half an hour. What sacrifice did you make to which algorithm deity? I won't tell anyone.)


Thanks! Coincidentally one of the first articles I wrote for my blog was about compile-time checking of table/column names: https://david.rothlis.net/d/templates/

(Just something I was messing around with, not anything I ever used in production.)


Spotted this in the code:

    def _left_pad(text, indent="    "):
        """Maybe I can find a package in pypi for this?"""
        return "\n".join(indent + line for line in text.split('\n'))
You can use the textwrap module from the Python standard library for this:

    textwrap.indent('hello\nthis\n  is a demo', '    ')
    # '    hello\n    this\n      is a demo'


Ha! I use `textwrap.dedent` all the time but didn't make the leap of logic to check for the existence of `indent`. :-)

I suspect this was a leftover from Python 2.7 days.


Nice tool! One common problem I have with SQLite is the DDL to alter tables is quite limited and the performance is not super. I actually opened https://sqlite.org/forum/forumpost/8b4e95fd55 a few days ago because it is currently a pain to drop a column that has a foreign key constraint.

Concerning performances, I am dealing with sqlite databases of a couple GB (5-20) and dropping a column can easily take 15-20m plus it doubles the size of the table on disk. If someone has tips on how to improve that I am interested!


I believe it doesn't vacuum by default so you can potentially reduce the size back down again by running vacuum. I believe it is slow but I've never used it


We had something similar at Google, where you change some CREATE TABLE statements in an SQL file, then run a migration based on that changed SQL file. It was insanely nice. I think this tool can be super useful, especially with a lot more people using something like Litestream.

I'm personally (and maybe naively) of the opinion that DB migrations should be as painless/automated and without ceremony as possible. I wish there were more tools like this out in the wild.


I believe database migrations should be boring and common.

At many places I've worked they've been exciting and rare - which massively hurts engineering velocity, to the point that sometimes teams will build things in a really convoluted way rather than go through the bureaucracy of having a schema change accepted!


At my work right now (where we're using Laravel, which has migrations as a cultural standard) we have many migrations. But 95% if not more of them are adding stuff, very few are removing or refactoring stuff. The database evolved under harsh deadlines and growth so there's a lot of technical debt in there.

It's definitely not terrible, but just wished to see people talking about using migrations to refactor databases a bit more.


I think this is an interesting way to generate migration scripts at design time, but I wouldn't trust auto-migration code in production that migrates from arbitrary schema to arbitrary schema. Take migrations that require copying the entire table - this could be fine for small tables, but extremely expensive for large tables. If disk space is a concern, this will also use 2x disk space! I at least want any such operations to be obviously visible at code review time, so the team can discuss any performance implications before we ship such a migration.

Our team uses a rigid workflow to maintain the SQLite schemas for our mobile and desktop apps:

1. Generate a new migration. This creates ${SCHEMA_VERSION}-migration-name.sql, where SCHEMA_VESION will be the PRAGMA user_version after migration. We also generate a matching ${SCHEMA_VERSION}-migration-name.test.ts.

2. The developer updates the migration file to do the thing. This part we could replace with declarative table diffing as in the article, which would be cool! The developer also fills in the test. Our tests typically write some rows to the DB at SCHEMA_VERSION-1, perform the migration, and then make assertions about how the rows look after the migration.

3. When the changes are ready for review, the developer runs a command that packs up all the migrations into generated allMigrations.{json,ts} files, along with an auto-generated "fast-forward" migration that is just the final database SQL dump, and a JSON description of the expected end result schema.

4. On end-user devices, at app boot we read PRAGMA user_version and apply the migrations transactionally in order, to the latest migration. At the end, we assert that the schema in the database matches the expected schema. If the database is new, like on a fresh install, we apply the fast-forward migration instead.


Related project: Alembic[1]. It has tight integration with SQLAlchemy and allows autogenerate migrations based on code/db changes.

[1]: https://alembic.sqlalchemy.org/en/latest/


Why are all these tools written in python? I'm not complaining, sometimes I'd like to see such a tool in Php or Bash or something where say my project doesn't need to include yet another Language dependency ... Doesn't SQLite cli for example ship as a bash or win32 CMD file...?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: