
Show HN: Version Control for Databases - akie
https://github.com/infostreams/db
======
PaulKeeble
The way I have been doing database versioning and rollout for years is with
Liquibase ([https://www.liquibase.org/](https://www.liquibase.org/)). It is
based on the Database refactoring book that came out some years ago and
provides forwards and backwards progress based on a table that tracks the
checksums of the changes allowing them to be rolled out where ever the file
goes. You can share with teammates, propagate to environments and have
seperate setup for structure, data and test data setups.

Restoring snapshots isn't really what you want with databases. The data in
production is only in production so you need to refactor it. The issue in
today's 365/24/7 always online is that refactoring database structure in that
way with a lot of data causes downtime so these sorts of tools require a lot
of orchestration around them.

Liquibase can also capture an existing structure and data set with an export
as well from just about any database engine out there, so you can capture your
initial point as you are and then move forward with updates being individual
changes so it is relatively easy to adopt.

~~~
djrobstep
I don't know why people are stuck on this model of a long chain/log of
successive schema changes when you can simply make a diff.

I lay out this approach here: [https://djrobstep.com/talks/your-migrations-
are-bad-and-you-...](https://djrobstep.com/talks/your-migrations-are-bad-and-
you-should-feel-bad)

~~~
ben509
Because diffs don't work. If you rename a table, for instance, a diff
algorithm will see that as dropping the table and creating a new one.

~~~
evanelias
People love to cite this reason, but practically speaking, it's FUD. Renaming
in production -- whether entire tables or just a column -- is operationally
complex no matter what.

Assuming any non-trivial software deployment (multiple servers), it's
impossible to execute the rename DDL at the exact same time as new application
code goes live. So either you end up with user-facing errors in the interim,
or you can try to write application logic that can interact with both old and
new names simultaneously. That's overly complex, typically not supported by
ORMS or DAOs, and very risky in terms of bug potential anyway.

I'm a database expert, and among 100% of the companies I've worked at or
consulted for, renames were either banned entirely or treated as a very rare
special-occasion event requiring extra manual work. Either way, lack of rename
support in diff-based schema management isn't really a problem, as long as the
tooling has these two properties:

1\. Catches unsafe/destructive changes in general and require special
confirmation for them (preventing accidental rename-as-drop in the rare case
where a rename truly is desired)

2\. Offers a way to "pull" from the db side, so that if a rename is actually
needed, it can be done "out of band" / manually, and the schema repo can still
be updated anyway

~~~
ben509
> People love to cite this reason, but practically speaking, it's FUD.

> Either way, lack of rename support in diff-based schema management

You're trying to polish a turd here, this isn't "lack of support" it's "it
will drop objects in production."

> Catches unsafe/destructive changes in general and require special
> confirmation for them

Again, polishing a turd: the only way your "automation" works is through
manual intervention.

> Assuming any non-trivial software deployment (multiple servers), it's
> impossible to execute the rename DDL at the exact same time as new
> application code goes live.

You can construct a view referencing the old table, and rename the table. Yes,
it has to be an updateable view and you need transactional DDL, but within
those constraints, it's doable.

> I'm a database expert, and among 100% of the companies I've worked at or
> consulted for, renames were either banned entirely or treated as a very rare
> special-occasion event requiring extra manual work.

If they're using a DBMS that doesn't support transactional DDL, completely
understandable. If their tools are liable to drop production data due to
renames, also completely understandable.

But the fact that they ban a trivial operation is a symptom of the problem
with all the half solutions and snake oil in SQL schema management. It's so
bad that you have large companies investing heavily in ripping out the schema
entirely, which itself is just more snake oil.

In the problem space of trying to keep a schema in sync, we know that diffing
leads to unacceptable answers, that is an indicator that it's the wrong
conceptual basis for a correct solution.

~~~
evanelias
> You're trying to polish a turd here, this isn't "lack of support" it's "it
> will drop objects in production."

That's a strawman argument. Any reasonable schema management implementation
has safeguards against DROPs. If your tooling blindly executes a DROP without
extra confirmation, use better tooling.

> Again, polishing a turd: the only way your "automation" works is through
> manual intervention.

There's absolutely nothing wrong with requiring extra human confirmation for
destructive actions. Quite the contrary. I've spent most of the last decade
working on database automation and operations at social network scale, and
will happily say this is a common practice, and it's a good one at that.

> You can construct a view referencing the old table, and rename the table.
> Yes, it has to be an updateable view and you need transactional DDL, but
> within those constraints, it's doable.

So you're assuming that every single table has a view in front; or you're
dynamically replacing the table with a view and hoping that has no detrimental
impact to other database objects or app performance. Either way, you're
talking about something operationally complex enough that it isn't fair to say
that production table renames or column renames are a "trivial operation" at
the vast majority of companies.

> It's so bad that you have large companies investing heavily in ripping out
> the schema entirely, which itself is just more snake oil.

This is frequently overstated. For example, although Facebook uses a flexible
solution for its largest sharded tables, there are many tens of thousands of
other tables at Facebook using traditional schemas.

> In the problem space of trying to keep a schema in sync, we know that
> diffing leads to unacceptable answers, that is an indicator that it's the
> wrong conceptual basis for a correct solution.

The only "unacceptable answer" you've cited is rename scenarios, which even if
it incorrectly leads to a DROP, the tooling will catch.

If you need crazy view-swapping magic to support an operation (renames), that
is an indicator that it's a conceptually problematic operation that should be
strongly reconsidered in production.

As I've already stated elsewhere in this thread, declarative schema management
has been successfully used company-wide by Facebook by nearly a decade, and is
also a common practice in the MS SQL Server world. If you're unconvinced,
that's fine, but many companies have found it to be a great workflow!

------
akie
I made this, and have been using it in production and for local development
for the past year or two. It's been immensely helpful to quickly save or
easily restore the database to a previously know 'good point'. I have also
been using it to quickly load staging databases into development and vice
versa. So far it's MySQL only, but the connection points for other databases
such as PostgreSQL are in place.

It's main goal is to simplify development - it's not intended as a replacement
for writing database migrations or to work with production databases. I would
be grateful for any feedback, preferably in the form of pull requests ;-)

~~~
westoque
I like this pattern of being able to restore snapshots in your development
environment. For the uninitiated, docker actually has all these functions. For
example, if you create a container for postgres and mount a volume, you can
create a snapshot of the volume for distribution for your team so they can
mount/unmount anytime.

~~~
akhilcacharya
I was under the impression container file systems make running a DB in a
container a bad idea? Or is it just in performance critical situations?

~~~
tehbeard
Doing so into the main container filesystem (which is usually an overlay/union
stack) is probably bad for performance.

But a bind mount (which is usually the recommended way to do it so that data
survives the container lifespan) will avoid that and just leave you with
whatever "problems" the host filesystem has re: database files.

------
MadWombat
There are multiple tools that attempt to do this, some specific to a framework
some not. To name a few,

\- flyway

\- liquibase

\- dbmate

\- alembic for python/sqlalchemy

\- django migrations for python/django

\- diesel migrations for rust/diesel

\- echo migrations for elixir/echo

\- persistent migrations for haskell/persistent

All of these, in one way or another, provide an ability to create sequential
database schema changes that can be stored in version control and
applied/rolled back.

What is your system doing that is different from all of these?

~~~
akie
This is more meant as something that supplements migrations, as a “time
machine” of your database so to speak. You would run it before you do any kind
of potentially destructive operation on your database, such as running or
rolling back a migration. Not all migrations are reversible (for example,
migrations that drop columns), so you cannot easily go back to before you ran
that migration. It’s also convenient when you check out an older version of
your code - you can just restore the database as it was at that point in time.
You could run a command in a git hook to enforce saving a new version when you
tag a new release for example. In any case, it makes your database less of a
black box, and it basically becomes ok to destroy it - because you can so
easily make a new snapshot and go back to how it was before.

~~~
evanelias
> it makes your database less of a black box, and it basically becomes ok to
> destroy it - because you can so easily make a new snapshot and go back to
> how it was before.

There are some extremely serious caveats here that need mentioning. A
production DB is typically taking application writes 24/7\. Restoring from a
snapshot would cause data loss of any writes that occurred after the snapshot
and before the restore.

You would need to also stop prod traffic and then replay the binary logs
beginning from the same point as the snapshot, but that can be complicated if
you're trying to reverse some DDL operation that is inherently incompatible
with the transactions in the binlogs.

Also, if I'm not mistaken, your tool's invocation of mysqldump is not using
the --single-transaction option. Without that, mysqldump doesn't create
consistent snapshots for InnoDB tables. This means what you're dumping and
restoring doesn't actually represent a single valid point-in-time of the DB's
state.

~~~
akie
You are right: this tool offers no guarantees whatsoever. It's intended use is
for development and staging. You could theoretically use it in production, but
not on a live database server. If you have that use case, you need to think
carefully about your tools and your processes to guarantee a consistent
outcome.

------
egyptiankarim
Is this approach materially different/more scalable than using the active
record, ORM, and seed file patterns supported by popular MVC frameworks (e.g.,
like how Rails does it
[https://edgeguides.rubyonrails.org/active_record_migrations....](https://edgeguides.rubyonrails.org/active_record_migrations.html#migrations-
and-seed-data))? Isn't the crux of all that to be able to source control and
easily distribute schemas and data in a database agnostic way?

~~~
abhchand
This was my same thought. Really impressive as an exercise to develop this
yourself, but this is core functionality of a lot of ORMS. Rails' ActiveRecord
is kind of the shining example of this, with it's `schema.rb` versioning file

~~~
akie
This is more meant as something that supplements migrations, as a “time
machine” of your database so to speak. You would run it before you do any kind
of potentially destructive operation on your database, such as running or
rolling back a migration. Not all migrations are reversible (for example,
migrations that drop columns), so you cannot easily go back to before you ran
that migration. It’s also convenient when you check out an older version of
your code - you can just restore the database as it was at that point in time.
You could run a command in a git hook to enforce saving a new version when you
tag a new release for example. In any case, it makes your database less of a
black box, and it basically becomes ok to destroy it - because you can so
easily make a new snapshot and go back to how it was before.

------
beart
This doesn't tackle the primary issue I have with databases and source control
- diff.

A proper source control system for databases should handle both migrations and
diffing across migrations. I should be able to compare any two commits and
view exactly what is different between them.

~~~
ericschrock
There are purpose-built databases for this purpose such as Dolt
([https://www.dolthub.com/](https://www.dolthub.com/)), itself purportedly
derived from noms ([https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)). Because it's baked into the
database, you can do complex things like diffs and merges, but it requires
that you run everything on that RDBMS vs. something standard. Generic diffing
from two arbitrary data states of any database is pretty challenging, even if
you have all the transaction history.

~~~
zachmu
Yeah, Dolt is built for distribution / collaboration of data sets, rather than
for actually hosting live data in production (although it can do that). The
workflow you would use to diff your database versions would be very similar to
OPs: periodically take a snapshot dump of prod, then import that snapshot into
Dolt. The reason this is better than just diffing the snapshot directly (i.e.
textual diff where you see which INSERT statements are present in one version
but not the other) is that 1) you get git-like version control semantics where
each revision is associated with a person and a commit message, and 2) you can
run SQL queries to inspect what's different between two revisions, rather than
needing to write some analysis program.

------
timsehn
We built a Git-style version controlled SQL database called Dolt
([https://github.com/liquidata-inc/dolt](https://github.com/liquidata-
inc/dolt)) mentioned deep in this thread. It allows diffs and merges across
schema and data based on commits.

We are focused on the data distribution/open data use case
([https://www.liquidata.co/blog/2019-10-09-where-is-the-
data-c...](https://www.liquidata.co/blog/2019-10-09-where-is-the-data-
catalog/), [https://www.liquidata.co/blog/2019-12-06-the-history-of-
data...](https://www.liquidata.co/blog/2019-12-06-the-history-of-data-
exchange/)) but I think a number of use cases mentioned in this thread are
covered by Dolt.

You do have to move a copy of your data to Dolt, it does not sit on top or
beside of your current database. So, it is an extra step in your data
modification workflow, similar to versioning code. You had to add a version
control step into your code modification workflow.

------
lichtenberger
Pretty nice tool :-)

But what if the data store would do all of that already?

    
    
      - retaining the history of your data:
        - free deduplication through copy on write and sharing unchanged structures
        - further store space savings through page-level versioning (not all records in a page have to be written: just the changed/inserted/deleted ones plus depending on the algorithm a few more to fast track it's reconstruction)
        - read and write performance is balanced using a sliding snapshot algorithm: This avoids having to read long chains of incremental page changes and it avoids write peaks from intermittant full page snapshots. Instead they get trailed over a few (usually three or four revisions).
        - the sorage manager stores timestamps for all revision in a RevisionRootPage, which itself is indexed
    
      - Optional commit author and message
      - Reverting to and committing a past "good" revision is just adding a few bytes to index a new RevisionRootPage which references the past revision data (simple pointer)
      - You can diff taking rolling hashes into account
      - Currently I'm working on a diffing JSON format and a web front-end to display these in special space filling approaches
      - Sophisticated time travel queries
      - typed, user-defined and versioned secondary index structures (based on AVL trees)
      - a path summary kept up to date always
      - a lot of metadata stored and kept up to date for each record/node
    

It's already implemented in [https://sirix.io](https://sirix.io) /
[https://github.com/sirixdb/sirix](https://github.com/sirixdb/sirix)

I'm working on getting it stable, but help from you, the Open Source community
is always needed :-)

------
ericschrock
I'm working on a project looking to solve a similar problem, but using a very
different mechanism to provide git-like semantics for any database that runs
in a container ([https://titan-data.io](https://titan-data.io)). We support
local docker containers today and will be adding k8s support in the coming
weeks to run in any Kubernetes cluster using k8s snapshots (e.g. push data
state from CI/CD k8s cluster to S3 bucket, and clone locally later for
forensics).

Each automation approach (SQL, dumpfiles, or storage) carries different
tradeoffs, but we believe there's a real problem facing developers when it
comes to managing data within their workflows. Would love to learn more about
what some of the killer use cases are, and how these approaches can complement
each other.

------
kaiby
Which databases can this be used for? Based on the usage of phpmyadmin and
WordPress in the examples, I take it only works for MySQL?

~~~
akie
Yes, so far this is MySQL only.

I don't think adding Postgres support would be that far fetched, have a look
at
[https://github.com/infostreams/db/tree/master/bin/drivers/my...](https://github.com/infostreams/db/tree/master/bin/drivers/mysql/connectors/direct)
for example to see the meat of the load & save operations. You'd have to
implement it twice though, once for local servers, and once for remote servers
(via ssh).

~~~
ianwalter
This should be on your README (ignore if it already is and I just missed it).

------
deepakputhraya
I think dotmesh does something similar -
[https://dotmesh.com/](https://dotmesh.com/)

Github link - [https://github.com/dotmesh-
io/dotmesh](https://github.com/dotmesh-io/dotmesh)

------
rhyst
Looks very interesting. In the README it says:

> The text mentions connection type phpmyadmin, but it doesn't actually work.
> Don't use it.

Will phpmyadmin support work in future? My use case is making backups of a DB
where I don't have SSH access but do have a phpmyadmin account.

~~~
akie
Yes, that's why I originally added it. After a day of hacking I had it
(mostly) working on one particular instance of phpmyadmin, but when I tried it
on another one it completely broke. Not sure why that was, but since I didn't
have the need myself anymore I stopped developing that part.

------
oauea
So this just automates myqldump and import, right?

~~~
MuffinFlavored
I can't tell if it is for just the schema, or the schema + data inside the
tables. If it is just the schema, I feel like there are better software ways
to define + manage database migrations (aka make them part of the repo with
up/down methods to roll forward / roll back)

------
valw
_cough_ Datomic _cough_ : [https://www.infoq.com/presentations/Datomic-
Database-Value/](https://www.infoq.com/presentations/Datomic-Database-Value/)

------
k-godwin
I'm not sure what advantage this has over putting MySQL dumps into git (or
manually versioning them with filenames)?

------
husui
Hi, thanks for making this. I tried to give it a shot with a remote MySQL
database, but I'm not sure how to specify the connection port (I submitted an
issue about it). Sorry if it's obvious and I missed it.

------
simbakhadder
Seems useful as a tool for integration tests. Makes it a bit cleaner to update
the state of a database and test against the application level code

------
llagerlof
Nice idea to speed up the proccess of development and testing.

Does the snapshot generated include the views, triggers and stored procedures?

------
kuczmama
This is a really cool idea. Thanks for sharing this. I will definitely give it
a go if you have postgres.

------
llcoolv
Looks nice, but I don't see what advantages it has compared to liquibase or
flyway.

~~~
PrimeDirective
It's not just structure, it's data, too

~~~
llcoolv
Well, most of the flyway migration schemes I have seen so far also cover data
(both initial refdata and migrating existing when the structure changes).

------
dmidnight
If you were to take the hash and place it in a bitcoin transaction on the
blockchain, you could "notarize" each hash with your private key.

