Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you version control a Database?
8 points by 719Ben on July 20, 2015 | hide | past | favorite | 15 comments
Hello All,

I am working on a project with 2 others and we have a PostgreSQL database that links to our project. We are trying to figure out the best way to share the layout of the database as it is edited. Currently we email each other when we change the table layout but this takes a lot of time and feels very backwards from regular Git commits. Does anyone have any advice to version controlling a database layout? We could care less about the actual data at this point because we are still developing the system.

If one of the many schema migration solutions out there doesn't work for you, hacking one together is simple enough:

* Put your schema/data migrations into .sql files as raw SQL

* Number the files from 0001 on

* Write a script to apply each migration file to the DB in order, recording the file name to a log table after successful application. Only apply migrations not found in the log table.

* Commit your .sql migration files along with the code it is tied to.

That's a good approach - I've used it in a legacy system, although I named the files in the format "TIMESTAMP_migration_description", which is a bit more descriptive.

Not sure if this is helpful, but you can use a database migration framework. Check update scripts into your version control, then the others can just run them when they pull your changes. We use alembic (http://alembic.readthedocs.org/en/rel_0_7/) but something else may be a better fit for you depending on what software you use.

So there are solutions to this issue as others have pointed out. Full products are made around it.

There are a few ways to do it quickly and more down and dirty though. 1, do what lastofus suggested, it works and is simple.

2, and the way I have used with small groups when there is limited data for testing and we are using localhost databases for dev. First, check in drop/create scripts for ever element in the database. Second, create a sql script that inserts all the test data, users etc. Third, create a shell script that you can run that gets the latest DB scripts and insert script and runs it all. It won't scale for a long time, but it works for longer than people think. Then whoever is making a new change is responsible for updating the insert script to make sure it stays 100% operational. And personally, we always tied the db script checkin to the same code set checkin that had the breaking change if there was one. Otherwise if it wasn't a breaking change which was our goal usually it could go separate.

Of course, a migration framework is superior, but sometimes the weight of it just isn't what you need yet.

Migration frameworks superior? No way.

Migration frameworks work on diffing dev with a copy of the deployment target. This is inferior. If you rename a table it may drop the table losing all data, then create a new one. It has no way of knowing a rename from 2 unrelated tables. It includes junk/testing objects that should never pollute proudction.

Hope your bank is not using one of these migration frameworks.

Nubmered scripts are the way to go. With a strict policy to never update a script, only create new ones. You're modifying state, so you must capture ordered steps that moved you from state 1 -> 2. This will avoid a host of subtle issues. After a release you can create a backup as a baseline for the next set of scripts to run against.

Migration/comparison frameworks/tools can be used poorly just like any tool, but they don't have to be. When you have the funds and a moderately sized team, especially in dev, I think the migration tools/frameworks can be invaluable time savers and are superior. Also, the OP isn't in production, but once in Production I would expect schema changes would be less regular and generally less breaking except for occasionally. So having a tool now is more advantageous in many ways but may not fit into their budget or learning curve window, and it isn't needed, just can be nice.

No tool mitigates the responsibility for reviewing the updates before applying them in production, that would just be irresponsible IMO. But then again to me, there is no such thing as a SQL database schema change that doesn't require a person monitoring deployment. I know for my teams over the years many times we have tested a script in dev and stage, only to find out in production some set of records or other change that wasn't reflected in the dev/stage environments causes an issue. That is also why we started using migration frameworks to do diffs for us between environments to help minimize the chance we would run into these issues. It also let us do comparisons of the scripts marked for deployment to make sure there weren't conflicting changes, again something that occasionally happens on larger or fast moving teams.

It isn't that numbered files don't work, but just like my other suggestion, they only scale and go so far, then you really need a tool to help you keep things straight and help compare your perception to reality. For the OP, either numbered scripts or my other suggestion works for now, but I wouldn't rule out migration/comparison tools as they can really be helpful.

And actually the bank point isn't really a good one, many of the financial clients I have worked for use SQL migration tools to manage their deployments, along with their DBA's playing overwatch. This allowed them to configure rules within those tools that would prevent errors from happening.

I think diff is OK as a verification check after a deployment. But diff is not OK to generate the change script itself.

It's not about abusing the diff. It's about diff being fundamentally the wrong technique to transition from state (C)urrent -> (T)arget.

A diff cannot always bring you to the desired T. The diff only factors in 2 states (C and T) and not X,Y,Z which you may have an import from and which may need to occur during a transitional state before T is reached.

It doesn't factor the desired state T may not even exist to diff against if developers move forward adding objects not for this release. It makes things easier on web developers at the expense of deployments.

Which "scales" better? Migration tools scale better at letting web-devs free wheel in the DB.

For bullet proof deployments migration tools don't scale. They excel at the simple cases but hit a hard limit.

BTW any details on what kind of data in production broke a deployment? Heads up to avoid it myself.

I agree you may have a few transitional states to go through before dev becomes production. And in general, we didn't let the migration tools write our scripts for us, we used them to automate the testing and deployment of our scripts, maybe that is the key difference in how we used them. Which is maybe why I have never experienced what you are saying, when we added the tools in place, properly set things up and used them consistently we had less issues, and smoother deployments (not perfect, but smoother). But again, they were larger teams all working in the same data sets so we had to have coordination etc. Plus we had DBA's that were gate keepers for the database systems, and we never allowed developers write access to production, that is just asking for issues.

As for some of the issues I have seen with teams moving fast or that are larger. In one case we had around 40 commonly used databases, hundreds of tables in most of them, millions of data rows, multiple servers etc. Not saying the DB architecture was ideal, but it was not uncommon for larger systems at mid+ sized Enterprises I have seen as a consultant. Many times it is when an organization is between rapid growth still and getting discipline that you see a lot more of these types of issues (larger teams, fast moving still with a lack of BTDT leadership). The issues we saw between dev, stage and production were 1. the datasets were too large to replicate from Production to each environment, 2. security concerns over too many people having access to some of the data, 3. multiple teams split functionally instead of per sprint.

So how did these cause issues.

1. We would not see all the data in each environment and while we would test across a wide swipe of data and also run queries trying to validate our alter and update scripts it wasn't uncommon every couple of deployments to have DB scripts fail to run because of some data we couldn't test against. The toolset helped us a little here, because the DBA's set it up to auto test changes against the production data before it would let the script run. And every time a script would fail they would add new test cases if it warranted it.

2. Since not everyone could have access to all the data they couldn't predict every outcome, see #1. So the toolset helped us automate what they couldn't see so we could prevent some issues. This is very common when dealing with financial or PII in health care settings. Also, this SHOULD and could have been mostly resolved by not splitting teams functionally #3.

3. Teams were split by their function instead of sprint. We had a data team, analytics team and development, each working mostly independent of each other. Meaning it wasn't rare for the data team to release a script to production in the morning which didn't get applied to dev/staging and then development comes along a few hours later tests and gets sign off on their changes in staging. Then development deploys to production and has scripts fail to run in production or causing some data havoc (or vice versa between the teams). Had the teams been integrated better this would have been less likely but still not impossible. And I hear all the purist saying well the problem was someone skipped stage etc. Yep, it sure is a core problem, but frankly in every environment I have worked I have seen it happen and so to dismiss it as it shouldn't happen is naive. Like saying well I am not testing for any errors in my code because there shouldn't be any, uh huh. So when we used the toolset it would tell us hey, something has changed here, and the DDL looks different than what you said it should look like, abort.

Also, most places I have seen need to routinely do production to stage and dev refreshes to help prevent a lot of these types of issues, which right there tells you that shit happens and it is nearly impossible to plan for all outcomes when it comes to complex systems. So tools are the right thing to help you.

Last point, when we did have major deployments and a lot of DB scripts were going out, the toolset also would run through all the scripts and warn us of conflicts. It would give us false positives sometimes, and miss some things, but it really saved our asses more than once.

Sorry for the length, just wanted to share. FWIW, we agree the simple methods work, its just that tools and frameworks, implemented properly really do mitigate some of the human and process failures we all have from time to time.

Ty for the share.

> Currently we email each other when we change the table layout but this takes a lot of time and feels very backwards from regular Git commits. Does anyone have any advice to version controlling a database layout?

One method is to use pg_dump to produce the script to recreate the database, and keep that in version control with source. It may not be the best method, but its probably the one that requires the least changes to existing workflow of modifying the database and still gives something like what you want.

A simple framework for database migration that has worked for my team:


Flyway is a Java based migration framework. It's pretty good. It's simple and straightforward. http://flywaydb.org/

Flyway is pretty good. It's Java-based. http://flywaydb.org/

My team is using Red Gate

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact