Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you version your data?
114 points by webmaven on Feb 20, 2017 | hide | past | favorite | 55 comments
Superficially, It ought to be possible to use SemVer major.minor.patch (patch can change data and alter the schema but not extend the schema, minor can extend the schema in backward compatible ways like adding a column, and major version numbers are used for backward-incompatible changes), but in practice I don't see this being applied consistently, especially if data dumps are only occasional.

Is your data versioned? How?

I consider data to be a first class object. I break it into data in motion and data at rest.

Data in motion - messages - always look something like this:

    message := version timesent field1 field2 ... fieldn
    version := INTEGER
    time-sent := INTEGER

and the parsers know to reject messages with versions greater than what they can parse; depending on the system, they can also be backwards compatible. Time-sent turns out to be a lifesaver in debugging. You might also need TZ of time-sent, depends on the domain.

Versioning data at rest tends to be a little squirrely depending on the domain. Do you migrate data or do you not? what's your uptime? streaming or batch? Sometimes I version the actual table names, sometimes I migrate.. it depends. My preference is for migration to keep a consistent system, but that is not always feasible.

I'm a huge fan of SQL - it defines the data shape and structures the transforms possible on it, along with allowing a strong separation of data and computation. Postgres is my friend; I heavily use foreign keys and constraints on the schema. That way the data is reliable. (if your data isn't reliable, your schema should reflect that too of course). If I need to have multiple versions of data running at the same time, multiple tables or migrating is cleaner than versioning the specific rows. Otherwise you wind up with nulls and driving schema logic out into your code.

Typically I tack a unix time of insert into the rows for later analysis. You might also care to insert the current application name+version into the rows to catch any bugaboos when that changes.

Check out Pachyderm [0]. It supports distributed, version-controlled data storage. The API is very Git-like: you modify data by making commits.

[0] https://github.com/pachyderm/pachyderm

Very interesting, thanks for pointing this out!

The most RESTFul way to do this would be to use content negotiation using the `content-type` header set to something like `application/com.vendor.product+json; data-version=2.1; api-version=3.4` where the minor version indicates data changes and the major version indicates schema changes to the data. You can club together the API and data versions into one version like '' if you can define what a major/minor change to the API means. Exact details on how the client is exposed to the versioning will depend on the product requirements.

In terms of storing the data, we had a system where the content would be zipped after the content developers were done with the authoring and sent to a place which would convert it into appropriate JSON documents with the metadata and versioning information stored in the DB while the the document content could be stored in the cloud or a document database like Mongo or just Postgres. The content authors only knew excel who were trained to follow a schema while writing the content. That was like a low cost CMS. You can update only the content that has a diff or the entire content depending on how well you can identify a diff for the content. The entire content makes it simple.

Our migrations have unique datetime stamps. Stamps and migration names go into a table.

We do have a hash of all of the contents of the table, but the schema "version" is just the list of migrations that have been run. We run many different instances of our applications on several different "versions" and from time to time pull individual migrations back into earlier releases for hotfixes etc. The overall version is the instance, environment, and this hash.

Practically speaking, when we're handling the files ourselves though, it's instance, environment and backup date.

There are edge cases and it's not perfect (around ordering mostly), but it's incredibly rare for us to run into issues with this schema / data migration approach. We've found that even with a bunch of environments and a couple thousand migrations, data versioning isn't that serious a problem for us.

So, the versioning of the data is basically by the backup date (plus pointers to the relevant schema, etc.), correct?

Great question - we invented a system for this at Snowplow, called SchemaVer:


SemVer doesn't work for data - for one thing, there is no concept of a "bug" in your data (so patches are meaningless).

We have hundreds of companies actively using SchemaVer via the Snowplow (https://github.com/snowplow/snowplow/) and Iglu (https://github.com/snowplow/iglu/) projects.

> there is no concept of a "bug" in your data (so patches are meaningless).

Is there not? Lets say that you're changing the data in your database/data structure from state X to state Y. This involves transforming the data in some tables/data structures from the old structure to the new.

Lets say that you do this and it's all fine, the upgrade goes great. But then you discover there's a problem with the data upgrade.

While you have transformed the data into the new format, it's not been done right. So you actually need a second data change to ensure that your data upgrade is semantically equivalent to the data that went before it, even though the data conforms to your new schema.

Would that not count as a bug in your data?

Any change you make to the schema will be a breaking change.

Not so. Adding a new column is not a breaking change for reads, and may not be a breaking change for writes unless the new column is required, has no default value, and cannot be NULL.

That is not really true in practice.

Take for example a GetUserStatistics() call which provides a list of userids and the users last login date.

A client might be using this list to get statistics on system usage.

If you change the codebase to add the concept of a test user and add an isTestUser column to GetUserStatistics() you have broken the contract with your users.

You had an implicit contract based on shared understanding of the data.

Now of course to correctly determine user usage statistics you need to exclude the test users by checking the new column.

Changing the semantics is entirely besides the point. You can change the semantics of the data without making any schema changes at all!

Isn't it the whole point?

As a consumer it is what I care about.

And I'm not discussing changing the schema, but the data contained with the data structures.

When transforming the schema, you frequently have associated changes that you apply to transform the data from one form (in the 'before' schema) to another (in the 'after' schema). These transformations are code that can have bugs like any other.

In cases like these you can have data that is in the right format, but isn't correct, and can need a second change (to the data only) to correct it.

+1 for Snowplow's approach. If you have an app out there in the wild (with autoupgrades off for at least 10% of devices), there isn't such a thing as "we'll do a hard cutoff for our tracking data". Their way, the data is versioned in self-describing contexts so they are separated from JavaScript tracker to database (two different tables), and you can write an ongoing migration between the tables. Very helpful.

Looks quite interesting, though again this is versioning the data schema rather than the data.

I think you have a certain amount of fuzziness around the idea of an "interaction" with the data. It would probably help to think about compatibility and breaking changes in terms of reads vs. writes in order to get the determinism you're looking for and better alignment with SemVer.

That is, if a client using the previous schema can still do reads and writes without the data being invalid, you have forward compatibility, and this qualifies as an PATCH.

If a client using the previous schema can still do reads against the new schema without the data being invalid but not writes, that would qualify as a MINOR change.

(Aside: write-but-not-read compatible changes are possible, but are uncommon in practice)

A change that can prevent a client using the old schema from doing valid reads against the new schema (eg. a column is renamed or removed) would be a MAJOR change.


I agree there is some fuzziness. Long after we wrote this I read:


which has a very succinct explanation of forwards and backwards compatibility as it relates to producers and consumers.

It's high time we did a second draft of SchemaVer which explains it in terms of forwards/backwards compatibility; the actual behavior of it (when to bump etc) would barely change.

I like the idea but I'd suggest not using semver terminology as it is misleading.

Breaking writes would not be considered a minor change in semver.

In Data warehouse context, it is often managed by one of the Slowly Changing Dimension management techniques.


Shameless plug: https://github.com/chop-dbhi/scds (stands for slowly changing dimensions store). A prototype, but versions individual records and tells you what changed if anything. Also there is preliminary support for JSON schema.

Cool project, thanks!

Thanks for the reference!

I wrote[0] about a method I've used with success in the past. Essentially you use the previous version's hash as the name for the next version. The benefit is that merging the work of multiple developers is easier. I ran the process manually when I did it and would love to hear if someone writes a script that makes the process easier.

[0]: https://medium.com/@clord/for-migration-of-schemas-use-versi...

So, your data versions form a DAG, sort of like Git commits?

What a great idea, thanks for sharing!

BTW, why using shell scripts to write migrations is so unpopular? We have lots of language/framework specific solutions or some «language agnostic» tools using plain SQL, both ways are rather limiting.

So if you release version 1.0 with hash x, then release version 1.1 with hash y, and then release version 1.0.1 with hash z, 1.0.1 == 1.1?

For branching schemas, yes it gets complicated. But at least you get a nice conflict while merging the branches (at the right place) These conflicts help guide a rewrite of the schema history into something consistent. That's part of it: when a merge happens, you basically have to create a new migration that does the merge too.

and if you never merge the forks, there is no problem. they likely won't share a production database anyway, right?

It's like a block chain!

On a somewhat related note, I have an API that returns time series data. The API itself is versioned but not the data when it is added to since the endpoint is the same. How would I go about notifying consumers of the API that the data has been updated and they should call the API to get the latest data? Would versioning the data help with this?

Most data versioning approaches would help bulk consumers of the data, rather than through the API.

You could add a separate "changes" feed to help API consumers, or inline version info to the main API results.

Thanks, was thinking about doing a new endpoint for changes too but wanted to see if this topic of data versioning pertains to this or not

It does, but I think what you're really after here is a mechanism for cache invalidation.

Check out PostgreSQL's LISTEN and NOTIFY:



A related HN discussion a few years ago: https://news.ycombinator.com/item?id=6689213

If you can predict when you're going to get new content, just use an HTTP header to tell until when the content can be seen as fresh. Then the client will know when to call the API again.

If not, ask the client to provide an HTTP callback endpoint, and ping it to notify of new fresh data on your side. Using WebSocket is also an option.

Thanks, unfortunately the new data updates can vary.

Have seen two ways of doing in large organizations 1. Use tools like Optim (i think ibm) 2. Export CSV data, and export schema and drop it to Hadoop data store

Option 2 is preferred for very long time restore say after 15 years as there may not be a way to restore to Oracle/MS Products which may not have run time ecosystem after 15 years.

Edit: I realized that your question is just about the versioning scheme (the "what"), and my answer is more about the "how". I hope you still find it useful.

After being inspired by Hans Werner's answer here (several terabytes of binary data, 50000+ revisions), I chose Subversion. It's not conventional, but works very well in practice.


You get:

a) Natural audit trail & notes on data modification

b) Managed central dump of data => multiple, distributed local copies that you don't need to worry about keeping in sync. Just delete the cache and the data access API (see below) will check it out automatically again when you request the file.

Data access is encapsulated via an API that manages a /home/datadump/ of cached, revisioned files. You refer to the file using it's name + revision number (see below). I guess tags and branches can be used for more natural revison numbers, but I need to investigate whether Subversion's cheap copy works well in practice for this. They might be the perfect solution for you major.minor.patch needs?

User Code --> get_data("<file_path/file_name>", "r=38") --> API checks the cache for file_path/file_name_r=38. If it's not there it checks it out using a read-only user id and puts it in the cache and returns the path to /home/datadump/file_path/file_name_r=38.

Unusual, but works just fine for our purposes of mostly-read-only large files that need a revision history.

An idea that I did not explore was using ZFS or other revisioned file systems. Another "crazy" idea that works just fine for some folks is using Binary blobs in a database; not sure about size limits, though.

I suggest you look into schema migration tools.

What I have done in the past is to prefer lazy migrations (migrate once you access an old record), rather than migrating everything in batch.

Also a good idea to archive data that hasn't been used, moving it outside your primary database.

I am pretty familiar with several, such as Alembic, South, Django Migrations, etc.

I'm not just interested in versioning schema changes, but data changes as well.

There are database migration tools like Liquibase/datical and, I believe, Flyway that let you specify migrations not just of schema changes (DDL) but also data changes (DML).

I have used Liquibase in production for handling data changes for small-scale data (in megabytes) for dev/test/mock data and/or values in preconfigured application-controlling tables, but I wouldn't use it for gigabytes of transactional data.

Thanks for the references!

+1 for Liquibase, too

Depends on what it is, either semver for things where backward compatibility is necessary or just "version 1, version 2 etc." where a new version is a big enough change you should consult the documentation to know what is going on.

I sometimes do it within the same table. It's a bit of a hack. What you want is a column called revision_id (something to point back to the master record). You'd want a state column to track if it's a draft, revision, active, etc. You might also want a sequence column too, so that when you apply drafts, you can compare if the draft can modify the master record (if sequence column doesn't match, you prevent the draft from overwriting master record data). Upon applying a draft to the master record, you do a deep clone of the existing record, insert the existing data as a new row pointing back to the master row via revision_id.

So, this is per-record versioning only, and isn't tracking tables as a whole or schema changes, correct?

My hack way is to dump the tables to text and commit that. That was done by cron and not coupled with committing the code.

Granted, I was interested in the data within the table not the structure and the size was manageable.

Do you mean schema, not data?

I've seen three schemes used in production:

- single number versioning (V1->V2->V3->V4) - this is the scheme Android's sqlite helper insists on https://developer.android.com/reference/android/database/sql...

- full version numbers, but well defined upgrade tracks enforced in code.

- token based (e.g. a list of upgrades that have been applied)

Schema changes are only one (admittedly, particularly troublesome) type of change that should be versioned. Record changes, additions, and deletions can in many cases be usefully versioned as well.

That is, imagine that after a version of the data is released it increases 10X, but the schema doesn't change. That might still be worth making a release for, right?

Typically what I see in the wild are data dumps on some sort of semi-regular schedule such as weekly, monthly, or annually, "versioned" with the timestamp or some derivative thereof (eg. "August 2016").

This is much better than nothing, but just as a source dump for each release of an application or other software in a tarball is not as useful as being able to check out a tagged release from a repo, so too versioned data releases that can be checked out with a tag from a repo would be more helpful that a data tarball.

I've also seen single-number versioning in production that goes '' -> 'v2' -> 'v3' :)

You might want to check Datomic out. They have a solution for your problem: http://www.datomic.com

This project seems to be making progress for WordPress: https://versionpress.net/

ALL migrations insert major.minor.patch to the database. Migrations are committed to the database, and can be run anytime; they each have check code at the beginning to compare against what's in the database before executing the payload. Migrations won't execute if the version is semantically less or too far ahead from what's in the database.

It's worked reasonably well so far.

I use protobuf schema evolution. It's the same approach that Google use (naturally), and Facebook (with Thrift), among others.

There are several tools out there for doing this, if you're storing your data in databases.

SQL Server has Master Data Services. There's also Talend. There are other master data management tools available.

Depending on how you use your data, there are tools for improving data quality that tie in with those as well.

CQRS and event sourcing is an approach where all data changes are versioned. You can re-create earlier states of the data by processing a portion of the event log.

All data changes? That is, every INSERT and UPDATE?

I was thinking more along the lines of a versioned data "release".

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact