Hacker News new | past | comments | ask | show | jobs | submit login

Great question - we invented a system for this at Snowplow, called SchemaVer:


SemVer doesn't work for data - for one thing, there is no concept of a "bug" in your data (so patches are meaningless).

We have hundreds of companies actively using SchemaVer via the Snowplow (https://github.com/snowplow/snowplow/) and Iglu (https://github.com/snowplow/iglu/) projects.

> there is no concept of a "bug" in your data (so patches are meaningless).

Is there not? Lets say that you're changing the data in your database/data structure from state X to state Y. This involves transforming the data in some tables/data structures from the old structure to the new.

Lets say that you do this and it's all fine, the upgrade goes great. But then you discover there's a problem with the data upgrade.

While you have transformed the data into the new format, it's not been done right. So you actually need a second data change to ensure that your data upgrade is semantically equivalent to the data that went before it, even though the data conforms to your new schema.

Would that not count as a bug in your data?

Any change you make to the schema will be a breaking change.

Not so. Adding a new column is not a breaking change for reads, and may not be a breaking change for writes unless the new column is required, has no default value, and cannot be NULL.

That is not really true in practice.

Take for example a GetUserStatistics() call which provides a list of userids and the users last login date.

A client might be using this list to get statistics on system usage.

If you change the codebase to add the concept of a test user and add an isTestUser column to GetUserStatistics() you have broken the contract with your users.

You had an implicit contract based on shared understanding of the data.

Now of course to correctly determine user usage statistics you need to exclude the test users by checking the new column.

Changing the semantics is entirely besides the point. You can change the semantics of the data without making any schema changes at all!

Isn't it the whole point?

As a consumer it is what I care about.

And I'm not discussing changing the schema, but the data contained with the data structures.

When transforming the schema, you frequently have associated changes that you apply to transform the data from one form (in the 'before' schema) to another (in the 'after' schema). These transformations are code that can have bugs like any other.

In cases like these you can have data that is in the right format, but isn't correct, and can need a second change (to the data only) to correct it.

+1 for Snowplow's approach. If you have an app out there in the wild (with autoupgrades off for at least 10% of devices), there isn't such a thing as "we'll do a hard cutoff for our tracking data". Their way, the data is versioned in self-describing contexts so they are separated from JavaScript tracker to database (two different tables), and you can write an ongoing migration between the tables. Very helpful.

Looks quite interesting, though again this is versioning the data schema rather than the data.

I think you have a certain amount of fuzziness around the idea of an "interaction" with the data. It would probably help to think about compatibility and breaking changes in terms of reads vs. writes in order to get the determinism you're looking for and better alignment with SemVer.

That is, if a client using the previous schema can still do reads and writes without the data being invalid, you have forward compatibility, and this qualifies as an PATCH.

If a client using the previous schema can still do reads against the new schema without the data being invalid but not writes, that would qualify as a MINOR change.

(Aside: write-but-not-read compatible changes are possible, but are uncommon in practice)

A change that can prevent a client using the old schema from doing valid reads against the new schema (eg. a column is renamed or removed) would be a MAJOR change.


I agree there is some fuzziness. Long after we wrote this I read:


which has a very succinct explanation of forwards and backwards compatibility as it relates to producers and consumers.

It's high time we did a second draft of SchemaVer which explains it in terms of forwards/backwards compatibility; the actual behavior of it (when to bump etc) would barely change.

I like the idea but I'd suggest not using semver terminology as it is misleading.

Breaking writes would not be considered a minor change in semver.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact