
Ask HN: How do you version control your event stream schema? - stefaniabje
Anyone here built or used a data pipeline that validates event structures or pipes data to multiple destinations?<p>- Did you use JSON, Thrift, Protobuf, Avro or something else to define the schema for the events in the stream?<p>- Did you version the schema?<p>- Did you version each event or object in the schema?<p>- What type of versioning did you use? (E.g. semantic, incremental counters, git hashes, another type of hash, etc.)<p>- Any other cool things you&#x27;d like to share?
======
raidan
I'm in the same boat - looking to consolidate a few smaller data pipelines
that have organically grown over the years into a single pipeline and taking
the opportunity to try and get the org to rethink how we emit events in
various parts of the stack. As teams operate fairly independently, we need
something that enables a flexible data structure whilst still enabling the
ability to join data on common well-defined fields.

The current path we are going down is looking at using a 'nested' JSON
structure that enables sub-processes in the various systems to inherit values
from the parent. Something similar to:

{ "type": "schemaX", "version": 1, "payload": { "k1": "v1", "type": "schemaY",
"version": 3, "payload": { "kk1": "v1" } } }

The structure itself will be documented using JSONSchema and hopefully we will
be able to verify the validity of events as they are processed, though this
might be too expensive to do in high volume scenarios.

The goal is to then build a low latency router that takes in routing policies
to forward subsets of events to further data pipelines. The policies
themselves will be defined using some kind of DSL (possibly using JSONPath?).

As a whole, this seems to be a fairly common problem[0] that other companies
are trying to solve, but a lot of the low level details are not spoken about.
One thing that does seem to be a common component to a service like this is
Apache Flink.[1]

[0] Netflix Keystone -
[https://www.youtube.com/watch?v=sPB8w-YXX1s](https://www.youtube.com/watch?v=sPB8w-YXX1s)
[1] [https://flink.apache.org/](https://flink.apache.org/)

~~~
stefaniabje
Thanks for sharing! So it seems you may end up using an incremental counter as
a version – or was that a placeholder example?

What are the main use cases for the data pipelines you're forwarding your data
into? Are the events fairly rich or fairly simple? Do you plan on doing any
type validation as well?

------
mrburton
We're using Avro + Kafka's Registry. We try to append to our messages to keep
backwards compatibility.

That being said, there are things we intentionally do to ensure we can support
this. e.g, Allow for nullable properties.

We pass in the message type, version, and additional headers. It's fairly
common to create an envelope around the actual message. I would also suggest
consider doing this for a few reasons. Skipping the obvious benefits, there's
one powerful benefit in adding meta data to messages without having to alter
the actual event. e.g, security tokens, time to live, and more.

I'm sure this might be also fairly commonly known, but I would highly
recommend adding information like request correlation information.

~~~
stefaniabje
Thanks for sharing.

> We try to append to our messages to keep backwards compatibility.

Do you additionally do versioning, or do you solve it by only allowing
backwards compatible changes such as appending to the message?

~~~
mrburton
We do - that's how Kafka Registry knows what schema to apply for the Avro
message. If you like, you can ping me at my username + gmail and I can answer
any other questions about what you're doing.

