
How to avoid fearing data migration? - notheguyouthink
So I&#x27;ve been spinning my wheels on multiple side projects lately due to one, main concern: Data&#x2F;schema migration.<p>My projects are often applications to store data about my life. Anything from wiki pages, to file storage, it&#x27;s just side projects that tend to interest me. However lately, I&#x27;ve been largely obsessing with choosing &quot;perfect&quot; schemas in projects that are more than just inside of a DB.<p>If it was in a DB (MySQL&#x2F;etc), migration isn&#x27;t a concern, it&#x27;s well understood. However I&#x27;m often writing my own storage mechanisms, such as storing images on disk or chunked bytes, whatever. So then if I realize I&#x27;m missing something, say crypto signing the blobs, I fear needing to change them after I&#x27;ve uploaded a TB worth of data to them.<p>So I don&#x27;t even know what to ask, but I figure this has to be a more common problem than I realize. Are there articles that talk about managing your own data and ensuring that it always has a migration path?<p>Hopefully my problem is clear, but I&#x27;m sorry for not being able to define a better question in this context. Any tips for similar projects would be appreciated!
======
maksut
Maybe you can use one of the data interchange protocols that has a story for
backward/forward compatibility? Something like Apache Avro or Protocol Buffers
should allow you to work with different versions of your data at the same
time.

See: [http://martin.kleppmann.com/2012/12/05/schema-evolution-
in-a...](http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-
protocol-buffers-thrift.html)

His book "Designing Data-Intensive Applications" has a section on this.

------
cimmanom
If you've scripted the data migration, what besides wait time is the
difference between migrating a KB of data and migrating a TB of data?

~~~
notheguyouthink
Data churn, I think. But good question, I've debated the same.

My current thought on migration is that I'm going to write what is effectively
a database _(again, it 's storing files locally and blah)_. Then, if I need to
make a schema change, I'll literally copy the db code to a v2 db. The v1 will
have vendored data types, so nothing outside of v1 can "break" the code. v1
will also incur no dev time because it's entirely isolated.

So "migration" basically pulls the entire database from v1 -> v2, storing it
however v2 is written to.

What I don't like with this model is as you mentioned, it incurs a lot of cost
of storage, bandwidth, who knows if a large migration is needed between many
versions. v1 -> v2 -> v3 -> v4 and etc. Taking a TB library from v1 to disk to
v2 to v3 and so forth seems.. costly. Furthermore, doing it purely in memory
is optimal, but then I'm likely writing lots of code specifically tailored for
in memory migration. Where as I'm hoping to basically pipe v1:read -> v2:write
as the "migrations".

On the plus side though, this is mainly a development concern, and that's
likely to only affect me.. I guess I'm just obsessing, and I really don't like
that there doesn't feel like a clear solution here. I've not encountered this
professionally.

Strangely, professionally I've had the _potential_ to run into this, but we've
always dealt with it manually. Ie, if we change how we store files on S3, it's
usually a one-off fix. Here I'm trying to avoid managing my data like that, I
just want to write features and have "rock solid" storage.. which means no one
off scripts, and instead proven migrations.

Anyway, this is just a bunch of rambling. You can't help me with this I'm sure
lol, but if you've dealt with filesystem migrations professionally I'd be
really curious to see what you've done. All I've ever seen for migrations was
traditional SQL migrations.

