
Ask HN: Would you use a “git for data”? - sha1-1b141e
Dear HN:<p>Let&#x27;s say there was a thing that gave you the full git workflow (branch, sync, push, pull, merge, revert, etc) efficiently for large-scale structured data.<p>Would such a thing be valuable? Would you use it? Would you pay for it?<p>Asking for a friend.
======
wtracy
What sort of structured data are we talking about?

If you format an XML or JSON file with one field per line (or just use YAML)
git itself should fit the bill perfectly.

Now, I do see lots of room for a git-like tool targeted at specific existing
binary file formats. Microsoft Office and Photoshop files come to mind off the
top of my head. (I believe that such tools already exist for those particular
formats, but they're expensive and currently have low adoption.)

~~~
sha1-1b141e
Any kind. JSON or XML could be input. Or things like SQL or protobufs.

Git itself doesn't work well for large-scale data
([https://help.github.com/articles/what-is-my-disk-
quota/](https://help.github.com/articles/what-is-my-disk-quota/)). I'm
thinking terabyte or petabyte datasets.

------
anton_tarasenko
Relevant: GitHub Large File Storage [https://git-lfs.github.com/](https://git-
lfs.github.com/)

On paying for this. When data operations are built around pipelines, it's
often easier to re-run the pipeline or restore a snapshot. Which requires a
good server, but not a service. So before paying, I'd check why the new tool
is better.

~~~
sha1-1b141e
Yeah, except that this is not really built into Git. You can't get detailed
differences or merge differences in these large files. They are basically
opaque blobs as far as Git is concerned.

------
tixocloud
From my experience in enterprise business intelligence, absolutely yes but
more for the tracking rather than the ability to roll back changes. Data
lineage is an important concept and is extremely valuable from a compliance
perspective. The key would be can your system integrate with existing data
sources?

You'll also likely be going up against Informatica if you want to play in the
Enterprise space. That said, I'm also interested in a solution like yours as
we're rolling out our own automation system but need to keep track of things
for compliance reasons.

Would be happy to have a chat.

------
tmaly
I would use it for storing and tracking specifications if this is a possible
usage. In regulatory compliance, there is always a need to know what the
specification was at some point in time.

However, any system would have to be available on a private internal network
for most places.

------
daveloyall
'dat jawn: git for tabular data':
[https://github.com/CfABrigadePhiladelphia/jawn](https://github.com/CfABrigadePhiladelphia/jawn)

------
dhogan
What advantage would this have over using something like a log table? Or a few
SQL(or whatever) commands?

~~~
zelloworld
Or something like Datomic ([http://www.datomic.com/](http://www.datomic.com/))

------
giuscri
The Dat project?...

