
Dolt is Git for data - timsehn
https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/
======
peteforde
Only 39 days since the last "GitHub for data" was announced:
[https://news.ycombinator.com/item?id=22375774](https://news.ycombinator.com/item?id=22375774)

I'll say what I said in February: I started a company with the same premise 9
years ago, during the prime "big data" hype cycle. We burned through a lot of
investor money only to realize that there was not a market opportunity to
capture. That is, many people thought it was cool - we even did co-sponsored
data contests with The Economist - but at the end of the day, we couldn't find
anyone with an urgent problem that they were willing to pay to solve.

I wish these folks luck! Perhaps things have changed; we were part of a flock
of 5 or 10 similar projects and I'm pretty sure the only one still around
today is Kaggle.

[https://www.youtube.com/watch?v=EWMjQhhxhQ4](https://www.youtube.com/watch?v=EWMjQhhxhQ4)

~~~
philipov
Git succeeded because it was free, and then business models were able to be
built up around the open-source ecosystem after a market evolved naturally.
There is a need, but if you go into it trying to build a business from
scratch, you're going to have a bad time.

~~~
TylerE
Git succeeded because of Linus.

Sure as hell wasn't because of the UX, else Mercurial would have won, or even
DARCS.

99.99999% of projects are not the Linux kernel

~~~
greggman3
Mercurial would not have won. Mercurial has since added features, that are not
the recommended workflow according to their docs, to have similar branching
model to git but the default "as designed" workflow of hg is arguably inferior
to git (yes, I know that word will get downvoted).

Without git, git's style of branching would likely never have been added to hg
and even though it's been added now AFAICT hg people don't use it. No idea
why. Git people get how much freedom git branches give them, freedom that
other vcs, include hg don't/didn't.

~~~
koonsolo
Git branching is not intuitive, because they are not branches but
pointers/labels. When you talk about the master branch, you actually talk
about the master pointer.

The other VCSes have an intuitive concept of branches, because they are in
fact branches.

I liked Mercurial more than Git, but when BitBucked dropped Mercurial I also
switched to Git.

~~~
Jestar342
AFAIK (from the rumour mill and not from any kind of reliable source) the `git
branch` command was only added as a cargocult from all the SVN users flocking
to git and asking "So how do I branch?!". Previous to this, everything was
tags and checkouts.

Again, no verifiable source, just water cooler talk with other devs.

~~~
tosser678
from the first kernel merge (link found in wikipedia)

[https://marc.info/?l=git&m=111377572329534](https://marc.info/?l=git&m=111377572329534)

I don't know about 'git branch', but it looks like 'git merge' wasn't a thing

edit: from searching a bit, it appears that it had branches on June of the
launch year, dunno if it had those on release.

~~~
enigmo
The git log is also handy.

first "merge":
[https://git.kernel.org/pub/scm/git/git.git/commit/?id=33deb6...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=33deb63a36f523c513cf29598d9c05fe78a23cac)

first "tag":
[https://git.kernel.org/pub/scm/git/git.git/commit/?id=bf0c6e...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=bf0c6e839c692142784caf07b523cd69442e57a5)

first "branch":
[https://git.kernel.org/pub/scm/git/git.git/commit/?id=74b242...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=74b2428f5573b1f68ce717706296ae7d1832cd65)

first Linus "branch" commit:
[https://git.kernel.org/pub/scm/git/git.git/commit/?id=e69a19...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=e69a19f784b3ff19efc8ab765166e877fffb052e)

------
sytse
Very cool! The world needs better version control for data.

How does this compare to something like Pachyderm?

How does it work under the covers? What is a splice and what does it mean when
it overlaps? [https://github.com/liquidata-
inc/dolt/blob/84d9eded517167eb2...](https://github.com/liquidata-
inc/dolt/blob/84d9eded517167eb2b1f76073df88e85665eec1d/go/store/merge/three_way_list_test.go#L137)

Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

~~~
timsehn
Here is an earlier blog we published on comparison's to Pachyderm:
[https://www.dolthub.com/blog/2020-03-06-so-you-want-git-
for-...](https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/)

We got a blog on the storage system coming on Wednesday. It's a mashup of a
Merkle DAG and a B-tree called a Prolly Tree. It comes from an open source
package called Noms ([https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)).

I'm not familiar with CRDT. Will read up on that.

~~~
jdoliner
Weighing in as Pachyderm founder.

The post Tim links here is a very apt description of what Pachyderm does.
We're designed for version controlling data pipelines, as well as the data
they input and output. Pachyderm's filesystem, pfs, is the component that's
most similar to dolt. Pfs is a filesystem, rather than a database, so it tends
to be used for bigger data formats like videos, genomics files, sometimes
databases dumps. And the main reason people do that is so they can run
pipelines on top of those data files.

Under the hood the datastructures are actually very similar though, we use a
Merkle Tree, rather than a DAG. But the overall algorithm is very similar.
Dolt, I think, is a great approach to version controlling SQL style data and
access. Noms was a really cool idea that didn't seem to quite find its groove.
Whereas dolt seems to have taken the algorithm and made it into more of a tool
with practical uses.

~~~
visarga
How does Pachyderm deal with GDPR requests. Is it possible to remove a file
not just from the present but also from the history? It would be no use to
delete a file on GPDR request from the current version while still keeping it
around in past commits.

~~~
jdoliner
Request to purge data are one aspect of the GDPR that Pachyderm makes
trickier. It makes it easier to remove a piece of data and recompute all of
your models without it, because it can deduplicate the computation. But to
truly purge a piece of data deduplication becomes a hinderance, because the
data can be reference by previous commits, and even by other user's data. You
can delete a piece of data and have it not be truly purged.

The best recommendation we have for that is that user's data should be
encrypted with a key that's unique to the user, and when that user asks you to
purge their data you should throw away the key. That means that even if two
users have the same data it will be stored encrypted by different keys, so if
one asks for the data to be purged the other can still keep their data.

~~~
visarga
But then wouldn't the storage and distribution of keys become a similar
problem to the original one? If the keys get distributed, then it's hard to
really remove them.

~~~
jdoliner
Yes, all the keys do is scale the problem down. In general this is a very
tough problem, everything else in the system is designed to avoid data loss,
that's the biggest scariest failure case. But then when you want to lose data
all the measures in the system to prevent data loss prevent that from
happening.

------
samatman
> _Dolt is the only database with branches_

There's also litetree, whose slogan is simply "SQLite with branches":

[https://github.com/aergoio/litetree](https://github.com/aergoio/litetree)

~~~
chrismorgan
Also noms, which I had high hopes for three years ago, but it seems dead these
days (Salesforce acquired their company, and that was that):

[https://github.com/attic-labs/noms](https://github.com/attic-labs/noms)

(Judging by other comments in this thread, Dolt may be a descendant, partially
or completely, of Noms?)

~~~
aboodman
Yep, dolt is a fork of noms:

[https://github.com/liquidata-inc/dolt#credits-and-
license](https://github.com/liquidata-inc/dolt#credits-and-license)

(and sorry :( -- yay open source?)

~~~
lifty
I can't tell you how happy I was when I discovered noms several days ago, and
then the subsequent disappointment that it is not developed anymore. Anyway,
now that it's in the open maybe someone with the technical chops to develop
such a thing will continue development. Here's to hoping! And good luck in
your new endeavour, which also looks like a very cool project.

~~~
aboodman
Thank you! For the record, Replicache uses Noms internally, so I think ...
something ... will still become of Noms. We're just not sure how to move
forward with it at this point.

And I personally find Noms incredibly satisfying mental model to work with, so
I hope that eventually some others will too.

------
timdorr
Any reason or history behind the name? It means "a stupid person", which seems
like a bad choice IMHO: [https://www.merriam-
webster.com/dictionary/dolt](https://www.merriam-webster.com/dictionary/dolt)

~~~
irrational
Git - (Chiefly British Slang) An unpleasant, contemptible, or frustratingly
obtuse person.

[https://www.ahdictionary.com/word/search.html?q=git](https://www.ahdictionary.com/word/search.html?q=git)

Dolt - A stupid person; a dunce.

[https://www.ahdictionary.com/word/search.html?q=dolt](https://www.ahdictionary.com/word/search.html?q=dolt)

~~~
unixhero
I think both names are apt.

------
flashman
So, we ingest a third-party dataset that changes daily. One of our problems is
that we need to retrospectively measure arbitrary metrics (how many X had
condition Y on days 1 through 180 of the current year?). Imagine the external
data like this:

UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etc

When we ingest a new UUID, we add a column "START_DATE" which is the first
date the UUID's metrics were valid. When any of the metric counts changes, we
add "END_DATE" to the row and add a new row for that UUID with an updated
START_DATE.

It works, but it sucks to analyse because you have to partition the database
by the days each row was valid and do your aggregations on those partitions.
And it sucks to get a snapshot of how a dataset looked on a particular day. It
would be _much_ easier if we could just access the daily diffs, which seems
like a task Dolt would accomplish.

I mean it has a better chance of working than getting the third party to
implement versioning on their data feed.

~~~
jamesblonde
You can accomplish this using time-travel queries in frameworks like Apache
Hudi and Databricks Delta that i mentioned in more detail in an earlier
comment. They only work for Spark-based data pipelines.

------
ChrisFoster
A year or so I looked into "git for data" for medical research data curation.
At the time I found a couple of promising solutions based on wrapping git and
git annex:

GIN: [https://gin.g-node.org/](https://gin.g-node.org/) datalad:
[https://www.datalad.org/](https://www.datalad.org/)

At the time GIN looked really promising as something potentially simple enough
for end users in the lab but with a lot of power behind it. (Unfortunately we
never got it deployed due to organizational constraints... but that's a
separate story.)

~~~
DocSavage
You might want to look at some of the work we're doing in "git for
connectomics": [http://dvid.io](http://dvid.io)

There will be a new backend released soon that should really improve our
ability to transfer and backup versioned data.

------
teraku
I think they could find funding and use-cases, if they had something like
lincensing and terms of use backed into data to track lineage. E.g. "this
columns contains emails" and is revokable. Or when you publish data, "this
column needs hashing/anonymizing/...". And if you track data across versions
and can version relations, you can create lineage.

Overall seen many of these lately, waiting for one to really shine. But not
because I think it's a grand problem, as I can version my DDL/DML even/code,
but I see some need for it because I have a lot of non-tech people working
with data throwing it left and right and expecting me to clean up after them.

------
cjbprime
Comparison to Dat?

[https://docs.dat.foundation/docs/intro](https://docs.dat.foundation/docs/intro)

~~~
timsehn
Dat is more of a distributed data sharing protocol. It does not do
branch/merge which Dolt does.

------
dominotw
> Dolt is the only database with branches.

datomic has branching too afaik.

~~~
benatkin
So does git. I know, I know, git doesn't provide the features you expect from
a full fledged database, but neither does Dolt.

Dolt even has some content in a GitHub wiki, which uses git as a database
backend for a web app: [https://github.com/liquidata-
inc/dolt/wiki](https://github.com/liquidata-inc/dolt/wiki)

------
rburhum
Eh, I worked on database with branches in 2002 for 3 years while I was at
ESRI. It is called a versioned system... Here is how it works from an answer I
gave several years back on gis.stackexchange
[https://gis.stackexchange.com/questions/15203/when-
versionin...](https://gis.stackexchange.com/questions/15203/when-versioning-
with-arcsde-can-posted-edits-be-cancelled-or-rejected)

------
rad_gruchalski
Seems like a lot of work went into this and there are very smart people behind
it. However, I can’t help the feeling that this will lead to so many
unintentional data leaks.

Nevertheless, starred. Let’s see what does it give.

------
kdamica
It's a cool idea. There's also
[https://quiltdata.com/](https://quiltdata.com/) but I haven't heard anything
about them in a long time.

~~~
timsehn
This blog compares us to Quilt and others:
[https://www.dolthub.com/blog/2020-03-06-so-you-want-git-
for-...](https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/)

------
oldgregg
Really interesting. Would be nice to see documentation. All their examples
show modifying the database by running command line sql queries, does it turn
up a normal mysql instance or just emulate? Are hooks available in Go?
Surprised they don't market it as a blockchain database. I'm building a Dapp
right now and this could be really useful.

~~~
timsehn
You can spin up a MySQL compatible server using `dolt sql-server`. It does not
implement everything right now but we'll get there.

------
quickthrower2
I think data (as in raw, collected / measured / surveyed data) doesn't really
change, but you get more of it. Some data may occasionally supersede old data.
Maybe the schema of the data changes, so your first set of data is in one
form, and subsequent data might have more information, or recorded in a
different way.

~~~
jon_richards
One _really_ important feature of time series data is the preservation of what
the dataset looked like at each point in time. Financial data providers will
make a mistake (off by order of magnitude, missed a stock split, etc) and then
go back and correct it. This means you end up training models entirely on
corrected data, but trade based on uncorrected data.

~~~
cbenz
That's exactly the reason why we created
[https://db.nomics.world](https://db.nomics.world)

------
aabbcc1241
It claimed to be the first database that support versioning. How does it
compare to the revision mechanism in couchdb?

------
hinkley
Maybe not a killer app, but there are certain kinds of collaborative 'CRUD'
apps that could benefit greatly from having versioning built into the database
as a service.

For instance, how much of a functional wiki could one assemble from off-the-
shelf parts? Edit, display, account management, templating, etc could all be
handled with existing libraries in a wide array of programming languages.

The logic around the edit history is likely to contain the plurality if not
the majority of the custom code.

------
heinrichhartman
Looks like they are a fork of noms ([https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)). The object store has the
telling name `.dolt/noms`.

Inside are a bunch of binary files. It would be interesting to know more about
the on-disk layout of the stored tables.

I was not able to find any documentation. Does someone know more about this?
Pointers would be appreciated.

------
hypewatch
Does Dolt have any benchmarks against other databases at scale? I would think
that a git SQL database would not be very snappy at scale

~~~
timsehn
We're working on building performance benchmarks right now. We started with
correctness. You can read about our correctness journey here:
[https://www.dolthub.com/blog/2019-12-17-one-nine-of-sql-
corr...](https://www.dolthub.com/blog/2019-12-17-one-nine-of-sql-correctness/)

We think over time (like years) we can achieve read performance parity with
MySQL or PosgreSQL. Architecturally, we will always be slower on write than
other SQL databases, given the versioned storage engine.

Right now, Dolt is built to be used offline for data sharing. And in that use
case, the data and all of its history needs to fit on a single logical storage
system. The biggest Dolt repository we have right now is 300Gb. It tickles
some performance bottlenecks.

In the long run, if we get traction we imagine building "big dolt" which is a
distributed version of Dolt, where the network cuts happen at logical points
in the Merkle DAG. Thus, you could run an arbitrarily large storage and
compute cluster to power it.

------
ralfebert
Since Wil Shipleys presentation "Git as a Document Format" (AltConf, 2015,
[1]) the idea of using git to track data has stuck with me.

Cool to see another approach at this.

From the first look, I miss the representation of data as plain-old-text-
files, but I guess that's a little bit in competition with the goal of getting
performance for larger data sets.

Anyway, I am wondering, did somebody here try using plain git like a database
to track data in a repository?

[1] [https://academy.realm.io/posts/altconf-wil-shipley-git-
docum...](https://academy.realm.io/posts/altconf-wil-shipley-git-document-
format/)

------
ComodoHacker
The idea is good, the product may be good too (can't find any whitepapers or
something about underlying technology). But some of their marketing is
suspiciously unprofessional. Like "Better Database Backups". In DB world, you
can't call a "backup" anything that can't restore all of your DB files bit-
for-bit, anything non-deterministic. You can call it "dump", "export" or
whatever, but not backup.

I don't think they plan to compete on DB backups storage market. So please
don't mislead you potential customers.

------
databeetle
I use a Python based CMS called CodeRedCMS for my website. They store all
their content in a file called db.sqlite3. I use PythonAnywhere for hosting
the site and they read the website-files from GitHub. So whenever I update my
site (including the blog), I just push the latest version of the db.sqlite3
file to GitHub and pull it into PythonAnywhere.

So, as I understand, as long as the DB can be converted into files, it will
work as anything else on Git and GitHub. What am I missing?

------
BiteCode_dev
Dolt is not Git for data.

Git take existing files, and allow you to version them.

Git for data would take existing tables or rows, and allow you to version
them.

A uniform, drop in, open source way to have an history or row, merge them,
restore them, etc. that works for Postgres, Mysql or Oracle in the same way.
And is compatible with migrations.

You can have an history if you use big table or couchdb, not need for Dolt if
it's about using a specific product.

~~~
tobr
You can nitpick any comparison like that by pointing out the different ways
the metaphor breaks.

~~~
BiteCode_dev
It's not a metaphor. It's a sell pitch.

------
stared
Whetever it works or not, I find the introduction confusing.

Compare and contrast it with the clarity of these introductions:

\- [https://git-lfs.github.com/](https://git-lfs.github.com/) (Git Large File
Storage)

\- [http://paulfitz.github.io/daff/](http://paulfitz.github.io/daff/) ("data
diff for tables")

------
apichat
it look like Daff (align and compare tables)

[https://github.com/paulfitz/daff](https://github.com/paulfitz/daff)

and Coopy (distributed spreadsheets with intelligent merges)

[https://github.com/paulfitz/coopy](https://github.com/paulfitz/coopy)

------
aantix
Slightly related - how does ML track new data input and ensure that the data
hasn't introduced a regression?

I would assume there's an automated test suite, but also some way of diffing
large amounts of input data and visualizing those input additions relative to
model classifications?

What are the common tools for this?

~~~
visarga
You generally can't analyse the accuracy of an ML system by each individual
piece of data in the training set. Each batch of examples slightly changes the
model making their updates interact and combine during the training process,
so it becomes extremely difficult to assign the contribution of individual
examples. Of course you could retrain the model leaving one example out, but
that would be exceedingly slow and the result would be inconclusive from a
single run because the stochastic noise of the training process is larger than
the effect of removing or adding one example.

Related areas are confidence calibration, active learning and hard example
detection during training. Another approach is to synthesise a new, much
smaller dataset that would train a neural net to the same accuracy of the
original larger dataset.

------
Noumenon72
Great start page. Very persuasive writing, tells me what the project will do
for me and not just what it is.

------
IanCal
Looks interesting, depending on performance this could neatly cover a few use-
cases I have at the moment without needing to build as much myself. At least
dolt on its own, whether we would need the hub is another matter but I guess
it depends on uptake.

------
pedro1976
Recently i was working with some open-data data and i was in need for a tool
that transforms those csv/jsons to something standardized, that i can run
queries against and patch the data. Maybe this is a use case for dolt.

------
nerdponx
How does Dolt compare to DVC?

~~~
timsehn
We compared DVC and a bunch of other Git for data" tools
here:[https://www.dolthub.com/blog/2020-03-06-so-you-want-git-
for-...](https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/)

------
brachi
> With Dolt, you can view a human-readable diff of the data you received last
> time versus the data you received this time.

How is this accomplished if the data is binary?

Also, how does this compare to git lfs?

~~~
timsehn
No data diffs is the data is binary. But diffs are cell-wise like so:
[https://www.dolthub.com/repositories/Liquidata/corona-
virus/...](https://www.dolthub.com/repositories/Liquidata/corona-
virus/compare/qebrs5sdd9p9qjh8p2sbmf7lbnb81afh)

git-lfs lets you put store large files on GitHub. With Dolt, we offer a
similar a similar utility called git-dolt. Both these allow you to store large
things on GitHub as a reference to another storage system, not the object
itself.

------
jnbiche
Is there a way to page sql results? Also, it would be awesome if I could use
rlwrap with `dolt sql`, so I can use the shortcuts I'm used to in an REPL
environment.

~~~
zachmu
Yeah, the SQL shell needs some work, including its readline implementation
which is kind of broken. Filed an issue for this:

[https://github.com/liquidata-
inc/dolt/issues/505](https://github.com/liquidata-inc/dolt/issues/505)

~~~
jnbiche
Awesome. Yeah, my question wasn't clear, but I meant paging in the shell, as
you correctly assumed. In a pinch, I can run page an inline `dolt sql -q`
query in the OS shell. But it would be idea to be able to page results in the
dolt shell, as we can in most SQL database shells.

BTW, I should have written it above, but dolthub/dolt is quite impressive. I
hope you all make it, because it's a great product that I would love to use at
work if I eventually shift back over to a data science position (right now,
working as a software dev).

------
zby
Non binary data can be saved as text - for example you can have an SQL
database dump. You can put that text into git. What does this solution add to
that simple idea?

------
honksillet
Pardon my ignorance, but is data copy writable? Or can it be owned? Obviously
someone can get into trouble upload propriety code to git. Is there
proprietary data?

~~~
timsehn
We wrote a couple blog post on data licensing. We're not lawyers but we did a
bit of research:

[https://www.dolthub.com/blog/2020-02-24-data-
licenses/](https://www.dolthub.com/blog/2020-02-24-data-licenses/)
[https://www.dolthub.com/blog/2020-02-26-copyrightable-
materi...](https://www.dolthub.com/blog/2020-02-26-copyrightable-material/)

------
perfect_wave
Can you give some more information about what you're doing with your cloud
infrastructure? Would be intrigued to hear about what you're running.

~~~
timsehn
We use AWS but will switch over to Google or multi-cloud when we exhaust our
credits.

The system is still pretty simple. The main cost is the storage for the blobs
in the Dolt repos pushed to DoltHub. We use S3 for that. There is an API that
receives pushes and writes any other metadata (user, permissions, etc) into an
RDS instance that stores metadata for DoltHub. That instance is also used to
cache some critical things. Then it's just a set of web servers and a GraphQL
sitting on top serving our React app.

------
fmajid
Delphix has provided for years branching and test database functionality for
real databases people actually use like Oracle.

------
chrisweekly
We all know naming things is hard, but "dolt" \-- as in "idiot" or "imbecile"
\-- is a head-scratcher.

------
kthejoker2
Love that the name "rhymes" with Git (both are insults), potentially a good
fit for MLOps to version your training splits.

------
fleetside72
As far as I can tell the only way to use this is to push everything into a
mysql instance. def some pros and cons there.

~~~
timsehn
You can export to whatever you like. In our imagination, data versioning would
sit upstream of production just like source code versioning. You take the data
out and do what you need with it in a "compile" step.

------
russfink
I can't tell from the font - is it DOLT - delta Oscar lima tango - or DOIT
delta Oscar India tango?

~~~
enriquto
delta zero one tango

------
senorsmile
This solves an immediate need that I was considering noms for. Thank you!

------
hypewatch
> As far as we can tell, Dolt is the only database with branches

What about pachyderm?

~~~
timsehn
See the comment down below from a cofounder of Pachyderm :-)

------
tgb
An example use case that "git for data" seems to break: storing data for
medical research where the participants are allowed to withdraw from the study
after the fact. Then their data must be deleted retroactively, not just in the
head node. I don't know of a good methodology for dealing with this at all as
it breaks backups, for example.

The problem extends beyond medical research due to privacy laws like the GDPR.
A participant or user must be able to delete their data not merely hide it so
as to protect themselves from data breaches. Suggestions welcome.

~~~
kspacewalk2
In principle, you should be able to 'rewrite history' in the same way you can
already do with git. It is clunky to remove a file from all versions using git
itself but easy using tools like bfg[0].

[0] [https://rtyley.github.io/bfg-repo-cleaner](https://rtyley.github.io/bfg-
repo-cleaner)

------
aerovistae
Really bad name lol. A dolt is an idiot.

~~~
mst
And a git is an asshole.

That's the joke.

------
amolo
I'm curious. So what is Kaggle?

------
olliej
Wow, this really emphasizes that the rationale for choosing “Ok” rather than
“Do It” for buttons was correct.

:-/

------
danzig13
Can I get a git for excel?

------
gitgud
Git tracks changes in logic (software).

Tracking changes in Data is simply called a _database_...

------
fiatjaf
What happened to Dat?

------
matthewbauer
I thought Git was Git for data.

~~~
AlexCoventry
Spotted the lisp fan.

~~~
fb03
Spotted the lisp fan.

