
Show HN: Splitgraph - Build and share data with Postgres, inspired by Docker/Git - mildbyte
http://www.splitgraph.com
======
chatmasta
Hey HN!

I’m Miles, co-founder of Splitgraph along with Artjoms (mildbyte). We met back
in 2018, when I reached out to him after reading his blog on HN and realizing
we lived right next to each other. Neither of us had a “real job,“ and we both
wanted to build something truly innovative and cool. We tossed around a few
ideas, but ultimately we couldn’t resist the idea of building “GitHub for
data,” which seemed like an obvious gap in the market. After nearly two years
of development, we are finally ready — and extremely excited — to share it
with the world.

We are not the first to notice this gap or try to build this product. So we
wanted to make sure we did it right. We made sure to start from “first
principles” and really analyze the problem space. We ended up realizing that
it’s not strictly Git or GitHub that people want “for data.” Rather, people
just want to be able to work with data as easily as they can work with code.
They want to experiment, build and maintain data without needless overhead.

Tools like Git and Docker are ubiquitous in any software engineer’s workflow,
and we took a lot of inspiration from them when designing Splitgraph. We
thought about _why_ people like and use these tools, and tried to translate
their benefits to the domain of data science. Our core philosophy is to stay
out of the way, and work with existing abstractions instead of introducing new
ones. You can version your code with Git without switching filesystems. You
can build Docker images without changing your code to work in Docker. Our goal
with Splitgraph is to provide an easy path to incremental adoption, so you can
introduce it into your existing workflows where and when it makes sense.

Splitgraph is powered by Postgres, and provides an easy way to build and share
versioned datasets, along with a whole bunch of other benefits. We encourage
you to read the landing page which (hopefully) explains it well. The
documentation goes into much more detail, and if you have ten minutes and
Docker installed, you can try Splitgraph for yourself. [0] If you work with
data, we really hope you’ll give Splitgraph a try.

We’re here to answer any questions, and we’ve also created a Discord server
[1] to hopefully build a bit of a community around Splitgraph.

[0] [https://www.splitgraph.com/docs/getting-started/five-
minute-...](https://www.splitgraph.com/docs/getting-started/five-minute-demo)

[1] [https://discord.gg/eFEFRKm](https://discord.gg/eFEFRKm)

------
ahnick
Personally I think I'm more drawn to the dotmesh approach
([https://docs.dotmesh.com/concepts/architecture/](https://docs.dotmesh.com/concepts/architecture/)),
but the one problem data has is as it gets massive it becomes really hard to
move it around and I guess that's where trying to layer git like workflows on
top of it become intractable. It's like data has it's own gravity and often
times it is just easier to bring other things to the data, rather than the
other way around. IIRC Bryan Cantrill said something similar about data when
Joyent was developing their object storage system Manta
([https://www.youtube.com/watch?v=79fvDDPaIoY);](https://www.youtube.com/watch?v=79fvDDPaIoY\);)
ergo, perhaps the Splitgraph approach will meet with better success.

------
ishcheklein
Here is one of the DVC maintainers :) Congrats! It's great to see more tools
for codifying data in different scenarios.

To be honest, since you introduce a new workflow and a few new concepts it's
not that easy to get the right perspective in 5 minutes (I know the same
problems exists with DVC and we've been iterating on docs a lot). Mind a few
questions?

Do I understand it right, that is mostly focused on tabular data? Kinda git
checkout for an SQL table?

~~~
chatmasta
Hey! Thanks for the kind words. Yeah, it's a small space and we all seem to be
aware of each other. In fact, I don't know if you saw it, but we do briefly
mention dvc in the FAQ [0].

> Kinda git checkout for an SQL table

This is basically right, yes. We implement some Git- and Docker- like
operations on top of the SQL standard. For example, we borrow the idea of
"delta compression" (storing only changes) from Git. Like Git, we store
commits as a set of objects representing the changes since the last commit.
But whereas Git versions files with lines as the unit of change, Splitgraph
versions tables with rows as the unit of change. Our "objects" are actually
cstore files that represent fragments of a table, with content addressable
hashes generated with LTHash [1]. If you want to read about this in more
detail, have a look at the documentation for "objects." [2]

[0] [https://www.splitgraph.com/docs/getting-
started/frequently-a...](https://www.splitgraph.com/docs/getting-
started/frequently-asked-questions#dvc-datalad-)

[1] LTHash has the useful property that the sum of all hashes of individual
fragments composing a table is equal to the content hash of a whole table.
This is how we are able to have content addressable objects, which unlocks a
lot of tricks in terms of efficiency and allows techniques like "layered
querying".

[2]
[https://www.splitgraph.com/docs/concepts/objects](https://www.splitgraph.com/docs/concepts/objects)

~~~
ishcheklein
Thanks! Still trying to wrap my mind around use cases :) What is the main
adoption path for the tool do you see?

People who use flat files and they are large enough and/or need some
provenance? Or people who already have Postgres and they want to version it?
Or is it about production use case (like Docker) - make snapshot to deliver it
consistently?

Btw, regarding the FQA - thanks! From my take on Splitgraph though one the
main DVC's difference is that it deals with flat files. From software
engineering perspective - it is very close to Git lfs on steroids + some
higher level features similar to makefiles, ML metrics, etc.

~~~
chatmasta
> What is the main adoption path for the tool do you see?

We hope that anyone who works with data on a daily basis can benefit from
Splitgraph. When we first started, we gave a presentation at a Docker meetup
called "Docker for Data" [0] with the idea that we wanted to do for data
scientists what Docker did for DevOps. We hope that people will be able to
throw out some of their fragile ETL scripts in favor of using Splitfiles, just
like DevOps engineers could throw out their Salt and Chef scripts when they
switched to Docker.

> Or people who already have Postgres and they want to version it?

It's important to note that in many cases, Postgres is really just an
intermediary. Splitgraph allows you to "mount" data from any source, not just
Postgres databases, by leveraging Postgres foreign data wrappers (FDWs) [1].
This idea of "mounting" is one of the core abstractions of Splitgraph that
makes it really powerful, because it lets you use a common format (Splitfiles)
to transform and query data from anywhere. And once you've built an image from
a bunch of disparate data sources, you can use a core set of your favorite
tools to query it, since as far as they're concerned, it's just a Postgres
schema. So in this sense, Splitgraph can serve as a sort of universal
translation layer for data from all over the place.

For a really powerful example using this idea, see the example where we mount
two tables from two separate data portals (Chicago and Cambridge), and join
between them. [2]

[0] We gave two presentations in 2018, both similar. A lot of details have
changed since then, but the core abstractions are the same, and they might
give some insight into our direction:

[0.a] [https://www.slideshare.net/splitgraph/splitgraph-docker-
for-...](https://www.slideshare.net/splitgraph/splitgraph-docker-for-
data-119112722)

[0.b] [https://www.slideshare.net/splitgraph/splitgraph-ahl-
talk](https://www.slideshare.net/splitgraph/splitgraph-ahl-talk)

[1] [https://www.splitgraph.com/docs/ingesting-data/foreign-
data-...](https://www.splitgraph.com/docs/ingesting-data/foreign-data-
wrappers/introduction)

[2] [https://www.splitgraph.com/docs/ingesting-
data/socrata#using...](https://www.splitgraph.com/docs/ingesting-
data/socrata#using-metabase-to-join-and-plot-data-from-multiple-data-portals)

~~~
ishcheklein
Okay, thanks! I think I got the idea. Have you seen Quilt btw? They had the
same message initially - Docker for Data, packaging data, etc. Implementation
was very different though.

I somewhat don't like this analogy btw, see how you mentioned Docker changed
life for DevOps in the first place (vs engineers), the same here - data
scientists don't care about packaging data - there should be strong incentive
to do so.

A few specific questions:

1\. where does query execution happen - always client or remote as well? 2\.
in a global "Github for data" case is there some discovery mechanism for
existing data? 3\. do you provide a public storage to cover the case for
Github for data? or is it now more like torrent - peers host and pay for data
storage?

Btw, what is major direction for you - Github (public collaboration and
sharing) or internal versioned warehouses (or some other internal case?).

~~~
chatmasta
Excellent questions, thank you.

> Have you seen Quilt btw?

Yes, we have. It seems in this space that everything has been done or pitched
before, but in our opinion nobody has hit the exact right execution yet. A big
problem with a lot of existing tools is that they disrupt your workflow, or
are otherwise hard to adopt without major adjacent changes. Our core
philosophy with Splitgraph is to stay out of the way. As long as we can keep
this up, and as long as we can continue building on a core set of simple
abstractions, we think we stand a pretty good chance.

> data scientists don't care about packaging data

Indeed. It's worth noting that packaging data with Splitfiles is entirely
optional. You can also run ad-hoc queries against a database with change-
tracking enabled (meaning, Splitgraph audit triggers are installed), and
periodically commit or checkout different versions as you see fit. This
workflow would be more similar to a git workflow. But we encourage the use of
Splitfiles because of the advantages they add; namely reproducibility due to
provenance. It's sort of like how you can build a docker image by running
arbitrary commands in a container and then `docker commit`. The problem with
that workflow is that you lose all the benefits of Dockerfiles. The same logic
applies to `sgr commit` and Splitfiles. Our bet is that data scientists will
find Splitfiles to be the path of least resistance to accomplishing their
goals.

> where does query execution happen - always client or remote as well?

At the moment, most of it happens on the client. But in Splitgraph Cloud, we
do have the capability to execute queries on the remote. In a public setting,
it's obviously more desirable to push down query execution to the client (or,
if it's done remotely, to charge them for it). But in a corporate setting, you
could imagine a shared remote cluster that executes queries on behalf of thin
clients. So, it's possible to support both, but at the moment we're focused on
the client.

> in a global "Github for data" case is there some discovery mechanism for
> existing data

Splitgraph Cloud includes discovery mechanisms including search and topics.
We'll be adding a lot more features around this. We intend for the "data
catalog" to be a core part of our offering.

> do you provide a public storage to cover the case for Github for data? or is
> it now more like torrent - peers host and pay for data storage?

At the moment, for simplicity and while we're in beta, Splitgraph Cloud is
providing storage at our discretion. However, Splitgraph is designed so that
data storage is decoupled from metadata storage. You can configure `sgr` to
upload objects to any S3 compatible store. Currently it's configured to upload
to object storage at Splitgraph Cloud, but there is no reason we could not
introduce some kind of federation protocol where users can upload to
independent silos of S3-compatible storage. But this raises a lot of questions
with reliability and responsibility, so we have not fully explored it yet. In
the near term, the easier solution will probably be charging clients for
storage at Splitgraph Cloud. But, federation is something that is technically
possible and at least academically interesting.

Also, note that Splitgraph Cloud does not host all the data it includes in its
index. For example, the 40,000+ datasets currently in the Splitgraph index are
not hosted by Splitgraph [0], but we index them, and provide value added
services like a REST API that does some remote execution of queries on your
behalf. Currently these use the Socrata mount point, but you could imagine a
situation in a corporate environment where the catalog might index lots of
databases that are not Splitgraph images, but can be mounted with an FDW in
the same way.

> what is major direction for you - Github (public collaboration and sharing)
> or internal versioned warehouses (or some other internal case?).

Most likely, both. We will probably follow the GitHub model of offering a
public and on-premise version of the same product. In an ideal world,
companies or universities might pay to license an on-premise version of
Splitgraph Cloud that includes all the same features as the public version.
We've done a lot of work on our backend to make deployments like this
possible, so it's an appealing direction for us.

[0] [https://www.splitgraph.com/docs/splitgraph-cloud/external-
re...](https://www.splitgraph.com/docs/splitgraph-cloud/external-repositories)

------
philips
This is so cool!

I have been looking around for databases that have any sort of cryptographic
digest of data to ensure integrity. And this is the first time I have seen
something do that.

Could the snapshots and content addressability be used for regular backups of
application databases?

~~~
philips
Via their (very good FAQ)

Writing to PostgreSQL tables that are change-tracked by Splitgraph is almost
2x slower than writing to untracked tables (Splitgraph uses audit triggers to
record changes rather than diffing the table at commit time).

------
zmmmmm
I'm probably a bit naive about this but could it make it unnecessary to
explicitly create database dumps as backups in scenarios where you need a
rollback? ie: could I just tag the database and be guaranteed I would later
get back that data if, for example, my upgrade failed and I wanted to restore,
simply by checking out the tag?

~~~
mildbyte
That does sound very ambitious for now! I discussed below that we're focused
on the OLAP use case (manage actual data, not DDL around it), so triggers,
indexes and functions that you create won't be stored in the Splitgraph image
you'll make (a Splitgraph image is not a full database dump).

For things like schema migrations, PostgreSQL itself has transactional DDL:
column deletions/additions can be wrapped in a transaction, so you can
ROLLBACK if your migration fails. This might be more appropriate for your use
case?

(Note that you can still add DDL commands to a Splitgraph table after you
check it out, since at that point it's just a Postgres table. In theory it
would be possible to track DDL changes with some other mechanism, and apply
them after loading a version of data)

------
username3
How does this compare to Dolt and DoltHub?

~~~
timsehn
CEO of the company that built Dolt here. I just did some preliminary research
and it seems that we have a similar mission. We both want to version control
data.

The thing Dolt does that splitgraph does not do is support branches, diffs,
merges and conflicts on data and schema. Dolt has its own storage engine to do
this efficiently whereas Splitgraph relies on Postgres.

Having native Postgres with a versioning layer on top has other advantages so
we're excited to see how Splitgraph's approach works in practice. Excited to
play with it. We love to see more tools in this underserved space.

~~~
timsehn
Also a survey of other "Git for data" options from a few months ago:
[https://www.dolthub.com/blog/2020-03-06-so-you-want-git-
for-...](https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/)

