
Show HN: Quilt – manage data like code - akarve
https://quiltdata.com/
======
akarve
Hi, I'm one of the founders of Quilt Data (YCW16). We built Quilt to bring
package management to data. The goal is to create a community of versioned,
reusable building blocks of data, so that analysts can spend more time
analyzing and less time finding, cleaning, and organizing data.

Our general inspiration is to create a new kind of data warehouse based on
code management practices that haven't yet reached the data domain.

Feedback welcome. Ask me anything.

~~~
masklinn
The naming is a bit sad in that it conflicts with the patch sets manager:
[https://en.wikipedia.org/wiki/Quilt_(software)](https://en.wikipedia.org/wiki/Quilt_\(software\))

~~~
kevinemoore
Sorry to hear about the name conflict. We weren't familiar with the patch sets
manager. If it helps to keep the quilt (data) command line tools out of your
path, you can still run all the quilt commands from inside Python.

~~~
kchr
No worries, just make the name switch sooner rather than later. Since your
target audience is obviously developers in general - to whom the patch tool is
well known - it will most likely affect your business negatively if you
don't...

How did this not come up during your market research?

~~~
kchr
Also, fantastic product! My only criticism is the poor name choice and lack of
market research...

------
smilliken
It's outrageous how little tooling support there is for version control in
data compared to code. Every mainstream database forgets history with updates,
don't support distributed workflows, don't support commit ids as first class
objects, or most other basic features of VCSs. Databases just aren't a
solution to version control.

I can't imagine a future where we don't treat data version control like a
necessity in the same was as code version control. I hope Quilt can fill the
much-needed role of "Github for data".

~~~
xyzzy_plugh
I've often thought about this problem space. Imagine you have a database with
all the bells and whistles you suggest. At first, it's great. But at some
point, when you start experiencing growth (compute/storage) pressure (many
petabytes) all of this metadata adds up.

Does it remain cost effective at scale?

All the source code in the world is a drop in the bucket compared to the raw
data collected by a large business.

~~~
hinkley
If the user of the data insists on a time horizon of 'since our founding',
then yeah, there's an opportunity cost to keeping all of the data. The one
people usually miss is that it takes log(n) time to update or insert a record
because you have to update all of the indexes. So when you have 1000 times as
much data, every insert takes 10x as long as it did at the beginning. Or you
use partial indexes and it gets maybe 2 times slower which the users probably
won't notice.

So people just give up and exfiltrate all of the data to another server to run
their reports on. But as a user I still want to be able to figure out 'did I
do that thing in February, or was it March?' fairly often.

One thing I've always wished database replication systems did (universally)
was allow you to run different indexes on different replicas, so you don't
have to export the data at all. Your insert time would still be a function of
network delay + worst case index update time, but you could segregate traffic
based on kind and continue to scale to a fairly large company before anyone
had to mention data warehousing or data lakes.

------
fredcash25
How is this different/the same as the DAT project
([https://datproject.org/](https://datproject.org/) and
[https://github.com/datproject](https://github.com/datproject))?

~~~
akarve
Dat is a distributed transport layer for raw data. Quilt is a centralized
(your infrastructure or ours) transport and consumption layer for virtualized
data. As such we'll be able to, for example, run efficient queries across all
of Quilt, allow users to import data the same way (no data prep scripts)
across a variety of platforms, etc.

~~~
lmeyerov
I recently encountered [https://data.world/datanerd/inc-5000-2016-the-full-
list](https://data.world/datanerd/inc-5000-2016-the-full-list) and was
impressed by the combo of a pandas-friendly client library & centralized
online tier with bells & whistles. Feels closer to that direction.

There's obviously friction in this space -- this weekend I'm playing with
databricks spark notebooks and public data on S3 and the data prep will still
be annoying -- so I'm looking forward to design innovation in it.

~~~
akarve
I'd like to chat more about this. Quilt uses S3 and I think we could make the
process of getting data into Databricks much simpler. Drop me a line if you'd
like to discuss aneesh at quiltdata dot io.

------
gouggoug
This makes me think of [http://www.pachyderm.io/](http://www.pachyderm.io/).
Although Quilt seems to be more like github for data, whereas Pachyderm is
more like git for data.

~~~
lobster_johnson
Pachyderm was the first thing I thought of. In fact there should be
opportunities for collaboration here.

Pachyderm looks brilliant for high-scalability parallel data processing, and
the versioned data part is a way to not just maintain the history of the data,
but also avoid reprocessing of data that hasn't changed since the previous
run.

~~~
jaz46
Hi, I'm one of the creators of Pachyderm. We've been talking with the Quilt
founders about various ways to work together and think there are some really
exciting opportunities!

------
rspeer
I'm very excited. I want to use this to version ConceptNet's raw input and its
built data, all of which is public.

So I can assume this isn't going to be afraid of gigabytes, right? I've seen
services before that want to be a repository of data, and I try to upload a
mere 20 GB of data and they're like "oh shit nevermind". Even S3 requires it
to be broken into files of less than 5 GB for some inscrutable reason.

~~~
IanCal
I don't think you have had to break files up yourself for a long time on S3.
You can treat files up to 5TB as a single object. I think you have to do a
multipart upload but that's probably not a bad idea anyway.

~~~
rspeer
Ah, right. I'm remembering from when I was trying to use git-annex to version
the data, which was a problem for multiple reasons, including that their S3
driver didn't use multipart uploads.

------
stewbrew
It looks like a plain html page but requires JS to view anything except:
"Please enable JavaScript to use this site." What a wonderful time to live in.

Anyway, do I get this right: They expect users to be experts in data analysis
but not being able to load the data into whatever software they use? They want
me to share data and to offload my data into their walled garden that can be
accessed only via their service? If I wanted to share my data, wouldn't I
rather use something more accessible?

~~~
eternauta3k
> If I wanted to share my data, wouldn't I rather use something more
> accessible?

Still, it's good inspiration. Maybe I'll make a github repo with my city's
open datasets loaded into python.

~~~
akarve
There are some important differences from git:
[https://news.ycombinator.com/item?id=14772036](https://news.ycombinator.com/item?id=14772036)

------
edraferi
This looks like a cool project -- always glad to see new tools for statistical
collaboration and reproducible research.

How does this compare to what data.world [1] is doing? They recently released
a Python SDK [2] as well.

[1] [https://data.world/](https://data.world/) [2]
[https://github.com/datadotworld/data.world-
py](https://github.com/datadotworld/data.world-py)

~~~
akarve
Hi. Key differences from data dot world:
[https://news.ycombinator.com/item?id=14792143](https://news.ycombinator.com/item?id=14792143)

------
sixdimensional
I think it was a really interesting (and smart) choice to convert to Parquet
format. Columnar storage is so much more efficient, and working with data in
Parquet is pretty fast using the engines they mention (Apache Spark, Impala,
Hive, etc.).

I actually had been thinking about Parquet as a component of ETL, and if it
might be possible to make ETL many times faster by compressing to Parquet
format on the source and then transmitting to a destination - especially when
you're talking about limited bandwidth situations where you need entire data
sets moved around in bulk.

This looks really nice for sharing public data sets, but I wish that there was
a better public non-profit org running indexes of public data sets.... I guess
if something like the semantic web had ever taken off, then the Internet
itself would be the index of public data sets, but it seems like that dream is
still yet to materialize.

~~~
kevinemoore
Once the interfaces are mostly settled, we plan to open-source the server so
that other organizations can run Quilt registries. If you know of non-profit
data-indexes that Quilt should work with or organizations who might be
interested in running a Quilt registry, please let us know.

~~~
themgt
Code for America & civic tech orgs would be quite interested, I would imagine.

------
geraldbauer
FYI: I've built a datapackage manager called datapak in ruby [1][2]. datapak
supports the tabular datapackages (.csv with .json schema) from the
frictionless data initiative (by the open knowledge foundation). All open
source and public domain. See some examples such as the Standard&Poors 500. By
default the datapackage gets auto-added from .csv to an in-memory SQLite
database for easy querying etc. Thanks to ActiveRecord you can use PostgreSQL,
MySQL, etc.

[1] [https://github.com/textkit/datapak](https://github.com/textkit/datapak)
[2]
[http://okfnlabs.org/blog/2015/04/26/datapak.html](http://okfnlabs.org/blog/2015/04/26/datapak.html)

~~~
edraferi
Very interesting, I've heard of the Open Knowledge Foundation [1] but wasn't
aware of the Frictionless Data Initiative [2]. Looks like it's complementary
to Common Workflow Language [3]

[1] [https://okfn.org/](https://okfn.org/) [2]
[http://frictionlessdata.io/](http://frictionlessdata.io/) [3]
[http://www.commonwl.org/](http://www.commonwl.org/)

------
reggieband
I think you're missing a trick with the pricing. My guess is the real money
will come once data is treated like a commodity. So the big, big money will be
in brokerage's and exchanges.

Paying flat fees for access to repos is fundamentally thinking about the
problem incorrectly.

~~~
akarve
We charge business and on-prem users in TB-sized blocks. So that part is
variable cost, not flat. And we sell user seats in blocks of 10. What else
should we be thinking about? We want to be fair and also price in a way that
encourages sharing behind the firewall (e.g. shouldn't require manager
approval to add every new user).

~~~
reggieband
> What else should we be thinking about?

My feeling is brokering. Consider the market for wheat where there is pricing
based on supply and demand. There are futures, options, etc.

Consider a NYSE for data. Why host the data? Be a discovery service both for
the price and for brokering of access.

Data should not be priced on it's size to store/transfer. That is leaving huge
money on the table. It should be priced based on what people are willing to
pay for it.

Why not allow someone to pay for the _option_ for exclusive resale rights of
some weather data service? Then allow someone to make a profit off their
ability to re-sale that data. etc.

You may not be well placed to do that now but someone is going to. And why
would I go to you hoping to find specific data when I can go to a market full
of data, full of data re-sellers, etc.

~~~
akarve
Spot on. I get where you're coming from. The value of data is whatever the
buyer and seller agree upon. I was mostly taking about the value/price of the
service.

------
fiatjaf
What does this do that cannot be done with git or similar software + data
stored in some standard format?

~~~
akarve
Four things: serialization, virtualization, querying, big data (even Git LFS
isn't super performant for large files). Quilt actually transforms data into
Parquet and wraps it in a virtualization layer so that the data can be
injected directly into code. Efficient querying is a function of the
serialization.

By contrast, GitHub is a blob store, it doesn't transform the data either for
serialization or for virtualization.

------
forkLding
Is there an option to be able to sign up using Github? That would make life
much easier for me.

~~~
akarve
GH sign up would be useful. If you email us I will ping you when we add it:
feedback at quiltdata dot io.

~~~
forkLding
I've downloaded Quilt using pip, is the login more for cloud storage for your
own data and projects similar to github? Maybe clarify that point on the
website? Because not too sure what I would be signing up for, thanks again and
love the service.

~~~
akarve
The login allows you to push packages. If you only want to consume public
packages: no login required.

------
tzm
Thanks for releasing. Looks useful and aligned with several projects I've
worked on.

The first thing I looked for was a canonical package / resource specification
in build.py. Any chance supporting Frictionless data resource spec for
interop?

[https://specs.frictionlessdata.io/data-
resource/](https://specs.frictionlessdata.io/data-resource/)

~~~
akarve
As Kevin mentioned we can extend support to frictionless (and are acceptign
PRs on GitHub :). The thing we didn't love about frictionless is that it
requires the user to fully specify the schema. We take a slightly more
automated approach: [https://docs.quiltdata.com/make-a-
package.html](https://docs.quiltdata.com/make-a-package.html)
[https://docs.quiltdata.com/buildyml.html](https://docs.quiltdata.com/buildyml.html)

I think we could generate a frictionless schema pretty easily...

------
azag0
I am a scientist who sometimes publishes data sets with academic papers and
this looks super useful both as a tool and as a potential publishing best
practice. Currently doing away with HDF5 and figshare. One necessary future
for academia would be to be able to assign a DOI to a given version of data.
Is it feasible for quilt to have such a feature?

~~~
kevinemoore
We'd love to include DOIs for each version of datasets. I think it's feasible
for us to do that, but we haven't scoped out how hard or expensive that will
be. In the meantime, if you have a way of creating DOIs from URLs, creating a
package version will give you a permanent URL for a version of a dataset. If
you have a recommendation for how to implement DOI creation, please let us
know. Thanks!

------
pinhead
Not to be confused with [http://quilt.io/](http://quilt.io/)

~~~
klodolph
Huh, I was thinking of
[https://linux.die.net/man/1/quilt](https://linux.die.net/man/1/quilt)

Basically the sanest way of maintaining patch sets (multiple patches -> quilt,
get it?). The logic of using "quilt" as the name for anything that doesn't
have to do with patches baffles me.

------
dawiddutoit
Awesome idea. Have you thought about a marketplace perhaps? Say for instance I
create a package with all the cities in the world or all species of canine,
host it in a marketplace and others can buy that data package to use in their
own projects.

~~~
akarve
Stay tuned :)

------
hfourm
Hey, website did a good job of explaining what you are within like 2 seconds.
I like

------
dpiers
Wow - this looks like it would be really useful for us, and fits perfectly
with our existing processes. I am building out the data analytics function on
the Internal Audit team at Uber, and one of our challenges is that we have to
pull and manage data from different business systems, and be able to track
which version of a data set a report/analysis was run against.

It would be really cool if quilt could generate documentation for datasets,
even if it was just column names/types. One of the issues we have is keeping
track of all of the data "assets" people have pulled or created.

~~~
kevinemoore
We're definitely planning to make column names and types more easily
accessible and searchable. It'd be great to learn more about what information
and metadata would be most useful to you. We also have an on-prem version
we're rolling out with a couple of pilot customers in case that's helpful.

------
nicodjimenez
This seems like a great way of publishing public datasets. However, as someone
who works on a computer vision startup I don't think I could really use this.
In my work data annotation, visualization, and versioning cannot be easily
separated. The effort we would need to put in to use quilt might be better
spent building a simple versioning system on top of our current data
infrastructure.

~~~
akarve
Where does it break down? Quilt can package and version directly from in-
memory objects so if you are working in Python (more languages planned) you
can package as you go and include any dependencies?

------
temuze
This is great! I was looking for something like this for a while.

You guys should make the search bar a little more prominent. Took me a while
to find it!

~~~
akarve
Done. You should see a more obvious search bar on the next push. We'll also
make search case insensitive :)

------
servilio
What's the difference with CVMFS[1] and Nix[2]?

[1]
[http://cernvm.cern.ch/portal/filesystem](http://cernvm.cern.ch/portal/filesystem)
[2] [http://nixos.org/nix/](http://nixos.org/nix/)

~~~
akarve
Those are both interesting projects. Quilt has a rather different emphasis,
though. The difference from CVMFS is that Quilt is a full set of services
around data (build, push, install) whereas CVMFS is just the file system. In
the big data community Paruqet, which we use as a virtualization format, has
far more traction than CVMFS. Nixos is specialized for software package
management so the distinctions between Quilt and git apply
[https://news.ycombinator.com/item?id=14772036](https://news.ycombinator.com/item?id=14772036).

------
tannhauser23
I met one of your founders at an event recently. Very impressive product! Good
luck to you all.

------
wodenokoto
The workflow I would have imagined for versioning data is:

1) Load original data from source into quilt

2) Do transformation

3) Commit transformations to quilt, with commit message

4) Run experiment

5) Do new transformations

6) Commit to quilt

7) Run experiment

Rinse and repeat.

Looking at the video and documentations, this is not emphasised at all,
suggesting that edits to data should be saved as a new package.

~~~
akarve
You can absolutely edit in place and that will go into `quilt log` for the
package--as long as you are the package owner. Our docs were a bit confusing
on this point. I just updated them: [https://docs.quiltdata.com/edit-a-
package.html](https://docs.quiltdata.com/edit-a-package.html)

------
jinjin2
I would love if someone did this with [http://realm.io](http://realm.io), so
that the data could be "live" and multiple users could collaborate on it in
realtime.

~~~
akarve
We've talked about this. Though streams (e.g. Apache beam) are closer to where
we think realtime data is going. It would be possible to wire Quilt to
something like firebase to get realtime behavior... Happy to brainstorm other
solutions: aneesh at quiltdata dot io.

------
mdevere
Hey, this sounds really interesting and I'd like to play around with it.
However, I'm a novice and run into the following issue:

>>> examples.sales

<DataNode>

No idea what a DataNode is so am struggling to actually see the data! Any
tips?

~~~
akarve
Is there an example in the docs that just shows `examples.sales`? If so please
let me know and I'll fix it. I searched and couldn't find such an example
(though maybe, through the "magic" of Chrome's service-worker, got an old
version of the website). As mentioned by Kevin `()` or `_data()` is what you
need.

------
torchous
Any thoughts on adding DOIs? It's a complex subject wrt versioning, in
particular (new DOI per version? How to keep track?). It would help
tremendously with the academic community; for the bean counting.

~~~
akarve
The package name + hash is an implicit DOI. What if we added web support for
it so that users could
[https://quiltdata.com/packages/USER/PKG?doi=SOME_HASH](https://quiltdata.com/packages/USER/PKG?doi=SOME_HASH)
?

~~~
torchous
Yes, but it's not globally recognizable. that's why doi's are standardized
through ISO [https://www.doi.org/](https://www.doi.org/) Internally you could
implement a DOI->HASH mapping, but a quilt hash isn't going to help in the
reference list of a paper (if you're lucky you can copy'n'paste it. How do you
know where to go? What happens if your package organization changes internally
and so forth.

------
anon1253
Any relation to
[https://github.com/QuiltProject](https://github.com/QuiltProject) ?

~~~
kevinemoore
No, no relation.

------
ah-
How much of it is open source? Can I run my own?

~~~
akarve
The client is fully open source. You can indeed run your own and we are just
starting to roll that out. I can get you started: feedback at quiltdata dot
io. We are deliberating open sourcing the registry as well (making everything
open source). What do you think?

~~~
heinrichhartman
Being able to host my own repository is a must for me. We have many TB of data
and don't want to stream that over the internet. Having a repo on-site is a
must.

I'd appreciate the code being open source. I can afford paying for a
(perpetual) license.

~~~
akarve
Understood. We can do something very close to that. Right now we give the
registry source to our on prem users as part of the license. Email sales at
quiltdata dot io and we can discuss.

------
nosefouratyou
What's the difference between Apache Parquet and Apache Arrow? They are both
columnar formats right?

~~~
akarve
Parquet is stationary data on-disk. Arrow is focused on in-memory analytics
and serializes out to a variety of formats (including Feather and Parquet).

------
synaesthesisx
This is great - looking forward to seeing the available datasets grow!

------
atomical
Is there a similar project for Ruby?

------
huac
No charge for bandwidth?

~~~
kevinemoore
No, there's no charge for bandwidth. The most common uses so far are users
installing datasets locally, which caches the data at the destination or
running batch jobs in ECS/EC2, which doesn't accrue charges on AWS.

~~~
huac
Huh, that's a pretty good setup for you then! I might mess around and see
about writing an R package/interface, because this looks very useful.

~~~
akarve
We welcome your contributions to the R interface. If you email me, aneesh at
quiltdata dot io, I can add you to our Slach channel where our engineers can
support your efforts. Several users have asked about R and if we combine them
together I think we have the horsepower to build an R layer for Quilt.

