
Launch HN: Quilt (YC W16) – A versioned data portal for S3 - akarve
We&#x27;re Aneesh and Kevin of Quilt (<a href="https:&#x2F;&#x2F;open.quiltdata.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;open.quiltdata.com&#x2F;</a>). Quilt is a versioned data portal for S3 that makes it easier to share, discover, model, and decide based on data at scale. It consists of a Python client, web catalog, and lambda functions (all open source), plus a suite of backend containers and CloudFormation templates for businesses to run their own stacks. Public data are free. Private stacks are available for a flat monthly licensing fee.<p>Try searching for anything on <a href="https:&#x2F;&#x2F;open.quiltdata.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;open.quiltdata.com&#x2F;</a> and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We&#x27;ve got over 1M Jupyter notebooks, 100M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.<p>The best example, so far, of &quot;S3 bucket as data repo&quot; is from the Allen Institute for Cell Science <a href="https:&#x2F;&#x2F;open.quiltdata.com&#x2F;b&#x2F;allencell&#x2F;tree&#x2F;" rel="nofollow">https:&#x2F;&#x2F;open.quiltdata.com&#x2F;b&#x2F;allencell&#x2F;tree&#x2F;</a>.<p>Kevin and I met in grad school. We started with the belief that if data could be &quot;managed like code,&quot; data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we&#x27;ve made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.<p>In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:<p>* Few users were willing to copy data—especially sensitive and large data—into Quilt<p>* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back<p>* Data are consumed in teams that include a variety of non-technical users<p>* Even in 2019, it&#x27;s unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)<p>* It&#x27;s difficult for a small team to build both &quot;git for data&quot; (core tech) and &quot;Github for data&quot; (website + network effect) at the same time<p>On the plus side, our users confirmed that &quot;immutable data dependencies&quot; (something Quilt still does) went a long way towards making analysis reproducible and trace-able.<p>Put all of the above together, and we had the realization that if we viewed S3 as &quot;git for data&quot;, it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our &quot;immutable dataset abstraction&quot; (Quilt packages) and used them to solve a problem that S3 object versioning doesn&#x27;t: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.<p>We believe that public data should be free and open to all—with no competing interests from advertisers—that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a &quot;federated network of S3 buckets&quot; offers the foundations on which to achieve such a vision.<p>All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)<p>Your comments, design suggestions, and open source contributions to any of the above topics are welcomed.
======
timsehn
Congratulations to the Quilt team on the launch!

Quilt reached out to me and suggested I chime in suggesting that people
interested in versioning data also check out Dolt
([https://github.com/liquidata-inc/dolt](https://github.com/liquidata-
inc/dolt)) and DoltHub ([https://www.dolthub.com](https://www.dolthub.com)).

We've taken the Git and GitHub for data analogy a lot more literally than
Quilt has :-) We are a SQL database with native Git semantics. Instead of
versioning files like Git, we version table rows. This allows for diff and
conflict detection down to the cell level. We are built on top of another open
source project called Noms ([https://github.com/attic-
labs/noms](https://github.com/attic-labs/noms)).

We think there is a ton of room in this space for a bunch of tools: Quilt,
Noms, QRI ([https://qri.io/](https://qri.io/)), Pachyderm
([https://www.pachyderm.io/](https://www.pachyderm.io/)), and even Git. We're
excited to see so many bright minds trying to solve this problem.

We're going to be populating DoltHub with a bunch of datasets we harvest from
the open data community to show off the capabilities of Dolt. The coolest one
so far is the Google open images dataset:
[https://www.dolthub.com/repositories/Liquidata/open-
images](https://www.dolthub.com/repositories/Liquidata/open-images).

~~~
kevinemoore
Thanks Tim! I definitely second your observation that there's room and reason
for plenty of tools in this space. DVC probably belongs in your list too:
[https://github.com/iterative/dvc](https://github.com/iterative/dvc). Looking
forward to checking out Open Images.

~~~
timsehn
The coolest thing to do is diff between the V2 and V3 branch for a
label_descriptions table.

[https://www.dolthub.com/repositories/Liquidata/open-
images/c...](https://www.dolthub.com/repositories/Liquidata/open-
images/commits/j9mmf12fuat35k8l1kf6kr969kdgd8fv)

You can start to see the power of column-wise diffs. You can start to imagine
what it would be like to change this table and then merge Google's changes in
V3 onto your modified copy. Very powerful. We need a query interface on top of
diffs. Lots to build...

~~~
lichtenberger
Great work on Dolt and seeing temporal data stores emerging :-)

Regarding XML and JSON SirixDB[1] already provides full blown time-travel
queries using a fork of Brackit[2], that is basically XQuery to process and
query both the XML as well JSON documents. That said SirixDB in principal
could also store relational data or graph data. The storage engine has been
built from scratch to offer the best possible versioning capabilities. However
I've never implemented branching/merging as I didn't come up with good use
cases. It seems it's then always more of a versioning system like Git, but
more fine granular.

I always struggled to implement this as SirixDB currently only allows a single
read-write transaction on a resource. Thus, if it would support branching and
merging users would have to manually handling conflicts when merging (or
automatically -- using a merge-strategy which is often case not good).

There's however plently of optimization potential, as SirixDB optionally
stores a lot of metadata for each node (number of descendants, a rolling hash,
Dewey-IDs, number of children... as well as user-defined, typed secondary
index-structures). I'll have to look how to build AST rewrite rules and
implement a lot of optimizations into my Brackit binding in the future, so
it's just the starting point (but everything should at least work already) :-)

[1] [https://sirix.io](https://sirix.io) and
[https://github.com/sirixdb/sirix](https://github.com/sirixdb/sirix)

[2] [http://wwwlgis.informatik.uni-
kl.de/cms/fileadmin/publicatio...](http://wwwlgis.informatik.uni-
kl.de/cms/fileadmin/publications/2013/Dissertation-Baechle.pdf)

------
breck
This is great. Thank you for so openly sharing your strategic thinking and
lessons learned.

I follow about 100 projects in this space "github for data" and haven't yet
seen a breakout hit. Yours looks like it has potential. I like the simplicity
and the "objects by file extension". Lots of these sites I think get too
complex too quick.

At the UH Cancer Center we routinely deal with datasets in the TB - PB range,
and that type of size definitely makes this problem qualitatively different.
Your splitting of the storage (S3) from the front end is the correct technical
decision, IMO.

I've worked in this space for about 10 years. My open source project is called
Ohayo, and I used to try and do both front end and backend, and then similarly
decided to drop the data storage backend and instead focus on my strengths,
which is front end exploratory data analysis.

I think adding a "quilt" keyword to Ohayo, and access to the Quilt datasets
directly in Ohayo may be mutually beneficial. Ohayo is just a single dumb web
app (no online storage, no tracking, full program source code are stored in
the url) and pulls in data via HTTP. Here's an example program that shows the
post history from the 2 quilt founders on hackernews:
[https://ohayo.computer?filename=hncomparison.flow&yi=~&xi=_&...](https://ohayo.computer?filename=hncomparison.flow&yi=~&xi=_&data=hackernews.submissions_100_akarve_kevinemoore~_tables.basic~_hidden~_filter.where_by_!%253D_~__hidden~__filter.where_type_%253D_story~___hidden~___vega.scatter~____xColumn_time~____yColumn_score~____colorColumn_by~layout_column)

We use Vega for visualization. You could imagine allowing fast simple EDA on
these Quilt data sets through simple Ohayo links. Ohayo version 14 is a
substantial improvement and I hope to ship next week or two, and then would
love to add Quilt to the picture.

~~~
akarve
Oh I would love to get the UH Cancer Center data into Quilt! Do you happen to
have an S3 bucket with that data live? If the bucket is publicly permissioned
it should "just work." We can talk about indexing the data for search. We are
comfortable in the TB-PB range :)

I will look more closely at Ohayo.

~~~
rsync
"At the UH Cancer Center we routinely deal with datasets in the TB - PB range
..."

...

"Do you happen to have an S3 bucket with that data live?"

As someone not working in academia (or in this field at all) can you help me
understand the question you have just asked ?

Specifically, wouldn't it be _tremendously profligate_ for them to have that
PB range dataset living in S3 ?

Given the resources that a university has (in both Internet2 connectivity,
hardware budget and (relatively) cheap manpower), why would they ever store
that data outside of their own UH datacenter ?

If the answer is "offsite backup" wouldn't it be glacier or nearline or ...
_anything but S3_ ?

~~~
akarve
Good questions. First, services like open.quiltdata.com and Amazon's Registry
of Open Data cover the S3 costs for public data. So that's one incentive.
Second, the cost of cloud resources are highly competitive (if not superior)
to on-premise data centers (see
[https://twitter.com/mohapatrahemant/status/11024016152632238...](https://twitter.com/mohapatrahemant/status/1102401615263223809?lang=en\)--so)
I don't think it's correct to think of S3 as expensive.

There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but
at some point the data become so slow to access that you can't offer a
pleasant user experience around browsing, searching, and feeding pipelines.

Most importantly, the "my data, my bucket" strategy gives users control over
their data. A university with their own bucket has more control over their
data than they do if Google, Facebook, etc. host and monetize it.

------
kevinemoore
Aneesh's co-founder here. I just want to add a word of thanks to Jed Sundwall
and the AWS Registry of Open Data. The support of AWS makes publishing data at
this scale possible. I also want to thank Jackson Brown and everyone else who
worked so hard to compile, document and annotate these large and extremely
valuable datasets.

------
lichtenberger
So you basically store S3 Buckets in Elastic Search and you're using Git for
versioning a hierarchy of buckets, right?

It's interesting that versioning now finally seems to be getting some traction
in mainstream database systems (even though they are not really optimal in
these systems my opinion) and for instance also in your data store. You
position this as a Dropbox or Google Drive replacement, right? :-)

I'm asking all these questions, because I'm engineering a temporal, versioned
Open Source storage system myself (since I studied at the University of
Konstanz until 2012), possibly on a much more database oriented level --
currently for storing both XML and JSON data in a binary format.

A resource in this storage system basically stores a huge tree of database
pages whereas an UberPage is the main entry point (reminiscent of ZFSs
UberPage, from which SirixDB borrows some ideas and puts these to the sub-file
level), consisting of various more or less hash-array based subtrees as in
ZFS. Thus, levels of indirect pages are added if more data needs to be stored.
I've added some optimizations from in-memory hash-array based tries.

Each revision is indexed. SirixDB stores per revision and per page deltas
based on a copy-on-write log-structure.

I've thought about storing each database page fragment in a S3 storage backend
as another storage option and using Apache BookKeeper directly or Apache
Pulsar for distributing an in-memory intent log (it doesn't need to be
persisted before committing to the data files, as the UberPage just needs to
be swapped atomically for consistency).

For the interested reader:

[https://sirix.io](https://sirix.io) and
[https://github.com/sirixdb/sirix](https://github.com/sirixdb/sirix)

~~~
akarve
Not quite ;) S3 is the primary data and metadata store, so that the rest of
the stack is a pure function of S3 data (including Elastic). We don't use git
at all yet. We use S3 object versioning and then capture the version, SHA-256,
etag, etc. in a JSONL-based manifest [https://open.quiltdata.com/b/quilt-
example/tree/.quilt/packa...](https://open.quiltdata.com/b/quilt-
example/tree/.quilt/packages/04de92de5f4c4141ee9c326a6719467293599b6478a6556e82d9f42a92440e6a).
Said JSONL manifest is simply a "locked list" of all the S3 objects in that
package. The same manifests can be checked into git for fork/merge of data
sets, but we're still exploring the right way to do that.

I'll let Kevin answer the database fragments question.

~~~
raghava
Neat. But would this not build dependency on s3s versioning and make it hard
for getting this portable across other clouds?

~~~
akarve
Not quite. Abstraction layers like min.io support versioning. More
importantly, Quilt manifests only require a "fully qualified physical key"
that points to the data. In theory, the manifest can work with any URI: S3,
local disk, etc.

------
JoshTriplett
While naming conflicts aren't necessarily always a problem, given that you
specifically describe aspects of the problem as a "git for X", you should know
that "quilt" is already the name of a popular piece of version control
software.

~~~
akarve
Thanks. We were careful to publish as `quilt3` on PyPI, so there are no naming
conflicts with the `quilt` patch manager. We are also "Quilt Data, Inc."
officially.

Our thesis is that blob storage is already "git for data" and the interesting
problems, which we're working on in the open source, are to build a cross-
functional data portal atop that base.

~~~
JoshTriplett
Thank you for taking that into account; sounds quite reasonable.

------
heinrichhartman
Excited to see this being re-launched. "git for data" ranks pretty high on my
all time list of tech I want to see succeed.

I find the business model very interesting: A kind of "middle layer" SAAS,
where you provide a new front-end for an existing service. Not seen that very
often. Certainly helps with the data privacy issues. Rapid on-boarding is
another immediate benefit.

~~~
akarve
Just determining the shape of "git for data" has been a nontrivial exercise.
We found that, if you do the naive translation, you get a "one size fits
none," because data and code are fundamentally different.

What would be your main use cases with said "git for data"?

~~~
heinrichhartman
One use-case is this one:
[https://github.com/HeinrichHartmann/arxiv_meta](https://github.com/HeinrichHartmann/arxiv_meta)

------
rabidrat
Is there a way to get a "dataset of datasets"? That is, all datasets you have,
in downloadable tabular form with metadata for each dataset?

~~~
breck
+1 for this.

UC Irvine has a famous repo of datasets for machine learning research but does
not have a metadata dump anywhere. I had to crawl it manually and create one
(1). Would be great if you offered a single URL with CSV/JSON/other dump of
your available datasets.

1\.
[https://ohayo.computer?filename=ucimlrDemo.flow&yi=~&xi=_&da...](https://ohayo.computer?filename=ucimlrDemo.flow&yi=~&xi=_&data=ucimlr.datasets~_handsontable.basic)

~~~
kevinemoore
The list of datasets is pretty dynamic. Would an API call work as a URL?
(assuming it could return a CSV/TSV/etc.)?

~~~
breck
Yes! That is ideal.

~~~
kevinemoore
We'll look into that for sure!

------
gidim
Really excited to see this relaunched. Every DS team has issues around dataset
management. We previously shared a tutorial on how to get a fully reproducible
pipeline with Quilt + Comet.ml [https://blog.quiltdata.com/building-a-fully-
reproducible-mac...](https://blog.quiltdata.com/building-a-fully-reproducible-
machine-learning-pipeline-with-comet-ml-and-quilt-c0e682b8e25)

------
trailerfins
I also really appreciated your lessons learned — pretty compelling. The
showcase buckets on the site are awesome. What's the mechanism by which the
public data ends up in S3, just out of curiosity?

~~~
kevinemoore
In general, a data publisher simply creates an S3 bucket in their account and
sets the access control to allow public read. In the specific case of the AWS
Registry of Open Data, data providers create a clean AWS account to hold the
data (the account will have only S3 and won't run any compute services). Once
the dataset is accepted into RODA, AWS will cover the costs (S3 storage and
egress bandwidth) for the clean account.

------
lyal
I'm very excited to see this -- data portability and management is a primary
struggle we're trying to map out. Would love to see an engineering post on
what you did for ElasticSearch.

~~~
akarve
Can do. A lot of the magic happens in the es/indexer and search lambdas here:
[https://github.com/quiltdata/quilt/tree/master/lambdas](https://github.com/quiltdata/quilt/tree/master/lambdas).

The short of what we do: we listen for bucket notifications in Lambda, open
the object metadata and send it, along with a snippet of the file contents, to
ElasticSearch for indexing. ElasticSearch mappings are a bit of a bear and we
had to lock those down to get them to behave well.

What are the big barriers you're bumping into on the data management and
portability side of things?

~~~
Felz
Seems like it'd be more elegant (and probably cost effective) if you stored
the Lucene indexes inside the buckets themselves.

~~~
akarve
That is an interesting idea. What kind of performance could we expect,
especially in the federated case of searching multiple buckets? Elastic has
sub-second latency (at the cost of running dedicated containers).

~~~
Felz
That's a bit of an open question right now, unfortunately. Using S3 to store
Lucene indexes is a roll-your-own thing since last I checked, and the
implementation I wrote currently deals with smaller indexes where files can be
pulled to fully disk as needed. S3 does support range requests, which I'd
think would mimic random access well enough.

Assuming whatever ElasticSearch implementation you're using is backed by SSDs
there'd likely be more latency with S3, but I'd expect it to scale pretty
well. Internally, a Lucene index is an array of immutable self-contained
segment files that store all indices for particular documents. Searching in
multiple indices is pretty much just searching through all their segments-
which can be as parallel as you want it to be.

To be honest, I'm actually surprised the Elasticsearch company doesn't offer
this as an option. Maybe because they sell hardware at markup?

------
codetrotter
> Try searching for anything on
> [https://open.quiltdata.com/](https://open.quiltdata.com/) and let us know
> how search works for you.

I suggest adding the possibility of searching for exact matches with quotation
marks, and also to ensure that it works with the quotation marks that the
default keyboard on iOS has.

For example, I want to search for “Irish Setter” and only see results that
include those two words next to each other like that.

~~~
akarve
Here is an example of the "Irish Setter" query:
[https://open.quiltdata.com/search?q=irish%20%2B%20setter](https://open.quiltdata.com/search?q=irish%20%2B%20setter)

You would type "irish + setter" in the search box.

~~~
codetrotter
Thank you :)

------
diegoscara
I'm really excited to start exploring these datasets through quilt and build
ML models. Thanks to all the Quilt team and everyone involved for making this
possible!

------
DTE
Congrats to Aneesh and team! We (Paperspace, YCW15) are big fans and have been
following these guys for a while now!

------
FanaHOVA
Congrats on the launch guys! Excited to read that you've already connected
with Tim, more and more smart people tackling this problem is always a plus
for everyone :)

------
foxhop
Does it work with digital ocean spaces?

~~~
akarve
Not yet. But it's closer than one might think. Spaces has an S3-compatible
API, and we have plans to use something like min.io to make Quilt work "all
the blobs": GCP, Azure, Digital Ocean, etc. OSS contributions welcome :)

What would you use Quilt for in Spaces?

~~~
thegagne
Is it possible to implement this using Backblaze B2 instead of S3?

~~~
akarve
Contingent upon support in something like ceph or minio, yes.

------
digitaltrees
Congrats. You are an awesome team.

------
antman
Is there any tool in this space that also handles permissions e.g. per column
or table?

------
admirethemeyer
Thanks for sharing this and driving Quilt forward @Kevin!

