Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Quilt (YC W16) – A versioned data portal for S3
177 points by akarve on Sept 24, 2019 | hide | past | favorite | 64 comments
We're Aneesh and Kevin of Quilt (https://open.quiltdata.com/). Quilt is a versioned data portal for S3 that makes it easier to share, discover, model, and decide based on data at scale. It consists of a Python client, web catalog, and lambda functions (all open source), plus a suite of backend containers and CloudFormation templates for businesses to run their own stacks. Public data are free. Private stacks are available for a flat monthly licensing fee.

Try searching for anything on https://open.quiltdata.com/ and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We've got over 1M Jupyter notebooks, 100M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.

The best example, so far, of "S3 bucket as data repo" is from the Allen Institute for Cell Science https://open.quiltdata.com/b/allencell/tree/.

Kevin and I met in grad school. We started with the belief that if data could be "managed like code," data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we've made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.

In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:

* Few users were willing to copy data—especially sensitive and large data—into Quilt

* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back

* Data are consumed in teams that include a variety of non-technical users

* Even in 2019, it's unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)

* It's difficult for a small team to build both "git for data" (core tech) and "Github for data" (website + network effect) at the same time

On the plus side, our users confirmed that "immutable data dependencies" (something Quilt still does) went a long way towards making analysis reproducible and trace-able.

Put all of the above together, and we had the realization that if we viewed S3 as "git for data", it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our "immutable dataset abstraction" (Quilt packages) and used them to solve a problem that S3 object versioning doesn't: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.

We believe that public data should be free and open to all—with no competing interests from advertisers—that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a "federated network of S3 buckets" offers the foundations on which to achieve such a vision.

All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)

Your comments, design suggestions, and open source contributions to any of the above topics are welcomed.

Congratulations to the Quilt team on the launch!

Quilt reached out to me and suggested I chime in suggesting that people interested in versioning data also check out Dolt (https://github.com/liquidata-inc/dolt) and DoltHub (https://www.dolthub.com).

We've taken the Git and GitHub for data analogy a lot more literally than Quilt has :-) We are a SQL database with native Git semantics. Instead of versioning files like Git, we version table rows. This allows for diff and conflict detection down to the cell level. We are built on top of another open source project called Noms (https://github.com/attic-labs/noms).

We think there is a ton of room in this space for a bunch of tools: Quilt, Noms, QRI (https://qri.io/), Pachyderm (https://www.pachyderm.io/), and even Git. We're excited to see so many bright minds trying to solve this problem.

We're going to be populating DoltHub with a bunch of datasets we harvest from the open data community to show off the capabilities of Dolt. The coolest one so far is the Google open images dataset: https://www.dolthub.com/repositories/Liquidata/open-images.

Thanks Tim! I definitely second your observation that there's room and reason for plenty of tools in this space. DVC probably belongs in your list too: https://github.com/iterative/dvc. Looking forward to checking out Open Images.

The coolest thing to do is diff between the V2 and V3 branch for a label_descriptions table.


You can start to see the power of column-wise diffs. You can start to imagine what it would be like to change this table and then merge Google's changes in V3 onto your modified copy. Very powerful. We need a query interface on top of diffs. Lots to build...

Great work on Dolt and seeing temporal data stores emerging :-)

Regarding XML and JSON SirixDB[1] already provides full blown time-travel queries using a fork of Brackit[2], that is basically XQuery to process and query both the XML as well JSON documents. That said SirixDB in principal could also store relational data or graph data. The storage engine has been built from scratch to offer the best possible versioning capabilities. However I've never implemented branching/merging as I didn't come up with good use cases. It seems it's then always more of a versioning system like Git, but more fine granular.

I always struggled to implement this as SirixDB currently only allows a single read-write transaction on a resource. Thus, if it would support branching and merging users would have to manually handling conflicts when merging (or automatically -- using a merge-strategy which is often case not good).

There's however plently of optimization potential, as SirixDB optionally stores a lot of metadata for each node (number of descendants, a rolling hash, Dewey-IDs, number of children... as well as user-defined, typed secondary index-structures). I'll have to look how to build AST rewrite rules and implement a lot of optimizations into my Brackit binding in the future, so it's just the starting point (but everything should at least work already) :-)

[1] https://sirix.io and https://github.com/sirixdb/sirix

[2] http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/publicatio...

Is what you're describing something like transaction time / "AS OF SYSTEM TIME"? From a different starting point, so to speak.

This is great. Thank you for so openly sharing your strategic thinking and lessons learned.

I follow about 100 projects in this space "github for data" and haven't yet seen a breakout hit. Yours looks like it has potential. I like the simplicity and the "objects by file extension". Lots of these sites I think get too complex too quick.

At the UH Cancer Center we routinely deal with datasets in the TB - PB range, and that type of size definitely makes this problem qualitatively different. Your splitting of the storage (S3) from the front end is the correct technical decision, IMO.

I've worked in this space for about 10 years. My open source project is called Ohayo, and I used to try and do both front end and backend, and then similarly decided to drop the data storage backend and instead focus on my strengths, which is front end exploratory data analysis.

I think adding a "quilt" keyword to Ohayo, and access to the Quilt datasets directly in Ohayo may be mutually beneficial. Ohayo is just a single dumb web app (no online storage, no tracking, full program source code are stored in the url) and pulls in data via HTTP. Here's an example program that shows the post history from the 2 quilt founders on hackernews: https://ohayo.computer?filename=hncomparison.flow&yi=~&xi=_&...

We use Vega for visualization. You could imagine allowing fast simple EDA on these Quilt data sets through simple Ohayo links. Ohayo version 14 is a substantial improvement and I hope to ship next week or two, and then would love to add Quilt to the picture.

Oh I would love to get the UH Cancer Center data into Quilt! Do you happen to have an S3 bucket with that data live? If the bucket is publicly permissioned it should "just work." We can talk about indexing the data for search. We are comfortable in the TB-PB range :)

I will look more closely at Ohayo.

> Do you happen to have an S3 bucket with that data live?

No. However, I'm helping start the Data Curation Core at the AIPHI here (https://aiphi.shepherdresearchlab.org/). Our intent is to be a one stop shop for all medical data in Hawaii. We don't yet have a plan on where we will actually store the public datasets (have solutions for private data), but it sounds like from what you folks are saying S3 is the place, and we should link to it via Quilt. That sounds like a good plan to me.

On a related topic, we just had a paper accepted ("Maternal Cardiovascular-Related Single Nucleotide Polymorphisms, Genes and Pathways Associated with Early-Onset Preeclampsia") with a smaller dataset (in the low TB IIRC) where we were unable to put the data live online publically for privacy reasons, so instead created a strongly typed schema for the data and wrote a method "synthesizeProgram()" to generate fake but correctly typed data so we could publish working code, and other researchers could just swap out the CSVs to get real results. Perhaps that might be a good thing to integrate into Quilt.

We have a data curators program on Quilt and I encourage you to apply (page bottom on open.quiltdata.com). For high-value public data sets, AWS's registry of open data will, if accepted, cover the costs of storage and egress. We went through this process with Allen Cell and I'm happy to help.

Great! Done. I'm on the mainland the rest of this month but would love to chat sometime in October.

I just want to give a plug for sharing data in the public cloud and S3 in particular. Jed Sundwall (AWS Global Open Data Lead) sums it up really well: "The cloud completely changes the dynamic for sharing data. When data is shared in the cloud, researchers no longer have to worry about downloading or copying data before getting to work. Instead, they can deploy compute resources on-demand in the cloud, where a single copy of the data is made available. It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are, and this makes it cheaper for researchers to ask more questions and experiment often." See the full whitepaper here: https://s3-us-west-2.amazonaws.com/opendata.aws/AWS_Sharing_...

> "It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are"

I love this quote, thanks. I do try to do things in the cloud as much as possible, but often times it's more practical for TCO reasons to do things locally.

This quote makes me wonder if in the future we'll see some sort of external SSDs with a RasberyPi-like portable GPU hooked up. Some sort of dedicated Storage+Computer USB hybrid.

What I like about our schema/anonymization solution, is you can put fake data and real code online, and then people can make changes to the real code on the cloud, and you can run those reliably on data locally.

There's no doubt that local processing is a lot cheaper than the cloud for a lot of workloads.

That's a very interesting pattern--publishing "fake" (perhaps safe or anonymized) data online along with code to spur research and development then running the enhanced code locally on private (e.g., PII data) on local compute resources.

We hope Quilt packages can play a role to make that easier. The package serves as an interface and layer of abstraction between the code and the data so the same code can be run against the safe or private data.

"At the UH Cancer Center we routinely deal with datasets in the TB - PB range ..."


"Do you happen to have an S3 bucket with that data live?"

As someone not working in academia (or in this field at all) can you help me understand the question you have just asked ?

Specifically, wouldn't it be tremendously profligate for them to have that PB range dataset living in S3 ?

Given the resources that a university has (in both Internet2 connectivity, hardware budget and (relatively) cheap manpower), why would they ever store that data outside of their own UH datacenter ?

If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?

Good questions. First, services like open.quiltdata.com and Amazon's Registry of Open Data cover the S3 costs for public data. So that's one incentive. Second, the cost of cloud resources are highly competitive (if not superior) to on-premise data centers (see https://twitter.com/mohapatrahemant/status/11024016152632238... I don't think it's correct to think of S3 as expensive.

There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but at some point the data become so slow to access that you can't offer a pleasant user experience around browsing, searching, and feeding pipelines.

Most importantly, the "my data, my bucket" strategy gives users control over their data. A university with their own bucket has more control over their data than they do if Google, Facebook, etc. host and monetize it.

> If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?

Well, technically, S3 Glacier and S3 Glacier Deep Archive is still S3, Cloud Storage Nearline is similar, except it's a tier on Google's S3-equivalent service.

But lots of public charities, especially academic institutions, host data in a way conveniently accessible to the public via well-known convenient APIs, including S3, even when it is not the least expensive method possible viewed strictly from the cost of storage and institution-internal access because of their mission.

+1 for everything Aneesh said, but I also wanted to add that the public cloud offers opportunities in data sharing that academia hasn't yet provided, specifically the ability for collaborators to bring their code to the data. I posted a quote from Jed Sundwall, Global Open Data Lead at AWS in another thread. I think he really nails it when he says that the cloud "completely changes the dynamic for sharing data."

There certainly have been efforts in academia to provide shared computing resources. Cyverse (https://www.cyverse.org/about) comes to mind. At Wisconsin many researchers shared clusters using Condor. But, none to my knowledge come close to the scale, reliability and features of AWS and the other major cloud providers.

Just peeked at Ohayo and it looks neat. Feel free email me: aneesh at quiltdata dot io. We have also been thinking about integration with Jupyter Lab data explorer and perhaps there's a common shim that we can use.

Aneesh's co-founder here. I just want to add a word of thanks to Jed Sundwall and the AWS Registry of Open Data. The support of AWS makes publishing data at this scale possible. I also want to thank Jackson Brown and everyone else who worked so hard to compile, document and annotate these large and extremely valuable datasets.

So you basically store S3 Buckets in Elastic Search and you're using Git for versioning a hierarchy of buckets, right?

It's interesting that versioning now finally seems to be getting some traction in mainstream database systems (even though they are not really optimal in these systems my opinion) and for instance also in your data store. You position this as a Dropbox or Google Drive replacement, right? :-)

I'm asking all these questions, because I'm engineering a temporal, versioned Open Source storage system myself (since I studied at the University of Konstanz until 2012), possibly on a much more database oriented level -- currently for storing both XML and JSON data in a binary format.

A resource in this storage system basically stores a huge tree of database pages whereas an UberPage is the main entry point (reminiscent of ZFSs UberPage, from which SirixDB borrows some ideas and puts these to the sub-file level), consisting of various more or less hash-array based subtrees as in ZFS. Thus, levels of indirect pages are added if more data needs to be stored. I've added some optimizations from in-memory hash-array based tries.

Each revision is indexed. SirixDB stores per revision and per page deltas based on a copy-on-write log-structure.

I've thought about storing each database page fragment in a S3 storage backend as another storage option and using Apache BookKeeper directly or Apache Pulsar for distributing an in-memory intent log (it doesn't need to be persisted before committing to the data files, as the UberPage just needs to be swapped atomically for consistency).

For the interested reader:

https://sirix.io and https://github.com/sirixdb/sirix

Not quite ;) S3 is the primary data and metadata store, so that the rest of the stack is a pure function of S3 data (including Elastic). We don't use git at all yet. We use S3 object versioning and then capture the version, SHA-256, etag, etc. in a JSONL-based manifest https://open.quiltdata.com/b/quilt-example/tree/.quilt/packa.... Said JSONL manifest is simply a "locked list" of all the S3 objects in that package. The same manifests can be checked into git for fork/merge of data sets, but we're still exploring the right way to do that.

I'll let Kevin answer the database fragments question.

Neat. But would this not build dependency on s3s versioning and make it hard for getting this portable across other clouds?

Not quite. Abstraction layers like min.io support versioning. More importantly, Quilt manifests only require a "fully qualified physical key" that points to the data. In theory, the manifest can work with any URI: S3, local disk, etc.

Sirix sounds like a very interesting system! Is it similar in its internal structure to noms (https://github.com/attic-labs/noms) or Dolt (https://github.com/liquidata-inc/dolt)?

I think S3 is a good match for storing database pages as long as they are immutable. The Vectorized query processing model seems to fit this approach very well (e.g., http://oai.cwi.nl/oai/asset/14075/14075B.pdf) Anyone out there from Snowflake care to answer how Snowflake stores database pages in AWS?

I haven't used BookKeeper or Pulsar myself so I can't comment on how well they might work for distributing an intent log.

I'm not really sure how Noms versions stuff. It seems they use a variant of a B-tree index. Or maybe only for secondary indexes?

I think if you can use a simple monotonically increasing sequence number SirixDB has indexing advantages as for instance when storing XML and JSON documents or graph data.

The cool thing also is that SirixDB not only copies changed database pages but it implements a sliding window algorithm for versioning the database pages itself along with well known backup versioning strategies. Furthermore user-specified, typed secondary index structures are also naturally versioned.

One downside is that SirixDB doesn't support branching, even though it would be relatively easily possible to implement I guess, but I'm not convinced that it's needed. I don't want that anyone has to merge merge conflicts. I think automatic algorithms to do this are also not the right thing. But of course it's really interesting and I also thought about it :-) maybe someone has a really good use case? :-)

BTW: Everything in SirixDB is immutable regarding updates of resources in databases. Of course you can revert to an old revision and change stuff, but the revisions in-between will still be accessible.

Keep up the great work on Quilt :-)

While naming conflicts aren't necessarily always a problem, given that you specifically describe aspects of the problem as a "git for X", you should know that "quilt" is already the name of a popular piece of version control software.

Thanks. We were careful to publish as `quilt3` on PyPI, so there are no naming conflicts with the `quilt` patch manager. We are also "Quilt Data, Inc." officially.

Our thesis is that blob storage is already "git for data" and the interesting problems, which we're working on in the open source, are to build a cross-functional data portal atop that base.

Thank you for taking that into account; sounds quite reasonable.

Excited to see this being re-launched. "git for data" ranks pretty high on my all time list of tech I want to see succeed.

I find the business model very interesting: A kind of "middle layer" SAAS, where you provide a new front-end for an existing service. Not seen that very often. Certainly helps with the data privacy issues. Rapid on-boarding is another immediate benefit.

Just determining the shape of "git for data" has been a nontrivial exercise. We found that, if you do the naive translation, you get a "one size fits none," because data and code are fundamentally different.

What would be your main use cases with said "git for data"?

Is there a way to get a "dataset of datasets"? That is, all datasets you have, in downloadable tabular form with metadata for each dataset?

+1 for this.

UC Irvine has a famous repo of datasets for machine learning research but does not have a metadata dump anywhere. I had to crawl it manually and create one (1). Would be great if you offered a single URL with CSV/JSON/other dump of your available datasets.

1. https://ohayo.computer?filename=ucimlrDemo.flow&yi=~&xi=_&da...

The list of datasets is pretty dynamic. Would an API call work as a URL? (assuming it could return a CSV/TSV/etc.)?

Yes! That is ideal.

We'll look into that for sure!

Thanks! That's really interesting feedback. We hadn't thought of that.

The package landing page is essentially that: a list of all of your datasets. This is constructed from the special s3://.quilt/* directory. Since a Quilt manifest is a list of keys, you can compose N manifests into a single manifest using the API, see e.g. https://docs.quiltdata.com/advanced-usage/working-with-manif...

Hmm, yes and no. It's certainly possible to nest a package (dataset) inside a package. But, there's no (current) API that returns the list of datasets as a table/DataFrame. How would you imagine using that feature?

I use local tools for all my data processing, and if I want to search for a relevant dataset, I'd like to do it locally instead of having to contact an API and do all the paging etc. The kinds of queries that might be important to me are outside the scope of any fixed API; I might want all datasets updated in December (of any year), or only those with between 100k and 1m rows, or to get a frequency table of datasets by license terms. It's very easy to do any of these queries if I can download a TSV (or equivalent) with all the metadata, and usually too frustrating to even attempt if I can't.

Maybe the Elasticsearch cluster would help you here. Because all the datasets (and files within datasets) are indexed along with their metadata, you could write elastic queries to find the datasets you want--as long as the dataset creators are including the relevant metadata in the dataset annotations.

Really excited to see this relaunched. Every DS team has issues around dataset management. We previously shared a tutorial on how to get a fully reproducible pipeline with Quilt + Comet.ml https://blog.quiltdata.com/building-a-fully-reproducible-mac...

I also really appreciated your lessons learned — pretty compelling. The showcase buckets on the site are awesome. What's the mechanism by which the public data ends up in S3, just out of curiosity?

In general, a data publisher simply creates an S3 bucket in their account and sets the access control to allow public read. In the specific case of the AWS Registry of Open Data, data providers create a clean AWS account to hold the data (the account will have only S3 and won't run any compute services). Once the dataset is accepted into RODA, AWS will cover the costs (S3 storage and egress bandwidth) for the clean account.

I'm very excited to see this -- data portability and management is a primary struggle we're trying to map out. Would love to see an engineering post on what you did for ElasticSearch.

Can do. A lot of the magic happens in the es/indexer and search lambdas here: https://github.com/quiltdata/quilt/tree/master/lambdas.

The short of what we do: we listen for bucket notifications in Lambda, open the object metadata and send it, along with a snippet of the file contents, to ElasticSearch for indexing. ElasticSearch mappings are a bit of a bear and we had to lock those down to get them to behave well.

What are the big barriers you're bumping into on the data management and portability side of things?

Seems like it'd be more elegant (and probably cost effective) if you stored the Lucene indexes inside the buckets themselves.

That is an interesting idea. What kind of performance could we expect, especially in the federated case of searching multiple buckets? Elastic has sub-second latency (at the cost of running dedicated containers).

That's a bit of an open question right now, unfortunately. Using S3 to store Lucene indexes is a roll-your-own thing since last I checked, and the implementation I wrote currently deals with smaller indexes where files can be pulled to fully disk as needed. S3 does support range requests, which I'd think would mimic random access well enough.

Assuming whatever ElasticSearch implementation you're using is backed by SSDs there'd likely be more latency with S3, but I'd expect it to scale pretty well. Internally, a Lucene index is an array of immutable self-contained segment files that store all indices for particular documents. Searching in multiple indices is pretty much just searching through all their segments- which can be as parallel as you want it to be.

To be honest, I'm actually surprised the Elasticsearch company doesn't offer this as an option. Maybe because they sell hardware at markup?

> Try searching for anything on https://open.quiltdata.com/ and let us know how search works for you.

I suggest adding the possibility of searching for exact matches with quotation marks, and also to ensure that it works with the quotation marks that the default keyboard on iOS has.

For example, I want to search for “Irish Setter” and only see results that include those two words next to each other like that.

The search is powered by Elasticsearch. We want to support the full power of elastic’s query language, but so far haven’t found a great way to prevent overly intense (or malicious) queries and still allow arbitrary queries—love to hear from Elastic experts out there. Exact match on multiple words seems possible though.

Here is an example of the "Irish Setter" query: https://open.quiltdata.com/search?q=irish%20%2B%20setter

You would type "irish + setter" in the search box.

Thank you :)

You can do that :) See here for search syntax https://www.elastic.co/guide/en/elasticsearch/reference/6.8/... (there is a faint link under the big search bar). Let us know how that works for you.

I'm really excited to start exploring these datasets through quilt and build ML models. Thanks to all the Quilt team and everyone involved for making this possible!

Congrats to Aneesh and team! We (Paperspace, YCW15) are big fans and have been following these guys for a while now!

Congrats on the launch guys! Excited to read that you've already connected with Tim, more and more smart people tackling this problem is always a plus for everyone :)

Does it work with digital ocean spaces?

Not yet. But it's closer than one might think. Spaces has an S3-compatible API, and we have plans to use something like min.io to make Quilt work "all the blobs": GCP, Azure, Digital Ocean, etc. OSS contributions welcome :)

What would you use Quilt for in Spaces?

Is it possible to implement this using Backblaze B2 instead of S3?

Contingent upon support in something like ceph or minio, yes.

Congrats. You are an awesome team.

Is there any tool in this space that also handles permissions e.g. per column or table?

Thanks for sharing this and driving Quilt forward @Kevin!

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact