We're Aneesh and Kevin of Quilt (
https://open.quiltdata.com/). Quilt is a versioned data portal for S3 that makes it easier to share, discover, model, and decide based on data at scale. It consists of a Python client, web catalog, and lambda functions (all open source), plus a suite of backend containers and CloudFormation templates for businesses to run their own stacks. Public data are free. Private stacks are available for a flat monthly licensing fee.
Try searching for anything on https://open.quiltdata.com/ and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We've got over 1M Jupyter notebooks, 100M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.
The best example, so far, of "S3 bucket as data repo" is from the Allen Institute for Cell Science https://open.quiltdata.com/b/allencell/tree/.
Kevin and I met in grad school. We started with the belief that if data could be "managed like code," data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we've made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.
In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:
* Few users were willing to copy data—especially sensitive and large data—into Quilt
* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back
* Data are consumed in teams that include a variety of non-technical users
* Even in 2019, it's unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)
* It's difficult for a small team to build both "git for data" (core tech) and "Github for data" (website + network effect) at the same time
On the plus side, our users confirmed that "immutable data dependencies" (something Quilt still does) went a long way towards making analysis reproducible and trace-able.
Put all of the above together, and we had the realization that if we viewed S3 as "git for data", it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our "immutable dataset abstraction" (Quilt packages) and used them to solve a problem that S3 object versioning doesn't: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.
We believe that public data should be free and open to all—with no competing interests from advertisers—that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a "federated network of S3 buckets" offers the foundations on which to achieve such a vision.
All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)
Your comments, design suggestions, and open source contributions to any of the above topics are welcomed.
Quilt reached out to me and suggested I chime in suggesting that people interested in versioning data also check out Dolt (https://github.com/liquidata-inc/dolt) and DoltHub (https://www.dolthub.com).
We've taken the Git and GitHub for data analogy a lot more literally than Quilt has :-) We are a SQL database with native Git semantics. Instead of versioning files like Git, we version table rows. This allows for diff and conflict detection down to the cell level. We are built on top of another open source project called Noms (https://github.com/attic-labs/noms).
We think there is a ton of room in this space for a bunch of tools: Quilt, Noms, QRI (https://qri.io/), Pachyderm (https://www.pachyderm.io/), and even Git. We're excited to see so many bright minds trying to solve this problem.
We're going to be populating DoltHub with a bunch of datasets we harvest from the open data community to show off the capabilities of Dolt. The coolest one so far is the Google open images dataset: https://www.dolthub.com/repositories/Liquidata/open-images.