Hacker News new | comments | show | ask | jobs | submit login
Show HN: Quilt – manage data like code (quiltdata.com)
367 points by akarve 7 months ago | hide | past | web | favorite | 178 comments

Hi, I'm one of the founders of Quilt Data (YCW16). We built Quilt to bring package management to data. The goal is to create a community of versioned, reusable building blocks of data, so that analysts can spend more time analyzing and less time finding, cleaning, and organizing data.

Our general inspiration is to create a new kind of data warehouse based on code management practices that haven't yet reached the data domain.

Feedback welcome. Ask me anything.

The naming is a bit sad in that it conflicts with the patch sets manager: https://en.wikipedia.org/wiki/Quilt_(software)

I used to like this style of naming software with a allusive dictionary noun, but the sheer volume of new code is driving me to wish there was a canonical clearing house for naming, with agreed conventions for the "given name" even. e.g. Quilt, which in documentation and prose reference is the most natural choice, and a real use handle with a context qualifier, especially e.g. Quilt_Data and the use of a clearing house website for reference and proposals, including short forms which may need to be tested by popularity in use. Quilt_DM might be a alias to my preference, but I think it's worth the effort to get the time and confusion out of accidental conflicts like this.

Java has the right idea. Namespace by [reverse] domain name. It's sadly a tad verbose though. There's high contention for short names which is at odds with uniqueuness.

Namespacing by reverse domain name is no panacea. Your company might change it's name, might be bought, or might buy another company.

On the open source side, how much stuff was org.sourceforge that then had to move when SourceForge turned into a scummy operation?

Microsoft's CLR guidelines are a little less verbose but still run into the same problem with corporate name being permanently unique.

> naming software with a allusive dictionary noun

Perhaps <adjective> <noun>, <adjective> <noun> <verb>, or <adjective> <noun> <verb> <adverb>. Or <adjective> <adjective> <noun>. And so on.

Frog (Design, 200M ggl hits); Happy Frog (soil, copy center, 500k hits); Happy Frog Swimming (pool, 4 hits); Happy Frog Swimming Sideways (0 hits).

Quilted Data (250 hits).

But my impression is common nouns have trademark advantages.

Sorry to hear about the name conflict. We weren't familiar with the patch sets manager. If it helps to keep the quilt (data) command line tools out of your path, you can still run all the quilt commands from inside Python.

Quilt is an incredibly widely used tool. Many kernel developers use it as an alternative to git for patch management, and it's used by most distributions to manage their patches on top of upstream projects.

It's quite surprising. Quilt is a very common tool, e.g. for Debian developers: https://wiki.debian.org/UsingQuilt

You should probably rename the pip package and the CLI tool to quiltdata to avoid conflicts.

No worries, just make the name switch sooner rather than later. Since your target audience is obviously developers in general - to whom the patch tool is well known - it will most likely affect your business negatively if you don't...

How did this not come up during your market research?

Also, fantastic product! My only criticism is the poor name choice and lack of market research...

There is also a relatively well-known startup called Quid. They are doing something with data visualization. I first though this was them.

Hmm. We were able to get the pip handle so didn't see major conflicts in our target space. Are there places in code/cli where we could name conflict with the patch sets manager?

I think it conflicts a bit:

  $ brew install quilt >/dev/null 2>&1 && which quilt
I have used it extensively in automation in the past. I don't know if it's a good idea to call it "quilt" on the $PATH. Maybe "dataquilt" or "data-quilt"? "quilted"?

When choosing a name, you should check the package manager index sites of large distros for potential conflicts. For example in Debian:


Quilt is a patchset manager and used a lot in linux distro spaces (it was created by one of Linux's maintainers for his own use). Any global install of your tool will conflict with its existence since the utility names are the same, which you'll get will depend on your PATH setup.

> We built Quilt to bring package management to data.

No, you built Quilt to own the data. Package managers don't make users pay if they want their own package repositories.

> so that analysts can spend more time analyzing and less time finding, cleaning, and organizing data.

You should spend more time with data analysts then. You would understand that cleaning, filtering and preparing the data is actually part of data analysis. I would refuse to do any work on data that was pre processed without having the exact list Of what was done to the raw data.

Honestly all I see technically speaking in this project is a python program to download data frames... Woo-hoo. And for the business side, the now usual attempt to own a community of users and their data. Boring.

Namespaces are free. Public repos are free for unlimited data. Should we run a charity and also make private date free? :)

We understand that cleaning/filtering/preparation are key to the analysis, and we plan to support those operations as part of package construction. The question is whether or not that work should be done repeatedly, or once for the benefit of your collaborators.

You can store any kind of data in Quilt. Not just data frames. What's on Quilt today is just the beginning of what is possible.

Do you also dislike GitHub? We've opened up the client source and the community for as much free data as people can publish. It seems naive to expect that we wouldn't charge for anything ever. The users always and forever control their own data, by the way.

I'm not criticizing the price here. It is obviously totally OK to pay for data hosting.

> "Contact us to start Business or On-premise service." I DO reject the whole concept of having a "hosted-only" software. To me it makes the whole project useless. It means I cannot have my own private or confidential data. I cannot use it in my company, etc. More importantly, the data is not mine anymore. It's yours.

I cannot find any place on your website that explains what happens to the data once it is uploaded! As far as I'm concerned, it means I am giving you my data, for free, without any restriction on your side. You can resell it, modify it, rebrand it, prevent me from accessing it. Since there is no mention of how the data is stored, I also cannot know if the data is encrypted on your side, or have the capability to read everything that is uploaded.

We need to clarify the EULA and terms on the website. The data belongs to the users and we want to keep it that way. Quilt is not "hosted-only". The whole point of the on-prem install is that customers can run Quilt on their own infrastructure (it's Dockerized, etc.). We'll roll out a more formal on-prem solution as the project evolves. We're doing one thing at a time right now :)

> "The data belongs to the users and we want to keep it that way." Does it mean you plan to include in the EULA that Quilt will not parse, read, sell or otherwise use the data that users upload?

> "The whole point of the on-prem install is that customers can run Quilt on their own infrastructure (it's Dockerized, etc.)."

Yes, and on premise is not possible at the moment, and on the website it is marketed the same a business use.

Anyway, I lost my cool a bit on previous comments, thanks for keeping yours, and good luck :)

> Package managers don't make users pay if they want their own package repositories.

Isn't that npm's business model?

This is a really impressive project. Do you think you would add data cleaning commands? How do you think you would handle datasets that are only available to academics or other restrictions?

On your front page you should have the owner/dataset name instead of just the dataset name so I wouldn't have to click through to find the owner name.

Data cleaning is so necessary. `build.yml` already supports a limited set of feature (through pandas). In addition to custom data transformations, any "out of the box" cleaning functions you'd like? In the spirit of dplyr? We've looked at e.g. scikit feature for normalization, 1-of-n encoding, etc.

How about deduplication. Also table / dictionary lookup . And string replacement and regex replacement .

Got it. If you'd like me to ping you once we have custom build hooks: aneesh at quiltdata dot io. Have you tried Luigi or Bonobo for data cleaning?

No I haven't I think I just did it myself I'll look into that .

We've been thinking of making a Python build file to make it easier add richer transformations than the simple file->DataFrame that runs by default. We'd love suggestions and input on the types of transformations that would be most useful.

This looks pretty cool. Am I right to understand that this is providing a virtualized filesystem interface that dynamically loads the slices of data actually being accessed (but through a direct API rather than something generic like FUSE)?

Unrelated I found a typo on your blog. Search for "seriailize" on your "Manage data like source code" post.

Yes, essentially the quilt library abstracts the serialization (building packages) and deserialization (importing packages) so that user/client code doesn't need to include the mechanics of loading data (e.g., from a file system path or URL in S3).

I now wish there was a feature to make one's own comment less prominent yet still in thread, a self down vote in essence.

Because I took my first point in a direction away from the context, despite the reasons I wrote being the underlying reason.

I'm aiming at the extent to which data is susceptible to considerable performance improvements when you have the ability to align the physical store with the data structure.

Simultaneously, arguing as I did, to punch holes in the operating system to get to hardware comes with costs and the cost of losing backups is not trivial.

This is why I asked about the Weka FS in my comment, because if you were atop a system like that you would have theoretical redundancy already.

I ignored my dubious omission of this being another FS and not a hardware pass through my comment was about.

However the first point of contact in any case is going to be a requirement to partition a store for pass through access, taking my comment literally.

That's a great path to disaster and not what I meant but omitted the explanation required:

If you had commit access to your FS code, you could provide accommodation for the pass through request to vary metal, but just lose the point of the FS. I instead envisioned the FS providing you with hardware layouts that are suitable for the data you need running fast. I imagined a API to request that "on disk" format, and the FS to be able to indicate potential changes to balance between the requested layout and the performance of FS features such as cluster latency, computational efficiency of raid schemes and other features like T10 and other management of data overall. A scale between handing over the hardware and alerting to management consoles of the existence of a opaque store, and the agreement in tradeoffs the full features provide, where the application is placed in authority to size performance at a level of reasonable costs.

I'm thinking of serialisation that's a part of any larger system which provides reliability features in other components.

But the flexibility could exist to allow for flow under exceptional load, if doing so was critical to the performance of your overall system.

If you are buffering requests at a point where you have to pass the stream to the next component which is responsible for acknowledgement of the requests, and they will be able to resend if they have not received acknowledgement, then the opportunity to trade normal FS behaviour for raw speed, is possible under reported conditions known by the management instruments.

I'm assuming that the FS is not going to let you write data faster than it can function for jobs like replication, and then limits would indicate that the discarding of the FS roles are not always good value.

I haven't seen any sign of the purported advent of intelligence in file systems for the sort of thing I am interested in like this.

But certainly it must be a likelier possibility to cooperate with new FS companies, than the Windows design team and the weight their legacy brings.

I'm out of touch, but NTFS provides or provided interfaces for, e.g. sparse file layouts and there's nothing new in my central proposal. I've merely speculated how far it might be possible to go. And there's a entire field of data structure optimisation which can only be done in a worthwhile way with the FS cooperation. To just leave it at the hope of getting nice features in FSs else write your own driver, seems like a crude bipolar proposition in the present times

I hope this is not too tangential, but I have been thinking about the best ways to make use of direct access to non volatile memories, ignoring the block and driver level and ordering the layout in your code. I suspect that your project is one which could take that direction very usefully.

I would very much like to hear your reaction to this suggestion, and I would like to add if you have looked at the work at weka.io for their take on the convergence of storage. I increasingly like the idea of having the ability to use the hardware and have the management directed at the applications level where the developer is able to use intelligent measures and policies to tune their systems in discreet manner from the OS. Close cooperation will enable the collection of data to provide a valuable resource for administrators and directly benefit the pace of production development, by providing a comprehensive universal instrumentation context.

I know this is almost arguing about the reason why databases should use their own on disk format and there is long history of the tradeoff involved with that.

But the trouble I have with the current storage space is that the equity in the file systems is not flexible enough for the kind of smaller mixed deployment I come into contact with in the lower small business market, as example of this, CEPH or any FS which is a monolithic investment in which you are going to find the most restricted resource is management time. Additional FSs are difficult propositions to small shops. But the idea I'm looking for is the application layer should be responsible for storage management and performance tuning and follow best practice set by the software publisher learned from collected instrument data.

I think the epiphany of the general operating system is nigh or even last century.

Nobody is able to use large software programs in a turnkey way, making assumptions about the OS environment. I hand wave plenty saying that, but certainly I can follow up flippantly to add that a friend's experience providing contracted management to small businesses is not atypical by my experience, he joked that he loves Linux because it meant he got a clean install and nobody likely to be able to know how to mess it up.

The context where I see very little leveraging of OS capabilities, particularly in the Windows Server user world, it looks like a lot of wasted effort and license expense.

I beg forgiveness in advance of this facetious illustration, but in conversation with a small business Web developer recently, I cited the example of Plenty of Fish, and rhetorically asked if he knew it was a one man gig, on Windows and IIS? He was unaware of this, so I teased him that he would be forever in his first money rounds and hiring, if he had similarly accepted a bet on building such a dating site, if he kept on reading HN so much... My joke is off color sorry, but I wanted then as now to make the point where it has become all too accepted to automatically get started with a complete development stack and seek advantages in terms of the customization and deep power that is leveraging highly experienced professionals, and I worry about whether we all just do too much of this, and it's time to review the situation more broadly than my rotten humour alludes to, because the problem, if it is a problem, is much wider.

I agree wholeheartedly with this statement: "application layer should be responsible for storage management and performance tuning". I would take it one step further and say that storage should be virtualized in a high-performance way. HDF5 was the old way, Parquet/Avro is the new way, and something like Apache Arrow is the future. We are currently focused on efficient, cross-platform friendly ways of serializing [columnar] data and have chosen Parquet. Optimizing to the level of volatile caches, though, is probably not something we're ready to tackle. The performance gains to be had by eliminating parsing and lazily loading data (in the spirit of DMA) are absolutely huge. And good file-formats accomplish that. Moreover, the amount of time and performance lost to moving data around is staggering. https://weld-project.github.io/ and https://arrow.apache.org/ sketch the solution: 1) optimize the entire computation graph to minimize data materialization; 2) have a canonical in-memory representation that can quickly serialize results to a variety of clients.

I have not heard of weka.io but will take a look.

As to your slicing question, yes. Data is lazily loaded. With Parquet as our data store we can do even more (but haven't yet): load only the columns referenced.

I dream of being able to tell the file system to be ready for a range of requests when they are first anticipated.

In a extreme, I think about getting a NV store to load the pointers to the physical data, in cache and to not flush the cache until the context is released by the application thread. Going further into this reverie, anticipatory QoS for bandwidth is a personal desire. I'd even like to have a PCIe lane reserved, in my fascistic moments.

Now I'm regretting my wordy comments above, because the explanation I wrote wasn't needed.

But this dream I have to gain programmatic cooperation for data performance is surely not unique to me.

Maybe twenty years ago it made perfect sense to leave the task of resources sharing to the operating system and subsystems, but now not only do many people have a high performance headroom to exploit, but the knowledge and experience of writing concurrent schedulers and load balances is far more commonplace, just by virtue of the needs presented by the crunch expansion of the Internet.

The only point from my above comments that I think is under appreciated by the end users of software and worth approaching, is how much you can improve your software when you have a truly controlled environment. Talking straight through the operating system stack you can obtain uncoloured measurements from which you can build optimal performance applicable to all installations. I may dream, but I grew up with the worshipping of vendor benchmarks as my bete noir, and I have been set against waste created by supporting unrestrained variations, ever since. That makes sense for a commercial operating system, but the opportunity to work with completely homogeneous stacks is something that ought to be recognised for the value potential, by management. I'm convinced that this will be a critical commercial advantage for whoever first finds a solution that isn't a service only but customer deployable.

If you're talking about lazy loading of data, for your R implementation you might want to (if you're not already) look into creating a custom dplyr backend that only loads data when needed (similar to the dplyr SQL backends).

Another wish of mine, is the reporting of the installation conditions on every individual installation.

It's a long time past, but I can't forget my experience in small businesses, where I had to disbelieve any reports about performance problems, until I had full access, or even on site immediacy to get a idea of the circumstances.

I'm fed up with Microsoft getting the advantage of being the only one who collects metrics aggressively or at all.

The open source community should be the first to get the data out of customers from the production systems

How can we do this?

I'm just feeling a personal sense of futility from my experiences optimising code as well as installations to sometimes minimal effect, bounded by the hardware budget. I certainly learned a valuable part of my skills in that way, but I have ever since felt sceptical how much of the development effort would be better allocated, if true installation performance instruments were reporting the whole user base.

(one assumes a R user will be well equipped, but I expect it to be a broad spectrum of hardware, from students in India to multinational developers working with the latest generation.

Its often true that the most impoverished users gain the most from optimised software and the ethics of this result are impeccable. I'm simply asking for the concerted collection of instruments metrics to support optimisation efforts in open source software. Putting the authors, creators and hackers first, is something that I wish was done as part of the process of promoting FOSS and the provision of quality insights into how their work is used, I think should be the basic foundation of our responsibility and gratitude for their work. Not to mention the improvements in our work which will result. Surely this is not impossible to solve. I have often thought about a package manager taking snapshot performance characteristics and reporting to the developers by way of a public page update. But I've not even seen the idea anywhere else, and don't understand what gives...

Typo fixed. Thanks :)

Love the concept! Clear value proposition for anyone who knows both worlds (code + data). Look forward to giving it a try.

Also, if we want to produce our own data set for community consumption, what are your requirements, and what kind of payout could we expect?

To start with, we can add stars (pay with prestige). Getting more into science fiction--but very possible science fiction--we can put data on the blockchain and let people transact. The data owner would get the lion's share of the transaction.

Hmm. I can see some ways for this to work.

Awhile ago I started a project converting government voting records (both elections and congressional) into a database. Would that be interesting to you?

Here's an idea: You could also host a data bounty program, and/or start a grant program for the production of these data sets.

Still missing an answer to "what are you requirements"? How do you verify data quality, etc? What format(s)?

We support arbitrary data formats in that Quilt falls back to a raw copy if it can't parse the file. On the columnar side (things we convert to Parquet) we support XLS, CSV, TSV, and actually anything that `pandas.read_csv` can parse. We use pandas and pyarrow for column type inference. We want to add a "data linter" that checks data against user-provided rules, and welcome such feature requests on GitHub or in our Slack Channel.

Currently, we support 2 "targets" a Pandas DataFrame and a file. Files can be any format. The Quilt build logic uses Pandas to read files into DataFrames so any format Pandas can read should work in Quilt to create a DataFrame node.

Do you plan to curate the free plans? If not, how will you prevent abuse? (thinking in terms of both the service and its users)

Yes :) Happy to discuss in detail if you have specific types of abuse in mind. Our first line of defense is to get help from the community through downvoting useless content.

The main thing is how you are able to offer unlimited disk, when anyone can start uploading data? Some users might upload very large amounts of innocent data, others might simply abuse it as a sort of data backup system. More nefarious users might even try to distribute malware.

Could you tell us how this compares with Synapse (https://www.synapse.org/)?. They've been doing this for a long time and have a large presence in computational biology.

I know only a bit about Synapse. It seems like they have some valuable data. I think the biggest differences are in the culture and user experience. Quilt gets its DNA from the open/community-driven cultures of GitHub and npm. We want to make it extremely light-weight for anyone to push and share data packages. Synapse frankly seems to have a lot more features. We want to compete on simplicity.

Does Synapse offer server-side filtering? That's something we're thinking about.

> We want to compete on simplicity.

Bravo, this is an excellent strategy. I've been using NPM packages to share small datasets for a little while due to the simplicity of distribution.

Would be great if you also had the schema / file format information as well on the site.

Agreed. We're considering doing more along these lines, such as generating the Hive DDL. So you would like to browse the schema, e.g. under Contents?

Awesome! Yup would like to browse the schema.

What type of representation would be most useful for schema information? Avro?

That or just a basic visual editor is fine too. Goal is to just know what I'm getting before getting it in more detail.

are you aware of datapackages [1]? do you plan on opensourcing the backend components?

[1]: https://github.com/frictionlessdata

Frictonless is interesting but 1) doesn't handle serialization (which is essential for performance); 2) requires users to hand annotate schemas (we think schemas should be auto-generated whenever possible).

Yes and yes. Do you use frictionless data packages? If so, what do you like and not like? We've looked at their specs and have thought about ways we could integrate. We'd love to hear your suggestions.

Any way your web page could be made viewable without requiring Javascript?

I disable it in my web browsers because of security and privacy concerns, and it would be great if I could just read about your project without it.

Ah, that would be useful. In the meantime our docs are browsable without JS https://docs.quiltdata.com/. The articles linked at the bottom of the docs landing page are also legible without JS. Unfortunately the docs menu is not available without JS. I will look into fixing these issues. Email is feedback at quiltdata dot io if you have questions :)

The menu on https://docs.quiltdata.com/ isn't visible if you have JS disabled. I had to disable CSS to be able to navigate.

Is there planned support for any other languages than Python?

Yes! We'd like to get to R and Scala next and hopefully C++ soon. We'd love help from open-source collaborators. Our team is definitely strongest in Python.

I'd definitely be interested in looking into this but I'm on an R team. Didn't see anything on how to contribute in another language or an API to hit. Will be there be information soon?

Hi. R is really important to us and we want to add support (probably through SparklyR). If you email me I can add you to our Slack channel and we can talk through extending Quilt (aneesh at quiltdata dot io). There is a sliver in our docs but it's not complete: https://docs.quiltdata.com/basics.html

The hooks for extending build targets are here: https://github.com/quiltdata/quilt/blob/4aa6897f9e33349b7778...

Again it's a bit raw but we're here to help make R support easy.

Is this only cloud hosted? I have been looking for this exact solution, and I'm very excited to try yours, but banks and other institutions will not tolerate storing their data in my butt. It must be totally local without ever needing an outside network. Can you provide that?

We are rolling out on-prem where the customer runs the package registry on their own infrastructure. The set up process takes a bit of customization depending on your environment. If you have Docker containers it's easier. If you email me I can get you rolling: aneesh at quiltdata dot io.

This is pretty neat! Is there an API or anything, so we can write support for other languages?

There's a RESTful API that's used by the Python client, though it's not documented yet. If you're willing to go through the code, you can find /api/... calls in https://github.com/quiltdata/quilt/blob/master/quilt/tools/c... .

We're planning to write proper documentation soon.

Which languages are most interesting to you? We wrote the client with an eye towards supporting R and Scala. A PR or FR on GitHub would be ideal. We also have a Slack channel where we can support you if you want to tackle adding language bindings to Quilt. feedback at quiltdata dot io.

Ruby and Go, personally.

The package metadata is stored in JSON so that should be pretty easy to access in either Ruby or Go. Tabular data is stored in Parquet by default. Do you know of any good libraries for reading and writing Parquet in Ruby and Go? Do either of those languages have a DataFrame-like class/struct?

This could be a big deal. Fantastic concept and great implementation.

Thanks. Where do you see it going in 2-5 years? Sometimes outsiders see things about the future that we haven't thought about :)

What's your privacy policy for private data?

Only the owner and designated collaborators can view it. We offer private data in the cloud (S3) and are rolling out on-premise in case you want to run Quilt on your own infrastructure, in which case you control blob storage. Does that answer your question?

It's outrageous how little tooling support there is for version control in data compared to code. Every mainstream database forgets history with updates, don't support distributed workflows, don't support commit ids as first class objects, or most other basic features of VCSs. Databases just aren't a solution to version control.

I can't imagine a future where we don't treat data version control like a necessity in the same was as code version control. I hope Quilt can fill the much-needed role of "Github for data".

And most databases can't tell you who put that bad data into the database.

Nearly every place I've worked at we've had to try to create our own version history and audit information. This one of the examples I use when people start talking about how we have a shortage of engineers.

No, we have a shortage of good tools. Nobody should have been writing their own data versioning code in 2007, let alone 2017.

I just joined a company that solves this very problem: https://www.delphix.com/

What you described is literally what we offer -- the ability to go backwards and forwards in time and create branches, bookmarks, and refreshes with very little overhead. [Edit: This is all for databases that by default, as you noted, don't do any of this stuff.]

We aren't (yet) super well known, but we've got a number of things in flight that should change that :)

And if you're curious how this magic works, the hint is: we have a number of really smart ZFS contributors on the engineering team.

How does it work? Is it filesystem snapshots? I know it's unreasonable to expect, but do you support forking/merging? What about incremental or logical replication?

Filesystem snapshots are part of it. Forking yes. Merging no. You can configure it to ingest incremental backups as well as realtime transaction logs, which means you can go backwards/forward in time all the way down to the individual transaction. It's quite cool!

A lot of the time, people get confused by source control & think, “I need to refer to this data as it was at a specific point in time, I’ll use git”. Then they get all the overhead of a DVCS without getting most of the benefits (blame, bisect, branches, merge resolution) when what they wanted was versioned releases of data packages…

I've often thought about this problem space. Imagine you have a database with all the bells and whistles you suggest. At first, it's great. But at some point, when you start experiencing growth (compute/storage) pressure (many petabytes) all of this metadata adds up.

Does it remain cost effective at scale?

All the source code in the world is a drop in the bucket compared to the raw data collected by a large business.

If the user of the data insists on a time horizon of 'since our founding', then yeah, there's an opportunity cost to keeping all of the data. The one people usually miss is that it takes log(n) time to update or insert a record because you have to update all of the indexes. So when you have 1000 times as much data, every insert takes 10x as long as it did at the beginning. Or you use partial indexes and it gets maybe 2 times slower which the users probably won't notice.

So people just give up and exfiltrate all of the data to another server to run their reports on. But as a user I still want to be able to figure out 'did I do that thing in February, or was it March?' fairly often.

One thing I've always wished database replication systems did (universally) was allow you to run different indexes on different replicas, so you don't have to export the data at all. Your insert time would still be a function of network delay + worst case index update time, but you could segregate traffic based on kind and continue to scale to a fairly large company before anyone had to mention data warehousing or data lakes.

Well, there's human-generated data, and there's computer-generated data. The former is tiny but more dynamic, and where we want all of these nice creature comforts. The latter can be large, but workflows tend to be static so there's less need for these features.

As a thought, keep an eye on DBHub.io:


Still in early dev stages, and I'm on holidays for a few weeks atm, but we'll have something useful online in a few months. Supporting forking, branching, merging, etc.

Have you seen data.world? "GitHub for data" is basically their entire model.

Yeah. Important differences: Quilt offers unlimited public storage, users do not need to log in to use public data, offers on-premise solution, handles serialization, works offline with local builds, etc. You'll see more differentiation as we roll out new product features.

How is this different/the same as the DAT project (https://datproject.org/ and https://github.com/datproject)?

Dat is a distributed transport layer for raw data. Quilt is a centralized (your infrastructure or ours) transport and consumption layer for virtualized data. As such we'll be able to, for example, run efficient queries across all of Quilt, allow users to import data the same way (no data prep scripts) across a variety of platforms, etc.

I recently encountered https://data.world/datanerd/inc-5000-2016-the-full-list and was impressed by the combo of a pandas-friendly client library & centralized online tier with bells & whistles. Feels closer to that direction.

There's obviously friction in this space -- this weekend I'm playing with databricks spark notebooks and public data on S3 and the data prep will still be annoying -- so I'm looking forward to design innovation in it.

I'd like to chat more about this. Quilt uses S3 and I think we could make the process of getting data into Databricks much simpler. Drop me a line if you'd like to discuss aneesh at quiltdata dot io.

I'm very excited. I want to use this to version ConceptNet's raw input and its built data, all of which is public.

So I can assume this isn't going to be afraid of gigabytes, right? I've seen services before that want to be a repository of data, and I try to upload a mere 20 GB of data and they're like "oh shit nevermind". Even S3 requires it to be broken into files of less than 5 GB for some inscrutable reason.

Gigabytes should be no problem. We have yet to implement multi-part uploads though, so if your network gets interrupted you will need to re-upload the package from scratch. By design when the upload is in progress your client is talking directly to scalable blob storage. If you run into any issues, please email me: aneesh at quiltdata dot io. We also want to add server-side filtering for giant packages (where `quilt install` doesn't make sense). I can add you to our Slack channel if you'd like to discuss further.

I don't think you have had to break files up yourself for a long time on S3. You can treat files up to 5TB as a single object. I think you have to do a multipart upload but that's probably not a bad idea anyway.

Ah, right. I'm remembering from when I was trying to use git-annex to version the data, which was a problem for multiple reasons, including that their S3 driver didn't use multipart uploads.

This makes me think of http://www.pachyderm.io/. Although Quilt seems to be more like github for data, whereas Pachyderm is more like git for data.

Pachyderm was the first thing I thought of. In fact there should be opportunities for collaboration here.

Pachyderm looks brilliant for high-scalability parallel data processing, and the versioned data part is a way to not just maintain the history of the data, but also avoid reprocessing of data that hasn't changed since the previous run.

Hi, I'm one of the creators of Pachyderm. We've been talking with the Quilt founders about various ways to work together and think there are some really exciting opportunities!

We really like pachyderm and know the founders. Quilt is zero-config focused on storage and versioning, pachyderm is more focused on on-prem and compute (as a Hadoop replacement).

It looks like a plain html page but requires JS to view anything except: "Please enable JavaScript to use this site." What a wonderful time to live in.

Anyway, do I get this right: They expect users to be experts in data analysis but not being able to load the data into whatever software they use? They want me to share data and to offload my data into their walled garden that can be accessed only via their service? If I wanted to share my data, wouldn't I rather use something more accessible?

> If I wanted to share my data, wouldn't I rather use something more accessible?

Still, it's good inspiration. Maybe I'll make a github repo with my city's open datasets loaded into python.

There are some important differences from git: https://news.ycombinator.com/item?id=14772036

It's not about people not being able to load their data, but about accelerating the loading with serialization and about whether or not people want to focus on data cleaning, or have the cleaning done once and then available for posterity.

Quilt is, in my view, as open as git or GitHub. The de/serialization code is all open source, and uses an open format (Parquet). Parquet is accessible (and more optimal than text files) for things like Presto DB, Hive, etc.

> Quilt is, in my view, as open as git or GitHub

Can I run something that hosts a quilt repo on my own server?

Yes. In progress and the first on-prem installs are up and running. Happy to chat further: aneesh at quilt data dot io.

I know git is not great with binaries, but wouldn't just a git repo be a better start for what you want?

This looks like a cool project -- always glad to see new tools for statistical collaboration and reproducible research.

How does this compare to what data.world [1] is doing? They recently released a Python SDK [2] as well.

[1] https://data.world/ [2] https://github.com/datadotworld/data.world-py

Hi. Key differences from data dot world: https://news.ycombinator.com/item?id=14792143

I think it was a really interesting (and smart) choice to convert to Parquet format. Columnar storage is so much more efficient, and working with data in Parquet is pretty fast using the engines they mention (Apache Spark, Impala, Hive, etc.).

I actually had been thinking about Parquet as a component of ETL, and if it might be possible to make ETL many times faster by compressing to Parquet format on the source and then transmitting to a destination - especially when you're talking about limited bandwidth situations where you need entire data sets moved around in bulk.

This looks really nice for sharing public data sets, but I wish that there was a better public non-profit org running indexes of public data sets.... I guess if something like the semantic web had ever taken off, then the Internet itself would be the index of public data sets, but it seems like that dream is still yet to materialize.

Once the interfaces are mostly settled, we plan to open-source the server so that other organizations can run Quilt registries. If you know of non-profit data-indexes that Quilt should work with or organizations who might be interested in running a Quilt registry, please let us know.

Code for America & civic tech orgs would be quite interested, I would imagine.

FYI: I've built a datapackage manager called datapak in ruby [1][2]. datapak supports the tabular datapackages (.csv with .json schema) from the frictionless data initiative (by the open knowledge foundation). All open source and public domain. See some examples such as the Standard&Poors 500. By default the datapackage gets auto-added from .csv to an in-memory SQLite database for easy querying etc. Thanks to ActiveRecord you can use PostgreSQL, MySQL, etc.

[1] https://github.com/textkit/datapak [2] http://okfnlabs.org/blog/2015/04/26/datapak.html

Very interesting, I've heard of the Open Knowledge Foundation [1] but wasn't aware of the Frictionless Data Initiative [2]. Looks like it's complementary to Common Workflow Language [3]

[1] https://okfn.org/ [2] http://frictionlessdata.io/ [3] http://www.commonwl.org/

I think you're missing a trick with the pricing. My guess is the real money will come once data is treated like a commodity. So the big, big money will be in brokerage's and exchanges.

Paying flat fees for access to repos is fundamentally thinking about the problem incorrectly.

We charge business and on-prem users in TB-sized blocks. So that part is variable cost, not flat. And we sell user seats in blocks of 10. What else should we be thinking about? We want to be fair and also price in a way that encourages sharing behind the firewall (e.g. shouldn't require manager approval to add every new user).

> What else should we be thinking about?

My feeling is brokering. Consider the market for wheat where there is pricing based on supply and demand. There are futures, options, etc.

Consider a NYSE for data. Why host the data? Be a discovery service both for the price and for brokering of access.

Data should not be priced on it's size to store/transfer. That is leaving huge money on the table. It should be priced based on what people are willing to pay for it.

Why not allow someone to pay for the option for exclusive resale rights of some weather data service? Then allow someone to make a profit off their ability to re-sale that data. etc.

You may not be well placed to do that now but someone is going to. And why would I go to you hoping to find specific data when I can go to a market full of data, full of data re-sellers, etc.

Spot on. I get where you're coming from. The value of data is whatever the buyer and seller agree upon. I was mostly taking about the value/price of the service.

I think the whole project is doomed, the technology is trivially clonable and the backend is a thin API over S3. Their best bet is an acquisition by Amazon etc. to become e.g. Elastic Data Packs

I don't be think that trivial tech means they are doomed. GitHub, for example, is basically a GUI around git with a social aspect. Many others have implemented the same features: GitLab, Gogs, BitBucket, etc. Yet GitHub has not failed. It's all about the execution.

GitHub solves a lot of performance bottlenecks, and codified forking. They didn’t write an app or client bindings, which it seems that quilt is giving away. It sounds to me like they have a great interface to a closed hosting platform, and they hope enough people won’t mind maybe paying to access their data in the future. Hence why I think a cloud provider is the most probable acquirer.

edit by ‘doomed’ I mean “hefty discount on employee stock options/wouldn’t put my own money into it”

What does this do that cannot be done with git or similar software + data stored in some standard format?

Four things: serialization, virtualization, querying, big data (even Git LFS isn't super performant for large files). Quilt actually transforms data into Parquet and wraps it in a virtualization layer so that the data can be injected directly into code. Efficient querying is a function of the serialization.

By contrast, GitHub is a blob store, it doesn't transform the data either for serialization or for virtualization.

Is there an option to be able to sign up using Github? That would make life much easier for me.

GH sign up would be useful. If you email us I will ping you when we add it: feedback at quiltdata dot io.

I've downloaded Quilt using pip, is the login more for cloud storage for your own data and projects similar to github? Maybe clarify that point on the website? Because not too sure what I would be signing up for, thanks again and love the service.

The login allows you to push packages. If you only want to consume public packages: no login required.

Thanks for releasing. Looks useful and aligned with several projects I've worked on.

The first thing I looked for was a canonical package / resource specification in build.py. Any chance supporting Frictionless data resource spec for interop?


As Kevin mentioned we can extend support to frictionless (and are acceptign PRs on GitHub :). The thing we didn't love about frictionless is that it requires the user to fully specify the schema. We take a slightly more automated approach: https://docs.quiltdata.com/make-a-package.html https://docs.quiltdata.com/buildyml.html

I think we could generate a frictionless schema pretty easily...

We've definitely been looking at that! Are you using frictionless data packages now?

I am a scientist who sometimes publishes data sets with academic papers and this looks super useful both as a tool and as a potential publishing best practice. Currently doing away with HDF5 and figshare. One necessary future for academia would be to be able to assign a DOI to a given version of data. Is it feasible for quilt to have such a feature?

We'd love to include DOIs for each version of datasets. I think it's feasible for us to do that, but we haven't scoped out how hard or expensive that will be. In the meantime, if you have a way of creating DOIs from URLs, creating a package version will give you a permanent URL for a version of a dataset. If you have a recommendation for how to implement DOI creation, please let us know. Thanks!

[disclaimer, work for digital science but not figshare]

Figshare currently has versioned dois, so you can refer to a specific release, and the API will let you download each one. I might put some small tooling around this as I release data there myself.

To add on the feature requests, for academia having trust that the data will survive the company going away is something important to me.

Not to be confused with http://quilt.io/

Huh, I was thinking of https://linux.die.net/man/1/quilt

Basically the sanest way of maintaining patch sets (multiple patches -> quilt, get it?). The logic of using "quilt" as the name for anything that doesn't have to do with patches baffles me.

Awesome idea. Have you thought about a marketplace perhaps? Say for instance I create a package with all the cities in the world or all species of canine, host it in a marketplace and others can buy that data package to use in their own projects.

Stay tuned :)

Hey, website did a good job of explaining what you are within like 2 seconds. I like

Wow - this looks like it would be really useful for us, and fits perfectly with our existing processes. I am building out the data analytics function on the Internal Audit team at Uber, and one of our challenges is that we have to pull and manage data from different business systems, and be able to track which version of a data set a report/analysis was run against.

It would be really cool if quilt could generate documentation for datasets, even if it was just column names/types. One of the issues we have is keeping track of all of the data "assets" people have pulled or created.

We're definitely planning to make column names and types more easily accessible and searchable. It'd be great to learn more about what information and metadata would be most useful to you. We also have an on-prem version we're rolling out with a couple of pilot customers in case that's helpful.

This seems like a great way of publishing public datasets. However, as someone who works on a computer vision startup I don't think I could really use this. In my work data annotation, visualization, and versioning cannot be easily separated. The effort we would need to put in to use quilt might be better spent building a simple versioning system on top of our current data infrastructure.

Where does it break down? Quilt can package and version directly from in-memory objects so if you are working in Python (more languages planned) you can package as you go and include any dependencies?

This is great! I was looking for something like this for a while.

You guys should make the search bar a little more prominent. Took me a while to find it!

Done. You should see a more obvious search bar on the next push. We'll also make search case insensitive :)

What's the difference with CVMFS[1] and Nix[2]?

[1] http://cernvm.cern.ch/portal/filesystem [2] http://nixos.org/nix/

Those are both interesting projects. Quilt has a rather different emphasis, though. The difference from CVMFS is that Quilt is a full set of services around data (build, push, install) whereas CVMFS is just the file system. In the big data community Paruqet, which we use as a virtualization format, has far more traction than CVMFS. Nixos is specialized for software package management so the distinctions between Quilt and git apply https://news.ycombinator.com/item?id=14772036.

I met one of your founders at an event recently. Very impressive product! Good luck to you all.

The workflow I would have imagined for versioning data is:

1) Load original data from source into quilt

2) Do transformation

3) Commit transformations to quilt, with commit message

4) Run experiment

5) Do new transformations

6) Commit to quilt

7) Run experiment

Rinse and repeat.

Looking at the video and documentations, this is not emphasised at all, suggesting that edits to data should be saved as a new package.

You can absolutely edit in place and that will go into `quilt log` for the package--as long as you are the package owner. Our docs were a bit confusing on this point. I just updated them: https://docs.quiltdata.com/edit-a-package.html

We definitely imagine saving edits as creating new versions of the same package. The most common pattern we've heard is adding more dataframes or files to a package as new results are generated. But, we can certainly imagine other transformations.

I would love if someone did this with http://realm.io, so that the data could be "live" and multiple users could collaborate on it in realtime.

We've talked about this. Though streams (e.g. Apache beam) are closer to where we think realtime data is going. It would be possible to wire Quilt to something like firebase to get realtime behavior... Happy to brainstorm other solutions: aneesh at quiltdata dot io.

Hey, this sounds really interesting and I'd like to play around with it. However, I'm a novice and run into the following issue:

>>> examples.sales


No idea what a DataNode is so am struggling to actually see the data! Any tips?

Is there an example in the docs that just shows `examples.sales`? If so please let me know and I'll fix it. I searched and couldn't find such an example (though maybe, through the "magic" of Chrome's service-worker, got an old version of the website). As mentioned by Kevin `()` or `_data()` is what you need.

try: example.sales()

it's shorthand for: example.sales._data()

For the future, learn to use dir().

Any thoughts on adding DOIs? It's a complex subject wrt versioning, in particular (new DOI per version? How to keep track?). It would help tremendously with the academic community; for the bean counting.

The package name + hash is an implicit DOI. What if we added web support for it so that users could https://quiltdata.com/packages/USER/PKG?doi=SOME_HASH ?

Yes, but it's not globally recognizable. that's why doi's are standardized through ISO https://www.doi.org/ Internally you could implement a DOI->HASH mapping, but a quilt hash isn't going to help in the reference list of a paper (if you're lucky you can copy'n'paste it. How do you know where to go? What happens if your package organization changes internally and so forth.

This is more than adequate. DOIs are simply redirects. It's up to the data owner to point the DOI at whatever resource contains the data. If a DOI is registered, it can be pointed to the quilt URL.

You can take it a step further and either integrate with a DOI provider or become one yourself and integrate the registration process within your api or create command line tools.

Good suggestion! From what we've heard from academic users, they'll want a DOI for a specific version, e.g., data from a particular paper or journal article. Any thing else we should watch out for?

It's problematic when data publisher != data user/paper writer. I'm not familiar enough with DOI minting and therefore don't know what issues DOI generation on large scales for miniscule changes in the data might bring. Ultimately, if I make data openly available the worst case is that every change to the data requires a new DOI as I don't know how many people have downloaded earlier versions and not published on those yet / don't care about my added cleaning (or think it's wrong). I haven't done it in a while, but github's collaboration with Zenodo results in a zip file hosted there. Obviously, that reduces the amount of DOIs created but it's not great. As soon as my code changes, and someone uses that version in a paper, they'll use the old DOI. Potentially resulting in not reproducible results. The same is true for data. On the researcher side, you may end up with 100s of DOIs, each with zero-few citations. Also not great. A happy medium might be to leave it to the data generator to create DOIs for set versions, and drop anyone trying to resolve the DOI on a landing page that provides links to that original version and any updates since (maybe indicating later releases that have DOIs attached separately). Certainly would make me as data user / supplier happy.

Thanks for describing the problem. That's really interesting. We can certainly aggregate counts of downloads and installs across versions in Quilt. I'll definitely look into providing DOIs within Quilt and see if it's something we can do.

No, no relation.

How much of it is open source? Can I run my own?

The client is fully open source. You can indeed run your own and we are just starting to roll that out. I can get you started: feedback at quiltdata dot io. We are deliberating open sourcing the registry as well (making everything open source). What do you think?

Being able to host my own repository is a must for me. We have many TB of data and don't want to stream that over the internet. Having a repo on-site is a must.

I'd appreciate the code being open source. I can afford paying for a (perpetual) license.

Understood. We can do something very close to that. Right now we give the registry source to our on prem users as part of the license. Email sales at quiltdata dot io and we can discuss.

What's the difference between Apache Parquet and Apache Arrow? They are both columnar formats right?

Parquet is stationary data on-disk. Arrow is focused on in-memory analytics and serializes out to a variety of formats (including Feather and Parquet).

This is great - looking forward to seeing the available datasets grow!

Is there a similar project for Ruby?

No charge for bandwidth?

No, there's no charge for bandwidth. The most common uses so far are users installing datasets locally, which caches the data at the destination or running batch jobs in ECS/EC2, which doesn't accrue charges on AWS.

Huh, that's a pretty good setup for you then! I might mess around and see about writing an R package/interface, because this looks very useful.

We welcome your contributions to the R interface. If you email me, aneesh at quiltdata dot io, I can add you to our Slach channel where our engineers can support your efforts. Several users have asked about R and if we combine them together I think we have the horsepower to build an R layer for Quilt.

That would be awesome! We're happy to help, but we only know a little bit of R. We've been looking at Sparklyr in case it might help read Parquet into R.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact