Our general inspiration is to create a new kind of data warehouse based on code management practices that haven't yet reached the data domain.
Feedback welcome. Ask me anything.
On the open source side, how much stuff was org.sourceforge that then had to move when SourceForge turned into a scummy operation?
Microsoft's CLR guidelines are a little less verbose but still run into the same problem with corporate name being permanently unique.
Perhaps <adjective> <noun>, <adjective> <noun> <verb>, or <adjective> <noun> <verb> <adverb>. Or <adjective> <adjective> <noun>. And so on.
Frog (Design, 200M ggl hits); Happy Frog (soil, copy center, 500k hits); Happy Frog Swimming (pool, 4 hits); Happy Frog Swimming Sideways (0 hits).
Quilted Data (250 hits).
But my impression is common nouns have trademark advantages.
You should probably rename the pip package and the CLI tool to quiltdata to avoid conflicts.
How did this not come up during your market research?
$ brew install quilt >/dev/null 2>&1 && which quilt
No, you built Quilt to own the data. Package managers don't make users pay if they want their own package repositories.
> so that analysts can spend more time analyzing and less time finding, cleaning, and organizing data.
You should spend more time with data analysts then. You would understand that cleaning, filtering and preparing the data is actually part of data analysis. I would refuse to do any work on data that was pre processed without having the exact list Of what was done to the raw data.
Honestly all I see technically speaking in this project is a python program to download data frames... Woo-hoo.
And for the business side, the now usual attempt to own a community of users and their data. Boring.
We understand that cleaning/filtering/preparation are key to the analysis, and we plan to support those operations as part of package construction. The question is whether or not that work should be done repeatedly, or once for the benefit of your collaborators.
You can store any kind of data in Quilt. Not just data frames. What's on Quilt today is just the beginning of what is possible.
Do you also dislike GitHub? We've opened up the client source and the community for as much free data as people can publish. It seems naive to expect that we wouldn't charge for anything ever. The users always and forever control their own data, by the way.
> "Contact us to start Business or On-premise service."
I DO reject the whole concept of having a "hosted-only" software. To me it makes the whole project useless. It means I cannot have my own private or confidential data. I cannot use it in my company, etc. More importantly, the data is not mine anymore. It's yours.
I cannot find any place on your website that explains what happens to the data once it is uploaded! As far as I'm concerned, it means I am giving you my data, for free, without any restriction on your side. You can resell it, modify it, rebrand it, prevent me from accessing it. Since there is no mention of how the data is stored, I also cannot know if the data is encrypted on your side, or have the capability to read everything that is uploaded.
> "The whole point of the on-prem install is that customers can run Quilt on their own infrastructure (it's Dockerized, etc.)."
Yes, and on premise is not possible at the moment, and on the website it is marketed the same a business use.
Anyway, I lost my cool a bit on previous comments, thanks for keeping yours, and good luck :)
Isn't that npm's business model?
On your front page you should have the owner/dataset name instead of just the dataset name so I wouldn't have to click through to find the owner name.
Unrelated I found a typo on your blog. Search for "seriailize" on your "Manage data like source code" post.
Because I took my first point in a direction away from the context, despite the reasons I wrote being the underlying reason.
I'm aiming at the extent to which data is susceptible to considerable performance improvements when you have the ability to align the physical store with the data structure.
Simultaneously, arguing as I did, to punch holes in the operating system to get to hardware comes with costs and the cost of losing backups is not trivial.
This is why I asked about the Weka FS in my comment, because if you were atop a system like that you would have theoretical redundancy already.
I ignored my dubious omission of this being another FS and not a hardware pass through my comment was about.
However the first point of contact in any case is going to be a requirement to partition a store for pass through access, taking my comment literally.
That's a great path to disaster and not what I meant but omitted the explanation required:
If you had commit access to your FS code, you could provide accommodation for the pass through request to vary metal, but just lose the point of the FS. I instead envisioned the FS providing you with hardware layouts that are suitable for the data you need running fast. I imagined a API to request that "on disk" format, and the FS to be able to indicate potential changes to balance between the requested layout and the performance of FS features such as cluster latency, computational efficiency of raid schemes and other features like T10 and other management of data overall. A scale between handing over the hardware and alerting to management consoles of the existence of a opaque store, and the agreement in tradeoffs the full features provide, where the application is placed in authority to size performance at a level of reasonable costs.
I'm thinking of serialisation that's a part of any larger system which provides reliability features in other components.
But the flexibility could exist to allow for flow under exceptional load, if doing so was critical to the performance of your overall system.
If you are buffering requests at a point where you have to pass the stream to the next component which is responsible for acknowledgement of the requests, and they will be able to resend if they have not received acknowledgement, then the opportunity to trade normal FS behaviour for raw speed, is possible under reported conditions known by the management instruments.
I'm assuming that the FS is not going to let you write data faster than it can function for jobs like replication, and then limits would indicate that the discarding of the FS roles are not always good value.
I haven't seen any sign of the purported advent of intelligence in file systems for the sort of thing I am interested in like this.
But certainly it must be a likelier possibility to cooperate with new FS companies, than the Windows design team and the weight their legacy brings.
I'm out of touch, but NTFS provides or provided interfaces for, e.g. sparse file layouts and there's nothing new in my central proposal. I've merely speculated how far it might be possible to go. And there's a entire field of data structure optimisation which can only be done in a worthwhile way with the FS cooperation. To just leave it at the hope of getting nice features in FSs else write your own driver, seems like a crude bipolar proposition in the present times
I would very much like to hear your reaction to this suggestion, and I would like to add if you have looked at the work at weka.io for their take on the convergence of storage. I increasingly like the idea of having the ability to use the hardware and have the management directed at the applications level where the developer is able to use intelligent measures and policies to tune their systems in discreet manner from the OS. Close cooperation will enable the collection of data to provide a valuable resource for administrators and directly benefit the pace of production development, by providing a comprehensive universal instrumentation context.
I know this is almost arguing about the reason why databases should use their own on disk format and there is long history of the tradeoff involved with that.
But the trouble I have with the current storage space is that the equity in the file systems is not flexible enough for the kind of smaller mixed deployment I come into contact with in the lower small business market, as example of this, CEPH or any FS which is a monolithic investment in which you are going to find the most restricted resource is management time. Additional FSs are difficult propositions to small shops. But the idea I'm looking for is the application layer should be responsible for storage management and performance tuning and follow best practice set by the software publisher learned from collected instrument data.
I think the epiphany of the general operating system is nigh or even last century.
Nobody is able to use large software programs in a turnkey way, making assumptions about the OS environment. I hand wave plenty saying that, but certainly I can follow up flippantly to add that a friend's experience providing contracted management to small businesses is not atypical by my experience, he joked that he loves Linux because it meant he got a clean install and nobody likely to be able to know how to mess it up.
The context where I see very little leveraging of OS capabilities, particularly in the Windows Server user world, it looks like a lot of wasted effort and license expense.
I beg forgiveness in advance of this facetious illustration, but in conversation with a small business Web developer recently, I cited the example of Plenty of Fish, and rhetorically asked if he knew it was a one man gig, on Windows and IIS? He was unaware of this, so I teased him that he would be forever in his first money rounds and hiring, if he had similarly accepted a bet on building such a dating site, if he kept on reading HN so much... My joke is off color sorry, but I wanted then as now to make the point where it has become all too accepted to automatically get started with a complete development stack and seek advantages in terms of the customization and deep power that is leveraging highly experienced professionals, and I worry about whether we all just do too much of this, and it's time to review the situation more broadly than my rotten humour alludes to, because the problem, if it is a problem, is much wider.
I have not heard of weka.io but will take a look.
In a extreme, I think about getting a NV store to load the pointers to the physical data, in cache and to not flush the cache until the context is released by the application thread. Going further into this reverie, anticipatory QoS for bandwidth is a personal desire. I'd even like to have a PCIe lane reserved, in my fascistic moments.
Now I'm regretting my wordy comments above, because the explanation I wrote wasn't needed.
But this dream I have to gain programmatic cooperation for data performance is surely not unique to me.
Maybe twenty years ago it made perfect sense to leave the task of resources sharing to the operating system and subsystems, but now not only do many people have a high performance headroom to exploit, but the knowledge and experience of writing concurrent schedulers and load balances is far more commonplace, just by virtue of the needs presented by the crunch expansion of the Internet.
The only point from my above comments that I think is under appreciated by the end users of software and worth approaching, is how much you can improve your software when you have a truly controlled environment. Talking straight through the operating system stack you can obtain uncoloured measurements from which you can build optimal performance applicable to all installations. I may dream, but I grew up with the worshipping of vendor benchmarks as my bete noir, and I have been set against waste created by supporting unrestrained variations, ever since. That makes sense for a commercial operating system, but the opportunity to work with completely homogeneous stacks is something that ought to be recognised for the value potential, by management. I'm convinced that this will be a critical commercial advantage for whoever first finds a solution that isn't a service only but customer deployable.
It's a long time past, but I can't forget my experience in small businesses, where I had to disbelieve any reports about performance problems, until I had full access, or even on site immediacy to get a idea of the circumstances.
I'm fed up with Microsoft getting the advantage of being the only one who collects metrics aggressively or at all.
The open source community should be the first to get the data out of customers from the production systems
How can we do this?
I'm just feeling a personal sense of futility from my experiences optimising code as well as installations to sometimes minimal effect, bounded by the hardware budget. I certainly learned a valuable part of my skills in that way, but I have ever since felt sceptical how much of the development effort would be better allocated, if true installation performance instruments were reporting the whole user base.
(one assumes a R user will be well equipped, but I expect it to be a broad spectrum of hardware, from students in India to multinational developers working with the latest generation.
Its often true that the most impoverished users gain the most from optimised software and the ethics of this result are impeccable. I'm simply asking for the concerted collection of instruments metrics to support optimisation efforts in open source software. Putting the authors, creators and hackers first, is something that I wish was done as part of the process of promoting FOSS and the provision of quality insights into how their work is used, I think should be the basic foundation of our responsibility and gratitude for their work. Not to mention the improvements in our work which will result. Surely this is not impossible to solve. I have often thought about a package manager taking snapshot performance characteristics and reporting to the developers by way of a public page update. But I've not even seen the idea anywhere else, and don't understand what gives...
Awhile ago I started a project converting government voting records (both elections and congressional) into a database. Would that be interesting to you?
Here's an idea: You could also host a data bounty program, and/or start a grant program for the production of these data sets.
Still missing an answer to "what are you requirements"? How do you verify data quality, etc? What format(s)?
Does Synapse offer server-side filtering? That's something we're thinking about.
Bravo, this is an excellent strategy. I've been using NPM packages to share small datasets for a little while due to the simplicity of distribution.
I disable it in my web browsers because of security and privacy concerns, and it would be great if I could just read about your project without it.
The hooks for extending build targets are here:
Again it's a bit raw but we're here to help make R support easy.
We're planning to write proper documentation soon.
I can't imagine a future where we don't treat data version control like a necessity in the same was as code version control. I hope Quilt can fill the much-needed role of "Github for data".
Nearly every place I've worked at we've had to try to create our own version history and audit information. This one of the examples I use when people start talking about how we have a shortage of engineers.
No, we have a shortage of good tools. Nobody should have been writing their own data versioning code in 2007, let alone 2017.
What you described is literally what we offer -- the ability to go backwards and forwards in time and create branches, bookmarks, and refreshes with very little overhead. [Edit: This is all for databases that by default, as you noted, don't do any of this stuff.]
We aren't (yet) super well known, but we've got a number of things in flight that should change that :)
And if you're curious how this magic works, the hint is: we have a number of really smart ZFS contributors on the engineering team.
Does it remain cost effective at scale?
All the source code in the world is a drop in the bucket compared to the raw data collected by a large business.
So people just give up and exfiltrate all of the data to another server to run their reports on. But as a user I still want to be able to figure out 'did I do that thing in February, or was it March?' fairly often.
One thing I've always wished database replication systems did (universally) was allow you to run different indexes on different replicas, so you don't have to export the data at all. Your insert time would still be a function of network delay + worst case index update time, but you could segregate traffic based on kind and continue to scale to a fairly large company before anyone had to mention data warehousing or data lakes.
Still in early dev stages, and I'm on holidays for a few weeks atm, but we'll have something useful online in a few months. Supporting forking, branching, merging, etc.
There's obviously friction in this space -- this weekend I'm playing with databricks spark notebooks and public data on S3 and the data prep will still be annoying -- so I'm looking forward to design innovation in it.
So I can assume this isn't going to be afraid of gigabytes, right? I've seen services before that want to be a repository of data, and I try to upload a mere 20 GB of data and they're like "oh shit nevermind". Even S3 requires it to be broken into files of less than 5 GB for some inscrutable reason.
Pachyderm looks brilliant for high-scalability parallel data processing, and the versioned data part is a way to not just maintain the history of the data, but also avoid reprocessing of data that hasn't changed since the previous run.
Anyway, do I get this right: They expect users to be experts in data analysis but not being able to load the data into whatever software they use? They want me to share data and to offload my data into their walled garden that can be accessed only via their service? If I wanted to share my data, wouldn't I rather use something more accessible?
Still, it's good inspiration. Maybe I'll make a github repo with my city's open datasets loaded into python.
Quilt is, in my view, as open as git or GitHub. The de/serialization code is all open source, and uses an open format (Parquet). Parquet is accessible (and more optimal than text files) for things like Presto DB, Hive, etc.
Can I run something that hosts a quilt repo on my own server?
How does this compare to what data.world  is doing? They recently released a Python SDK  as well.
I actually had been thinking about Parquet as a component of ETL, and if it might be possible to make ETL many times faster by compressing to Parquet format on the source and then transmitting to a destination - especially when you're talking about limited bandwidth situations where you need entire data sets moved around in bulk.
This looks really nice for sharing public data sets, but I wish that there was a better public non-profit org running indexes of public data sets.... I guess if something like the semantic web had ever taken off, then the Internet itself would be the index of public data sets, but it seems like that dream is still yet to materialize.
Paying flat fees for access to repos is fundamentally thinking about the problem incorrectly.
My feeling is brokering. Consider the market for wheat where there is pricing based on supply and demand. There are futures, options, etc.
Consider a NYSE for data. Why host the data? Be a discovery service both for the price and for brokering of access.
Data should not be priced on it's size to store/transfer. That is leaving huge money on the table. It should be priced based on what people are willing to pay for it.
Why not allow someone to pay for the option for exclusive resale rights of some weather data service? Then allow someone to make a profit off their ability to re-sale that data. etc.
You may not be well placed to do that now but someone is going to. And why would I go to you hoping to find specific data when I can go to a market full of data, full of data re-sellers, etc.
edit by ‘doomed’ I mean “hefty discount on employee stock options/wouldn’t put my own money into it”
By contrast, GitHub is a blob store, it doesn't transform the data either for serialization or for virtualization.
The first thing I looked for was a canonical package / resource specification in build.py. Any chance supporting Frictionless data resource spec for interop?
I think we could generate a frictionless schema pretty easily...
Figshare currently has versioned dois, so you can refer to a specific release, and the API will let you download each one. I might put some small tooling around this as I release data there myself.
To add on the feature requests, for academia having trust that the data will survive the company going away is something important to me.
Basically the sanest way of maintaining patch sets (multiple patches -> quilt, get it?). The logic of using "quilt" as the name for anything that doesn't have to do with patches baffles me.
It would be really cool if quilt could generate documentation for datasets, even if it was just column names/types. One of the issues we have is keeping track of all of the data "assets" people have pulled or created.
You guys should make the search bar a little more prominent. Took me a while to find it!
1) Load original data from source into quilt
2) Do transformation
3) Commit transformations to quilt, with commit message
4) Run experiment
5) Do new transformations
6) Commit to quilt
7) Run experiment
Rinse and repeat.
Looking at the video and documentations, this is not emphasised at all, suggesting that edits to data should be saved as a new package.
No idea what a DataNode is so am struggling to actually see the data! Any tips?
it's shorthand for:
You can take it a step further and either integrate with a DOI provider or become one yourself and integrate the registration process within your api or create command line tools.
I'd appreciate the code being open source. I can afford paying for a (perpetual) license.