Can I ask why, for a git-style system, IPFS was chosen instead of GUN or SSB?
Certainly, images/files/etc. are better in IPFS than GUN or SSB.
But, you're gonna have a nightmare doing any git-style index/patch/object/etc. operations with it - both GUN & SSB's algorithms are meant to handle this type of stuff.
Did you guys do any analysis?
We did look into SSB. I'll admit to not hearing about until only a few months ago, but the main reason we chose IPFS was for single-swarm behaviour, allowing for natural deduplication of content (a really nice property for dataset versioning).
The majority of our work has been in the exact area you mentioned, building up a dataset document model that will version, branch, and convert to different formats. We've gone so far as to write our own structured data differ (https://github.com/qri-io/deepdiff). I'm very happy with the progress we've made on this frontier so far.
I'm a huge fan of SSB, but don't think it's well suited for making datasets globally discoverable across the network. In the end the libp2p project tipped the scales for us, providing a nice set of primitives to build on.
I'm the creator of DVID (http://dvid.io), which has an entirely different approach to how we might handle distributed versioning of scientific data primarily at a larger scale (100 GB to petabytes). Like Qri and IPFS, DVID is written in Go. Our research group works in Connectomics. We start with massive 3D brain image volumes and apply automated and manual segmentation to mine the neurons and synapses of all that data. There's also a lot of associated data to manage the production of connectomes.
One of our requirements, though, is having low-latency reads and writes to the data. We decided to create a Science API that shields clients from how the data is actually represented, and for now, have used an ordered key-value stores for the backend. Pluggable "datatypes" provide the Science API and also translate requests into the underlying key-value pairs, which are the units for versioning. It's worked out pretty well for us and I'm now working on overhauling the store interface and improving the movement of versions between servers. At our scale, it's useful to be able to mail a hard drive to a collaborator to establish the base DAG data and then let them eventually do a "pull request" for their relatively small modifications.
We've published some of our data online (http://emdata.janelia.org) and visitors can actually browse through the 3d images using a Google-developed web app, Neuroglancer. It's running on a relatively small VM so I imagine any significant HN traffic might crush it :/ We are still figuring out the best way to handle the public-facing side.
I think a lot of people are coming up with their own ideas about how to version scientific data, so maybe we should establish a meeting or workshop to discuss how some of these systems might interoperate? The RDA (https://rd-alliance.org/)
has been trying to establish working groups and standards, although they weren't really looking at distributed versioning a few years ago. We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.
exactly my thought, do you know any working group that is working toward that goal?
The basic idea is distributed version control, like git, but over p2p swarms rather than clusters around “central” repositories. We have special handling for large datasets (but still using git) to improve transfer efficiency and diffing.
There’s a UI layer for collaboration (discussion, PRs, review) that supports deep linking to and embedding of files at specific commits, which sounds a bit like what you’re looking for.
Feedback is very much appreciated!
One of the issues for me is file-based versioning, which then requires the means to parse the format. A number of ventures and organizations (e.g., NeuroData without Borders) address versioning of the entire ecosystem necessary to correctly use the underlying data files, so not sure if that’s an explicit part of your ecosystem. Most importantly, is your stack going to be open source?
We support data versioning, interactive web previews, seamless loading into hosted Jupyter notebooks (Kaggle Kernels), seeing/sharing analytic results built on the data version, and adding direct collaborators right now.
We don't support a data-oriented version of an "issue" or a "pull request" quite yet, but these needs are definitely on our radar.
Buying hard discs (100TB for a few 10kEUR a few years ago) is a real investion in our institute. As far as I understood, with distributed storages each participant volunteers to share his disc to store his (and other) data. Here's the devil's advocate: Why should I share my expensively bought disc space with you?
With versioned data, you could leverage the largesse of the big institutions to provide the base data, and then only the deltas for the children versions need be handled by users making changes.
full disclosure: I work at Qri
What are the benefits of it? What git did not please?
to name a few other projects.
A large project using their language of choice (Go in this instance) gives external validation that their tribe is growing, and thus having made the correct choice to join it.
I blame the things not being done sooner on Go.