
Qri: A global dataset version control system built on the distributed web - anewhnaccount2
https://github.com/qri-io/qri
======
marknadal
I really love the design and style qri! It is fun!

Can I ask why, for a git-style system, IPFS was chosen instead of GUN or SSB?

Certainly, images/files/etc. are better in IPFS than GUN or SSB.

But, you're gonna have a nightmare doing any git-style index/patch/object/etc.
operations with it - both GUN & SSB's algorithms are meant to handle this type
of stuff.

Did you guys do any analysis?

~~~
b_fiive
hey, qri dev here. Delighted you like the design, we're hoping to make data a
little more "approachable" :)

We did look into SSB. I'll admit to not hearing about until only a few months
ago, but the main reason we chose IPFS was for single-swarm behaviour,
allowing for natural deduplication of content (a really nice property for
dataset versioning).

The majority of our work has been in the exact area you mentioned, building up
a dataset document model that will version, branch, and convert to different
formats. We've gone so far as to write our own structured data differ
([https://github.com/qri-io/deepdiff](https://github.com/qri-io/deepdiff)).
I'm very happy with the progress we've made on this frontier so far.

I'm a huge fan of SSB, but don't think it's well suited for making datasets
globally discoverable across the network. In the end the libp2p project tipped
the scales for us, providing a nice set of primitives to build on.

~~~
marknadal
Nice work!

------
DocSavage
Interesting project, particularly with the choice of IPFS and DCAT --
something I'll have to look into. There have been other efforts to handle
mostly file-based scientific data with versioning in both distributed (Dat
[https://blog.datproject.org/tag/science/](https://blog.datproject.org/tag/science/))
and centralized ways (DataHub
[https://datahub.csail.mit.edu/www/](https://datahub.csail.mit.edu/www/)).
Juan Benet visited our research center to give a talk about IPFS a few years
ago. Really fantastic stuff.

I'm the creator of DVID ([http://dvid.io](http://dvid.io)), which has an
entirely different approach to how we might handle distributed versioning of
scientific data primarily at a larger scale (100 GB to petabytes). Like Qri
and IPFS, DVID is written in Go. Our research group works in Connectomics. We
start with massive 3D brain image volumes and apply automated and manual
segmentation to mine the neurons and synapses of all that data. There's also a
lot of associated data to manage the production of connectomes.

One of our requirements, though, is having low-latency reads and writes to the
data. We decided to create a Science API that shields clients from how the
data is actually represented, and for now, have used an ordered key-value
stores for the backend. Pluggable "datatypes" provide the Science API and also
translate requests into the underlying key-value pairs, which are the units
for versioning. It's worked out pretty well for us and I'm now working on
overhauling the store interface and improving the movement of versions between
servers. At our scale, it's useful to be able to mail a hard drive to a
collaborator to establish the base DAG data and then let them eventually do a
"pull request" for their relatively small modifications.

We've published some of our data online
([http://emdata.janelia.org](http://emdata.janelia.org)) and visitors can
actually browse through the 3d images using a Google-developed web app,
Neuroglancer. It's running on a relatively small VM so I imagine any
significant HN traffic might crush it :/ We are still figuring out the best
way to handle the public-facing side.

I think a lot of people are coming up with their own ideas about how to
version scientific data, so maybe we should establish a meeting or workshop to
discuss how some of these systems might interoperate? The RDA ([https://rd-
alliance.org/](https://rd-alliance.org/)) has been trying to establish working
groups and standards, although they weren't really looking at distributed
versioning a few years ago. We need something like a Github for scientific
data where papers can reference data at a particular commit and then offer
improvements through pull requests.

~~~
amirouche
> We need something like a Github for scientific data where papers can
> reference data at a particular commit and then offer improvements through
> pull requests.

exactly my thought, do you know any working group that is working toward that
goal?

~~~
DocSavage
If by working group you mean a cross-company collection of people, I don't
know of any or I would've joined them :) I've been working toward that goal
for the last 5 years, but primarily with an eye to our kinds of data problems
in the Connectomics field. I've been meaning to look at RDA again but
reluctant to start a working group myself.

~~~
b_fiive
Hey DocSavage! I'm one of these Qri folks, I'd love to see that working group
exist. I have a friend or two at the RDA. maybe we should get an email going
on the subject? Projects like these are bigger than any one company or tool :)

~~~
DocSavage
Agreed. Will follow up on email through your Qri contact page.

~~~
b_fiive
delightful. thanks!

------
guywhocodes
What are the benefits of using qri over ipfs? At a glance it seems very
similar, just narrower.

~~~
b_fiive
Imagine git were built on top of IPFS, and aimed specifically at datasets. Qri
uses IPFS to store & move data, so all versions are just normal IPFS hashes.
eg this:
[https://app.qri.io/b5/world_bank_population](https://app.qri.io/b5/world_bank_population)
is just referencing this IPFS hash:
[https://ipfs.io/ipfs/QmXwh5kNGsNAysRx66jcMiw1grtFf9j7zLFGbK9...](https://ipfs.io/ipfs/QmXwh5kNGsNAysRx66jcMiw1grtFf9j7zLFGbK9jA2wEX2)

full disclosure: I work at Qri

~~~
guywhocodes
Ah, that's excellent. Thanks for your time

------
mewwts
I love how the distributed web is seemingly built more and more in golang
these days.

\- [https://github.com/ethereum/go-ethereum](https://github.com/ethereum/go-
ethereum)

\- [https://github.com/ipfs/go-ipfs](https://github.com/ipfs/go-ipfs)

\- [https://github.com/textileio/go-textile](https://github.com/textileio/go-
textile)

\-
[https://github.com/lightningnetwork/lnd](https://github.com/lightningnetwork/lnd)

to name a few other projects.

~~~
Protostome
Why do you love that its go in particular? (seriously asking, out of
curiosity. why Go over all other languages, e.g. Rust and such)

~~~
maccio92
Fanboyism

~~~
stingraycharles
On a more serious note, I do think it’s probably related to group identity (or
described as “tribes” in popular media) that explains it.

A large project using their language of choice (Go in this instance) gives
external validation that their tribe is growing, and thus having made the
correct choice to join it.

