
Show HN: Arvados, a free software storage and compute platform for big data - tetron
https://github.com/curoverse/arvados
======
tetron
Hello all, Arvados developer here to answer your questions.

~~~
brudgers
How does Arvados differ from other big data tools?

It looks to be focused mainly on bio-informatics, do you see other use cases
where it's features provide more leverage than general purpose tools?

~~~
tetron
There's several features that make Arvados unique.

* The content addressed storage system references hashes all data so you can unambiguously reference an immutable data set, similar to git but capable of handling huge invidual files (hundreds of gigabytes) and scaling to petabytes.

* Every compute job is recorded in a database with hashes identifying the inputs, Docker image, and outputs, so re-running past jobs is easy.

* Designed to federate multiple instances, to support both "hybrid cloud" setups within an organization, and allowing different organizations to share data.

These are all features that are particularly important to the bioinformatics
community, but solve problems that are common to lots of informatics big data
problems.

~~~
brudgers
So would it be correct to say there's an emphasis on immutability and
idempotency and a capability for heterogenatity across cloud computing
platforms?

~~~
tetron
Yes, that's right. I would also add data provenance, eg where did a data set
come from and how it was computed. This flows directly from the features I
mentioned of content addressed storage and recording compute job history.

