
Hadoop on IPFS [pdf] - usgroup
http://www.cse.unsw.edu.au/~hpaik/thesis/showcases/16s2/scott_brisbane.pdf
======
imrehg
I think this highlights the insight that IPFS is more of a "decentralized
transfer" tool rather than "decentralized storage" tool necessarily. Storage
is part of the caching, but the discovery of the required data pieces is the
main advantage here too, as they mention that "With IPFS installed on each
node, the traffic between the DataNodes (slaves) and the NameNode (master) is
reduced."

To me this understanding and shift in perception came after a few days of
intense experimentation with IPFS, trying different use cases, etc. There are
plenty of surprises if you just want to use IPFS as decentralized storage, as
most introduction seems to approach it.

And it's definitely very cool & inspiring to find different use cases for IPFS
:) For example the IPFS-backed docker registry is similarly cool
[https://github.com/jvassev/image2ipfs/](https://github.com/jvassev/image2ipfs/)
(though unfortunately seems abandoned or just in deep hibernation...)

~~~
XR0CSWV3h3kZWg
Are you interpreting the speed up as due to less contention for the NameNode's
resources?

~~~
imrehg
Not a hadoop expert, but that seems to be the case, more decentralized caching
due to the way IPFS works.

~~~
XR0CSWV3h3kZWg
If that's the case you could also boost the replication # of a given
folder/file in HDFS and see a similar speed up. There are circumstances where
IPFS is nicer by default, but it might not be something that is a clear
winner.

~~~
imrehg
I think this is definitely a worthy experiment if you'd like to dig more into
it. The poster (I believe) is very thin on details, so more eyes are better.

~~~
XR0CSWV3h3kZWg
Unfortunately I don't see the connector code, or the experiment set up.

------
ah-
[https://www.pachyderm.io/open_source.html](https://www.pachyderm.io/open_source.html)
does something really similar.

~~~
stmw
How is Pachyderm doing, in terms of adoption and features and so on? Do people
have comments?

------
XR0CSWV3h3kZWg
I can't seem to pull up any more info on this. The write up really doesn't
explain why ipfs would have perfomed better in this case. Is it because the
data in question is replicated more times? Is it because ipfs protocol is
faster than HDFS?

Can't seem to find the connector code either. It's a little strange to go back
to map reduce when presumably YARN is available and tez or spark could be
used.

~~~
TallGuyShort
>> It's a little strange to go back to map reduce when presumably YARN is
available and tez or spark could be used.

YARN just refactored the scheduling / resource management out of MapReduce.
Hadoop 2+ still uses the term "MapReduce" for that particular application, and
that application runs on a YARN cluster, just like Tez (I think) and Spark
can. I don't see anything to indicate they're NOT using YARN.

~~~
agibsonccc
That happened _years_ ago. I haven't seen a hadoop 1 in the wild in a long
time now. When most people think of "hadoop" now, it's more about the
ecosystem of "HDFS + some addons like spark"

~~~
TallGuyShort
I'm just saying that when the article says "MapReduce", it in no way limits
their workload to Hadoop 1 or 0.20.x, as OP seem to have thought.

------
batoure
So I think this is pretty misleading. I could see how mapping against IPFS
could improve mapping because IPFS uses more advanced replication strategies
than HDFS. But the reducing step would be really important to measure here and
my guess would be that these number would be all over the place if you were
trying to use this out in the wild.

------
jaytaylor
It's really cool that they did this, and would be more useful if they'd share
the code so we can see exactly how they made it work.

------
rkwasny
Interesting, anyone has seen the code/.jar?

