Hadoop on IPFS [pdf]

imrehg · on Oct 24, 2017

I think this highlights the insight that IPFS is more of a "decentralized transfer" tool rather than "decentralized storage" tool necessarily. Storage is part of the caching, but the discovery of the required data pieces is the main advantage here too, as they mention that "With IPFS installed on each node, the traffic between the DataNodes (slaves) and the NameNode (master) is reduced."

To me this understanding and shift in perception came after a few days of intense experimentation with IPFS, trying different use cases, etc. There are plenty of surprises if you just want to use IPFS as decentralized storage, as most introduction seems to approach it.

And it's definitely very cool & inspiring to find different use cases for IPFS :) For example the IPFS-backed docker registry is similarly cool https://github.com/jvassev/image2ipfs/ (though unfortunately seems abandoned or just in deep hibernation...)

XR0CSWV3h3kZWg · on Oct 24, 2017

Are you interpreting the speed up as due to less contention for the NameNode's resources?

imrehg · on Oct 24, 2017

Not a hadoop expert, but that seems to be the case, more decentralized caching due to the way IPFS works.

XR0CSWV3h3kZWg · on Oct 24, 2017

If that's the case you could also boost the replication # of a given folder/file in HDFS and see a similar speed up. There are circumstances where IPFS is nicer by default, but it might not be something that is a clear winner.

imrehg · on Oct 24, 2017

I think this is definitely a worthy experiment if you'd like to dig more into it. The poster (I believe) is very thin on details, so more eyes are better.

XR0CSWV3h3kZWg · on Oct 24, 2017

Unfortunately I don't see the connector code, or the experiment set up.

ah- · on Oct 24, 2017

https://www.pachyderm.io/open_source.html does something really similar.

stmw · on Oct 25, 2017

How is Pachyderm doing, in terms of adoption and features and so on? Do people have comments?

XR0CSWV3h3kZWg · on Oct 24, 2017

I can't seem to pull up any more info on this. The write up really doesn't explain why ipfs would have perfomed better in this case. Is it because the data in question is replicated more times? Is it because ipfs protocol is faster than HDFS?

Can't seem to find the connector code either. It's a little strange to go back to map reduce when presumably YARN is available and tez or spark could be used.

Mateon1 · on Oct 24, 2017

While I haven't looked for more info, I can assure you IPFS is a slower protocol than HDFS.

The existing IPFS implementation involves a lot of overhead in memory, CPU use, and latency [and as you can see in the MapReduce bar graph, it's slower], but overall, it improves performance when bandwidth is the bottleneck.

IPFS is pretty similar to BitTorrent, just more practical.

XR0CSWV3h3kZWg · on Oct 24, 2017

Why do you say bittorrent isn't practical?

Mateon1 · on Oct 25, 2017

BitTorrent isn't "not practical". It is however less practical than IPFS.

With IPFS, you can effortlessly link from one tree to another, already existing tree.

In BitTorrent, you have to include a .torrent file, or an infohash in some file, but no software that I know of will easily follow that link.

This linking ability is an extremely useful property, that allows you to cheaply create a copy of a merkle tree, with a subset of the data replaced. The tree will operate identically to one created from scratch.

You also don't need to hold all the data to "patch" the tree, which I imagine is useful in this Hadoop filesystem.

Unfortunately there is no real API for operating on torrents. You could say WebTorrent is pushing in the right direction to fix this, but there doesn't seem to be much adoption for it.

TallGuyShort · on Oct 24, 2017

>> It's a little strange to go back to map reduce when presumably YARN is available and tez or spark could be used.

YARN just refactored the scheduling / resource management out of MapReduce. Hadoop 2+ still uses the term "MapReduce" for that particular application, and that application runs on a YARN cluster, just like Tez (I think) and Spark can. I don't see anything to indicate they're NOT using YARN.

agibsonccc · on Oct 24, 2017

That happened years ago. I haven't seen a hadoop 1 in the wild in a long time now. When most people think of "hadoop" now, it's more about the ecosystem of "HDFS + some addons like spark"

TallGuyShort · on Oct 25, 2017

I'm just saying that when the article says "MapReduce", it in no way limits their workload to Hadoop 1 or 0.20.x, as OP seem to have thought.

XR0CSWV3h3kZWg · on Oct 24, 2017

Yep, if they are running hadoop 2 then they have YARN and they can use MR, Tez or Spark.

batoure · on Oct 25, 2017

So I think this is pretty misleading. I could see how mapping against IPFS could improve mapping because IPFS uses more advanced replication strategies than HDFS. But the reducing step would be really important to measure here and my guess would be that these number would be all over the place if you were trying to use this out in the wild.

jaytaylor · on Oct 24, 2017

It's really cool that they did this, and would be more useful if they'd share the code so we can see exactly how they made it work.

rkwasny · on Oct 24, 2017

Interesting, anyone has seen the code/.jar?