I think this highlights the insight that IPFS is more of a "decentralized transfer" tool rather than "decentralized storage" tool necessarily. Storage is part of the caching, but the discovery of the required data pieces is the main advantage here too, as they mention that "With IPFS installed on each node, the traffic between the DataNodes
(slaves) and the NameNode (master) is reduced."
To me this understanding and shift in perception came after a few days of intense experimentation with IPFS, trying different use cases, etc. There are plenty of surprises if you just want to use IPFS as decentralized storage, as most introduction seems to approach it.
And it's definitely very cool & inspiring to find different use cases for IPFS :) For example the IPFS-backed docker registry is similarly cool https://github.com/jvassev/image2ipfs/ (though unfortunately seems abandoned or just in deep hibernation...)
If that's the case you could also boost the replication # of a given folder/file in HDFS and see a similar speed up. There are circumstances where IPFS is nicer by default, but it might not be something that is a clear winner.
I think this is definitely a worthy experiment if you'd like to dig more into it. The poster (I believe) is very thin on details, so more eyes are better.
I can't seem to pull up any more info on this. The write up really doesn't explain why ipfs would have perfomed better in this case. Is it because the data in question is replicated more times? Is it because ipfs protocol is faster than HDFS?
Can't seem to find the connector code either. It's a little strange to go back to map reduce when presumably YARN is available and tez or spark could be used.
While I haven't looked for more info, I can assure you IPFS is a slower protocol than HDFS.
The existing IPFS implementation involves a lot of overhead in memory, CPU use, and latency [and as you can see in the MapReduce bar graph, it's slower], but overall, it improves performance when bandwidth is the bottleneck.
IPFS is pretty similar to BitTorrent, just more practical.
BitTorrent isn't "not practical". It is however less practical than IPFS.
With IPFS, you can effortlessly link from one tree to another, already existing tree.
In BitTorrent, you have to include a .torrent file, or an infohash in some file, but no software that I know of will easily follow that link.
This linking ability is an extremely useful property, that allows you to cheaply create a copy of a merkle tree, with a subset of the data replaced. The tree will operate identically to one created from scratch.
You also don't need to hold all the data to "patch" the tree, which I imagine is useful in this Hadoop filesystem.
Unfortunately there is no real API for operating on torrents. You could say WebTorrent is pushing in the right direction to fix this, but there doesn't seem to be much adoption for it.
>> It's a little strange to go back to map reduce when presumably YARN is available and tez or spark could be used.
YARN just refactored the scheduling / resource management out of MapReduce. Hadoop 2+ still uses the term "MapReduce" for that particular application, and that application runs on a YARN cluster, just like Tez (I think) and Spark can. I don't see anything to indicate they're NOT using YARN.
That happened years ago. I haven't seen a hadoop 1 in the wild in a long time now. When most people think of "hadoop" now, it's more about the ecosystem of "HDFS + some addons like spark"
So I think this is pretty misleading. I could see how mapping against IPFS could improve mapping because IPFS uses more advanced replication strategies than HDFS. But the reducing step would be really important to measure here and my guess would be that these number would be all over the place if you were trying to use this out in the wild.
To me this understanding and shift in perception came after a few days of intense experimentation with IPFS, trying different use cases, etc. There are plenty of surprises if you just want to use IPFS as decentralized storage, as most introduction seems to approach it.
And it's definitely very cool & inspiring to find different use cases for IPFS :) For example the IPFS-backed docker registry is similarly cool https://github.com/jvassev/image2ipfs/ (though unfortunately seems abandoned or just in deep hibernation...)