How Merkle Trees Enable the Decentralized Web

basemi · on Oct 24, 2017

A similar concept (Merkle-DAG) is implemented by IPFS (ipfs.io)

"A Merkle-DAG is similar to a Merkle tree in that they both are essentially a tree of hashes. A Merkle tree connects transactions by sequence, but a Merkle-DAG connects transactions by hashes. In a Merkle-DAG, addresses are represented by a Merkle hash. This spider web of Merkle hashes links data addresses together by a Merkle graph. The directed acyclic graph (DAG) is used to model information. In our case, modeling what address has stored specific data. "

(https://www.cio.com/article/3174193/healthcare/from-medical-...)

EGreg · on Oct 24, 2017

Why can't there be cycles in the Merkle graph

That would be cool. You just have a partial order between children and parents, like in git.

tolmasky · on Oct 24, 2017

This is a good paper that covers hashing graphs (including ones with cycles): "On Hashing Graphs" https://eprint.iacr.org/2012/352.pdf

EGreg · on Oct 24, 2017

Wow, very interesting.

Can you summarizd how the actual algorithm is supposed to work, ie its main ideas?

I am surprised to find that using sha hashes in a tree or DAG will leak information to people who shouldn't have it. Is this a serious flaw and how do these guys ultimately solve the problem? I read the paper but the algorithm seems a bit hard to understand and follow.

zardo · on Oct 24, 2017

Consider the simplest cycle, a node that points to itself. To create it, you must include the content hash of the document in the document itself.

To do that, you need to generate a collision. You enter a random content hash in the document, then keep modifying the document and hashing it untill it's hash matches the one you included. It's not impossible, but depending on the hash function, it may take you a very long while to find it.

glic3rinu · on Oct 24, 2017

You would need to know the hash of a yet-to-exist node to create cycles on a Merkle graph :)

eternalban · on Oct 24, 2017

It is really not that different than the 'Grandfather Parsdox': Merkel graphs are inherently 'causal sequences' and when combined with nodes as 'spaces of things' they define the evolving manifold of a 'world of things'. This seems to inform a sort of deep structure in our reality manifesting in phenomena such as Golden Section, etc. [speculaing at the end there, of course.]

dracodoc · on Oct 24, 2017

To rephrase my comment:

Some disadvantage of OP's method:

- hash code is difficult to read for human. Urls under some hierarchy share some common patterns, it also bear some meanings. All hash code will look same and have nothing to hint on the content.

- You have to copy it, almost impossible to type it, or compare two visually similar string.

- You may end up with some link shortening service for hash code, but you can use link shortening to solve the portable file host problem already.

The Merkle Trees can solve some problems, but I don't think portable urls are the right one.

subless · on Oct 24, 2017

I agree 100%. Also, Merkle Trees would only benefit static content uploaded to the Internet. Updating any dynamic content would constantly generate a new root hash meaning a new URL with each update to that specific content. It's just not a good option for anything other than static content in general.

One place I could see it being valuable is with an online archiving type service like Archive.org where the content doesn't change except when a new snapshot of that content is recorded displaying any changes made to that content.

Blahah · on Oct 24, 2017

The way this is done in dat (via the hypercore module) is that the share key is a publickey, the merkle tree is signed with the private key. Updates can stream in on the log, and only the keys in the tree that have changed need updating. The history is stored in the log so you can retrieve any previous version.

https://github.com/mafintosh/hypercore

notheguyouthink · on Oct 24, 2017

I actually really like the idea of updating new hashes and etc. While I don't disagree with your statement, I definitely feel it's a challenge worth looking into.

Why? Well a lot of what I think is touted on the IPFS homepage, but to put it in my own words, I've become dismayed over how mutable the web is. It seems to entirely benefit those who seek to lie and disorient. Yet, if something from an "honest person" leaks onto the internet (nude photos, credit card, email, etc) it's nearly impossible to remove it.

This has less to do with the technology, and more to do with human nature. Facts are easy to mutate and spread misinformation about. Pages can be edited, blocked, DDOSed, etc. Yet often leaks of information that small actors, i.e. I upload my secret key, is basically permanently stolen on the web. So I feel like the mutable web is all cost to the public, with no benefit.

I feel/hope that IPFS and IPNS can allow for software to present a normal Reddit-like experience, where users never deal with weird hashes and etc, but underlying it all is an immutable and entirely audit-able paper trail. Information is key in this day and age, and if we can have an immutable web with no UX loss, I think it's a boon.

As it stands, people have identified trends among bad actors on the web - such as politicians botting Reddit to sway opinion, but the trail goes fuzzy quite quick when the entirety of the bots content can be deleted, mutated, etc.

Anyway, just sharing my thoughts / rambles.

ajanuary · on Oct 24, 2017

I don't see how content addressing helps with leaks. If I have some stolen data I can distribute it with padding on the end so it has a different address faster than you can push out blacklists.

notheguyouthink · on Oct 24, 2017

It doesn't - my point was that leaks (as in, anything of mine that gets out) already tend to be permanent on the web. Yet, news sites, botters on social media, etc - all change their content with no paper-trail for us to follow.

Anything I leak is basically permanent on the current web. If that's the case already, why would I want a mutable system in place where people can edit what they said? Alter what was posted, alter votes, take ddos content, etc.

ajanuary · on Oct 25, 2017

I see. Thank you for clarifying.

um304 · on Oct 24, 2017

Exactly! Editing content is the elephant in the room which author didn't even address

tbv · on Oct 24, 2017

It was only a 10 minute talk, so I had to cut something out ;)

Blahah’s comment above may interest you:

https://news.ycombinator.com/item?id=15541392

notheguyouthink · on Oct 24, 2017

Editing is easy, fwiw - content hashes are immutable and are not to be used for mutable content (obviously). That's why IPFS has IPNS - so we can have mutable pointers to data.

indigochill · on Oct 24, 2017

From a dumb layman's perspective (mine), it seems as though the addressing issue has a very simple solution: to refer to some arbitrary file which is hosted by some known third-party (say, a video on YouTube), provide a link to it from your server and distribute the link to that link to others. When switching from YouTube to Vimeo, simply change the one link on your server and the link you distributed remains valid.

This also avoids the danger of hash collisions, although it is vulnerable to third-party hosts taking down the content (although again, there's not much cost to that. Just move the content elsewhere and change your redirect link).

With this method you do still have some server responsible for hosting the content rather than distributing that load, but that's actually a saving when looking at the storage load across the entire ecosystem.

It's a cool article and I really like the idea of content addressing instead of host addressing. I just feel like it's too divergent from the web as most people use it today (and I'm mildly concerned about malicious hash shenanigans), whereas the above method can be used and understood right now by many web users.

ENGNR · on Oct 24, 2017

CDN's have made hosting pretty easy to scale from a low powered controller, the hard problem is working out when to invalidate the cache. Which is easily solved by.... another layer of merkle trees actually! It's Angela Merkel's all the way down

dracodoc · on Oct 24, 2017

Or you can go one step further, use third party server to provide the link, because it's a burden to maintain your own server for many people. "Your server" is most likely somebody else's server anyway.

That will be google voice for url. The OP's method is use people's name as phone number.

Another problem with OP's method is that it's difficult to read for human, and very difficult to verify by eye. There is also no common parts for files organized in same hierarchy.

api · on Oct 24, 2017

Hash collisions are really a non-issue. A collision in SHA-256 or larger spaces is less likely than being, say, struck by a meteorite twice in one day. When you get up to 384 and 512 bit hashes you could content address the entire Internet for thousands of years and likely never have a collision.

abrax3141 · on Oct 24, 2017

It’s not so much the M-trees that make this work, as the idea of distributing the leaves around so that you can’t efficiently corrupt the tree by replacing it. If it was all in one machine you could just fake the tree. According to the original bitcoin paper, this idea came from Haber and Stornetta. (I have a personal theory that Stornetta is Satoshi ... who doesn’t reference their own papers multiple times?!)

yosito · on Oct 24, 2017

Can someone ELI5 what happens when I want to change a piece of data in a tree? What if I have a document, and I notice a typo, and I want to fix it?

nattmat · on Oct 24, 2017

By using an append-only Merkle tree, you make 'edits' by adding a new piece of data with information on what was changed in the older data. You get the benefit of also keeping a version history. I think Git works like this.

palunon · on Oct 24, 2017

Note that Git commit graph isn't really a tree, but a directed acyclic graph, in which nodes (commits) points to actual Merkle trees (trees/blobs).

(Also, git keeps the full data, not only diffs)

yosito · on Nov 2, 2017

Interesting. I suppose deleting sensitive data is a more difficult challenge.

tbv · on Oct 24, 2017

Not quite ELI5, but Blahah explained how Dat handles changes below:

https://news.ycombinator.com/item?id=15541392

toomim · on Oct 24, 2017

Exactly. Then merkle trees don't help you. They only work for static documents.

pfraze · on Oct 24, 2017

letlambda's explanation for IPFS is right. In Dat, that method is used to produce an append only log out of a merkle tree. Each time you add an entry, a leaf is added to the tree, and the new head is signed. I wrote about it here [1] and have some gifs demonstrating the process.

Merkle trees can work just fine for dynamic content. You just need to use a pubkey pointer.

1. https://beakerbrowser.com/2017/06/19/cryptographically-secur...

letlambda · on Oct 24, 2017

They certainly don't make that easier. But if the system supports a mutable link, you can have documents that always link to their own latest version.

IPFS does this by storing (mutable link, contentID) pairs in the DHT, which can be updated if you know the private key for the mutable link.

skate22 · on Oct 24, 2017

Good article, but there are a lot of claims of '100% certainty' that arn't necessarily true. The author even states that hash functions only guarentee no collisions to a high probability.

DennisP · on Oct 24, 2017

It looks like there are about as many Earth-like planets in the universe as grains of sand on the Earth. Write your name on a grain of sand on one of those planets. Now have someone else randomly pick a single grain of sand from some planet in the universe. How certain are you that they won't pick yours?

The chance of that happening is roughly equal to the chance of a collision randomly occurring somewhere in a few quadrillion SHA256 hashes.

https://crypto.stackexchange.com/questions/52261/birthday-at...

http://www.npr.org/sections/krulwich/2012/09/17/161096233/wh...

https://www.cnet.com/news/the-milky-way-is-flush-with-habita...

dracodoc · on Oct 24, 2017

It's totally possible the reality is different from the theoretical limit. See collision attacks against MD5, SHA-0 by Wang Xiaoyun.

https://en.wikipedia.org/wiki/Wang_Xiaoyun

DennisP · on Oct 25, 2017

Yes, the chance that the hash will be broken is much higher than the chance of collisions occurring randomly. I'm just responding to "hash functions only guarantee no collisions to a high probability." People really underestimate how strong that probabilistic guarantee is.

api · on Oct 24, 2017

Those are directed attacks not random collisions.

skate22 · on Oct 24, 2017

The article uses the context of content validation from untrusted sources lol

DennisP · on Oct 25, 2017

As long as the hash function remains unbroken, untrusted sources can't screw with you.

Hash functions tend to be broken gradually and publicly, and we migrate to new ones as they start to look shaky. It's theoretically possible for someone to privately break a function that everyone else thinks is secure, but it would be an extremely impressive achievement since lots of full-time cryptographers work on breaking these things and publish every little bit of progress they make.

ComodoHacker · on Oct 24, 2017

Let's suppose everyone has moved to content addressing. Considering all the amount of content generated every day, how much time would it take before real hash collisions start to emerge?

openasocket · on Oct 24, 2017

The number of 256-bit hashes you would have to generate in order to have a 50% that there's a duplicate in there is ~4*10^38. If we had a billion machines each generating a billion new hashes a second, it would take over 12 million years to get that many.

GordonS · on Oct 24, 2017

This assumes computing power will remain constant; it won't.

Also, it's highly likely that flaws will be found with the hash algorithm long before 12 million years are up.

openasocket · on Oct 24, 2017

I don't assume computing power is constant, just that the rate of content generation is more or less constant (I think a billion devices publishing a billion new pieces of content every second is a pretty reasonable upper bound). OP asked about the odds of collisions occurring by accident due to the sheer volume of content generated and published, not about attacker scenarios.

Obviously we wouldn't use the same hash algorithm and setup for 12 million years, but the sheer absurdity of that length of time and that pace of content production shows this method will last, at least until flaws are found in SHA256.

jononor · on Oct 24, 2017

IPFS uses a multihash format, so that new hash functions can be introduced over time.

ComodoHacker · on Oct 25, 2017

>50% that there's a duplicate in ther

I think even 0.01% would be enough to harm usability.

DennisP · on Oct 25, 2017

The 50% chance he's referring to is the chance that, out of all that hashes that exist, there will be two hashes that match. He's saying that in his scenario it would be millions of years before there's even a single duplicate.

CoryG89 · on Oct 24, 2017

So what you're telling me is I have a chance? :P

Blahah · on Oct 24, 2017

100% certainty is correct to integer rounding. It's correct to many significant figures too.

skate22 · on Oct 24, 2017

I think '100% certainty' should be reserved for claims that have been proven to always be true. I'm fine with saying 'near 100% certainty'lol

asadm · on Oct 24, 2017

Video link of the talk if anyone wants: https://youtu.be/wGB5AYvFjxE?t=4h26m30s

0xbadcafebee · on Oct 24, 2017

> The Web is centralized,

It isn't.

> but why?

"The web" is a collection of independent networks providing a means to traverse hyperlinked text documents across any network that is addressable.

Not only are servers not a problem, host-based addressing is what makes the web work at all, as numerical addressing by itself would have killed the web's growth long ago, and users would have no reasonable way to address content.

pfraze · on Oct 24, 2017

You're ignoring the data silo problem. Of course the Web is decentralized in terms of accessing web pages, but-- the majority of data is not composed of pages, and the majority of actions on the Web are not composed of hyperlink traversals. For the applications on the Web to be decentralized, the data and behaviors have to be decentralized as well, and they are not.

0xbadcafebee · on Oct 24, 2017

The majority of data is composed of pages, but the biggest uses of the web today are not. The web isn't the web anymore. It's a shitty application platform.

In this sense, the web works exactly how any application platform does. You can't get content out of old apps, or apps that no longer run, or on systems that no longer run. This isn't data siloing, this is just legacy applications.

With web apps the data is siloed, but it doesn't have to be, and didn't used to be. The Internet Archive is proof. Getting the data out isn't that difficult, IF it 's not hiding behind a web app. The difficult part is to convince people to stop writing applications which are siloed and do prevent easy access to content that the web used to provide.

Peer to peer networks are not a solution to incompatible legacy applications. It's like a vehicle which is immobile when it runs out of gas. Instead of building the vehicle so it can still be used when it lacks power, they're changing the way the roads work. It's ridiculous.

pfraze · on Oct 24, 2017

You can't just wish that people would build apps a different way though, right? You have to change the system somehow. So, this is how we're doing it. No legacy support, but it does have interesting new use-cases that can disrupt the legacy.