UUCP [https://en.wikipedia.org/wiki/UUCP] used the computers' modems to dial out to other computers, establishing temporary, point-to-point links between them. Each system in a UUCP network has a list of neighbor systems, with phone numbers, login names and passwords, etc.
FidoNet [https://en.wikipedia.org/wiki/FidoNet] was a very popular alternative to internet in Russia as late as 1990s. It used temporary modem connections to exchange private (email) and public (forum) messages between the BBSes in the network.
In Russia, there was a somewhat eccentric, very outspoken enthusiast of upgrading FidoNet to use web protocols and capabilities. Apparently, he's still active in developing "Fido 2.0": https://github.com/Mithgol
Usenet back then was spam free and you could usually end up talking to the creators of whatever you're discussing. I rather miss it.
Quite a few tech companies used private newsgroups for support, so you'd dial into those separately. As they were often techie to techie they worked rather well.
I first came across Usenet and uucp via the Amiga Developer programme. Amicron and uucp overnight all seemed a bit magic back in 87 compared to dialing into non-networked BBS's to browse, very, very slowly!
I still use usenet! It's not quite what it used to be, but you should check it out.
you need the www
For us raised in Soviet Union, it was eye opening experience that you may freely exchange messages with people around the globe.
These things are definitely systems to learn from, both their architectures and their histories; and people have already been drawing parallels to Usenet on this very page, notice.
Your email address might be george@cmu!vax!something!mitre!foo
Which meant Route the email through foo->mitre->something->vax->cmu.
Sysadmins would often keep tables of known routes. While people would describe their route from common known routing hosts.
We were so living in the cyberpunk future then.
And we liked it!
(Just guessing, but I wouldn't be surprised if it happened to be true).
I tried to switch what server my account was on halfway through my GNU Social life, and you just can't; all your followers are on the old server, all your tweets, and there is no way to say "I'm still the same person". I didnt realise I wanted cryptographic identity and accounts until I tried to actually use the alternative.
That's also part of the interest I have in something like Urbit, which has an identity system centered on public keys forming a web of trust, which also lets you have a reputation system and ban spammers which you can't do easily with a pure DHT.
The challenges that I see:
- Making it easy for any user to get up and running.
- De-authenticating old devices.
- Making it available from any mobile device.
Ideally, the end solution would be dead simple. Download the Windows app, run it, put your credit card in if you need a URL registered, and it does everything, including daily backups to a folder on your disk.
This is not a damn cloud ! This a remote computer you can access over the internet. Can we stop with this use of marketing lingo please ?
"Host" is generally less ambiguous, referring to a specific thing given the context of the discussion. It's a pronoun for machines (kinda).
The purpose of the original coinage of the word "cloud" was to obfuscate that you really meant "someone else's computer". It gives a nice warm, fuzzy decentralised impression - clouds are natural and ubiquitous! No one owns them! If it's in "the cloud" (note the definite article) then it's safe in the very fabric of the network, right?
Nope. It's in Larry and Sergey's basement. Not decentralised at all. Just somewhere else.
The proper term is "server", "datacenter", or "network", depending on what you're actually trying not to say.
What percentage of users do you think would be affected by such cases? If it's something over 0.001%, it's a huge problem for a social network.
Sites like Coinbase and Github exist because they re-centralize distributed systems — users don't trust themselves to host their own data securely.
Alternately, if this isn't a problem, why don't users simply host all their own infrastructure for existing tech problems today ?
I'm sure someone capable of living in the Mojave Desert is capable of hosting their own infrastructure - is this network simply for those people, or is it also for journalists, trans people, and HR professionals?
Very few people have considered whether or not they should attempt to back up their Facebook account. Same's true for Flickr, Twitter, and Gmail.
I'm probably in the minority being so irresponsible with my own backups, but I'm not alone.
Google and Facebook have a lot on the line with regard to user trust of their reliability. Also, they can't monetize data that they've lost.
I'd rather have my own backup copieS and take my own responsibilities. The scenario you evoke here would not happen if you had proper backup strategy, two is one and one is none.
Compare that to their experience at home with personal gear. Many like the convenience and reliability of Facebook over their own technical skills or efforts. You'd have to convince those people... a shitload of people... that they should start handling IT on their own. Also note that there's many good, smart, interesting, and so on people that simply don't do tech. Anyone filtering non-technical or procrastinating people in a service will be throwing out lots of folks whose company they might otherwise enjoy.
So, these kinds of issues are worth exploring when trying to build a better social network.
Same with backups, lose your data to drive failure or theft once and suddenly having a backup strategy becomes a priority.
But as long as they have not been bitten once they don't care enough to actually do something proactive.
It seems like the system would work just as well for people who decide to turn their system off when they go to work, or are on a sailboat. Of course it's not convenient in the same way that always-on social networks are, but that seems to be specifically not the point of SSB.
None of these things are a concern on traditional social networks. They have to be solved before the world has any chance of moving to a decentralized network.
I say no. Most of them don't have to be solved first.
Feel free to convince me that 3 days of downtime on my personal messaging account cannot have my personal account is a problem.
These are people who typed "Facebook Login" into a Google search, clicked the first result without reading, and got confused. Now tell these same users that Comcast blocked their social network or that they can't log in on their phone because their home Internet connection is down.
If you want a social network filled with just people like you and me, look at App.net or GNU Social for inspiration. If you want average users to sign in, these issues absolutely do have to be solved.
The early web (when I entered and before) wasn't for everyone. And thats OK for me. Actually I think it is a good way to start.
It got where it went by doing the opposite of what you're suggesting. The walled garden for smart elites were mostly working on OSI from what old timers tell me. The TCP/IP, SMTP, etc involved lots of hackers trying to avoid doing too much work. Much like the users you prefer to filter. Then, it just went from there getting bigger and bigger due to low barrier of entry. Tons of economic benefits and business models followed. Now we're talking to each other on it.
If you're internet connection goes down, power goes out, you get DDoS'D this would hinder your ability to use any third party online service anyways.
The data cap and ISP restrictive terms of service are a different problem that would be challenged and fixed given internet subscriber would go the p2p self host way. The commercial ISP situation is a terrible mess right now.
If you got hacked unplug from network, boot from recovery, restore from backup, you're back online in less time than it takes to recover a hacked facebook account.
You say decentralized but it seems to me you meant distributed here.
One data center * 24 backup generators * 3 MW each = 72 MW per data center. Four of those = 288 MW > 1/4GW.
Google has more than three data centers.
Assuming that they're doubling energy consumption every year they'd have reached 8GW in 2016. That's 8W per user if we assume 1 billion users. Energy usage of a Raspberry is not insignificant relative to even this.
Doing things at scale is vastly more efficient. And only a subset of Google services can be relegated to a Raspberry. Even if you host your own mails, are you ready to ditch the Google search index and Youtube?
And you just use it for yourself? well, ok... but at that point you could also use a fully decent system.
Sounds interesting. How can you ban spammers when they can just create a new public key/identity if their old one is banned? And also, what does "banning" comprise, in a decentralized social network? I would assume it would be sufficient to just "unfollow" that particular identity.
I'd rather be able to email everybody and have it be annoying to switch than be able to only email people on my chosen provider (and then have to make an account on every service anyway).
This is a common misunderstanding. You do not need to use those nodes to bootstrap. Most clients simply choose to because it is the most convenient way to do so on the given substrate (the internet). DHTs are in no way limited to specific bootstrap nodes, any node that can be contacted can be used to join the network, the protocol itself is truly distributed.
If the underlying network provides some hop-limited multicast or anycast a DHT could easily bootstrap via such queries. In fact, bittorrent clients already implement multicast neighbor discovery which under some circumstances can result in joining the DHT without any hardcoded bootstrap node.
The multicast neighbor discovery is a neat idea. I wonder what percentage of clients/connections it results in successful bootstrapping for.
You could also run your own bootstrap node on an always-up server if downtimes making the lists stale is a concern.
You can also inject contacts when starting the client, you would have to obtain them out-of-band from somewhere of course, but it still does not require anything centralized.
If you're desperate you could also just sweep allocated IPv4 blocks and DHT-ping port 6881, you'll probably find one relatively fast. Of course that doesn't work with v6.
So there is no centralization and no single point of failure.
> The multicast neighbor discovery is a neat idea. I wonder what percentage of clients/connections it results in successful bootstrapping for.
It could work on a college campus, some conference network or occasionally some open wifi. Additionally there are some corporate bittorrent deployments where peer discovery via multicast can make sense.
If I understand TFA correctly scuttlebutt assumes(?) roaming through wifis and LANs. Those circumstances are ideal for multicast bootstrapping, so in principle the DHT can perform just as well as scuttlebutt, probably even better because once it has bootstrapped it can use the global DHT to keep contact with the network even if there is no lan-local peer to be discovered.
There is no semantic difference between the two. The only difference is when you connect to the single-point-of-truth bootstrap, at download time (well, technically build-time) or at first startup time. And the latter probably gives you a more current, and not limited to long-lived nodes, thus better, answer.
> You could also run your own bootstrap node on an always-up server if downtimes making the lists stale is a concern.
Which itself needs to be bootstrapped. And once it is, it's equivalent to your local cache.
Possibly, which mechanisms are used varies from client to client. Usually DHT bootstrap is not a primary goal but a side-effect of other mechanisms. Things that work in some clients:
magnet -> tracker -> peer -> dht ping
torrent -> tracker -> peer -> dht ping
magnet -> contains direct peer -> peer -> dht ping
torrent or magnet -> multicast discovery -> peer -> dht ping
torrent -> contains a list of dht node ip/port pairs
I believe this is how bitcoin works. Or at least it used to.
are there any noteworthy resources for non-academics to get started?
But a DHT is usually just a low-level building block in more complex p2p systems. As its name says it's simply a distributed hash table. A data structure on a network. It just gives you a distributed key-value pair store where the values are often required to be small. In itself it doesn't give you trust, two-way communication, discovery or anything like that. Those are often either tacked on as ad-hoc features, handled by separate protocols or require some tricky cryptography.
The last two are particularly devastating. Even if the peers had a key/value whitelist and hashes (e.g. like a .torrent file), an adversary can still insert itself into the routing tables of honest nodes and prevent peers from ever discovering your key/value pairs. Moreover, they can easily spy on everyone who tries to access them. It is estimated  that 300,000 of the BitTorrent DHT's nodes are Sybils, for example.
BEP42 has been implemented by many clients and yet nobody has felt the need to actually switching to enforcement mode.
All that is the result of the bittorrent DHT being a low-value target. It does not contain any juicy information and is just one of multiple peer discovery mechanisms, so there's some redundancy too.
If I'm "in" on the sharing, then I learn the IP addresses (and ISPs and proximate locations) of the other people downloading the shared file. Moreover, if I control the right hash buckets in the DHT's key space, I can learn from routing queries who's looking for the content (even if they haven't begun to share it yet). Encryption alone does not make file-sharing a private affair.
> BEP42 has been implemented by many clients and yet nobody has felt the need to actually switching to enforcement mode.
It also does not appear to solve the problem. The attacker only needs to get control of hash buckets to launch routing attacks. Even with a small number of unchanging node IDs, the attacker is still free to insert a pathological sequence of key/value pairs to bump hash buckets from other nodes to them.
> All that is the result of the bittorrent DHT being a low-value target. It does not contain any juicy information and is just one of multiple peer discovery mechanisms, so there's some redundancy too.
Are you suggesting that high-value apps should not rely on a DHT, then?
Someone who is "in" on encrypted content can observe the swarm anyway, thus gains very little from performing snooping on a DHT. On the other hand a passive DHT observer who is not "in" will be hampered by not knowing what content is shared, he only sees participation in opaque hashes. Additionally payload encryption adds deniability because anyone can transfer the ciphertext but participants won't know whether others have the necessary keys to decrypt it.
What I'm saying is that any information leakage via the DHT (compared to public trackers and PEX) is quite small, and this small loss can be more than made up by adding payload encryption.
> the attacker is still free to insert a pathological sequence of key/value pairs to bump hash buckets from other nodes to them.
There is no bumping in kademlia with unbounded node storage. And clients with limited storage can make bumping very hard for others with oldest-first and one-per-subnet policies, i.e. bumping the attackers instead of genuine keys.
> Are you suggesting that high-value apps should not rely on a DHT, then?
No, they should use DHT as a bootstrap mechanism of easy-to-replicate, difficult-to-disrupt small bits of information (e.g. peer contacts as in bittorrent) which then run their own content-specific gossip network for the critical content. In some contexts it can also make sense to make reverse lookups difficult, so attackers won't know what to disrupt unless they're already part of some group.
I can see that this thread is getting specific to Bittorrent, and away from DHTs in general. Regardless, I'm not sure if this is the case. Please correct me if I'm wrong:
* If I can watch requests on even a single copy of a single key/value pair in the DHT, I can learn some of the IP addresses asking for it (and when they ask for it).
* If I can watch requests on all copies of the key/value pair, then I can learn all the interested IP addresses and the times when they ask.
* If I can do this for the key/value pairs that make up a .torrent file, then I can (1) get the entire .torrent file and learn the list of file hashes, and (2) find out the IPs who are interested in the .torrent file.
* If I can then observe any of the key/value pairs for the .torrent file hashes, then I can learn which IPs are interested in and can serve the encrypted data (and the times at which they do so).
This does not strike me as "quite small," but that's semantics.
> There is no bumping in kademlia with unbounded node storage. And clients with limited storage can make bumping very hard for others with oldest-first and one-per-subnet policies, i.e. bumping the attackers instead of genuine keys.
Yes, the DHT nodes can employ heuristics to try to stop this, just like how BEP42 is a heuristic to thwart Sybils. But that's not the same as solving the problem. Applications that need to be reliable have to be aware of these limits, and anticipate them in their design.
> No, they should use DHT as a bootstrap mechanism of easy-to-replicate, difficult-to-disrupt small bits of information (e.g. peer contacts as in bittorrent) which then run their own content-specific gossip network for the critical content. In some contexts it can also make sense to make reverse lookups difficult, so attackers won't know what to disrupt unless they're already part of some group.
This kind of proves my point. You're recommending that applications not rely on DHTs, but instead use their own content-specific gossip network.
To be fair, I'm perfectly okay with using DHTs as one of a family of solutions for addressing one-off or non-critical storage problems (like bootstrapping). But the point I'm trying to make is that they're not good for much else, and developers need to be aware of these limits if they want to use a DHT for anything.
It is quite small because bittorrent needs to use some peer source. If you're not using the DHT you're using a tracker. The same information that can be obtained from the DHT can be obtained from trackers. So there's no novel information leakage introduced by the DHT.
That's why the DHT does not really pose a big information leak.
> This kind of proves my point. You're recommending that applications not rely on DHTs, but instead use their own content-specific gossip network.
That's not what I said. Relying on a DHT for some parts, such as bootstrap and discovery is still... well... relying on it, for things it is good at.
> But the point I'm trying to make is that they're not good for much else, and developers need to be aware of these limits if they want to use a DHT for anything.
Well yes, but these limits arise naturally anyway since A stores data for B on C and you can't really incentivize C to manage anything more than small bits of data.
> I can see that this thread is getting specific to Bittorrent
About DHTs in general, you can easily make reverse lookups difficult or impossible by hashing the keys (bittorrent doesn't because the inputs already are hashes), you can obfuscate lookups by making them somewhat off-target until they're close to the target and making data-lookups and maintenance lookups indistinguishable. You can further add plausible deniability by by replaying recently-seeing lookups when doing maintenance of nearby buckets.
Replacing a tracker with a DHT trades having one server with all peer and chunk knowledge with N servers with partial peer and chunk knowledge. If the goal is to stop unwanted eavesdroppers, then the choice is between (1) trusting that a single server that knows everything will not divulge information, or (2) trusting that an unknown, dynamic number of servers that anyone can run (including the unwanted eavesdroppers) will not divulge partial information.
The paper I linked up the thread indicates that unwanted eavesdroppers can learn a lot about the peers with choice (2) by exploiting the ways DHTs operate. Heuristics can slow this down, but not stop it. With choice (1), it is possible to fully stop unwanted eavesdroppers if peers can trust the tracker and communicate with it confidentially. There is no such possibility with choice (2) if the eavesdropper can run DHT nodes.
> That's not what I said. Relying on a DHT for some parts, such as bootstrap and discovery is still... well... relying on it, for things it is good at.
> Well yes, but these limits arise naturally anyway since A stores data for B on C and you can't really incentivize C to manage anything more than small bits of data.
Thank you for clarifying. Would you agree that reliable bootstrapping and reliable stead-state behavior are two separate concerns in the application? I'm mainly concerned with the latter; I would never make an application's steady-state behavior dependent on a DHT's ability to keep data available. In addition, bootstrapping information like initial peers and network settings can be obtained through other channels (e.g. DNS servers, user-given configuration, multicasting), which further decreases the need to rely on DHTs.
> About DHTs in general, you can easily make reverse lookups difficult or impossible by hashing the keys (bittorrent doesn't because the inputs already are hashes), you can obfuscate lookups by making them somewhat off-target until they're close to the target and making data-lookups and maintenance lookups indistinguishable. You can further add plausible deniability by by replaying recently-seeing lookups when doing maintenance of nearby buckets.
I'm not quite sure what you're saying here, but it sounds like you're saying that a peer can obfuscate lookups by adding "noise" (e.g. doing additional, unnecessary lookups). If so, then my reply would be this only increases the number of samples an eavesdropper needs to make to unmask a peer. To truly stop an eavesdropper, a peer needs to ensure that queries are uniformly distributed in both space and time. This would significantly slow down the peer's queries and consume a lot of network bandwidth, but it would stop the eavesdropper. I don't know of any production system that does this.
In practice trackers do divulge all the same information that can be gleaned from the DHT and so does PEX in a bittorrent swarm. Those are far more convenient to harvest.
> I'm not quite sure what you're saying here, but it sounds like you're saying that a peer can obfuscate lookups by adding "noise" (e.g. doing additional, unnecessary lookups).
That's only 2 of 4 measures I have listed. And I would mention encryption again as a 5th. The others: a) Opportunistically creating decoys by having others repeat lookups they have recently seen as part of their routing table maintenance b) storing data in the DHT in a way that requires some prior knowledge to be useful, which will ideally result in the only leaking information when the listener could obtain the information anyway if he has that prior knowledge.
There's a lot you can do to harden DHTs. I agree that naive implementations are trivial to attack, but to my knowledge it is possible to achieve byzantine fault tolerance in a DHT in principle, it's just that nobody has actually needed that level of defense yet, attacks in the wild tend to be fairly primitive and only succeed because some implementations are very sloppy about sanitizing things.
> To truly stop an eavesdropper, a peer needs to ensure that queries are uniformly distributed in both space and time.
Not quite. You only need to increase the number of samples needed beyond the number of samples a peer is likely to generate during some lifecycle, and that is not just done by adding more traffic.
> Would you agree that reliable bootstrapping and reliable stead-state behavior are two separate concerns in the application?
Certainly, but bootstrapping is a task that you do more frequently than you think. You don't just join a global overlay once, you also (re)join many sub-networks throughout each session or look for specific nodes. DHT is a bit like DNS. You only need it once a day for a domain (assuming long TTLs), and it's not exactly the most secure protocol and afterwards you do the heavy authentication lifting with TLS, but DNS is still important, even if it you're not spending lots of traffic on it.
I'm confused. I can configure a tracker to only communicate with trusted peers, and do so over a confidential channel. The tracker is assumed to not leak peer information to external parties. A DHT can do neither of these.
> That's only 2 of 4 measures I have listed. And I would mention encryption again as a 5th. The others: a) Opportunistically creating decoys by having others repeat lookups they have recently seen as part of their routing table maintenance b) storing data in the DHT in a way that requires some prior knowledge to be useful, which will ideally result in the only leaking information when the listener could obtain the information anyway if he has that prior knowledge.
Unless the externally-observed schedule of key/value requests is statistically random in time and space, the eavesdropper can learn with better-than-random guessing which peers ask for which chunks. Neither (a) nor (b) address this; they simply increase the number of samples required.
> There's a lot you can do to harden DHTs. I agree that naive implementations are trivial to attack, but to my knowledge it is possible to achieve byzantine fault tolerance in a DHT in principle, it's just that nobody has actually needed that level of defense yet, attacks in the wild tend to be fairly primitive and only succeed because some implementations are very sloppy about sanitizing things.
First, no system can tolerate Byzantine faults if over a third of its nodes are hostile. If I can Sybil a DHT, then I can spin up arbitrarily many evil nodes. Are we assuming that no more than one third of the DHT's nodes are evil?
Second, "nobody has actually needed that level of defense yet" does not mean that it is a sound decision for an application to use a DHT with the expectation that the problems will never occur. So the maxim goes, "it isn't a problem, until it is." As an application developer, I want to be prepared for what happens when it is a problem, especially since the problems are known to exist and feasible to exacerbate.
> Not quite. You only need to increase the number of samples needed beyond the number of samples a peer is likely to generate during some lifecycle, and that is not just done by adding more traffic.
I'm assuming that peers are arbitrarily long-lived. Real-world distributed systems like BitTorrent and Bitcoin aspire to this.
> Certainly, but bootstrapping is a task that you do more frequently than you think. You don't just join a global overlay once, you also (re)join many sub-networks throughout each session or look for specific nodes. DHT is a bit like DNS. You only need it once a day for a domain (assuming long TTLs), and it's not exactly the most secure protocol and afterwards you do the heavy authentication lifting with TLS, but DNS is still important, even if it you're not spending lots of traffic on it.
I take issue with saying that "DHTs are like DNS", because they offer fundamentally different data consistency guarantees and availability guarantees (even Beehive (DNS over DHTs) is vulnerable to DHT attacks that do not affect DNS).
Regardless, I'm okay with using a DHT as one of many supported bootstrapping mechanisms. I'm not okay with using it as the sole mechanism or even the primary mechanism, since they're so easy to break when compared to other mechanisms.
But then you are running a private tracker for personal/closed group use and have a trust source. If you have a trust source you could also run a closed DHT. But the bittorrent DHT is public infrastructure and best compared to public trackers.
> I'm assuming that peers are arbitrarily long-lived. Real-world distributed systems like BitTorrent and Bitcoin aspire to this.
Physical machines are. Their identities (node IDs, IP addresses) and the content they participate in at any given time don't need to be.
> If I can Sybil a DHT, then I can spin up arbitrarily many evil nodes.
This can be made costly. In the extreme case you could require a bitcoin-like proof of work system for node identities. But that would be wasteful... unless you're running some coin network anyway, then you can tie your ID generation to that. In lower-value targets IP prefixes tend to be costly enough to thwart attackers. If an attacker can muster the resources to beat that he would also have enough unique machines at his disposal to perform a DoS on more centralized things.
> Are we assuming that no more than one third of the DHT's nodes are evil?
Assuming is the wrong word. I think approaching BFT is simply part of what you do to harden a DHT against attackers.
> Second, "nobody has actually needed that level of defense yet" does not mean that it is a sound decision for an application to use a DHT with the expectation that the problems will never occur.
I haven't said that. I'm saying that simply because this kind of defense was not yet needed nobody tried to build it, as simple as that. Sophisticated security comes with implementation complexity, that's why we had HTTP for ages before HTTPS adoption was spurred by the snowden leaks.
> Neither (a) nor (b) address this; they simply increase the number of samples required.
(b) is orthogonal to sampling vs. noise.
> I'm not okay with using it as the sole mechanism or even the primary mechanism, since they're so easy to break when compared to other mechanisms.
What other mechanisms do you have in mind? Most that I am aware of don't offer the same O(log n) node-state and lookup complexity in a distributed manner.
You're ignoring the fact that with a public DHT, the eavesdropper has the power to reroute requests through networks (s)he can already watch. With a public tracker, the eavesdropper needs vantage points in the tracker's network to gain the same insights.
If we're going to do an apples-to-apples comparison between a public tracker and a public DHT, then I'd argue that they are equivalent only if:
(1) the eavesdropper cannot add or remove nodes in the DHT;
(2) the eavesdropper cannot influence other nodes' routing tables in a non-random way.
> This can be made costly. In the extreme case you could require a bitcoin-like proof of work system for node identities. But that would be wasteful... unless you're running some coin network anyway, then you can tie your ID generation to that. In lower-value targets IP prefixes tend to be costly enough to thwart attackers. If an attacker can muster the resources to beat that he would also have enough unique machines at his disposal to perform a DoS on more centralized things.
Funny you should mention this. At the company I work part-time for (blockstack.org), we thought of doing this very thing back when the system still used a DHT for storing routing information.
We had the additional advantage of having a content whitelist: each DHT key was the hash of its value, and each key was written to the blockchain. Blockstack ensured that each node calculated the same whitelist. This meant that inserting a key/value pair required a transaction, and the number of key/value pairs could grow no faster than the blockchain.
This was not enough to address data availability problems. First, the attacker would still have the power to push hash buckets onto attacker-controlled nodes (it would just be expensive). Second, the attacker could still join the DHT and censor individual routes by inserting itself as neighbors of the target key/value pair replicas.
The best solution we came up with was one whereby DHT node IDs would be derived from block headers (e.g. deterministic but unpredictable), and registering a new DHT node would require an expensive transaction with an ongoing proof-of-burn to keep it. In addition, our solution would have required that every K blocks, the DHT nodes would deterministically re-shuffled their hash buckets among themselves in order to throw off any encroaching routing attacks.
We ultimately did not do this, however, because having the set of whitelisted keys growing at a fixed rate afforded a much more reliable solution: have each node host a 100% replica of the routing information, and have nodes arrange themselves into a K-regular graph where each node selects neighbors via a random walk and replicates missing routing information in rarest-first order. We have published details on this here: https://blog.blockstack.org/blockstack-core-v0-14-0-release-....
> Assuming is the wrong word. I think approaching BFT is simply part of what you do to harden a DHT against attackers.
If you go for BFT, you have to assume that no more than f of 3f+1 nodes are faulty. Otherwise, the malicious nodes will always be able to prevent the honest nodes from reaching agreement.
> I haven't said that. I'm saying that simply because this kind of defense was not yet needed nobody tried to build it, as simple as that. Sophisticated security comes with implementation complexity, that's why we had HTTP for ages before HTTPS adoption was spurred by the snowden leaks.
Right. HTTP's lack of security wasn't considered a problem, until it was. Websites addressed this by rolling out HTTPS in droves. I'm saying that in the distributed systems space, DHTs are the new HTTP.
> What other mechanisms do you have in mind? Most that I am aware of don't offer the same O(log n) node-state and lookup complexity in a distributed manner.
How about an ensemble of bootstrapping mechanisms?
* give the node a set of initial hard-coded neighbors, and maintain those neighbors yourself.
* have the node connect to an IRC channel you maintain and ask an IRC bot for some initial neighbors.
* have the node request a signed file from one of a set of mirrors that contains a list of neighbors.
* run a DNS server that lists currently known-healthy neighbors.
* maintain a global public node directory and ship it with the node download.
I'd try all of these things before using a DHT.
But in the context of bittorrent that is not necessary if we're still talking about information leakage. The tracker + pex gives you the same, and more, information than watching the DHT.
> we thought of doing this very thing back when the system still used a DHT for storing routing information.
The approaches you list seem quite reasonable when you have a PoW system at your disposal.
> have each node host a 100% replica of the routing information, and have nodes arrange themselves into a K-regular graph
This is usually considered too expensive in the context of non-coin/-blockchain p2p networks because you want nodes to be able to run on embedded and other resource-constrained devices. The O(log n) node state and bootstrap cost limits are quite important. Otherwise it would be akin to asking every mobile phone to keep up to date with the full BGP route set.
> assume that no more than f of 3f+1 nodes are faulty. Otherwise, the malicious nodes will always be able to prevent the honest nodes from reaching agreement.
Of course, but for some applications that is more than good enough. If your adversary can bring enough resources to bear to take over 1/3rd of your network he might as well DoS any target he wants. So you would be facing massive disruption anyway. I mean blockchains lose some of their security guarantees too once someone manages to dominate 1/2 of the mining capacity. Same order of magnitude. It's basically the design domain "secure, up to point X".
> I'm saying that in the distributed systems space, DHTs are the new HTTP.
I can agree with that, but I think the S can be tacked on once people feel the need.
> How about an ensemble of bootstrapping mechanisms?
The things you list don't really replace the purpose of a DHT. A dht is a key-value store for many keys and a routing algorithm to find them in a distributed environment. What you listed just gives you a bunch of nodes, but no data lookup capabilities. Essentially you're listing things that could be used to bootstrap into a DHT, not replacing the next layer services provided by a DHT.
Funny you should mention BGP. We have been approached by researchers at Princeton who are interested in doing something like that, using Blockstack (but to be fair, they're more interested in giving each home router a copy of the global BGP state).
I totally hear you regarding the costly bootstrapping. In Blockstack, for example, we expect most nodes to sync up using a recent signed snapshot of the node state and then use SPV headers to download the most recent transactions. It's a difference between minutes and days for booting up.
> Of course, but for some applications that is more than good enough. If your adversary can bring enough resources to bear to take over 1/3rd of your network he might as well DoS any target he wants. So you would be facing massive disruption anyway.
Yes. The reason I brought this up is that in the context of public DHTs, it's feasible for someone to run many Sybil nodes. There's some very recent work out of MIT for achieving BFT consensus in open-membership systems, if you're interested: https://arxiv.org/pdf/1607.01341.pdf
> I mean blockchains lose some of their security guarantees too once someone manages to dominate 1/2 of the mining capacity. Same order of magnitude. It's basically the design domain "secure, up to point X".
In Bitcoin specifically, the threshold for tolerating Byzantine miners is 25% hash power. This was one of the more subtle findings from Eyal and Sirer's selfish mining paper.
> The things you list don't really replace the purpose of a DHT. A dht is a key-value store for many keys and a routing algorithm to find them in a distributed environment. What you listed just gives you a bunch of nodes, but no data lookup capabilities. Essentially you're listing things that could be used to bootstrap into a DHT, not replacing the next layer services provided by a DHT.
If the p2p application's steady-state behavior is to run its own overlay network and use the DHT only for bootstrapping, then DHT dependency can be removed simply by using the systems that bootstrap the DHT in order to bootstrap the application. Why use a middle-man when you don't have to?
It seems like we have a quite different understanding how DHTs are used, probably shaped by different use-cases. Let me see if I can summarize yours correctly: a) over time nodes will be interested or have visited in a large proportion of the keyspace b) it makes sense to eventually replicate the whole dataset c) the data mutation rate is relatively low d) access to the keyspace is extremely biased, there is some subset of keys that almost all nodes will access. Is that about right?
In my case this is very different. Node turnover is high (mean life time <24h), data is volatile (mean lifetime <2 hours), nodes are only ever interested in a tiny fraction of the keyspace (<0.1%), nodes access random subsets of the keyspace, so there's little overlap in their behavior. The data would become largely obsolete before you even replicated half the DHT unless you spent a lot of overhead on keeping up with hundreds of megabytes of churn per hour and you would never use most of it.
So for you there's just "bootstrap dataset" and then "expend a little effort to keep the whole replica fresh". For me there's really "bootstrap into the dht", "maintain (tiny) routing table" and then "read/write random access to volatile data on demand, many times a day".
This is why the solutions you propose are no solutions for a general DHT which can also cope with high churn.
Agreed on (a), (b), and (c). In (a), the entire keyspace will be visited by each node, since they have to index the underlying blockchain in order to reach consensus on the state of the system (i.e. each Blockstack node is a replicated state machine, and the blockchain encodes the sequence of state-transitions each node must make). (d) is probably correct, but I don't have data to back it up (e.g. because of (b), a locally-running application node accesses its locally-hosted Blockstack data, so we don't ever see read accesses).
> In my case this is very different. Node turnover is high (mean life time <24h), data is volatile (mean lifetime <2 hours), nodes are only ever interested in a tiny fraction of the keyspace (<0.1%), nodes access random subsets of the keyspace, so there's little overlap in their behavior. The data would become largely obsolete before you even replicated half the DHT unless you spent a lot of overhead on keeping up with hundreds of megabytes of churn per hour and you would never use most of it.
Thank you for clarifying. Can you further characterize the distribution of reads writes over the keyspace in your use-case? (Not sure if you're referring to the Bittorrent DHT behavior in your description, so apologies if these questions are redundant). For example:
* Are there a few keys that are really popular, or are keys equally likely to be read?
* Do nodes usually read their own keys, or do they usually read other nodes' keys?
* Is your DHT content-addressable (e.g. a key is the hash of its value)? If so, how do other nodes discover the keys they want to read?
* If your DHT is not content-addressable, how do you deal with inconsistent writes during a partition? More importantly, how do you know the value given back by a remote node is the "right" value for the key?
I am, but that's not even that important because storing a blockchain history is a very special usecase because you're dealing with an append-only data structure. There are no deletes or random writes. Any DHT used for p2p chat, file sharing or some mapping of identity -> network address will experience more write-heavy, random access workloads.
> Are there a few keys that are really popular, or are keys equally likely to be read?
Yes, some are more popular than others, but the bias is not strong compared to the overall size of the network. 8M+ nodes. Key popularity may range from 1 to maybe 20k. And such peaks are transient, mostly for new content.
> Do nodes usually read their own keys, or do they usually read other nodes' keys?
It is extremely unlikely that nodes are interested in the data for which they provide storage.
> Is your DHT content-addressable (e.g. a key is the hash of its value)?
Yes and no, it depends on the remote procedure call used. Generic immutable get/put operations are. Mutable ones use the hash of the pubkey. Peer address list lookups use the hash of an external value (from the torrent).
> * If your DHT is not content-addressable, how do you deal with inconsistent writes during a partition? More importantly, how do you know the value given back by a remote node is the "right" value for the key?
For peer lists it maintains a list of different values from multiple originators, the value is the originator's IP, so it can't be easily spoofed (3-way handshake for writes). A store adds a single value, a get returns a list.
For mutable stores the value -> signature -> pubkey -> dht key is checked.
also note: DHT: hash table, BlockChain: linked list.
but there are a lot more datastructures than that!
There are several DHT papers that talk about bootstrapping DHTs off of social networks. They all fail to solve the Sybil problem in the same way: an adversary simply attacks the social network by pretending to be many people.
Not everything needs a global singleton like a blockchain or DHT or a DNS system. Bitcoin needs this because of the double-spend problem. But private chats and other such activities don't.
I have been working on this problem since 2011. I can tell you that peer-to-peer is fine for asynchronous feeds that form tree based activities, which is quite a lot of things.
But everyday group activities usually require some central authority for that group, at least for the ordering of messages. A "group" can be as small as a chess game or one chat message and its replies. But we haven't solved mental poker well for N people yet. (Correct me if I am wrong.)
The goal isn't to not trust anyone for anything. After all, you still trust the user agent app on your device. The goal is to control where your data lives, and not have to rely on any particular connections to eg the global internet, to communicate.
Btw ironic that the article ends "If you liked this article, consider sharing (tweeting) it to your followers". In the feudal digital world we live in today, most people speak must speak a mere 140 characters to "their" followers via a centralized social network with huge datacenters whose engineers post on highscalability.com .
If you are interested, here I talk about it further in depth:
But I had never heard of scuttlebut until now. This looks even more ideal. In amateur radio, everyone self identifies with their call sign, this follows the same model.
For amateur radio, there is a restriction against encryption (intent to obscure or hide the message), but the public messages would be fine. Private messages (being encrypted for only those the right keys) might be a legal issue, so for a legit amateur radio deployment, the client would have to disable that (or at least operators would have to be educated that private messages may violate fcc rules).
(at 9m53s: https://youtu.be/WzMm7-j7yIY?t=9m53s)
How do you see this happening in such a relative short amount of time? Who (else) is going to do this? Is our culture predisposed to do this, and, if not, is there a strategy to overcome this culture factor?
edit: for clarity
Allow me to designate trusted friends / custodians. Store fractions of my private key with them, so that they can rebuild the key if I lost mine. They should also be able to issue a "revocation as of certain date" if my key is compromised, and vouch for my new key being a valid replacement of the old key. So my identity becomes "Bob Smith from Seattle, friend of Jane Doe from Portland and Sally X from Redmond". My social circle is my identity! Non-technical users will not even need to know what private key / public key is.
Introduce a notion of the "relay" server - a server where I will register my current IP address for direct p2p connection, or pick my "voicemail" if I can't be reach right away. I can have multiple relays. So my list of friends is a list of their public keys and their relays as best I know them. Whenever I publish new content, the software will aggressively push the data to each of my friends / subscribers. Each time my relay list is updated, it also gets pushed to everyone. If I can't find my friend's relay, I will query our mutual friends to see if they know where to find my lost friend.
There should be a way to create handles for real-life objects and locations. Since many people will end up creating different entries for the same object, there should be a way for me to record in my log that guid-a and guid-b refer to the same restaurant in my opinion. As well I could access similar opinion records made by my friends, or their friends.
Each post has an identity, as does each location. My friends can comment on those things in their own log, but I will only see these comments if I get to access those posts / locations myself (or I go out of my way to look for them). This way I know what my friends think of this article or this restaurant. Bye-bye Yelp, bye-bye fake Amazon reviews.
I will subscribe to certain bots / people who will tell me that some pieces of news floating around will be a waste of my time or be offensive. Bye-bye clickbait, bye-bye goatse.
Allow me to designate space to store my friend's encrypted blobs for them. They can back up their files to me, and I can backup to them.
a very nice person whom i like to call mix made a module for this recently: http://git.scuttlebot.io/%25XJz%2BcF9oIgd1eHYFGg3ycVwowLEseL...
The part which splits your key is now automated and part of Patchbay. I'll build the resurrection part when someone needs it
For identity, there's
Right now I'm particularly interested in https://github.com/solid/web-access-control-spec although I think it's incomplete when it comes to data portability and access control. From what I've seen on re-decentralizing the internet, access control is either non-existent, or relies on a server hosting your data to implement access control correctly.
What if, in the WAC protocol linked above, instead of ACL resources informing the server, we could have ACL resources providing clients with keys to the encrypted resource (presumably wrapped in each authorized agent's pub key). Host proof data is a necessity for decentralized social networking IMO, even if the majority of agents would happily hand their keys over to their host.
Also important that an initial smaller community would be targeted and that it would succeed there. FB did this will colleges, a federated one would in a world where FB already exists would have an even harder time.
Depends on your definition of "distributed", I suppose
Keybase offers decentralized trust, in that the Keybase server can't lie to you about someone's keys -- your Keybase client will trust their public proofs and not the Keybase server -- but it's not a distributed/decentralized service as a whole, because you still receive hints from the server about where proofs live, and learn Keybase usernames from it.
(I work at Keybase.)
No, I don't think the tech is quite there yet. Even just handing out human-readable usernames requires blockchain-style consensus, and we don't have a blockchain being followed along by everyone's machines to adjudicate consensus requests (yet!).
The folks at Blockstack Labs are doing fine work in this area, though: https://blockstack.org/
The fact that it failed at the most basic thing of actually telling what it is about, what it does and how would be good reasons to not use keybase.
It's unclear whether this can be changed later, and I'm not yet sure whether I want to use my real identity or a throwaway.
After creating an account with the default ¿randomly? generated name, I tried to use an invite obtained from http://188.8.131.52/invited which was linked from https://github.com/staltz/easy-ssb-pub.
All I got back was "An error occured (sic) while attempting to redeem invite. could not connect to sbot"
It worked with http://pub.locksmithdon.net/ though I feel a bit odd trusting a "locksmith" I've never heard of to stream lots of data to my harddrive...
It's cool that anyone can host a pub – basically, an instance of FB/Twitter/Gmail, it seems – but things 1) will get expensive for them, and it's unclear how they'll fund that – and 2) now I have to trust random people on the internet – not only to be nice, but also secure.
As a "random technically aware netizen", I honestly trust fooplesoft more, since they have a multi-billion-dollar reputation to protect. (Not that I trust fooplesoft).
FWIW, you can use pub.lua.cz:8008:@xYSW6eVu8gTS/nTSXZiH97dgKZ+wp7NkomR6WKK/PBI=.ed25519~iQ16RuvjKZqy/RhiXXmW9+6wuZNq+SBI8evG3PotxvI= if you have trouble connecting to the ones on github.
Feel free to add it to the wiki, I do plan to run it long term, but I am not a github user.
Right, but someone I trust could have their message corrupted, no?
eg; some political leader intends to write "everybody vote for Alice" and it is modified to read "everybody vote for Carol". Is this possible?
(I generally trust FB not to do this because their business would suffer if they were caught, for example – not so with ephemeral pubs)
your followers, their followers, and their followers (assuming everyone is using the default replication settings). These may include pubs or people you follow. If you are able to connect to a pub then most likely it is willing to replicate your feed.
The social aspect is important though because in this architecture what you see is determined by who you follow (and who they follow, etc.)
What I mean to say is, Usenet's social model hardly prevented it from drowning in a sea of low-value content.
Maybe I'm not thinking about it right or use it differently than most :P
See also Joel Spolsky on the topic: https://www.joelonsoftware.com/2003/03/03/building-communiti...
The deciding factor between what came before and facebook and twitter is the ability to broadcast to the entire social network at once, so all of the world can see your brilliance! Feeding into that narcissism is the killer feature of modern social networks.
But yes, for the majority of people, talking about themselves is exactly what they do. They talk about their vacation to the beach. They talk about the drama going on at work. They talk about their sister's date. They don't talk about advances in database design.
I was at a live event (a play) recently and was fascinated by a small group of women in their late 20s / early 30s. They spent a good 10-15 minutes before the play started just taking pictures of themselves being at the play and posting it to their social networks. They talked about the pictures, asked others to send them their copy of the picture. They took pictures from one angle and then another. They talked about who "liked" the picture they just uploaded. It went on and on and on. Not once did I overhear them talking about the play they were about to see. It seemed to be not the point at all. The play was just a hashtag for their social media posts.
Most good conversationalists are good at it because they explicitly draw the other person into talking about themselves and their interests. Whether things become narcissistic is more a factor of personality I think. Perhaps its more than that, though. A good conversationalist would steer the conversation to more interesting content - ie why he person is passionate about their hobby rather than just talking about their accomplishments. Perhaps we need to think about social network features that model what good conversationalists do? Not sure what that looks like though.
[edit for typo]
See, for example, the indiewebcamp people, who are against "silos", as they call Facebook et al., but are recreating their same functionalities with personal blogs and a new version of Pingbacks called "webmentions".
That's what "webmentions" do.
> That way my feed was a mix of the discussions I'm having with others as well as my own stuff.
Here's something I like: everything you say is part of a public discussion, so you're talking alone, also comments have about the same weight as standalone posts, also outsiders can join the discussion, it isn't restrict to your current circle of friends.
Yes, but just because no one has tried to create a different social network. That's why I made my initial comment in the first place.
> What you are describing has existed for years and we called it Usenet. Or a forum. Or a mailing list.
I don't know about Usenet, but forums and mailing lists are generally oriented to narrow topics, it is not something in which you'll see your school friends or people with multiple areas to discuss varied subjects.
Tumblr has some pretty good discussion about movies and books.
Twitter not so good for discussion because off the length limit, but there's plenty of people posting concise observations and jokes rather than posting about themselves.
On both systems, people can reply to content from strangers, and there's lots of conflict arising from that.
I do think Tumblr would be improved by making it easier to have discussions that don't go to all your followers by default, for example like on Twitter where if you tag people at the start of your tweet, it doesn't go into the main feed for your followers who aren't tagged.
Or you can go all the way to partitioning a system into topics, as with Reddit. I wouldn't call that a social network though, you don't just casually start a conversation with people you've chosen to connect with, you start a conversation with a subreddit.
When I used it, which admitedly was a long time ago now, the biggest setback was lack of cross device identities. So I ended up having two accounts with two feeds, `wesAtWork` and `wes`. Maybe they have solved this by now.
ps. Does patchwork still have the little gif maker? Because that was a super fun feature.
> forking a website so easily also makes spoofing very easy...
A fork copies the files of a site, so yeah, it certainly would be easily to spoof somebody's site. It basically is a spoof button. But doing so creates a new cryptographic identity for the site, and that will be the basis of how we authenticate
I understand that transparency might not be a design goal or techinically possible, I'm just raising the concern.
Can't I just share my private key across multiple devices?
2) it would greatly complicate the replication protocol, having to take into account forks, rather than assuming append only, where you can represent the current synced state with a single counter.
I ended up removing the gif maker in one iteration because it was so frequently buggy. That was probably the worst call I made.
Under the hood, patchwork connects to a scuttlebot server. Scuttlebot in turn is based on secure-scuttlebutt (ssb).
Patchwork is a user interface for displaying messages from the distributed database to the user, and to allow the user to add new messages. The underlying protocol supports arbitrary message types, patchwork exposes a UI for interacting with a subset of them. Anyone could write and use other UIs while still contributing to the same database. Patchbay for example is a more developer-centric frontend.
Under the hood, patchwork connects to a scuttlebot server. Scuttlebot in turn is based on secure-scuttlebutt (ssb).
 https://github.com/ssbc/patchbay  http://scuttlebot.io/
edit- they got unkilled.
_THIS_ is very much the spirit of SSB. :)
The technology is here, the only thing left is to make people actually use it…
But unto this database you can build whatever, like there's a github thing and a soundcloud thing and a facebook thing
Forgive the rambling, this is the first time I've written any of this down...
My idea is to use email as a transport for 'social attachments' that would be read using a custom mail client (it remains to be seen if it should be your regular email client or have it be just your 'social mail' client. But... if using another client as regular email, users would have to ignore or filter out social mails). It could also be done as a mimetype handler/viewer for social attachments.
Advantages of using email:
- Decentralized (can move providers)
- email address as rendezvous point (simple for users to grasp)
- Works behind firewalls
- Can work with local (ie Maildir) or remote (imap) mailstores. If using imap, helps to address the multiple devices issue. Could also use replication to handle it too (Syncthing, dropbox, etc)
Scuttlebutt looks like a nice alternative though. Will be following closely.
Problem is you don't have a mean to publicly advertise your status and offer a way to subscribe. That would be a third party provider. I can imagine someone fetching everyone's updates and providing a mechanism to just resend the mail via a public web repository that would act as a public registration hub.
That would be a huge data mine though. Unless you add pgp in the mix and then you have to hit the mark on the client pgp handling to easily allow close friends to give out their public key.
Wouldn't that make a fun POC project ?
I remember I was thinking about it when pownce came out.
I still believe the net would be so much more fun with the likes of pownce and w.a.s.t.e around :(.
I remember having some actual conversations on w.a.s.t.e. That's never happening with torrents.
One this I didn't mention was that there wouldn't be any public posts so definitely more social network than social media.
Anyways, just an idea at this point, though a prototype would not be that hard to put together as an experiment.
Granted, I have been running mail for a long time, so I got to learn the complications as they happened, rather than all at once. But anyone who can set up a production-quality web server/appserver/DB along with the accessories that go along with it can handle it.
Now if email isn't important to your business and/or you just don't want to deal with maintaining it, that's valid. But it just isn't as difficult as a lot of people seem to want to make it out to be.
Are you using Sendmail?
Not in a long time. I haven't found a situation in which I couldn't use Postfix in quite a while. Although the occasional sendmail.cf flashback still hits me.
that's why I asked.
I think such an approach could be interesting, but it seems there is a need for a non-profit to govern such a thing.
So your approach should work in theory within the described framework.
But seriously, this is a very very interesting project.
Maybe it's just me but if I see an article is x+ hours old (15+ for example), I don't bother commenting.
What type of social networking would HN use for non personal(not for family and immediate friends) communication? (I've tried hnchat.com, it's mostly inactive imho)
Perhaps HN could introduce email notifications (e.g., if somebody replies to one of your comments/posts, you get a notification by email).
1) be on the same wifi (presumably great for dissidents in countries with heavy-handed internet control, and inconvenient for everyone else)
2) use "pubs", which can be run on any server, and connected to ¿through the internet?
So most users would use pubs, which are described as "totally dispensable" (a nice property). But how can users exchange information about which pub to subscribe to? Is there a public listing of them?
It seems like the "bootstrapping server" problem (eg; reliance on router.bittorrent.com:6881) will still exist in practice. For that matter, is there currently an equivalent to router.bittorrent.com that would serve this purpose?
This seems like a potentially significant project, and I'm excited by the possibility that it might actually take off – hence the inquiry.
What about organizing groups, which might currently use Slack? For example, political dissidents who don't necessarily all know each other personally. They must use some other communication channel to communicate pubs?
I know some people who work in that area, and every time one of them finds out I work in software their first question is about mesh networking. If SSB is what it seems to be (user friendly, no-frills ad hoc mesh networking) then that would be huge for emergency and disaster planners! Is it mature enough to be used in this way?
Also, please remember that the American First Amendment limits Government speech restrictions. Private communities and individuals can make any rules they want about social acceptable speech.
naturally. where can I learn more?
If I want to have access to everything that's been shared with me, I have to store it all. In the case of images, the storage burden can get large quickly.
The flipside to your remark is, that it is fully offline capable and I'm perfectly happy with that. Also: contrast it with how much space a thunderbird profile takes up.
Is the protocol set up in such a way as to enable easy, automatic deletion of old data from local devices, while still storing them for easy search/scroll-based access on the Pub servers?
Thank goodness Facebook isn't what I want my social feed to look like... all those GIFs and garbage updates.
Also, I suppose if you're linking out of the SocialApp and into the Web, that most of the content is just "messages".
> then it syncs up with the larger storage on my home PC later.
I can hardly wait for devices that work this way.
Sure, but I have few enough friends as is. I know literally no one who would use this, as neat as it is. Bootstraping a social network is hard for both developers and users, but once it gets going, storage requirements would rise fast.
I observed in 2011 that HighWinds Media had not expired any non-binaries postings since 2006, and that Power Usenet had not expired a non-binaries posting for eight years ("3013+ days text retention" was in its advertising at the time). People effectively just turned non-binaries expiry in Usenet off, in the first few years of the 21st century. I did on my Usenet node, too.
I observed then that the Usenet nodes' abilities to store posts had far outstripped the size of the non-binaries portion of a full Usenet feed, which was only a tiny proportion of the full 10TiB/day feed of the time.
We are also basically betting on the size of our message logs to generally grow slower than our individual storage capacities, and it is interesting to know that that worked for Usenet too. For blobs, we will likely develop some garbage collection or expiring approaches. Since the network is radically decentralized, each participant can choose their own retention policy. You can, in fact, delete all your blobs (`rm -rf ~/.ssb/blobs`) and assuming some peers have replicated them, your client will just fetch them again as you need them.
> I'm simply pointing out the error in the premise of your question.
No, you made a non-sequitur factual post about Usenet. I see no actual error pointed out. The fact that Usenet stopped expiring non-binary posts after most of their traffic fled to other services is not a valid argument against possibly using the feature in a peer to peer distributed social network.
If you don't see an error in your premise being pointed out, then you need to put your "posts expire and get deleted, like on Usenet" right up against "people were not expiring non-binaries posts on Usenet" until the penny drops.
Then you need to notice the point, already made by others as well, that the premise of ISL's question is erroneous, too. The storage requirements are not necessarily "tremendous", if one actually learns from the past. Again, your comparison to Usenet needs to involve considering how Usenet treated binaries and non-binaries very differently. (One can look to experience of the WWW for this, too, and consider the relative weights in HTTP traffic of the "images" that ISL talks about and the non-binary contents of the WWW. But your comparison to Usenet does teach the same thing.)
Your and ISL's whole notion, that everything is going to get tremendously big and so everything will need to be expired, rather flies in the face of what we can see from history actually happened in systems like this, such as the one that you made your comparison to. Usenet did not expire and delete non-binaries posts.
By making this comparison and then trying to pretend that it's someone else's non-sequitur you are closing your eyes to the useful lessons to actually learn from your comparison. Usenet, and the Wayback Machine, and the early WWW spiders, and Stack Exchange, and Wikipedia with all of its talk pages, and Fidonet in its later years (when hard disc sizes became large enough), all teach that in fact one can get away with keeping all of the "non-binary" stuff indefinitely, or at least for time scales on the order of decades, because that is not where the majority of the storage and transmission costs is.
People have already danced this dance, several times, and making a distinction between the binary and the non-binary stuff and not fretting overmuch about the latter when one looks at the figures is generally where it ends up.
Let's say for instance that the file you're downloading is a long text file containing a novel, but all you care about is chapter 3. Then all you need are the pieces for chapter 3 – the rest can stick around in the ether somewhere.
This is harder to do with bags of bytes obviously – how do you know which bytes belong to chapter 3? – but if the pieces are self contained messages where you don't need either the previous or the next to make sense of it, then it should be trivial to link to them and the distribution could work like this. Whether it actually works like this or not I have no idea. Sounds like an interesting project anyway!