Hacker News new | past | comments | ask | show | jobs | submit login
Internet Archive, decentralized (archive.org)
242 points by justin_ on Aug 4, 2018 | hide | past | favorite | 84 comments

I strongly believe IA or any serious project working on permanent persistence must provide an option (opt-in if you will) to make the published material irremovable, à la arXiv [1].

[1] Red parts of https://arxiv.org/help/license

And that is your right. However most people would like to remove slander with a court order.

Pretty sure arXiv would remove slander with a court order as well. It's an unrelated issue.

If material can be removed, it's not irremovable.

We regularly confuse irrevocability of licenses and actual availability. The former is something we here probably agree cannot not make sense (once the beans are spilled there's no going back, you can't withdraw intellectual property: what you gave is now irrevocably tied to the recipient, they gained knowledge). The latter is clearly wrong: there's no such thing as universal availability. As humans behinds keyboards we pretty much decide in any way we want what part of (information about) our knowledge we share.

I think it's fair to assume "not-${VERB}able" means "as not-${VERB}able as legally possible".

This gets tricky with things like blockchains.

My choice of words was indeed not optimal, though you probably know that ambiguity pervades natural languages, and that it is yet often clear what was meant. I want to hope that you don’t fear physical amputation when a friend says to you “give me a hand”, for instance.

Thank you for bringing this up, it is another reason we need mutable data structures in the decentralized web - for social reasons, not just technological.

In another thread, I also mention that immutability creates algorithmic complexity that makes it difficult to scale: https://news.ycombinator.com/item?id=17693920

I didn't see any articles relating to how it was decentralized. Is internet archive going for something similar to a blockchain setup, because that would actually be one of the few cases it would make sense

I am not familiar with this new dweb subdomain, what is unique about it?

Archive has actually lost some of the archives I stored on it, which is weird, because I ran multitude of backups on it a few years back when the site got taken offline

I wanted to submit an article about this but couldn't find anything either.

My understanding is that this loads content from various protocols (listed at the top of the page), many of which support replicating data in a decentralized way. As far as I know, there's no blockchain involved in anything here yet.

If you browse to the Community Video section and choose a video, you can see peer information as though downloading through WebTorrent. If I disable WebTorrent and look at a video, I don't see the peer information and it seems to fall back to HTTP. Pretty cool! It looks like almost everything is only seeded by the Internet Archive right now, but hopefully they want to encourage more people to participate.

Hm that's interesting, I imagine the internet archive has lots of seedboxes/CDNs/datacenters distributed globally. I have no idea how much information the internet archive is currently backing up though, but its growing at an increasing rate.

I would love to see a write up of how their infrastructure works though

My understanding is all of their digital content is stored and served from their location in SF, although it would be awesome if they started geographically distributing their storage nodes.


Although, as a sibling comment points out, that article is very old, I can attest to its continuing general accuracy, having interviewed there last year.

Although I didn't get a look at their financials, my overall impression is that they can't afford anything as extravagant as geographic distribution, absent a huge corporate sponsor or two.

Even a modest increase in donation revenue would be unlikely to make a difference, as the other impression I had was that spending in that area was far from a priority, especially compared with data acquisition/conversion projects.

I think it's great that the IA is attempting to be a broad, general-interest digital library, since nobody else is doing that.

However, I also wish for a separate archive focused specifically on The Web. Besides not being subject to distractions or competing interests, I speculate that it would be more attractive for web-dependent companies, such as Google or CDNs, to donate to.

There's been at least a partial copy in Alexandria and Amsterdam for a long time, and they opened a full replica in Canada last year. At least I think the Canada one is done, can't find a blog post saying it was finished. https://blog.archive.org/2016/12/03/faqs-about-the-internet-...

Was not aware; this is excellent news!

Note this is an extremely old webpage.

Do you think it's built on top of PeerTube Social or is it something completely different?

Archive has actually lost some of the archives I stored on it

Who currently controls the domain? One of the ways stuff can be lost is if a new domain owner fiddles with robots.txt. (The Archive has recently changed their policy about that)

We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine.


Is there a delay between the time the robots.txt changes and the time when the content becomes inaccessible via Wayback Machine? How often does the archive.org_bot crawl robots.txt?

Can a script check robots.txt periodically for changes and if changes are detected, then download the content from Wayback Machine before it becomes inaccessible?

Additionally, can a script check the domain registration for an anticipated expiration date, or perhaps monitor domainname "drop lists"?

There's a video of the talk I did with Arkadiy (from IPFS) and Feross (from WebTorrent) at the Dweb Summit at https://archive.org/details/youtube-eO6pYYWZBs8?start=10447 it goes into quite a lot of detail about how it was implemented and the challenges we faced along the way.

What is it? My browser (firefox) sees nothing* even after enabling javascript.

*Except the sentence "The decentralized web is everywhere, but we have to find it." and a Name form that does nothing.

the problem is not firefox... it's probably addons or your internet connection that blocks content

It is also nothing for me in a Firefox fork with almost all add-ons disabled. Error console says there's a bunch of "class is a reserved identifier" and syntax errors of missing semicolons (before main() can even run).

My guess it's trying to use some bleeding edge emcascript stuff which borks the parsing in non-bleeding edge browsers.

oh... I forgot, I'm using the beta version of Firefox, that might be why.

If you run into problems, I'd love to know specific details - this is an experiment to see what was achievable with Dweb tools like IPFS, WebTorrent, Gun on browsers, with a big site like the Archive. It certainly pushes the edge of browsers - and does use the latest ES6 features without much attempt to support older browsers (unlike the production site at archive.org). We only currently test in the latest Chrome & Firefox, though it seems to work (mostly) on current Safari on iPad and iPhone and I heard its working on Android though I haven't seen it yet.

If you can get on, there is a feedback button, if you can't feel free to go straight to the form at https://docs.google.com/forms/d/e/1FAIpQLSe7pXiSLrmeLoKvlDi2... or open an issue on https://github.com/internetarchive/dweb-archive/issues .

> ... and does use the latest ES6 features without much attempt to support older browsers (unlike the production site at archive.org)

Except the wayback machine which recently changed it's interface to be entirely JS dependent and fails on older browsers. And the old interface is no longer accessible.

When cloudflare screwed up and published everyones secrets there was a coordinated effort involving, amongst others, archive.org to try and scrub the internet of those secrets. Are there any mechanism available to allow similar efforts with this dweb version?

There should not be a mechanism for that. China mastered such mechanisms. We should rather optimize for not having that kind of points of failure.

There's nothing inherently wrong with an opt-in coordination mechanism. It makes sense that if I trust someone to make a blacklist, I can have my node refer to that list. The problem is when an organization threatens you for "publishing" the wrong thing. That's a legal or social problem, not a technical one, and such organizations typically don't care in the slightest about how convenient it is for you to comply with their demands. If there's no blacklist mechanism, they're more than happy to demand that you shut down your node altogether.

I think of the two evils, the possobility of censorship or mistakes being eternal, I think I prefer the former.

For the same reason I find it comforting that an old fashioned bank transfer can be corrected if I transfer money and make a mistake writing the account number. Mutable history is a powerful feature.

> I think of the two evils, the possobility [sic] of censorship or mistakes being eternal, I think I prefer the former.

I get that. However, it's arguable that mistakes are all too often eternal against the most dangerous adversaries, such as authoritarian governments and other criminal organizations with state-level resources. And so it's arguably better to focus on resistance to censorship.

This feature is an illusion though. Mallory still can save anything that appears online for a second and so can you.

That’s technically correct but in practice it’s about as accurate as, say, assuming that you shouldn’t own anything expensive because it’s possible for anything to be stolen.

In real life, there are not billions of Mallorys watching your stuff constantly. Most people are decent and most of the others are deterred by laws, and the number of people who are willing to help abusers is relatively small.

Just using some real-life examples, think about doxxing or revenge porn. It’s technically true that this data cannot provably be removed from the internet but in practice most people didn’t save it and the ones who did became a lot more covert once the legal system caught up, which means that in practice far fewer people see it. The initial damage may have been done but that doesn’t mean we should give up and do nothing because there isn’t a theoretically-perfect option.

Sure. But if I e.g. accidentally uploaded something sensitive to GitHub (that can’t simply be changed to a new secret), I’d certainly delete it in a hurry, rather than shrug and say ”oh well It’s on the someone has already copied it so I’ll leave it”.

But in that case, are you saying you _wouldn't_ immediately change the credential you committed? Sure, the possibility of an adversary forking your repo after that commit but before your revision is small, but still exists.

Once a secret is exposed to the internet, it should be considered public and rotated. In this case mutability/immutability is moot though likely there are applications for other, non-credential secrets that are not so easily rotated (like your home address or something).

Yes a changeable credential you just change, but say the medical records of all staff your entire company or similar.

> an old fashioned bank transfer can be corrected if I transfer money and make a mistake writing the account number. Mutable history is a powerful feature.

That's not necessarily mutable history though. Such a correction will usually be made by an inverse transaction, not by wiping the original transaction from the record.

Ah, but who defines what is a "mistake" and what isn't?

Humans. That’s the feature. That agreements between humans (contracts, transfers, ...) are often imprecise. Matters can be argued between humans (in companies, authorities, courts)

But in this case there is a conflict between the agreement between IA and the content-submitter, and the agreement between IA and some political power.

In addition, optimize for dealing with those situations: secret rotation plans

"Please change the name of your first pet, the school you went to, and any distinguishing ratios on your body"

That's a funny take! At least the first two look like they need time machines, which would circumvent the solutions for the main issue in question here.

People actually fill those questions in with real info? Those types of questions are easily compromised with social media.


Ridgemont High.


Downvote because I think the nationalistic "us vs them" is uncalled for. Let's not make it about country politics when the topic isn't already politics.

Aww I was going to add almost exactly the same thing to https://github.com/pirate/bookmark-archiver but IA beat me to the punch! I still hope to add decentralized storage and lookup mechanisms to BA eventually, but considering one BA's archive outputs is the Internet Archive, it's less pressing now.

This is really cool! A bit sluggish, but I love the idea!

If you think back to, say, the library of Alexandria, to how much knowledge has been lost over the ages, it is so important to preserve as much as we can for future generations.

And building a decentralized foundation for this archive is a big step going forward, congratulations!

I recently went to the Internet Archive in SF, got a tour of their operations etc. Absolutely amazing place, very forward thinking, and they indeed are very serious about prevent something like the destruction of the Library of Alexandria from happening again.

There is a (great, IMHO) science fiction novel, "The Mote In God's Eye" by Larry Niven and Jerry Pournelle that describes humanity's first contact with another intelligent species.

Due to their biology, this alien species experiences periods of massive population growth that eventually lead to all-out war and collapse of civilization. Over uncounted thousands, maybe millions of years, these aliens have accepted this vicious cycle as kismet, and deal with it by building "warehouses" filled their most advanced technology to jump start civilization after the next, inevitable, collapse.

I hope humanity will never have to deal with such a collapse, but I given our collective tendency towards self-destructive behavior, maybe we should build such an archive as if it was meant for future cavemen to jump-start them into a new Anthropocene. Even if that collapse never happens (I am keeping my fingers crossed!), the resulting tome of knowledge would be a suitable monument to all the incredible things that humanity has accomplished, as well as an insurance policy in case we manage to mess up on a monumental scale.

EDIT: A more positive perspective would be The Library from David Brin's uplift saga, a humongous collection of knowledge acquired by many, many species over millions, if not billions of years.

EDIT: typo

How does this work?

Look, feel and content-wise this feels a lot like the internet of the 90s...

Unfortunately, it's not 90's enough, as it still requires fucking javascript.

Would you like to add more dependencies on server-side code for this decentralized project?

This is one of the (admittedly few) cases where javascript is actually required.

Of course it is, Javascript is what allows us to run significant code on the browsers without extensions or plugins or downloaded apps/peers, and its amazing how far its come as a language in the last few years. WebAssembly would probably work as well, but was far less developed when we started this project last year.

Unsupported mediatype: data


Yes - we don't yet support some of the more unusual types on the Archive (including "data" and "software")

on firefox it hangs past http.

Its still an experiment to see what was possible with a big site, in today's browsers using some of the emerging tools (IPFS, WebTorrent, GUN). We only test on absolutely current releases of Chrome & Firefox. If you run into problems on the site, there is a feedback button, otherwise you can report directly on https://docs.google.com/forms/d/e/1FAIpQLSe7pXiSLrmeLoKvlDi2... or open an issue on https://github.com/internetarchive/dweb-archive/issues

very nice

GUN author here (https://github.com/amark/gun) happy to answer any questions.

IA did this integration in 1 week, Mitra is awesome.

Also, decentralized Reddit (https://notabug.io) was built in 1 week on us, and pushed 0.5TB P2P traffic on 1st day.

Note: I may not be awake for several hours, and might not be able to reply until Monday.

GUN looks great! I love the quick-start tutorial too: https://gun.eco/think.html

How suitable is GUN for live multiplayer (non turn-based) web games? (Similar to https://airma.sh/.)

I see one game example at https://github.com/amark/gun/wiki/Awesome-GUN, although it's turn-based.

Thank you!!!

I need to fix the organization of the documentation (and update the docs, oye!). I'm impressed you found Awesome-GUN.

https://github.com/amark/gun/blob/master/examples/game/space... is probably what you were searching for.

What would be better is if I made a blog/tutorial for ^ link. Not a priority for me, sadly, but maybe it is for somebody out there, that they could help?

Thanks again!

Hi, thanks for stopping by :)

I'm still trying to get a handle on the security aspects of gun. Say you want to create a blog/note app - that holds both private notes, drafts - and things that are to be shared to some friends.

Would you effectively have to store data encrypted in gun, and manage access via sharing encryption keys - in order to be able to both securely store data, and share it?

In the examples, it appears things like "create user" is called in client side code - which seems to imply anyone can write any data to a gun db? (by adding themselves as admin?)

Is the use-case of gun more a public, structured wiki - where all content is fundamentally untrusted - but easily updated by anyone?

Right back to you! :)

That is probably because I've done a poor job communicating it, since I'm still finding time to write about it. Thank you for bringing this up!

Probably most relevant: I kinda sorta had a demo of a P2P LinkedIn working https://www.youtube.com/watch?v=ZiELAFqNSLQ .

So we do have an unstable API that automates key management and key sharing, but all production apps (notabug.io , etc.) today directly use our https://gun.eco/docs/SEA shim over WebCrypto.

Unfortunately, that means you have to be aware of how to apply it - thankfully, we did make a cartoon cryptography crash course on this (in link), so it is viable to get started.

Obviously, if you have any new insights, would love to hear it!

Without SEA, gun is very much like what you say. With SEA, you can protect against just anyone randomly writing to GUN. Jump in and ask more Qs on https://gitter.im/amark/gun about it, or you'll circle back around later - hopefully that is helpful directions?

Thanks. Most important statement of mine: I bet you'll enjoy the cartoon cryptography series.

Thanks for replying. I guess: https://gun.eco/explainers/data/summary.html sums up the situation - but there's a few things that aren't quite clear: by design, everyone can access all encrypted data? So there's some meta-data that's easy to find, such as checking if an account exist, and how much data is associated with it - and the ability to record the approximate rate that data is written to the account?

For example, if the login is an email, the app is an exercise logger - I might be able to infer that someone is out jogging by looking at the data?

Another, related, question: at https://github.com/amark/gun/blob/master/README.md we can read that:

"Distributed - GUN is peer-to-peer by design, meaning you have no centralized database server to maintain or that could crash. This lets you sleep through the night without worrying about database DevOps - we call it "NoDB". From there, you can build decentralized, federated, or centralized apps."

And then goes on to show how to boot an instance on heroku etc. But is a production setup documented anywhere? I'd assume one would want three server instances (to allow taking one down for upgrades) - to make sure clients can write data to a managed instance, in order to make sure data is backed up etc?

Apologies if I've overlooked an obvious documentation link.

Yes, can I follow up with you more on this later / in the chatroom[1]? I don't want to leave you hanging but won't be able to reply in detail for probably 1 week - but I do have an answer for you (I apologize the docs are slacking!).

Thanks / sorry!

Thank you for creating GUN! The decentralized Reddit demo loads so quickly.

To prevent spam, it looks like they require a PoW on each vote. Does every update on a GUN database from untrusted peers require something like this? Is authentication intended to stop spammers in the future?

PS. For anyone reading, there's a recording of a talk Mark gave on GUN from the DWeb Summit, available here: https://youtu.be/kW6e1GCpqpE?t=43m22s

Author of notabug here.

GUN uses a proof of work for account creation/login I think, but otherwise no there is no proof of work requirement for updates.

I added the proof of work requirement to votes as part of my own validation. The difficulty at https://notabug.io is set quite low, but https://dontsuemebro.com is a peer that still has it set quite a bit higher, it rejects the cheaper votes at notabug.io so the scores/sorts are different.

Spent a lot of time focusing on performance, notabug.io is running GUN with redis as a storage adapter and doing server side rendering to speed up the user experience.

Domain pages are currently all gun/clientside though with out the server doing anything special to help at all.


Also when using the infinite scroll feature or chat most all content is loaded directly through gun without intermediary REST calls.

Decentralized Reddit sounds neat.


I just upvoted myself to 200 points (making it the top post of all time on notabug), then saw someone else downvote me to -100 points (in 5 minutes), effectively censoring me.

While this was just me with one computer, how will you stop bad actors (specially state actors, corporate actors or other political actors) with immense technological resources from gaming the voting system to silence people?

It's a flaw inherent to democratic Internet voting-based comment filtering, no?

Decentralize how votes are weighed too.

Reddit (tries to) weigh the votes of bots, sockpuppets, and other no-do-gooders to 0, and the rest of us to 1.

Similarly, perhaps you weigh the votes of your friends to 1, your friend's friends to max(1,their_friends/10), and your friend's friends friend's to max(0.1,their_friends/100). Except for bob, who's votes you weigh at 0, because he's always getting his account hacked or suckered into yet another bitcoin ponzi scheme.

There won't be any single point of truth as to the "real" points of a post in this kind of model, but that's probably OK. Actually, there already wasn't: The same link in two different subreddits might gain wildly different amounts of points, with the subreddit adding as a proxy for a group of people who's votes you've decided to weight at 1.

Tech aside, decentralization just puts the onus for more finely deciding the weights of people's votes on the end users instead of on admins. With the right tools you can manage and limit abuse.

A flaw inherent in this model is doing admin stuff is probably more work than the average user wants to do, so such a model will probably never take off.

In my inevitably biased view, I see democratic Internet voting-based comment filtering as a tool that will be exploited by state, corporate or other pretender political actors for censoring speech and promoting their own speech -- no matter the technical implementation, they have the resources to easily buy 30k votes to use on any thread and skew perception on topics vital to their agenda, and we have been doing so for years not just on Reddit with votes, but anywhere where there is an user-input box, and using a variety of textbook mass social engineering tactics.

Knowing the problem so personally, I'm partial to throwing out voting-based comment filtering altogether and replacing it with a mix of the Metafilter and the Slashdot models: make registration an one-time $5 fee; let there be a "firehose" that isn't filtered by votes, but organized by date posted; let there be professional editors that select and curate user-posted links; and let the comments be organized by date from older to newest by default (this prevents a lot of the manipulation, as the oldest comments are usually free from manipulation, since we can't get to the thread faster, and undermine later manipulation, since usually people follow the leader comment), or, even better, get rid of comments altogether because a reliable and open forum like you had in the 1990s is something good (and dangerous) that you will never have again, or at least not without state-level attempts at infiltration.

So one of the fundamental differences between GUN and a blockchain is peers don't need to have the same dataset, or to necessarily agree on the state of the world.

The goal with notabug is that you should be able to run a peer with any sort of moderation structure you like including what you describe.

It's already possible with just ui changes to ignore votes entirely. the new sort works this way.

Comments can be sorted by new in this manner as well but isn't exposed in the UI yet.

The filtering you describe will be achievable with the moderation system I plan to build here:


You would set up a lens with a list of users (public keys) who paid you the verification fee.

You build a space with that lens as the good lens, and lenses for each of your editors to remove or highlight content in other spaces.

I don't know what the best model for online communities is. My vision for notabug.io is to shamelessly clone open source reddit in functionality and UX. But my vision for notabug more generally is as a system for disparate approaches to online forums in a connected system.

The possibilities sound very interesting, specially as an experiment to minimize the damage of social media's capability to be a very powerful, cheap and all-inclusive propaganda tool capable of causing society-wide disruption, and all the state, corporate and pretender political actors interference that come with that kind of threat; the combination of actors with state powers with anonymity making them unnoticed and unaccountable particularly terrify me; I will look into it.

I recently described my plans for moderation here:


The goal here is that moderation won't prevent people from speaking it will make it possible to delegate filtering of content you don't like to other people in a way that doesn't censor the content you don't like outright.

Open to suggestions and PRs.

> It's a flaw inherent to democratic Internet voting-based comment filtering, no?

Quite possibly so; increasing the vote difficulty may help here; but one thing to keep in mind is that the proof of work voting is not necessarily the only voting approach that could be supported. The proof of working voting I think works best at a large scale of users something notabug doesn't have yet.

It was something easy to implement that works decently enough for now; but with a decentralized network different peers can experiment with different voting and sorting strategies.

Wow, your work (GUN) is really great. I hope you don't mind me asking how it works in layman's terms? Thank you for your work.

Huge honor! Yes:

- http://gun.js.org/distributed/matters.html

- https://youtu.be/EHZyaupYjYo?t=55m52s

- http://gun.js.org/explainers/data/security.html (not about GUN itself, directly, but SEA)

Is at least 1 of those helpful? If not, have any ideas on how I can improve the explanation?

Is that website related to https://notabug.org in any way?


FYI: Notabug can't post anonymous comments or chats as of 8/7/18

And there was no way of me telling you this besides here...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact