
The Wayback Machine: Fighting Digital Extinction in New Ways - toomuchtodo
https://blog.archive.org/2019/10/18/the-wayback-machine-fighting-digital-extinction-in-new-ways/
======
tpmx
I'm really curious about how the creators of the Wayback machine are working
to save modern (perhaps sometimes somewhat unnecessarily overcomplicated) web
pages that are using SPA "techniques". Have they implemented a googlebot-like
crawler that interprets javascript and spits out.. some predigested final DOM
tree? Or.. do you record all web-page-initiated network traffic and just let
it replay, sort of? Lot of interesting research opportunities here, btw.

This is where archival meets browser/web tech, in a kinda complicated way. I
would hope that people from both of these backgrounds have been working on
this stuff together. If not, please start soon.

~~~
dvanduzer
A crawler has two high level options: parse the page, or render the page.

Most of our parser-based crawling is done by Heritrix (crawler.archive.org)
and most of our render-based crawling is done by a proxy-based recorder
similar to what you theorize
([https://github.com/internetarchive/brozzler](https://github.com/internetarchive/brozzler)).

~~~
tpmx
Thanks for sharing. That lets me sleep a bit easier.

------
8bitsrule
FF frequent-users of Wayback may want to add the Wayback Machine add-on to
their toolbar. Along with 'first','recent' and 'overview' selections it
includes 'Save Page Now', as well as related Alexa, Whois and Twitter
connects.

[https://addons.mozilla.org/en-US/firefox/addon/wayback-
machi...](https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new/)

~~~
jolmg
I see that it's licensed under the GPLv3, but where's the source?

EDIT: Maybe it's this one[1], but it's under a different license, AGPLv3. The
repo also hasn't been updated since 2016, but the extension page says last
update was in 2018. Are the changes and re-licensed source elsewhere?

[1] [https://github.com/internetarchive/wayback-machine-
firefox](https://github.com/internetarchive/wayback-machine-firefox)

------
thrwn_frthr_awy
Does the Wayback Machine have a long term plan that anyone is familiar with?
Is their goal to preserve the web indefinitely? Is the hope the storage and
compression improvements over time keep up with content creation?

And just to be clear, I think the Wayback Machine is great and the fact that I
can lookup my personal, basically zero traffic website from 15 years ago and
see it is truly astonishing to me–I'm just curious what this look likes in 10,
20, 50 years.

~~~
toomuchtodo
The best way to get this question answered would be to email or tweet at
Brewster Kahle, who started and heads the Internet Archive.

If you're in San Francisco, the Internet Archive is also hosting a block part
the evening of Oct 23rd from 5pm-10pm, and staff will be there to answer
questions (tickets are $15).

Disclaimer: No affiliation

~~~
jonah-archive
If anyone's interested in attending our party next Wednesday but the cost is
presenting a difficulty, shoot me an email (in my profile) and I'll send you a
ticket at no cost.

------
magashna
Save Page Now seems huge. It's a real bummer going to old forums for an
obscure hobby or fandom only to find all the text and none of the images,
music, etc.

~~~
btrettel
I've had the exact same problem before.

I used to run an old forum that's now just an archive and I've been really
meaning to download all the images linked in the posts in case they go
offline. At some point during the forum's run I added a file upload feature
which seems to have helped a lot (by avoiding external dependencies), but did
not solve the problem. Fortunately I believe I have many of the missing images
saved, but there very likely are important things missing.

I'm planning to launch a new forum next year and I think I'm going to write a
script to periodically archive all images and links posted to the forum. I
might make external images not allowed, though that seems rather extreme and
might just make people post a link rather than use the upload feature.

~~~
duskwuff
Tangent:

There's an incredible amount of information stored in obscure web forums,
often in posts with photos. The damage that services like Photobucket have
done by deleting old files, or by restricting hotlinking, has been
_incalculable_. I worry that Imgur has the potential to do even worse damage,
as so many forum users have converged on their service after others became
unavailable.

(Imgur's popularity with Reddit users leaves Reddit highly vulnerable as
well.)

~~~
jborichevskiy
Crazy idea: an browser extension a user can install which downloads images as
they come across them in their browser and uploads them to something
distributed - perhaps built on top of IPFS? Users could choose which domains
it would be active on. The network could be split up by either domain or topic
(say, people interested in diagrams of space which might include several
domains/sites).

Just thinking out loud here.

~~~
duskwuff
The problems with putting that kind of data in any sort of distributed service
are that:

1) It depends upon enough users being able to consistently contribute a lot of
storage to the system. It turns out that this is hard. Casual users are
actually a _hindrance_ , because they'll suck up a bunch of bandwidth trying
to replicate data, then drop out of the swarm forever.

2) The service will inevitably be used to host illegal pornographic content.
Without some sort of centralized control, there's no way to stop this, making
participation legally problematic.

------
ElijahLynn
Got me thinking about how much storage Internet Archive uses. The answer is:

Total used storage: 50 PetaBytes

-=--=--=-=-=-=-=-

[https://archive.org/web/petabox.php](https://archive.org/web/petabox.php)

A few highlights from the Petabox storage system:

Density: 1.4 PetaBytes / rack

Power consumption: 3 KW / PetaByte

No Air Conditioning, instead use excess heat to help heat the building.

Raw Numbers as of August 2014:

4 data centers, 550 nodes, 20,000 spinning disks

Wayback Machine: 9.6 PetaBytes

Books/Music/Video Collections: 9.8 PetaBytes

Unique data: 18.5 PetaBytes

Total used storage: 50 PetaBytes

~~~
ElijahLynn
And that got me thinking to how much 50 petabytes would cost...

From [https://www.backblaze.com/blog/petabytes-on-a-
budget-10-year...](https://www.backblaze.com/blog/petabytes-on-a-
budget-10-years-and-counting/) on September 24, 2019.

\-----------------------------------------------------

Storage Pod 1.0 allowed us to store one petabyte of data for about $81,000.

Today we’ve lowered that to about $35,000 with Storage Pod 6.0.

\-----------------------------------------------------

Obviously they paid more in storage in the past than today and it is a
different solution but if you were to buy 50 PetaBytes today in one of
Backblaze's Storage Pod 6.0s it would be $1,750,000.

And there is ongoing maintenance & costs of drive failures.

I feel the need to donate to Internet Archive soon, as I have greatly
benefited from it in the past and am sure to in the future too!

~~~
db48x
If you poke around you can find some more recent stats:
[https://catalogd.archive.org/report/space.php](https://catalogd.archive.org/report/space.php)

That yearly graph is pretty nice :)

~~~
ElijahLynn
Nice, thanks for that!

------
ElijahLynn
Love the URL in the example screenshot, whitehouse.gov. Fantastic example of a
source that _needs_ to be archived.

------
tannhaeuser
As much as I appreciate Wayback Machine, it's the responsibility of authors to
choose an authoring format that can stand the test of time, at least for
content you care about. HTML is built on a rich foundation of markup languages
which is more than adequate for preservation. Just that it renders in a
browser isn't good enough as browsers have turned into overly complex
monstrosities with a high risk of loosing further browser code bases going
forward (eg. Moz loosing their Google deal, and developing browsers becoming
economically infeasible) at which point we're at the mercy of an ad company to
even read our documents.

------
tripzilch
Tip: For people using duckduckgo as their default search, if you happen on a
site that's no longer available, just type "! wayback " in front of the URL.

I suppose you can also set it up as a keyword search in your favourite
browser.

------
tambre
Unfortunate that their crawler doesn't support IPv6. Trying to save IPv6-only
websites results in "Couldn't resolve host". Hopefully that'll get fixed soon
and not too much will be lost...

~~~
class4behavior
[https://archive.is/](https://archive.is/) supports IPV6

Whenever possible everyone should be archiving to both anyway.

------
white-moss
Oh, I donated some of my money to them a few days ago. I'm very happy to read
so wonderful news like this :) Outlink feature is great! Ultra useful for blog
site.

I'm a fan of them.

------
sizzle
Who runs www.archive.is and are they related?

~~~
333c
They're not related.

