
Inside Wayback Machine, the internet’s time capsule - rmason
https://thehustle.co/inside-wayback-machine-internet-archive
======
pronoiac
If you're in San Francisco this Wednesday, check out their annual bash:
[https://blog.archive.org/2018/08/20/save-the-date-
building-a...](https://blog.archive.org/2018/08/20/save-the-date-building-a-
better-web-internet-archives-annual-bash/)

------
branweb
Always good to pause and reflect on the ephemeral nature of knowledge on the
www. I've always admired the Internet Archive's Sisyphean mission to preserve
some piece of it.

ps those statues of internet saints occupying the old benches of the former
church/current IA hq are neat and kinda disturbing.

~~~
Fnoord
At the same time, the past is now fully documented. The knife cuts at both
sides.

------
tannhaeuser
I very much appreciate Wayback Machine's work and would like to support them
by offering our SGML software for free (see contact info on
[http://sgml.io](http://sgml.io) or PM).

SGML can be used as swiss army knife to perform all kinds of difficult HTML
parsing, manipulation, and preservation tasks since it is using classic DTD
grammars for your HTML flavor at hand, rather than having a particular HTML
grammar hardcoded. For example, see our HTML 5.1 DTD at [1] (which can be used
with any SGML software freely anyway).

In today's dark age of the web, we're loosing content daily as classic web
sites are shutting down.

[1]: [http://sgmljs.net/docs/html5.html](http://sgmljs.net/docs/html5.html)

------
rmason
Does anyone know if they take kindly to visitors? I'm always looking for
things to do when I'm in SF when I have a few hours to spare and this
interests me.

~~~
jonah-archive
We have public tours nearly every Friday at 1pm! Ping us beforehand to let us
know you're coming:
[https://archive.org/about/contact.php](https://archive.org/about/contact.php)

------
celerity
I wonder if the Wayback Machine people are using a (potentially more modern)
version of the AOPIC algorithm to decide what to archive. I wrote an article
about that algorithm (which is similar to the original PageRank, but simpler
IMO), and stated that a service " _like_ the Wayback Machine would probably
use something like AOPIC." It would be nice to remove that first _like_ from
the sentence!

[1] [https://intoli.com/blog/aopic-algorithm/](https://intoli.com/blog/aopic-
algorithm/)

~~~
greglindahl
No, Heretrix doesn't really do any ranking as it crawls, it's all up to a
cleverly chosen seed.

AOPIC looks to me like it's roughly the same as Yahoo's iterative pagerank
algorithm, but I didn't look at it that carefully.

------
bane
Anybody know of something like this that I can use for personal archiving?

~~~
PeterMikhailov
[https://wallabag.org/en](https://wallabag.org/en)

If you run your own, make sure you turn on "Download images"

~~~
zaarn
Wallabad isn't quite suitable for Archival operation last I checked since it
doesn't present or save the original webpage.

That's one of the major points why archive.is/wayback is popular; you get the
exact same page as you get in the browser. Perfectly for archival.

------
datavirtue
Love way back machine. An exploit of modx recently resulted in losing a
website that I maintain. Remembered the way back machine...all content plus my
ass saved.

------
cyborgx7
The Wayback Machine is an archive, not a time capsule. Very different things.

