
Archiving web sites - stargrave
https://lwn.net/Articles/766374/
======
fake-name
Whee, a topic close to my heart!

[https://github.com/fake-name/ReadableWebProxy](https://github.com/fake-
name/ReadableWebProxy) is a project of mine that started out as a simple
rewriting proxy, but at this point is basically a self-contained archival
system for entire websites, complete with preserved historical versions of
scraped content. It has a distributed fetching frontend[1], uses chromium[2]
to optionally deal with internet breaking bullshit (Helloooo cloudflare!
Fuuuuucccckkkkk yyyyooouuuuuuu), supports multiple archival modes (raw, e.g.
not rewritten and destyled, and a rewritten format which makes reading
internet text content actually nice), and a bunch of other stuff. The links in
fetched content are rewritten to point within the archiver, and if content is
not already retreived, it's fetched on-the-fly as you browse.

It also has plugin-based content rewriting features, allowing the complete
reformatting of content on-the-fly, and functions as a backend to a bunch of
other projects (I run a translated light-novel/web-novel tracker site, and it
also does the RSS parsing for that).

I've been occatonally meaning to add WARC forwarding to the frontend, and feed
that into the internet archive, but the fetching frontend is old, creaky and
brittle (it's some old code), and does a lot of esoteric stuff that would be
hard to replicate.

[1]: [https://github.com/fake-name/AutoTriever](https://github.com/fake-
name/AutoTriever) [2]: [https://github.com/fake-
name/ChromeController](https://github.com/fake-name/ChromeController)

------
londons_explore
Things that attempt to rewrite links and inline css and javascript are doomed
to fail. Many sites do wierd javascript shenanigans, and without a million
special cases, you'll never make it work reliably. Just try archiving your
facebook news feed and let me know how it goes.

Instead, archivists should try to record the exact data sent between the
server and a real browser, and then save that in a cache. Then, when viewing
the archive, use the same browser and replay the same data, and you should see
the exact same thing! With small tweaks to make everything deterministic
(disallow true randomness in javascript, set the date and time back to the
archiving date so SSL certs are still valid), this method can never 'bit rot'.

When technology moves on, and you can no longer run the browser and proxy, you
wrap it all up in a virtual machine, and run it like that. Virtual machines
have successfully preserved games consoles data nearly perfectly for ~40 years
now, which is far better than pretty much any other approach.

~~~
megous
Communication is dependent on JS events, which are dependent on user's action.
There's also localStorage and other such things. Your method might work for
some simple JS based websites, but it's no silver bullet.

~~~
bicubic
While what you say is true, the above method is the _only_ method to archive
arbitrary web pages. Yes it depends on user interactions to some extent, but
it's possible to reasonably let a page load until fetches stop, and consider
it rendered. Generally speaking, you can only archive some preset interactions
with a modern web page. You can't hope to capture it all.

There are tools like WebRecorder[0] that do this to some extent by recording
and replaying all requests. It's certainly a step in the right direction and
demonstrates that the approach is viable. This was the only approach I tried
that worked for archiving stuff like three.js demos. Worth mentioning there's
also an Awesome list[1] that covers various web archival technologies.

[0]
[https://github.com/webrecorder/webrecorder](https://github.com/webrecorder/webrecorder)

[1] [https://github.com/iipc/awesome-web-
archiving](https://github.com/iipc/awesome-web-archiving)

~~~
nikisweeting
What I'm trying to do with Bookmark Archiver is record user activity to create
a puppeteer script while the user is visiting the page, then replay that on a
headless browser later and record the session to a WARC. That should cover
both dynamically requested and interactive stuff that is otherwise being
missed by current archiving methods.

I also plan on saving an x86 VM image of the browser used to make the archive
every couple months so that sites can be replayed many decades into the
future.

------
AlphaWeaver
In the realm of scraping and page archiving, I'd like to note a library I
found useful recently, called `freeze-dry` [0][1]. It packages a page into a
SINGLE HTML file, inlining relevant styles. The objective is to try and
replicate the exact look and structure of the page instead of all the
interactive elements. Very useful for building a training dataset for any
algorithms that read web pages.

[0]: [https://www.npmjs.com/package/freeze-
dry](https://www.npmjs.com/package/freeze-dry)

[1]: [https://github.com/WebMemex/freeze-
dry](https://github.com/WebMemex/freeze-dry)

------
kevingrahl
I’ve been using

    
    
        wget -E -H -k -K -nd -N -p -P pageslug URL
    

for some time now and never had any issues with it. I created a .bash_aliases
entry so that now I only have to type

    
    
        war pageslug URL
    

to archive some website.

I haven’t archived to many websites (I focus more on media files like videos,
ebooks and such) that’s probably why I haven’t run into any issues yet but I’d
be interested if somebody has a link that doesn’t work with this method just
so that I can see what the result would be like.

Here’s an explanation of the method I use for anyone interested:
[https://gist.github.com/dannguyen/03a10e850656577cfb57](https://gist.github.com/dannguyen/03a10e850656577cfb57)

------
ivarv
Archiving one's web browsing trail seems to be a common use case. Here are
some promising related projects that have been on HN:

* [https://github.com/pirate/bookmark-archiver](https://github.com/pirate/bookmark-archiver)

* [https://getpolarized.io/](https://getpolarized.io/)

------
danso
I've got a set of small repos of small government sites I've used a
combination of `wget` and `curl` and other shell commands to snapshot, mostly
so I can have a reliable mirror when teaching web scraping:
[https://github.com/wgetsnaps](https://github.com/wgetsnaps)

But as the submitted article points out, archiving the Web is much trickier
these days, and wget is no longer sufficient for anything relatively modern.
I've been impressed with what Internet Archive has seemingly been able to do,
and I've been interested whether it's the result of improved techniques on
their side, or of certain sites following a standard that happens to make them
more archivable.

For example, 538's 2018 election trackers are very JS dependent, yet IA has
managed to capture them in a way that not only preserves the look and content,
but keeps their widgets and visualizations almost entirely fully functional:

[https://web.archive.org/web/20181102125134/https://projects....](https://web.archive.org/web/20181102125134/https://projects.fivethirtyeight.com/2018-midterm-
election-forecast/house/)

However, even the excellent archive of 538's site shows a huge weakness in
IA's efforts: IA (quite understandably) aggressively caches a site's
dependencies, such as external JS and JSON data files. If you scroll down the
538 example posted above, you'll see that despite being a snapshot on Nov. 2,
2018, many of its widgets only contain data from the last time IA fetched its
external dependencies, which appears to be August 16, 2018.

------
unicornporn
This post doesn't mention the Webrecorder Player[1], which is a GUI app that
displays WARC files. It's probably the easiest way to view Web ARChives.

For those willing to set up a docker container, check out Warcworker[2].

[1] [https://github.com/webrecorder/webrecorder-
player](https://github.com/webrecorder/webrecorder-player)

[2]
[https://github.com/peterk/warcworker](https://github.com/peterk/warcworker)

~~~
anarcat
I have dietary restrictions against Electron apps which is why I didn't
mention Webrecorder player. Besides, last time I suggested anything touching
node.js to LWN readers, the answer was a clear "nope" so I tend to thread
carefully there as well. ;)

------
batuhanicoz
Disclaimer: My company works with Teyit and I've built the archiving product.
Also: This is a shameless plug.

Teyit.org[0], the biggest fact-checking organization in Turkey, has their own
archiving site called teyit.link[1].

It's a non-profit organization and they automatically archive any link that's
sent to them via their site, Twitter, Facebook etc.. It's also usable by the
public.

It's open source on GitHub[2] and we've actually been developing a new
version[3] and have a plan to add `youtube-dl` along with WARC.

[0] [https://teyit.org](https://teyit.org)

[1] [https://www.teyit.link](https://www.teyit.link)

[2] [https://github.com/teyit/teyitlink-
web](https://github.com/teyit/teyitlink-web)

[3]
[https://github.com/noddigital/teyit.link](https://github.com/noddigital/teyit.link)

------
patrickyeon
gwern has a very involved post on archiving as well,
[https://www.gwern.net/Archiving-URLs](https://www.gwern.net/Archiving-URLs)

Somewhere on my to-do list is archiving everything I visit on the internet.
It's frustrating to know that I've seen something, but be unable to find it
again.

~~~
gildas
For this, you could use SingleFile [1], an extension for Chrome and Firefox.
It can auto-save pages.

[1] [https://github.com/gildas-lormeau/SingleFile](https://github.com/gildas-
lormeau/SingleFile)

~~~
vackosar
unfortunately it requires a lot of permissions. I try to minimize exts like
this

~~~
gildas
I really did my best to minimize the APIs used by the extension. Note that
Chrome 70 allows you to restrict extensions by host [1].

[1] [https://blog.chromium.org/2018/10/trustworthy-chrome-
extensi...](https://blog.chromium.org/2018/10/trustworthy-chrome-extensions-
by-default.html)

------
burtonator
I think you guys might like this personal web archival tool I launched about a
month ago:

[https://getpolarized.io/](https://getpolarized.io/)

It's basically an offline browser where you can capture full HTML pages
locally including the iframes, and tag and annotate the content.

I should have cloud sync support in the next release (1-2 weeks) which will
allow you to keep your data in the cloud and sync it between machines.
Initially it will just support Firebase but I have plans to support other
cloud providers via plugins.

I'd also like to support end to end encryption so that you don't have to worry
about people reading your data.

There's a huge Hacker News about Polar here:

[https://news.ycombinator.com/item?id=18219960](https://news.ycombinator.com/item?id=18219960)

A semi-requested feature is full recursive archival of content but I don't
think we're going to go in that direction. Instead I think we're going to
support pasting or importing a list of URLs.

Many documentation sites have an 'index' of Table of Contents and this way I
can just fetch and store all those URLs without over-fetching.

My background is search and I built a petabyte scale search service named
Datastreamer ([http://www.datastreamer.io/](http://www.datastreamer.io/)). I'm
also one of the inventors of RSS - so I have a lot of ideas on the roadmap
here.

It also supports PDFs, text and area highlights, comments, flashcards and sync
with Anki.

The initial response after our release has been amazing. The user base is
really engaged with thousands of monthly active users and contributors.

Anyway. Take it for a spin. It's free and Open Source.

------
frontier
I recently had to do this and after a lot of frustration with wget, httrack
and some other commercial ones too, I ended up settling on the results of this
free product, WebCopy.

[https://www.cyotek.com/cyotek-webcopy](https://www.cyotek.com/cyotek-webcopy)

Background: We couldn't keep the existing platform running, so had to
transition to static html files.

I used the WebCopy scan log to create the apache rewrite rules to preserve the
existing link structure.

Where I say WebCopy was better, it was this simple log, but also the file
structure it was producing was much cleaner with less junk pages and
duplicates. (the site was an absolute inconsistent mess to begin with)

~~~
frontier
I was surprised to not find any product that could create a perfect static
clone of the original, as far as maintaining the incoming link structure.

I know there would be a tonne of edge cases and obviously it would need to be
targeted to a particular platform, but I think we came pretty close with this
simple technique.

------
jnurmine
I am of the opinion that anything you might need to read two-three times over
a longer period of time should be copied locally, or to some service, for
later retrieval.

I use Pocket a lot, and "share" into it from different devices.

But sometimes just do "wget --mirror -np -k --limit-rate=10k
[https://interesting.stuff.here.org/"](https://interesting.stuff.here.org/")
on a PC.

I'd actually like a self-hosted variant of Pocket, but have not really
researched if those even exist. Anyone with a suggestion?

------
nikisweeting
Thanks for the mention of bookmark-archiver! WARC support has been high on my
list for a long time, but unfortunately, I have a day job that keeps me super
busy.

Also, the author, Antoine Beaupré is also an engineer living in Montreal who
works on mesh networking stuff, are we the same person?! I just sent him an
email to make sure it doesn't land in my own inbox...

~~~
anarcat
Hi Nick! Yes, I am the same person as ... er... myself if that makes any sense
at all. :) I'll followup by email, thanks for reaching out!

------
anarcat
author here. AMA.

since I wrote this, i started experimenting with grab-site:

[https://github.com/ludios/grab-site](https://github.com/ludios/grab-site)

it's a wrapper around (and a fork of) wpull but the main advantage over it is
it can do on-the-fly reconfiguration of delay, concurrency, ignore patterns
and so on. it also provides a nice web interface. if you're only crawling one
site every once in a while, wpull and crawl are fine, but for larger projects,
grab-site is a must.

------
shakna
I was working on an archiving tool a little while back, though I haven't
touched it recently.

It would recursively convert a page into a single URI. Chrome seems to have a
limit for URLs, but Firefox doesn't, so far as I can tell.

Copy the contents of [0] into your URL bar, and you'll see not just the page,
but the Python script is embedded in it too. (It's a bit long to dump onto a
forum page).

[0]
[https://shakna.keybase.pub/offlineweb](https://shakna.keybase.pub/offlineweb)

------
ernsheong
Those desiring something simple can check out
[https://www.pagedash.com/](https://www.pagedash.com/). Note: requires login
and saves pages to the cloud.

(Disclaimer: I am the maker of PageDash)

------
kev009
I use a wget invocation similar to the one listed for ps-2.kev009.com. I've
recently used HTTRack in a few places where that had issues and was impressed
with it as well.

------
alwaysreading
Love wget! The waybackmachine is a great tool but I wish there was a more
robust/complete service out there. Maybe the government is/will archive the
top million sites or something like that.

