
ArchiveBox: Open-source self-hosted web archive - based2
https://github.com/pirate/ArchiveBox
======
burtonator
I actually talked to the author of ArchiveBox about 1-2 weeks ago. Nice guy
and it's good that he's also friendly with the Internet Archive.

ArchiveBox uses WARC as it's backing store:

[https://en.wikipedia.org/wiki/Web_ARChive](https://en.wikipedia.org/wiki/Web_ARChive)

which is nice because it's standardized.

We were discussing integrating Polar web archives along with ArchiveBox and
maybe having some sort of standard to automatically submit these WARCs to the
Internet Archive as part of your normal browsing activity.

Polar has a similar web capture feature but it's not WARC

[https://getpolarized.io/](https://getpolarized.io/)

(yet)...

WARC is probably the easiest standard for Polar to adopt. Right now we use
HTML encoded in JSON objects.

When the user captures a web page we save all resources and store them in a
PHZ file which you can keep as your own personal web archive.

What I'd like to eventually do is update our extension to auto-capture web
pages so you could use Polar's cloud storage feature to basically store every
page you've ever visited.

It really wouldn't be that much money per year. I did the math and it's about
$50 per year to store your entire web history.

If I can get Polar over to WARC that would mean that tools like ArchiveBox and
Polar could interop but we could do things like automatically send your
documents you browse to the Internet Archive.

There's one huge problem though. What do we do about cookies and private data.
I'm really not sure what to do there. It might be possible to strip this data
for certain sites (news) without any risk of violating the users privacy.

~~~
peterwwillis
Why don't they store everything as plain documents in a zip file and keep
metadata in a json? Seems more future proof/easier for users to manipulate
than WARC

~~~
bnewbold
Have you looked at the WARC format? It's ridiculously simple, basically
concatenated raw HTTP requests and responses, with some extra HTTP metadata
headers mixed in (a la extra JSON metadata keys). You can open it with a text
editor. Very simple and efficient to manipulate, and very efficient to iterate
over or generate.

[https://iipc.github.io/warc-
specifications/specifications/wa...](https://iipc.github.io/warc-
specifications/specifications/warc-format/warc-1.1/#file-and-record-model)

Arguably the biggest problem is that it isn't complex enough: there is no
index of contents built-in (the standard .csv-like index format of
URL/timestamp/hash/offset is called CDX).

There aren't a ton of tools in the web archiving space in general, but almost
all of the ones that do exist work with WARC. Existing tools (for interchange)
include bulk indexing (for search, graph analysis, etc) and "replay" via web
interface or browser-like application. Apart from specific centralized web
archiving services that use WARC, there are several large public datasets,
like Common Crawl, that are released in WARC format.

~~~
peterwwillis
Yeah, the fact that I have to convert them after is just an unnecessary extra
step for me, so I'll stick to wget and httrack for my archives. Once I mirror
them I can just copy the files anywhere and browse them on any browser/device.

------
aloer
I have spent the last hours reading up on everything-WARC that I could find
but I still haven't been able to answer my main question: why only as external
crawlers?

There does not seem to be a tool to actually capture a warc directly in your
own browser session. webrecorder
([http://webrecorder.io/](http://webrecorder.io/)) is the only example I could
find that comes close in terms of user experience but it still requires a
third party and different browsing habits

\- are there browser extensions that can save a warc while you browse?

\- are there API limitations that require external browser control? something
browser extensions can't be used for?

\- or is it simply a question of use case. And crawlers are more popular (for
archiving) than locally recorded browser history (for search/analytics)?

edit:

I have now found
[https://github.com/machawk1/warcreate](https://github.com/machawk1/warcreate),
related discussions in issues #111 and #112 are quite interesting. Looks like
there are some serious limitations for browser extensions. I will look deeper
into how webrecorder works and how this could be combined

~~~
machawk1
When I initially coded up WARCreate, the webRequest API was still
experimental. I believe there are more mature APIs that can be used from the
extension context but some require DevTools to be visually open, which is not
a common usage pattern of a typical web user.

Per the ticket, we have worked on a few other JavaScript-driven web
preservation project like [https://github.com/N0taN3rd/node-
warc](https://github.com/N0taN3rd/node-warc) and
[https://github.com/N0taN3rd/Squidwarc](https://github.com/N0taN3rd/Squidwarc),
among others.

Web archiving from a browser extension is difficult but can be improved. I
don't know of any other approaches at trying to do this via a browser
extension beyond submitting a URI to an archive.

~~~
nikisweeting
Woah both these tools are awesome, thanks for sharing them!

I'm adding links to both from [https://github.com/pirate/ArchiveBox/wiki/Web-
Archiving-Comm...](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-
Community)

------
jalopy
This looks awesome. Is there an easy way to transfer my current chrome session
/ cookies to the archivebox chromium instance? Would love to my subscription
websites (eg nytimes) to register my logged-in state and allow the capturing
instance full logged-in access.

EDIT: Nevermind, should have RTFM - see CHROME_USER_DATA_DIR in
[https://github.com/pirate/ArchiveBox/wiki/Configuration](https://github.com/pirate/ArchiveBox/wiki/Configuration).

This looks really, really awesome.

------
amenod
This sounds like a very good idea, but I'm having trouble making it work. For
example, let's say I want to save a great website which will probably
disappear soon ([https://launchaco.com](https://launchaco.com)). I run `echo
[https://launchaco.com](https://launchaco.com) | ./archive` and then...? The
generated index.html doesn't load css and js files. Or is this more for static
content?

Is there some tool that would allow one to make a copy of a modern SPA? Is
that even possible?

EDIT: I'm sad to see launchaco.com go, it would be a perfect tool for a
project I'm working on. I don't mind paying, but I gather this is not possible
anymore, and anyway, it might take some time for me to have everything
ready...

~~~
gwern
> The generated index.html doesn't load css and js files. Or is this more for
> static content?

Why doesn't it do that? I thought that was the point of using Chrome as a
headless browser to load all the dynamic elements into a final DOM so they
could then be captured & serialized out: "ArchiveBox works by rendering the
pages in a headless browser, then saving all the requests and fully loaded
pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that will
last long after the original content dissapears off the internet."

~~~
nikisweeting
It does that, it's possible it just broke on the page he tried. It doesn't
work perfectly on 100% of pages which is why it saves as PDF, screenshot, and
other methods as a fallback.

------
platz
This looks very nice

As an aside,

    
    
       wget -r -k -np
    

works surprisingly well for my offline needs.

For permanent access I defer to archive.is

~~~
toomuchtodo
Consider [https://github.com/ArchiveTeam/grab-
site](https://github.com/ArchiveTeam/grab-site) (ArchiveTeam/grab-site) in the
future. There is a difference between fetching/mirroring content and recording
the http header request and responses when retrieving all objects for a site.

I believe there is even a docker container for a quick pull and run.

~~~
j88439h84
When I want to archive a site, don't I _want_ the content, not just the
headers? How do I save the content?

~~~
toomuchtodo
When archiving, you want the request and response headers _and_ the content.
Grab-site does that (any tool really that'll write WARC files). Sorry if my
comment was ambiguous in that regard.

[https://en.wikipedia.org/wiki/Web_ARChive](https://en.wikipedia.org/wiki/Web_ARChive)

[https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...](https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml)

If you wanted to write WARC files out with wget, you'd use the options
specified in
[https://www.archiveteam.org/index.php?title=Wget_with_WARC_o...](https://www.archiveteam.org/index.php?title=Wget_with_WARC_output)

~~~
progval
Depends on what you want to use your archive for. If it's just for opening it
locally in a web browser, mirroring the content is enough.

~~~
toomuchtodo
Yes, _but_ if the effort to go the WARC route is minimal above mirroring,
might as well do it. You never know if you’ll need data you didn’t grab.

~~~
nikisweeting
We intend to store both the WARC of the headless broswer render and the wget
clone. (I'm the creator @pirate on Github). The idea is to archive anything in
many redundant formats (HTML clone, WARC, PDF, etc) and for people to run this
on a deduplicated filesystem like ZFS/BTRFS or disable specific methods if
they care about saving space.

You can track the progress on implementing WARC saving for all archived
resources here:
[https://github.com/pirate/ArchiveBox/issues/130](https://github.com/pirate/ArchiveBox/issues/130)

------
JustARandomGuy
I was poking through the documentation - how does ArchiveBox generate the WARC
file? I see the web page is archived in HTML, PNG and PDF using Chrome, but I
don't think Chrome natively has the ability to create WARC files, does it?

~~~
wipseabusbus
Semes like the --warc-file= argument for wget
[https://github.com/pirate/ArchiveBox/blob/a74d8410f4c3f2f08f...](https://github.com/pirate/ArchiveBox/blob/a74d8410f4c3f2f08f75f223c15f8615be85744f/archivebox/archive_methods.py#L208)

------
ernesth
That is great. I am currently archiving my pocket archive. I some times ago
considered using my own wallabag instance, but too much work and maintenance.
ArchiveBox worked directly locally.

Unfortunately, it seems, for some sites, it is archiving the GDPR warning or
bot captcha instead of the content. For example, tumblr only gets me "Before
you continue, an update from us" with an 'accept' button that does nothing;
for Fastcompany the pdf output is overcast with a half page popup on each
page; ouest-france gives me a "I am not a Robot" captcha...

Probably those issues are linked to the fact I do not use chromium though it
is installed. When I understand how to archive the masked content, I intend to
look at each screenshot to detect problems and re-archive them.

~~~
ernesth
Urgh. I had not thought about some ill effects:

1\. It used no ad-blocking. Hence the bot is happy to download all trackers
and advertisements! Also trackers know my pocket bookmarks now...

2\. Websites are ugly. In a 1440x2000 screenshot of a national geographic
article, the top 1440x1500 pixels consist of black with just a grey round
progress in the middle and the word "ADVERTISEMENT" at the top.

3\. Websites are wasteful. Many warc archives clock in at 30 to 50 MB. Without
downloading media, a single (article) webpage is 30MB!

4\. While consulting the archive, my adblocker allows everything since the
initial page is on file protocol. It seems archivebox does not rewrite the
URLs in its html output (as wget's convert-links option would do). So trackers
and ads galore when opening output.html and even
archive/<timestamp>/index.html which automaticaly load the other files.

------
CMCDragonkai
Can archived content be put on a public permalink? IPFS.

------
ohlookabird
That's looks pretty nice! I have currently Wallabag running on a server (using
it with a browser plugin to save pages) and it works pretty well. While
Wallabag is strictly for websites as far as I understand, this seems to
support more data (like bookmarks, history, etc.), which is great and I will
certainly give it a try, luckily import from Wallabag is supported too.

------
midnightdiesel
This is great. I was just lamenting the lack of a nice tool like this, and I’m
looking forward to seeing this develop.

------
walterbell
If this were running on a self-hosted server, could it be invoked from an iOS
shortcut on phone/tablet?

~~~
sah2ed
Potentially, yes.

There is an open issue on GitHub to add an API. Once that feature is
implemented, adding a web page where links can be submitted for archival
becomes trivial.

~~~
WrtCdEvrydy
ArchiveBox UI coming soon _

------
5_minutes
So this is a direct pinboard comp?

Would be Nice to save enitre pages without paying extra

------
thrilleratplay
Has anyone heard any news on Mozilla open sourcing Pocket? There are pieces
that have been released but ultimately I would like to self host a Pocket
archive similar to Wallabag

~~~
toomuchtodo
> Has anyone heard any news on Mozilla open sourcing Pocket?

All of Pocket? It's fairly straightforward to extract your Pocket data through
their API (their export functionality leaves much to be desired
unfortunately).

------
mimixco
This sounds terrific! Thank you for building it!

~~~
nikisweeting
You're welcome <3

Let me know if you have any questions! (I'm @pirate on Github)

------
ausjke
what about SPA sites that javascript renders the html, if this can archive SPA
site that will be great.

~~~
nikisweeting
The short answer is that it can (usually) archive SPAs sucessfully. The long
answer is that the wget archive method will download the JS, and in theory it
should execute in the archived version just like it does on the normal site,
but in practice it doesn't always work 100%. Luckily that's why ArchiveBox
also saves a Chrome-rendered version of the page as a PDF, Screenshot, and DOM
dump, so you should be able to archive most javascript-heavy content without
too many problems.

------
joustfawrthis
Great idea, but automated installer might not be ready for prime time.

Automated install did not detect existing Python 3.6 on OSX, created quite the
mess...

Also, defaults to using google chrome vs. chromium when both are installed.

~~~
nikisweeting
Yup, I totally agree (I'm the creator @pirate on Github).

I personally don't like automated setup scripts, which is one of the reasons I
spent a bunch of time on our Manual Install docs:
[https://github.com/pirate/ArchiveBox/wiki/Install](https://github.com/pirate/ArchiveBox/wiki/Install)

Pip, apt, & homebrew packages are the install methods we're moving towards
currently. I just caved to user demand in the short-term and added a helper
script about a year ago as a crutch until we finished releasing the real
packages.

Stay tuned for updates on that progress here:
[https://github.com/pirate/ArchiveBox/issues/120](https://github.com/pirate/ArchiveBox/issues/120)

------
C-Consciousness
All browsers should aggressively cache everything they get and bypass all
"anti-cache" inanities. There are addons that modify response headers but this
is all just filthycasulness, everything is hostile to the user--just as it
should be. If people do not get exsanguinated violently enough they feel
really antsy. All data is to be immediately uploaded to Tor+IPFS. The browser
is meant to "play coy" and act as though it did respect the headers in online
mode, but shift to offline and it loads the entire history fully. This would
work well with Tor Browser. Same thing should apply for all videos of course.
To use such tools would break the anonymity set of people, which is why after
each TBB close the entire cache2 folder should be exported to a file that can
then be imported offline-only. A simple way to do this would be to simply copy
the entire tor-browser directory and then reopen it in work-offline. The
problem is the anti-cache websites are then lost, so just like JavaScript,
refuse to use websites that use anti-cache mechanisms. The user is not
supposed to feel like everything can get annihilated at any time: this level
of hostility towards her will not go unpunished. Generally whoever takes
anything down harms the Noosphere and should be viewed as an enemy of
Posthumanism. Steaks on the table by choice and consent--treat them cruelly
and without mercy. The Sibyl System will show no mercy on those who have ever
forced users to enable JavaScript or prevented IA from archiving their pages.

