ArchiveBox uses WARC as it's backing store:
which is nice because it's standardized.
We were discussing integrating Polar web archives along with ArchiveBox and maybe having some sort of standard to automatically submit these WARCs to the Internet Archive as part of your normal browsing activity.
Polar has a similar web capture feature but it's not WARC
WARC is probably the easiest standard for Polar to adopt. Right now we use HTML encoded in JSON objects.
When the user captures a web page we save all resources and store them in a PHZ file which you can keep as your own personal web archive.
What I'd like to eventually do is update our extension to auto-capture web pages so you could use Polar's cloud storage feature to basically store every page you've ever visited.
It really wouldn't be that much money per year. I did the math and it's about $50 per year to store your entire web history.
If I can get Polar over to WARC that would mean that tools like ArchiveBox and Polar could interop but we could do things like automatically send your documents you browse to the Internet Archive.
There's one huge problem though. What do we do about cookies and private data. I'm really not sure what to do there. It might be possible to strip this data for certain sites (news) without any risk of violating the users privacy.
Arguably the biggest problem is that it isn't complex enough: there is no index of contents built-in (the standard .csv-like index format of URL/timestamp/hash/offset is called CDX).
There aren't a ton of tools in the web archiving space in general, but almost all of the ones that do exist work with WARC. Existing tools (for interchange) include bulk indexing (for search, graph analysis, etc) and "replay" via web interface or browser-like application. Apart from specific centralized web archiving services that use WARC, there are several large public datasets, like Common Crawl, that are released in WARC format.
If POLAR gets popular enough, it can potentially dictate what is the standard. So I think it is best to compare the formats from the purely technical perspective. ZIP + json with as simple inner structure as possible has a huge advantage of being easy to handle by anyone: I can open (and actually read/modify) such a file with tools I will have on any machine, anytime. It is so simple and obvious, that I can write a script that packs some data in a format probably (at least partially) readable by your software in a minute.
1. Can I (meaningfully) do the same with this WARC thing?
2. What are the technical benefits of WARC over this unnamed (yet) file format?
3. Can WARC be losslessly converted into *.polar? And the other way around?
4. Are there tools to do #3? Is it tricky to implement on a new platform?
I mean, if you can actually propose something better than WARC (or whatever) you can potentially save the world from another one WebDAV (or name your favorite horrible standard which we cannot rid of because everyone uses it).
The advantages of SQLite are too numerous to list, and it has a built-in compressed format so there's no bloat to worry about vs. ZIP files.
As an example, this should make it practical, given a few iterations, to store multiple snapshots as deltas, deduplicating identical content. It also obviates having to base64 encode images and other binary assets.
SQLite is a recommended storage format (https://www.sqlite.org/locrsf.html) by the library of congress
> As of this writing (2018-05-29) the only other recommended storage formats for datasets are XML, JSON, and CSV
- they promise support until at least 2050 (https://sqlite.org/lts.html)
- if a promise isn't enough, SQlite is to be supported for the entire lifetime of the Airbus A350 airframe (https://mobile.twitter.com/copiousfreetime/status/6758345433...). I would assume Airbus will be paying for it.
A real problem with application dependency is that it makes it extremely hard to create a completely independent alternative. A good protocol should have multiple implementations with nothing in common otherwise you rely on specific implementation details. The infamous case of WebSQL should remind us of that.
Thanks for chiming in here! I only just saw ArchiveBox hit HN while I was away for the weekend. Just responding to everything now...
(@everyone in this thread, go check out Polar, it's awesome!)
- are there browser extensions that can save a warc while you browse?
- are there API limitations that require external browser control? something browser extensions can't be used for?
- or is it simply a question of use case. And crawlers are more popular (for archiving) than locally recorded browser history (for search/analytics)?
I have now found https://github.com/machawk1/warcreate, related discussions in issues #111 and #112 are quite interesting. Looks like there are some serious limitations for browser extensions. I will look deeper into how webrecorder works and how this could be combined
Web archiving from a browser extension is difficult but can be improved. I don't know of any other approaches at trying to do this via a browser extension beyond submitting a URI to an archive.
I'm adding links to both from https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
EDIT: Nevermind, should have RTFM - see CHROME_USER_DATA_DIR in https://github.com/pirate/ArchiveBox/wiki/Configuration.
This looks really, really awesome.
Is there some tool that would allow one to make a copy of a modern SPA? Is that even possible?
EDIT: I'm sad to see launchaco.com go, it would be a perfect tool for a project I'm working on. I don't mind paying, but I gather this is not possible anymore, and anyway, it might take some time for me to have everything ready...
Why doesn't it do that? I thought that was the point of using Chrome as a headless browser to load all the dynamic elements into a final DOM so they could then be captured & serialized out: "ArchiveBox works by rendering the pages in a headless browser, then saving all the requests and fully loaded pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the original content dissapears off the internet."
If the interesting content is inside a database at the backend, copying the interface is not enough. You also want the database.
As an aside,
wget -r -k -np
For permanent access I defer to archive.is
I believe there is even a docker container for a quick pull and run.
If you wanted to write WARC files out with wget, you'd use the options specified in https://www.archiveteam.org/index.php?title=Wget_with_WARC_o...
You can track the progress on implementing WARC saving for all archived resources here: https://github.com/pirate/ArchiveBox/issues/130
Unfortunately, it seems, for some sites, it is archiving the GDPR warning or bot captcha instead of the content. For example, tumblr only gets me "Before you continue, an update from us" with an 'accept' button that does nothing; for Fastcompany the pdf output is overcast with a half page popup on each page; ouest-france gives me a "I am not a Robot" captcha...
Probably those issues are linked to the fact I do not use chromium though it is installed. When I understand how to archive the masked content, I intend to look at each screenshot to detect problems and re-archive them.
1. It used no ad-blocking. Hence the bot is happy to download all trackers and advertisements! Also trackers know my pocket bookmarks now...
2. Websites are ugly. In a 1440x2000 screenshot of a national geographic article, the top 1440x1500 pixels consist of black with just a grey round progress in the middle and the word "ADVERTISEMENT" at the top.
3. Websites are wasteful. Many warc archives clock in at 30 to 50 MB. Without downloading media, a single (article) webpage is 30MB!
4. While consulting the archive, my adblocker allows everything since the initial page is on file protocol. It seems archivebox does not rewrite the URLs in its html output (as wget's convert-links option would do). So trackers and ads galore when opening output.html and even archive/<timestamp>/index.html which automaticaly load the other files.
There is an open issue on GitHub to add an API. Once that feature is implemented, adding a web page where links can be submitted for archival becomes trivial.
It's a nice way to update my archive with one tap from my phone.
Would be Nice to save enitre pages without paying extra
All of Pocket? It's fairly straightforward to extract your Pocket data through their API (their export functionality leaves much to be desired unfortunately).
Let me know if you have any questions! (I'm @pirate on Github)
Automated install did not detect existing Python 3.6 on OSX, created quite the mess...
Also, defaults to using google chrome vs. chromium when both are installed.
I personally don't like automated setup scripts, which is one of the reasons I spent a bunch of time on our Manual Install docs: https://github.com/pirate/ArchiveBox/wiki/Install
Pip, apt, & homebrew packages are the install methods we're moving towards currently. I just caved to user demand in the short-term and added a helper script about a year ago as a crutch until we finished releasing the real packages.
Stay tuned for updates on that progress here: https://github.com/pirate/ArchiveBox/issues/120