Fun fact, modern wget binaries can emit warc natively, if one wanted to kick the tires on such a thing
$ wget --version
GNU Wget 1.21.2 built on linux-gnu.
-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls
+ntlm +opie +psl +ssl/openssl
$ wget --help | grep warc
--warc-file=FILENAME save request/response data to a .warc.gz file
--warc-header=STRING insert STRING into the warcinfo record
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER
--warc-cdx write CDX index files
--warc-dedup=FILENAME do not store records listed in this CDX file
--no-warc-compression do not compress WARC files with GZIP
--no-warc-digests do not calculate SHA1 digests
--no-warc-keep-log do not store the log file in a WARC record
--warc-tempdir=DIRECTORY location for temporary files created by the
Very helpful to know that! Zimit[1] also uses warc files as an intermediate step to producing Zim files. You can use these Zim files to read and search websites offline with the excellent app Kiwix[2]. I think 'Kiwix for Android' and the Kiwix PWA support Zim files made with Zimit, with support by the desktop Kiwix application currently work-in-progress.
Other useful information about archiving websites is available from Webrecorder[3].
I've used it for about a year. ArchiveBox works well. My only complaint is that there's no JSON API to trigger archiving a page--it's primarily a CLI tool despite being built with Django. This can be worked around of course, and there's a browser plug-in to auto-archive sites you visit based on a whitelist.
I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app store), it will take a month or two. Next up is the REST API using the new plugin system.
Yeah, thanks for linking to the issue. I'm one of the upvotes on that. However, the age of that issue suggests to me that current contributors are (or have historically been) uninterested in a REST API.
Still, would be great to see such an update, and I'm still using ArchiveBox either way.
It's not so much that the team isn't interested in building it, more that there there were some volunteers that indicated they started work on APIs already so I moved on to work on other areas of the codebase. Those draft PRs have since been closed as stale... so I'm planning to come back to doing the API myself in the next major version cycle.
On a website, tap the share sheet icon, at the top of the tray tap “options”, and finally “web archive”. From there you can save/send/copy the file to your desired location.
I was positively stunned to learn such a thing, so while I don't have iOS I was able to find some references to that behavior in discussions.apple.com which led me to the fact that Safari on macOS can do it, too
But Apple gonna apple about using the plist hammer for every file format nail in their world, so no: definitely not the same thing we are discussing here
What's the point of WACZ? It appears to wrap a number of WARC files into a single zip, enabling Range requests to specific WARC files so it can be served by a passive file server. But why is that needed?
It's huge for being able to replay big WARC files in a browser without having to download the whole thing. (e.g. try loading a 700mb WARC from IPFS to visit one page within it, it's too slow to work as-is)
It's used extensively by the Browsertrix/Webrecorder.io projects (who's team pioneered the WACZ format) and a few other projects.
I remember looking into this file format when starting on an unfinished and mostly dead project. My goal was to build a local, searchable database of all webtraffic I visited. This really should be what is generated when clicking "Save page as"
Anyone archiving every website they visit? I often want to search for something I read recently, which is awkward to do if you don't remember where you read it.
I'd like to store everything I visit with my browser once the page is loaded, but haven't found any nice solutions for this.
Interesting! How well does this proxy deal with bot detection (Cloudflare and such)? Most proxies I've used, usually without archiving capabilities, have me fill out CAPTCHAs for every other site.
On MacOS HistoryHound gives you a searchable index of webpage content, though it doesn't store the full contents of pages in a way that can be viewed offline.
Nice! This is very close to what I'm hoping to find. My main issue is that it seems the pages are fetched from the linkwarden server, instead of preserved by my browser extension. For example if I use the browser extension to save a HN page, the saved version is an unauthenticated view of the page instead of showing my username.
If the pages were preserved directly from my browser now additional requests to the server would be required, and it would be realistic to archive everything automatically, instead of only when I press the button.
This is probably in the news since Fedora 40 announced to replace wget with wget2, removing WARC support. I've also looked yesterday what I will be missing, and decided I will not miss it. FTP on the other hand was very handy, and actually used in a lot of places
Interestingly wget2 was made by the same guy that maintains the original wget. I'm curious about the motivation.
From the announcement:
> The major benefit of switching to wget2 is leveraging the cleaner codebase that leverages modern practices for development and maintainability, including unit tests and fuzzing as a security-sensitive component. Users will also see better support for newer protocols over time as they are more easily and quickly plumbed into wget2 than wget.
Great article on the WARC file format. I really appreciate the efforts of the ArchiveIt team in developing and maintaining such a crucial tool for web archiving.
I'm trying to understand why the WARC format was made the way it is. Can someone explain the main reasons behind its design? Specifically, I'm curious about:
- Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.
It's a pretty simple design, and it's based on the ARC format (https://archive.org/web/researcher/ArcFileFormat.php) which is even simpler. In response to your questions, here's my take (as somebody who used to work on web archiving).
1. Two reasons: First, many files are harder to manage. WARC files might contain hundreds or thousands of files. It's easier to manage big groups of files that are roughly the same size. Both for humans, and, at least in the past, for the file systems themselves. Second, once you break them up into files, what do you name the files? If you give them a name unrelated to the URL that was fetched, what is the advantage? If you name them based on the URL, suddenly you have a problem of mapping a URL to a legal file name, which can vary based on the file system. This would be a huge headache.
2. Yes, it predates SQLite, but also, why would you use sqlite? That's adding a huge amount of complexity. Is SQLite even good at storing big binary blobs?
Additionally, because of the clever way that WARC files are gzipped, each piece of the WARC file is gzipped individually, which allows random access into the file for reading enclosed content in a compressed file without needing to read the entire WARC file.
Nope! SQLite is good for lots of small-ish blobs (kilobytes), but once you start getting into the megabyte range, less so. There's also currently a hard upper limit blob size of 2GiB.
To add to this, WARC.gz files are also concatenated gzip records, so you can read any record by starting a decompression at a known offset. This gives you the access time of a file with the efficiency of having many many records only taking up one file.
WACZ also extends this functionality to allow streaming archives off a server without having to request the whole file to get one page. https://replayweb.page/docs/wacz-format
Thanks for the insights, egh! It's clear now why SQLite wouldn't be ideal for this purpose. Also, the point about URLs not always being valid filenames really makes sense.
> - Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.
Filesystems don't generally deal well with having billions of files. Like you can work around some issues with deep directory structures, but even then it's not very effective and many cases you'll run out of inodes before you even reach a single billion. This is not a use-case filesystems in general tend to optimize for.
> - Why not use SQLite?
The main reason WARC looks like it does is to be as recoverable as possible if the crawler hard-crashes. There are much fewer things that can go wrong compared other formats. Records are only stored in exactly one location. Data corruption, write faults and other error states are all easy to detect and reason about. You have to work really hard to mess up a WARC file beyond being able to salvage most of it.
> WARC is also resistant to hardware errors though, and has this property without requiring constant fsyncing.
I just downloaded a sample WARC file to check for any checksums to detect bit rot, but I couldn't find one. Can you share any resources I can explore to understand how WARC is more resistant to hardware errors compared to other file formats?
It supports hashes of both the record and record+payload, content length for both, and on top of that if a record is bad, you can easily find the next one from the format itself.
Thanks for that info! Didn't realize WARC predated SQLite. Makes me wonder, are there any modern updates that could enhance WARC, or is it only format for archiving?
Yeah, the format really needs an update. For starters, WARC only officially supports HTTP/1.1. Webrecorder has started faking HTTP/1.1 data in WARC files in order to save other versions, but I don't think faking data is great for an archival format, especially if it isn't standardized.
> Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.
First of all, no matter how easy it is to extract the files, the results will be kinda user unfriendly - unless someone goes through the archived html files updating all the URLs for images and javascript and so on, it'll probably end up looking pretty broken. For users who just want to view a few files, Wayback Machine is a better tool for the job.
With that said, some situations where warc is helpful include:
* If a website had a page at www.example.com/foo and a file at www.example.com/foo/bar.txt you can't express that in a regular filesystem, as you can't have a file and a directory with the same name.
* If www.example.com embeds an image from exampleusercontent.com you can capture the image in the same archive file.
* If for some reason you want to store daily copies of www.cnn.com in the same archive for comparison purposes - you can.
* It lets you store headers, 300 redirect messages, case-sensitive filenames, and all that sort of stuff.
* And it's an extremely simple format - basically human readable. So if you think you're archiving for the super-long-term and want to make really conservative choices, you can be pretty confident in plain text.
Interesting points about the WARC format.I'm just wondering how it handles tricky stuff like those cache-busting query parameters and dynamic content generated by scripts. These seem like important details for preserving web pages accurately.
It's basically like a dumb flight recorder. It saves the http request and the response, including all headers, cookies, and the full query string, with additional metadata on top.
For personal archiving, how does WARC compare to tools that can save pages as single HTML that can be directly opened in browser, like SingleFile, monolith?
I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:
1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem
2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on
For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details
Anything that could ever appear in an http transaction. Which is to say, anything. This is like asking what possible sensitive information a piece of paper might have.
If you just mean unseen info that wasn't obviously displayed right on the screen where the original user will know not to be careless with the archive of that page, the answer is still anything.
Depends on what you’re archiving. Are you archiving a site with poorly protected authentication mechanisms? Something with personal data? In any case, it won’t have a bunch of internal network traffic that could be sensitive regardless of what you’re archiving.
Singlefile uses the browser it's running in to execute all the JS before saving. The Singlefile CLI runs a headless browser to do the same.
Most tools that can handle JS use puppeteer, playwright, or chrome headless directly.
ArchiveBox uses Singlefile and Chrome headless directly for screenshot, PDF, and HTML saving, though we may switch to playwright soon.
The best currently for high-fidelity with JS support is ArchiveWeb.page/Browsertrix though, they use puppeteer under the hood with a lot of magic to even make embedded YouTube videos work in the native player on replay.
It doesn't appear to care about viewing, only preserving.
Viewing would be a separate job and probably more than one right answer about how it should be done. Ie, just replay the transactions verbatim or try to translate or substitute some parts to adapt to the current context, etc.
Well "don't care" is wrong. The entire format is designed to be readable and decipherable even if it's 200 years in the future and you have no idea what the file is and are starting from scratch looking at it. They obviously care a great deal.
In that sense, yes you can look at it on a phone, since you can look at it in any text editor. And that is not some pedantic useless statement but an explicit design goal.
But interpreting it in any facier way like replaying the transactions and rendering the original browser view, but on some other browser with a different physical screen size and different browser features and limitations and plugins and scripting languages enabled/disabled/versions and security settings etc etc etc... That is really an entirely different job and no single right way to do it, so there will need to be many different viewer apps that all make different choices to attain different results from the same raw source data.
>In that sense, yes you can look at it on a phone, since you can look at it in any text editor. And that is not some pedantic useless statement but an explicit design goal.
But it is, I can't look at html layouts in plain text, let alone (minified) JS, those pieces are only useful when they create the actual human-readable page
> The entire format is designed to be readable and decipherable even if it's 200 years in the future and you have no idea what the file is and are starting from scratch looking at it.
Well, if the first 15 years is any indication, there might not be anyone caring enough in 200 years
is there a browser add-on that automatically archives every single page I visit? The solutions suggested here seem to be focused on indexing. I want complete saving of all resources so that I can rewind and inspect my entire browsing experience.
The ArchiveBox extension can save all browser history (among others [1]), but I don't recommend it. Most people who try it eventually switch to only archiving some domains or some pages.
It ends up being too much data to be useful, although that might change with better AI-based search and summarization tooling. Most people also don't want to manage the terabytes of storage it would take to do such a thing.