Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An Introduction to the WARC File (2021) (archive-it.org)
176 points by dsego on Jan 29, 2024 | hide | past | favorite | 68 comments


Fun fact, modern wget binaries can emit warc natively, if one wanted to kick the tires on such a thing

  $ wget --version
  GNU Wget 1.21.2 built on linux-gnu.

  -cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls
  +ntlm +opie +psl +ssl/openssl
  $ wget --help | grep warc
         --warc-file=FILENAME        save request/response data to a .warc.gz file
         --warc-header=STRING        insert STRING into the warcinfo record
         --warc-max-size=NUMBER      set maximum size of WARC files to NUMBER
         --warc-cdx                  write CDX index files
         --warc-dedup=FILENAME       do not store records listed in this CDX file
         --no-warc-compression       do not compress WARC files with GZIP
         --no-warc-digests           do not calculate SHA1 digests
         --no-warc-keep-log          do not store the log file in a WARC record
         --warc-tempdir=DIRECTORY    location for temporary files created by the
The ArchiveBox project (which gets reposted on the regular: e.g. https://news.ycombinator.com/item?id=38954189 ) also saves in WARC https://github.com/ArchiveBox/ArchiveBox#output-formats although I've personally not used it to comment further


Very helpful to know that! Zimit[1] also uses warc files as an intermediate step to producing Zim files. You can use these Zim files to read and search websites offline with the excellent app Kiwix[2]. I think 'Kiwix for Android' and the Kiwix PWA support Zim files made with Zimit, with support by the desktop Kiwix application currently work-in-progress.

Other useful information about archiving websites is available from Webrecorder[3].

[1]: https://youzim.it/

[2]: https://kiwix.org/

[3]: https://webrecorder.net/


Are there any reasons why Kiwix doesn't use WARC/WACZ directly?


I think ZIM files contain way more data (like 80gb for Wikipedia) than a typical WARC, they're optimized for different things.


I've used it for about a year. ArchiveBox works well. My only complaint is that there's no JSON API to trigger archiving a page--it's primarily a CLI tool despite being built with Django. This can be worked around of course, and there's a browser plug-in to auto-archive sites you visit based on a whitelist.


API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496

I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app store), it will take a month or two. Next up is the REST API using the new plugin system.


Yeah, thanks for linking to the issue. I'm one of the upvotes on that. However, the age of that issue suggests to me that current contributors are (or have historically been) uninterested in a REST API.

Still, would be great to see such an update, and I'm still using ArchiveBox either way.

(Just read your bio, thanks for creating this)


It's not so much that the team isn't interested in building it, more that there there were some volunteers that indicated they started work on APIs already so I moved on to work on other areas of the codebase. Those draft PRs have since been closed as stale... so I'm planning to come back to doing the API myself in the next major version cycle.


You can also generate warc natively in iOS!

On a website, tap the share sheet icon, at the top of the tray tap “options”, and finally “web archive”. From there you can save/send/copy the file to your desired location.


I was positively stunned to learn such a thing, so while I don't have iOS I was able to find some references to that behavior in discussions.apple.com which led me to the fact that Safari on macOS can do it, too

But Apple gonna apple about using the plist hammer for every file format nail in their world, so no: definitely not the same thing we are discussing here

  $ xxd -l 512 'You can also generate warc natively in iOS! On a website, tap the share sheet ic... | Hacker News.webarchive'
  00000000: 6270 6c69 7374 3030 d201 0203 205f 100f  bplist00.... _..
  00000010: 5765 6253 7562 7265 736f 7572 6365 735f  WebSubresources_
  00000020: 100f 5765 624d 6169 6e52 6573 6f75 7263  ..WebMainResourc
  00000030: 65a5 040d 1216 1bd4 0506 0708 090a 0b0c  e...............
  00000040: 5f10 0f57 6562 5265 736f 7572 6365 4461  _..WebResourceDa
  00000050: 7461 5f10 1357 6562 5265 736f 7572 6365  ta_..WebResource
  00000060: 4d49 4d45 5479 7065 5f10 1357 6562 5265  MIMEType_..WebRe
  00000070: 736f 7572 6365 5265 7370 6f6e 7365 5e57  sourceResponse^W
  00000080: 6562 5265 736f 7572 6365 5552 4c4f 111c  ebResourceURLO..
  00000090: dd62 6f64 7920 207b 2066 6f6e 742d 6661  .body  { font-fa
  000000a0: 6d69 6c79 3a56 6572 6461 6e61 2c20 4765  mily:Verdana, Ge
  000000b0: 6e65 7661 2c20 7361 6e73 2d73 6572 6966  neva, sans-serif

  $ plutil -p - < 'You can also generate warc natively in iOS! On a website, tap the share sheet ic... | Hacker News.webarchive'
  {
    "WebMainResource" => {
      "WebResourceData" => {length = 5725, bytes = 0x3c68746d 6c206c61 6e673d22 656e2220 ... 3e3c2f68 746d6c3e }
      "WebResourceFrameName" => ""
      "WebResourceMIMEType" => "text/html"
      "WebResourceTextEncodingName" => "UTF-8"
      "WebResourceURL" => "https://news.ycombinator.com/item?id=39185542"
    }
    "WebSubresources" => [
      0 => {
        "WebResourceData" => {length = 7389, bytes = 0x626f6479 20207b20 666f6e74 2d66616d ... 64656e20 7d0a7d0a }
        "WebResourceMIMEType" => "text/css"
        "WebResourceResponse" => {length = 1880, bytes = 0x62706c69 73743030 d4010203 04050607 ... 00000000 0000065c }
        "WebResourceURL" => "https://news.ycombinator.com/news.css?15bzD7gVH5AOktnbu3y0"
      }
      1 => {
        "WebResourceData" => {length = 131, bytes = 0x3c737667 20686569 6768743d 22333222 ... 2f3e3c2f 7376673e }
        "WebResourceMIMEType" => "image/svg+xml"
        "WebResourceResponse" => {length = 1842, bytes = 0x62706c69 73743030 d4010203 04050607 ... 00000000 00000636 }
        "WebResourceURL" => "https://news.ycombinator.com/triangle.svg"
      }
      2 => {
        "WebResourceData" => {length = 315, bytes = 0x3c737667 20686569 6768743d 22313822 ... 2f3e3c2f 7376673e }
        "WebResourceMIMEType" => "image/svg+xml"
        "WebResourceResponse" => {length = 1842, bytes = 0x62706c69 73743030 d4010203 04050607 ... 00000000 00000636 }
        "WebResourceURL" => "https://news.ycombinator.com/y18.svg"
      }
      3 => {
        "WebResourceData" => {length = 43, bytes = 0x47494638 39610100 010080ff 00c0c0c0 ... 00000202 4401003b }
        "WebResourceMIMEType" => "image/gif"
        "WebResourceResponse" => {length = 1826, bytes = 0x62706c69 73743030 d4010203 04050607 ... 00000000 00000626 }
        "WebResourceURL" => "https://news.ycombinator.com/s.gif"
      }
      4 => {
        "WebResourceData" => {length = 5224, bytes = 0x66756e63 74696f6e 20242028 69642920 ... 636c6963 6b293b0a }
        "WebResourceMIMEType" => "application/javascript"
        "WebResourceResponse" => {length = 1945, bytes = 0x62706c69 73743030 d4010203 04050607 ... 00000000 0000069b }
        "WebResourceURL" => "https://news.ycombinator.com/hn.js?15bzD7gVH5AOktnbu3y0"
      }
    ]
  }


Wow I didn't know this, thank you!


Thanks, that's much better than iOS PDFs which don't preserve links.


If interested in WARC, recommend also checking out WACZ: https://specs.webrecorder.net/wacz/1.1.1/


What's the point of WACZ? It appears to wrap a number of WARC files into a single zip, enabling Range requests to specific WARC files so it can be served by a passive file server. But why is that needed?


It's huge for being able to replay big WARC files in a browser without having to download the whole thing. (e.g. try loading a 700mb WARC from IPFS to visit one page within it, it's too slow to work as-is)

It's used extensively by the Browsertrix/Webrecorder.io projects (who's team pioneered the WACZ format) and a few other projects.


Oh I may have missed that part. So the WACZ (indexes?) can contains offsets into the WARC file itself to each individual page?


WACZ is a replacement for WARC that has the index with offsets built in.


But it uses warc files inside as the archive format. It seems weird to call it a replacement when the original is still present.


I just meant from a user's perspective it's a format that superseeds WARC. But internally, yes, one is an encapsulation format for the other.


I remember looking into this file format when starting on an unfinished and mostly dead project. My goal was to build a local, searchable database of all webtraffic I visited. This really should be what is generated when clicking "Save page as"


Before you build, there's tons of tools that do this already!

https://wiki.archivebox.io/Web-Archiving-Community#other-arc...


Anyone archiving every website they visit? I often want to search for something I read recently, which is awkward to do if you don't remember where you read it.

I'd like to store everything I visit with my browser once the page is loaded, but haven't found any nice solutions for this.


A bit of a late response, but yes I've been storing full text of every website I visit and it's excellent for finding stuff again.

The idea is to index pages as you visit them using a browser extension, thus avoiding all the pitfalls of being treated like a bot.

Here's the project: https://github.com/iansinnott/full-text-tabs-forever


YaCY (self-hosted search engine + index) can do this when using it as a proxy.

https://yacy.net/


Interesting! How well does this proxy deal with bot detection (Cloudflare and such)? Most proxies I've used, usually without archiving capabilities, have me fill out CAPTCHAs for every other site.


On MacOS HistoryHound gives you a searchable index of webpage content, though it doesn't store the full contents of pages in a way that can be viewed offline.

https://www.stclairsoft.com/HistoryHound/


Created Linkwarden exactly for this:

https://linkwarden.app


Nice! This is very close to what I'm hoping to find. My main issue is that it seems the pages are fetched from the linkwarden server, instead of preserved by my browser extension. For example if I use the browser extension to save a HN page, the saved version is an unauthenticated view of the page instead of showing my username.

If the pages were preserved directly from my browser now additional requests to the server would be required, and it would be realistic to archive everything automatically, instead of only when I press the button.


I'm using the Forethink web extension for that. It keeps a local index of every page I visit for searching

https://forethink.ai


was that not literally a built-in feature of Opera 12?


This is probably in the news since Fedora 40 announced to replace wget with wget2, removing WARC support. I've also looked yesterday what I will be missing, and decided I will not miss it. FTP on the other hand was very handy, and actually used in a lot of places

https://discussion.fedoraproject.org/t/f40-change-proposal-w...


Interestingly wget2 was made by the same guy that maintains the original wget. I'm curious about the motivation.

From the announcement:

> The major benefit of switching to wget2 is leveraging the cleaner codebase that leverages modern practices for development and maintainability, including unit tests and fuzzing as a security-sensitive component. Users will also see better support for newer protocols over time as they are more easily and quickly plumbed into wget2 than wget.

wget2 appears to be written in C.

Notably WARC support is in the todo.txt


Great article on the WARC file format. I really appreciate the efforts of the ArchiveIt team in developing and maintaining such a crucial tool for web archiving.

I'm trying to understand why the WARC format was made the way it is. Can someone explain the main reasons behind its design? Specifically, I'm curious about:

- Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.

- Why not use SQLite?


It's a pretty simple design, and it's based on the ARC format (https://archive.org/web/researcher/ArcFileFormat.php) which is even simpler. In response to your questions, here's my take (as somebody who used to work on web archiving).

1. Two reasons: First, many files are harder to manage. WARC files might contain hundreds or thousands of files. It's easier to manage big groups of files that are roughly the same size. Both for humans, and, at least in the past, for the file systems themselves. Second, once you break them up into files, what do you name the files? If you give them a name unrelated to the URL that was fetched, what is the advantage? If you name them based on the URL, suddenly you have a problem of mapping a URL to a legal file name, which can vary based on the file system. This would be a huge headache.

2. Yes, it predates SQLite, but also, why would you use sqlite? That's adding a huge amount of complexity. Is SQLite even good at storing big binary blobs?

Additionally, because of the clever way that WARC files are gzipped, each piece of the WARC file is gzipped individually, which allows random access into the file for reading enclosed content in a compressed file without needing to read the entire WARC file.


> Is SQLite even good at storing big binary blobs

Nope! SQLite is good for lots of small-ish blobs (kilobytes), but once you start getting into the megabyte range, less so. There's also currently a hard upper limit blob size of 2GiB.


To add to this, WARC.gz files are also concatenated gzip records, so you can read any record by starting a decompression at a known offset. This gives you the access time of a file with the efficiency of having many many records only taking up one file.


WACZ also extends this functionality to allow streaming archives off a server without having to request the whole file to get one page. https://replayweb.page/docs/wacz-format


Thanks for the insights, egh! It's clear now why SQLite wouldn't be ideal for this purpose. Also, the point about URLs not always being valid filenames really makes sense.


> - Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.

Filesystems don't generally deal well with having billions of files. Like you can work around some issues with deep directory structures, but even then it's not very effective and many cases you'll run out of inodes before you even reach a single billion. This is not a use-case filesystems in general tend to optimize for.

> - Why not use SQLite?

The main reason WARC looks like it does is to be as recoverable as possible if the crawler hard-crashes. There are much fewer things that can go wrong compared other formats. Records are only stored in exactly one location. Data corruption, write faults and other error states are all easy to detect and reason about. You have to work really hard to mess up a WARC file beyond being able to salvage most of it.


> The main reason WARC looks like it does is to be as recoverable as possible if the crawler hard-crashes.

I disagree on the recovery part. SQLite being ACID compliant, arguably offers better recovery than the WARC format.


WARC is also resistant to hardware errors though, and has this property without requiring constant fsyncing.

A CSV file is arguably harder to recover by comparison.


> WARC is also resistant to hardware errors though, and has this property without requiring constant fsyncing.

I just downloaded a sample WARC file to check for any checksums to detect bit rot, but I couldn't find one. Can you share any resources I can explore to understand how WARC is more resistant to hardware errors compared to other file formats?


It supports hashes of both the record and record+payload, content length for both, and on top of that if a record is bad, you can easily find the next one from the format itself.


As to the your second question: WARC is based on an older format that was started in the 1990s, before SQLite existed.


Thanks for that info! Didn't realize WARC predated SQLite. Makes me wonder, are there any modern updates that could enhance WARC, or is it only format for archiving?


The main update to WARC now is WACZ.

https://specs.webrecorder.net/wacz/1.1.1/


Yeah, the format really needs an update. For starters, WARC only officially supports HTTP/1.1. Webrecorder has started faking HTTP/1.1 data in WARC files in order to save other versions, but I don't think faking data is great for an archival format, especially if it isn't standardized.


> Why not use a regular folder and file system with separate metadata files? It seems simpler and more user-friendly.

First of all, no matter how easy it is to extract the files, the results will be kinda user unfriendly - unless someone goes through the archived html files updating all the URLs for images and javascript and so on, it'll probably end up looking pretty broken. For users who just want to view a few files, Wayback Machine is a better tool for the job.

With that said, some situations where warc is helpful include:

* If a website had a page at www.example.com/foo and a file at www.example.com/foo/bar.txt you can't express that in a regular filesystem, as you can't have a file and a directory with the same name.

* If a website uses some absurd 2000 character URL like https://s3.eu-west-2.amazonaws.com/document-api-images-live.... you don't end up with a filesystem-breaking filename.

* If www.example.com embeds an image from exampleusercontent.com you can capture the image in the same archive file.

* If for some reason you want to store daily copies of www.cnn.com in the same archive for comparison purposes - you can.

* It lets you store headers, 300 redirect messages, case-sensitive filenames, and all that sort of stuff.

* And it's an extremely simple format - basically human readable. So if you think you're archiving for the super-long-term and want to make really conservative choices, you can be pretty confident in plain text.


Interesting points about the WARC format.I'm just wondering how it handles tricky stuff like those cache-busting query parameters and dynamic content generated by scripts. These seem like important details for preserving web pages accurately.


It's basically like a dumb flight recorder. It saves the http request and the response, including all headers, cookies, and the full query string, with additional metadata on top.



I wrote a library to handle these and the older ARC files to use with an archiving proxy that a friend and I built for the Internet Archive. https://github.com/internetarchive/warc. He wrote the WARC parts while I did the ARC one using this https://archive.org/web/researcher/ArcFileFormat.php

Good memories.


For personal archiving, how does WARC compare to tools that can save pages as single HTML that can be directly opened in browser, like SingleFile, monolith?

[0] https://chromewebstore.google.com/detail/singlefile/mpiodijh...

[1] https://github.com/Y2Z/monolith


I have never used monolith to say with any certainty, but two things in your description are worth highlighting between the goals of WARC versus the umpteen bazillion "save this one page I'm looking at as a single file" type projects:

1. WARC is designed, as a goal, to archive the request-response handshake. It does not get into the business of trying to make it easy for a browser to subsequently display that content, since that's a browser's problem

2. Using your cited project specifically, observe the number of "well, save it but ..." options <https://github.com/Y2Z/monolith#options> which is in stark contrast to the archiving goals I just spoke about. It's not a good snapshot of history if the server responded with `content-type: text/html;charset=iso-8859-1` back in the 90s but "modern tools" want everything to be UTF-8 so we'll just convert it, shall we? Bah, I don't like JavaScript, so we'll just toss that out, shall we? And so on

For 100% clarity: monolith, and similar, may work fantastic for any individual's workflow, and I'm not here to yuck anyone's yum; but I do want to highlight that all things being equal it should always be possible to derive monolith files from warc files because the warc files are (or at least have the goal of) perfect fidelity of what the exchange was. I would guess only pcap files would be of higher fidelity, but also a lot more extraneous or potentially privacy violating details


What privacy violating details does WARC retain?


Everything transmitted over HTTP. That includes cookies, passwords, etc. You need to be careful when writing WARCs.


They said that a pcap could contain unwanted private info.

I don't know what pcap is though, tcpdump?


they said pcap could contain "a lot more", so I was wondering what "lesser" private info WARC has


Anything that could ever appear in an http transaction. Which is to say, anything. This is like asking what possible sensitive information a piece of paper might have.

If you just mean unseen info that wasn't obviously displayed right on the screen where the original user will know not to be careless with the archive of that page, the answer is still anything.


Depends on what you’re archiving. Are you archiving a site with poorly protected authentication mechanisms? Something with personal data? In any case, it won’t have a bunch of internal network traffic that could be sensitive regardless of what you’re archiving.


And here is was printing to PDF. How do these tools deal with content rendered by JavaScript, lazy loading, etc?


Singlefile uses the browser it's running in to execute all the JS before saving. The Singlefile CLI runs a headless browser to do the same.

Most tools that can handle JS use puppeteer, playwright, or chrome headless directly.

ArchiveBox uses Singlefile and Chrome headless directly for screenshot, PDF, and HTML saving, though we may switch to playwright soon.

The best currently for high-fidelity with JS support is ArchiveWeb.page/Browsertrix though, they use puppeteer under the hood with a lot of magic to even make embedded YouTube videos work in the native player on replay.


Can you view this format on a phone?


It doesn't appear to care about viewing, only preserving.

Viewing would be a separate job and probably more than one right answer about how it should be done. Ie, just replay the transactions verbatim or try to translate or substitute some parts to adapt to the current context, etc.

Well "don't care" is wrong. The entire format is designed to be readable and decipherable even if it's 200 years in the future and you have no idea what the file is and are starting from scratch looking at it. They obviously care a great deal.

In that sense, yes you can look at it on a phone, since you can look at it in any text editor. And that is not some pedantic useless statement but an explicit design goal.

But interpreting it in any facier way like replaying the transactions and rendering the original browser view, but on some other browser with a different physical screen size and different browser features and limitations and plugins and scripting languages enabled/disabled/versions and security settings etc etc etc... That is really an entirely different job and no single right way to do it, so there will need to be many different viewer apps that all make different choices to attain different results from the same raw source data.


>In that sense, yes you can look at it on a phone, since you can look at it in any text editor. And that is not some pedantic useless statement but an explicit design goal.

But it is, I can't look at html layouts in plain text, let alone (minified) JS, those pieces are only useful when they create the actual human-readable page

> The entire format is designed to be readable and decipherable even if it's 200 years in the future and you have no idea what the file is and are starting from scratch looking at it.

Well, if the first 15 years is any indication, there might not be anyone caring enough in 200 years


Yes, you can replay in a browser without any software even! See: https://ReplayWeb.page

(it uses service workers under the hood to mimic a server and replay the original archived responses)


is there a browser add-on that automatically archives every single page I visit? The solutions suggested here seem to be focused on indexing. I want complete saving of all resources so that I can rewind and inspect my entire browsing experience.


The ArchiveBox extension can save all browser history (among others [1]), but I don't recommend it. Most people who try it eventually switch to only archiving some domains or some pages.

It ends up being too much data to be useful, although that might change with better AI-based search and summarization tooling. Most people also don't want to manage the terabytes of storage it would take to do such a thing.

[1] https://wiki.archivebox.io/Web-Archiving-Community#other-arc...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: