Wow, it's really rare these days to see a tool that supports WARC.
Despite being an ISO standard [1] and the default archive format of the internet archive, and despite a handfull of lovingly crafted tools (such as webrecorder [2], warcprox etc.), it never seems to have caught on in a broader context.
Really a shame - I' deeply convinced that the ability to archive and replay requests is a technique for defending and strengthening user rights.
I have taken a shine to ArchiveBox[0] - a self-hosted application that will store dumps in a variety of formats (PDF, HTML, extracted text, optionally ping InternetArchive to take a snapshot, WARC, etc) as well as keep a SQLite metadata archive. I have taken to using it for all blog articles I read - the internet amnesia is the only way I can rely upon retaining the content.
Oh, I see they recommend writing WARC files with wget using `--no-warc-digests`. You really should not do that - one, it's just a sha1 and neither costly in terms of CPU nor storage. Two, the digest is used to create revisit records for de-duplication. If you disable that you or someone else might end up with lots of duplicate resources on re-crawling.
Absolutely. You can save a HAR file from devtools at least.
If you want to generate WARC from browsers, warcprox is relatively easy and fast - but setting up the proxy settings etc. is cumbersome if all you want is a single archive.
By the way, there are some great tools that use WARC under the hood such as perma [1]. They provide reliable snapshots of single documents with a stable URL.
Edit: I guess theyʼre philosophically opposed to PKI, and are promoting instead certificate pinning, Web-of-Trust:
http://www.stargrave.org/Harmful.html
This makes me wonder if itʼs possible to get `Letʼs Encrypt` to cross-sign an HTTPS certificate issued by another CA?
It's naive to think you can scale the web we have today on web of trust. If you didn't know, Google Chrome ignores to a large extent the whole certificate revocation infrastructure because it's slow and it has privacy issues[0]. How would things work if you would check every single web request you make against the web of trust? Your peers would know every site you visit, or worse, a notary for your trust would know every site you visit. PKI works because the check is local and based on time and cryptographic signatures.
It is currently possible to present multiple certificates when the connection is initiated using TLS. These certs are usually a chain - from CA to intermediate CA and then the final cert, but it should be possible to present certificates from different chains. The question is how the clients would be able to handle this.
> I am tired that everyone provides very limited certificates trust management capabilities, like either certificate or SPKI pinning with TOFU. Even my beloved Xombrero browser still pins only the whole certificate, but its public key would be much more sufficient and convenient to work with.
Eh, a public key plus information about its validity (date, subject) is a certificate? So when you "pin" a key to a domain, you've effectively made a (non-standard) certificate that you explicitly trust?
Despite being an ISO standard [1] and the default archive format of the internet archive, and despite a handfull of lovingly crafted tools (such as webrecorder [2], warcprox etc.), it never seems to have caught on in a broader context.
Really a shame - I' deeply convinced that the ability to archive and replay requests is a technique for defending and strengthening user rights.
Links:
[1] https://www.iso.org/standard/44717.html
[2] https://github.com/webrecorder/webrecorder-desktop
[3] https://github.com/internetarchive/warcprox