Hacker News new | past | comments | ask | show | jobs | submit login

Wow, it's really rare these days to see a tool that supports WARC.

Despite being an ISO standard [1] and the default archive format of the internet archive, and despite a handfull of lovingly crafted tools (such as webrecorder [2], warcprox etc.), it never seems to have caught on in a broader context.

Really a shame - I' deeply convinced that the ability to archive and replay requests is a technique for defending and strengthening user rights.

Links:

[1] https://www.iso.org/standard/44717.html

[2] https://github.com/webrecorder/webrecorder-desktop

[3] https://github.com/internetarchive/warcprox




I have taken a shine to ArchiveBox[0] - a self-hosted application that will store dumps in a variety of formats (PDF, HTML, extracted text, optionally ping InternetArchive to take a snapshot, WARC, etc) as well as keep a SQLite metadata archive. I have taken to using it for all blog articles I read - the internet amnesia is the only way I can rely upon retaining the content.

[0] https://archivebox.io/


Oh, I see they recommend writing WARC files with wget using `--no-warc-digests`. You really should not do that - one, it's just a sha1 and neither costly in terms of CPU nor storage. Two, the digest is used to create revisit records for de-duplication. If you disable that you or someone else might end up with lots of duplicate resources on re-crawling.


I really wish browsers like Firefox/Chromium could output WARC.


Absolutely. You can save a HAR file from devtools at least.

If you want to generate WARC from browsers, warcprox is relatively easy and fast - but setting up the proxy settings etc. is cumbersome if all you want is a single archive.

By the way, there are some great tools that use WARC under the hood such as perma [1]. They provide reliable snapshots of single documents with a stable URL.

[1] https://perma.cc/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: