digitaldragon's comments

digitaldragon · 2025-08-18T06:01:47 1755496907

The data is saved as a WARC file, which contains the entire HTTP request and response (compressed, of course). So it's much bigger than just a short -> long URL mapping.

lyu07282 · 2025-08-18T07:03:38 1755500618

did they follow the redirect and archive the page content? but why?

digitaldragon · on Oct 16, 2024

Unfortunately, Browsertrix relies on the Chrome Devtools Protocol, which strips transfer encoding (and possibly transforms the data in other ways). This results in Browsertrix writing noncompliant WARC files, because the spec requires that the original transfer encoding be preserved.

ikreymer · on Oct 17, 2024

Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.

We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.

CorentinB · on Oct 17, 2024

You could use a proxy.

"Archiving is always lossy" No.

nikisweeting · on Oct 17, 2024

You're talking to the guy who built the best proxy recorder in the archiving industry ;) ikreymer created https://pywb.readthedocs.io/en/latest/

I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.

But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?

pabs3 · on Oct 17, 2024

pywb has WARC issues too, due to use of warcio:

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

ikreymer · on Oct 17, 2024

Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)

The web is best-effort, and so is archiving the web.