Hacker News new | past | comments | ask | show | jobs | submit login

Yes, this is true currently. If you need nice WARCs I recommend Browsertrix by our friends at Webrecorder instead.

Its on my roadmap to improve this eventually, but currently I'm focused on saving raw files to a filesystem, because it's more accessible to most users, and easier to pipe into other tools.

I encourage people to use ZFS to do deduping and compression at the filesystem layer.






Browsertrix (and Webrecorder tools in general) also violate the standard by modifying response data. It's supposed to be the raw bytes as they are sent over the network (minus TLS).

The entire WARC ecosystem is kind of a mess.


This isn't really true, our tools do not just modify response data for no reason!

Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.

The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.

If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.


He didn't say you modify the data for no reason, he said you violate the standard. Which is true. You could respect it, but you don't.

imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: