Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Things that attempt to rewrite links and inline css and javascript are doomed to fail. Many sites do wierd javascript shenanigans, and without a million special cases, you'll never make it work reliably. Just try archiving your facebook news feed and let me know how it goes.

Instead, archivists should try to record the exact data sent between the server and a real browser, and then save that in a cache. Then, when viewing the archive, use the same browser and replay the same data, and you should see the exact same thing! With small tweaks to make everything deterministic (disallow true randomness in javascript, set the date and time back to the archiving date so SSL certs are still valid), this method can never 'bit rot'.

When technology moves on, and you can no longer run the browser and proxy, you wrap it all up in a virtual machine, and run it like that. Virtual machines have successfully preserved games consoles data nearly perfectly for ~40 years now, which is far better than pretty much any other approach.




so the article does go in details about how just "wget-ing" a website isn't sufficient. this is what WARC files are built for, and that's why i insisted on that principle.

but while it's true that some sites require some serious javascript stuff to be archived properly, my feeling is that if you design your site to be uncrawlable, you are going to have other problems to deal with anyways. there will be accessibility problems, and those affect not only "people with disabilities" but also search engines, mobile users, voice interfaces, etc.

if you design your site for failure, it will fail and disappear from history. after all, it's not always the archivist's fault sites die - sometimes the authors should be blamed and held to a higher standard than "look ma, i just discovered Javascript and made a single page app, isn't that great?" :p


Communication is dependent on JS events, which are dependent on user's action. There's also localStorage and other such things. Your method might work for some simple JS based websites, but it's no silver bullet.


While what you say is true, the above method is the only method to archive arbitrary web pages. Yes it depends on user interactions to some extent, but it's possible to reasonably let a page load until fetches stop, and consider it rendered. Generally speaking, you can only archive some preset interactions with a modern web page. You can't hope to capture it all.

There are tools like WebRecorder[0] that do this to some extent by recording and replaying all requests. It's certainly a step in the right direction and demonstrates that the approach is viable. This was the only approach I tried that worked for archiving stuff like three.js demos. Worth mentioning there's also an Awesome list[1] that covers various web archival technologies.

[0] https://github.com/webrecorder/webrecorder

[1] https://github.com/iipc/awesome-web-archiving


What I'm trying to do with Bookmark Archiver is record user activity to create a puppeteer script while the user is visiting the page, then replay that on a headless browser later and record the session to a WARC. That should cover both dynamically requested and interactive stuff that is otherwise being missed by current archiving methods.

I also plan on saving an x86 VM image of the browser used to make the archive every couple months so that sites can be replayed many decades into the future.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: