Things that attempt to rewrite links and inline css and javascript are doomed to...

anarcat · on Nov 23, 2018

so the article does go in details about how just "wget-ing" a website isn't sufficient. this is what WARC files are built for, and that's why i insisted on that principle.

but while it's true that some sites require some serious javascript stuff to be archived properly, my feeling is that if you design your site to be uncrawlable, you are going to have other problems to deal with anyways. there will be accessibility problems, and those affect not only "people with disabilities" but also search engines, mobile users, voice interfaces, etc.

if you design your site for failure, it will fail and disappear from history. after all, it's not always the archivist's fault sites die - sometimes the authors should be blamed and held to a higher standard than "look ma, i just discovered Javascript and made a single page app, isn't that great?" :p

megous · on Nov 23, 2018

Communication is dependent on JS events, which are dependent on user's action. There's also localStorage and other such things. Your method might work for some simple JS based websites, but it's no silver bullet.

bicubic · on Nov 23, 2018

While what you say is true, the above method is the only method to archive arbitrary web pages. Yes it depends on user interactions to some extent, but it's possible to reasonably let a page load until fetches stop, and consider it rendered. Generally speaking, you can only archive some preset interactions with a modern web page. You can't hope to capture it all.

There are tools like WebRecorder[0] that do this to some extent by recording and replaying all requests. It's certainly a step in the right direction and demonstrates that the approach is viable. This was the only approach I tried that worked for archiving stuff like three.js demos. Worth mentioning there's also an Awesome list[1] that covers various web archival technologies.

[0] https://github.com/webrecorder/webrecorder

[1] https://github.com/iipc/awesome-web-archiving

nikisweeting · on Nov 24, 2018

What I'm trying to do with Bookmark Archiver is record user activity to create a puppeteer script while the user is visiting the page, then replay that on a headless browser later and record the session to a WARC. That should cover both dynamically requested and interactive stuff that is otherwise being missed by current archiving methods.

I also plan on saving an x86 VM image of the browser used to make the archive every couple months so that sites can be replayed many decades into the future.