Hacker News new | past | comments | ask | show | jobs | submit login
Archiving Web Sites (2018) (lwn.net)
112 points by butterthebuddha on Feb 29, 2020 | hide | past | favorite | 11 comments



I just installed Memex today to help with this problem for myself. I really would like an extension that recorded everything and saved as WARC or some other re-playable format. I hope that Memex can achieve what I want now which is "Crap, which web page was that quote/keyword/idea on?". Chrome history is iffy at best.


SingleFile [1] can do that, i.e. automatically save viewed pages and/or bookmarks, but it will save pages in HTML. Alternatively, SingleFileZ [2] can do the same but will produce zip files (disguised as HTML files). Disclosure: I'm the author.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/gildas-lormeau/SingleFileZ


I'm using something similar I believe. I simply wrote a puppeteer automated browser to go through every page and saves it as `.mhtml` This work quite well for my purpose. I was archiving a site with content that I pay for and sits behind my login. I often use material from it when I'm offline and hence needed to put together this hack.

The below code does the job of saving the page as a single file.

```

        const page = await this.browser.newPage()
        const response = await page.goto(url, { timeout: 50000 })

        if (response.status() === 404) {
            await page.close()
            throw new Error('not found')
        }

        // credit: https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use
        const cdp = await page.target().createCDPSession();
        const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });

        const htmlFilename = "./data/" + slugify(url)+'.mhtml';
        fs.writeFileSync(htmlFilename, data);
```


I've been looking for something like this for a while now, to store all pages I visit into a personal archive, but all the options I found either involved setting up a proxy and MITMing all your requests (too much effort to set up) or saved to a format I could not easily access.

So far, SingleFile looks like a perfect fit, thanks!


https://github.com/dosyago/22120

produces repayable archives in a format that doesn't have the issues of WARC. It's a work in progress tho and currently archives everything, instead of just what you select.

You can of course edit the archive by hand. it's very simple to do, but not as simple as being able to select only what you want.

I think there's a whitelist Domain option that works per archive.


I believe the web packaging proposal allows something like this: https://www.infoq.com/news/2020/01/web-packaging-bundles-wic...


What is Memex?



Sites that use JavaScript based cache-busting code are probably hardest to archive. Fetching CDN.com/jQuery.js?rnd=63926195 isn't exactly easy without handcoded workarounds.


Mods: could we get 2018 in the title please?


The most expedient way to request changes like this is to shoot off an email to hn@ycombinator.com.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: