Archiving Web Sites (2018)

joshstrange · on Feb 29, 2020

I just installed Memex today to help with this problem for myself. I really would like an extension that recorded everything and saved as WARC or some other re-playable format. I hope that Memex can achieve what I want now which is "Crap, which web page was that quote/keyword/idea on?". Chrome history is iffy at best.

gildas · on Feb 29, 2020

SingleFile [1] can do that, i.e. automatically save viewed pages and/or bookmarks, but it will save pages in HTML. Alternatively, SingleFileZ [2] can do the same but will produce zip files (disguised as HTML files). Disclosure: I'm the author.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/gildas-lormeau/SingleFileZ

santa_boy · on Feb 29, 2020

I'm using something similar I believe. I simply wrote a puppeteer automated browser to go through every page and saves it as `.mhtml` This work quite well for my purpose. I was archiving a site with content that I pay for and sits behind my login. I often use material from it when I'm offline and hence needed to put together this hack.

The below code does the job of saving the page as a single file.

```

        const page = await this.browser.newPage()
        const response = await page.goto(url, { timeout: 50000 })

        if (response.status() === 404) {
            await page.close()
            throw new Error('not found')
        }

        // credit: https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use
        const cdp = await page.target().createCDPSession();
        const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });

        const htmlFilename = "./data/" + slugify(url)+'.mhtml';
        fs.writeFileSync(htmlFilename, data);

```

ordinary · on Feb 29, 2020

I've been looking for something like this for a while now, to store all pages I visit into a personal archive, but all the options I found either involved setting up a proxy and MITMing all your requests (too much effort to set up) or saved to a format I could not easily access.

So far, SingleFile looks like a perfect fit, thanks!

zuckluni · on Feb 29, 2020

https://github.com/dosyago/22120

produces repayable archives in a format that doesn't have the issues of WARC. It's a work in progress tho and currently archives everything, instead of just what you select.

You can of course edit the archive by hand. it's very simple to do, but not as simple as being able to select only what you want.

I think there's a whitelist Domain option that works per archive.

Fudgel · on Feb 29, 2020

I believe the web packaging proposal allows something like this: https://www.infoq.com/news/2020/01/web-packaging-bundles-wic...

copperx · on Feb 29, 2020

What is Memex?

progval · on Feb 29, 2020

Probably this: https://github.com/WorldBrain/Memex

londons_explore · on Feb 29, 2020

Sites that use JavaScript based cache-busting code are probably hardest to archive. Fetching CDN.com/jQuery.js?rnd=63926195 isn't exactly easy without handcoded workarounds.

edwhitesell · on Feb 29, 2020

Mods: could we get 2018 in the title please?

mintplant · on Feb 29, 2020

The most expedient way to request changes like this is to shoot off an email to hn@ycombinator.com.