
Archiving Web Sites (2018) - butterthebuddha
https://lwn.net/Articles/766374/
======
joshstrange
I just installed Memex today to help with this problem for myself. I really
would like an extension that recorded everything and saved as WARC or some
other re-playable format. I hope that Memex can achieve what I want now which
is "Crap, which web page was that quote/keyword/idea on?". Chrome history is
iffy at best.

~~~
gildas
SingleFile [1] can do that, i.e. automatically save viewed pages and/or
bookmarks, but it will save pages in HTML. Alternatively, SingleFileZ [2] can
do the same but will produce zip files (disguised as HTML files). Disclosure:
I'm the author.

[1] [https://github.com/gildas-lormeau/SingleFile](https://github.com/gildas-
lormeau/SingleFile)

[2] [https://github.com/gildas-lormeau/SingleFileZ](https://github.com/gildas-
lormeau/SingleFileZ)

~~~
santa_boy
I'm using something similar I believe. I simply wrote a puppeteer automated
browser to go through every page and saves it as `.mhtml` This work quite well
for my purpose. I was archiving a site with content that I pay for and sits
behind my login. I often use material from it when I'm offline and hence
needed to put together this hack.

The below code does the job of saving the page as a single file.

```

    
    
            const page = await this.browser.newPage()
            const response = await page.goto(url, { timeout: 50000 })
    
            if (response.status() === 404) {
                await page.close()
                throw new Error('not found')
            }
    
            // credit: https://stackoverflow.com/questions/54814323/puppeteer-how-to-download-entire-web-page-for-offline-use
            const cdp = await page.target().createCDPSession();
            const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    
            const htmlFilename = "./data/" + slugify(url)+'.mhtml';
            fs.writeFileSync(htmlFilename, data);

```

------
londons_explore
Sites that use JavaScript based cache-busting code are probably hardest to
archive. Fetching CDN.com/jQuery.js?rnd=63926195 isn't exactly easy without
handcoded workarounds.

------
edwhitesell
Mods: could we get 2018 in the title please?

~~~
mintplant
The most expedient way to request changes like this is to shoot off an email
to hn@ycombinator.com.

