
Ask HN: What is the best way to archive a webpage - badwolff
I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).<p>It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.<p>Currently they are using http:&#x2F;&#x2F;archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.<p>What are easy solutions to do this?<p>With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.<p>Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.<p>What are some other solutions?
======
mdaniel
I've enjoyed great success with various archiving proxies, including
[https://github.com/internetarchive/warcprox#readme](https://github.com/internetarchive/warcprox#readme)
and
[https://github.com/zaproxy/zaproxy#readme](https://github.com/zaproxy/zaproxy#readme)
(which saves the content to an embedded database, and can be easier to work
with than warc files). The benefit of those approaches over just save-as from
the browser is that almost by definition the proxy will save all the
components required to re-render the page, whereas save will only grab the
parts it sees at that time.

If it's a public page, you can submit the URL to the Internet Archive, and
benefit both you and them

------
cimmanom
If it's not doing silly things like using Javascript to load static content,
wget can do recursive crawls.

------
adultSwim
Either curl or wget will get you pretty far. Learn one of them well. They are
basically equivalent. I use curl.

For current web apps, there is an interactive archiver written in Python, Web
Recorder. It captures the full bi-directional traffic of a session.
[https://webrecorder.io/](https://webrecorder.io/) Web Recorder uses an
internal Python library, pywb. That might be a good place to look.
[https://github.com/webrecorder/pywb](https://github.com/webrecorder/pywb)

It looks like Selenium has done a lot of catching up on it's interface. I'd be
curious how they compare now.

Talk to librarians about archiving the web. They made Internet Archive and
have a lot of experience.

------
inceptionnames
Save the page using the browser's Save feature and zip the created assets (a
html file plus directory with graphics, js, css, etc) for ease of sharing.

------
tyingq
If you’re okay with easy, but saves at a third party:
[https://www.npmjs.com/package/archive.is](https://www.npmjs.com/package/archive.is)

------
anotheryou
perma.cc looks sweet, but it's very limited for private people

~~~
adultSwim
Run it yourself [https://github.com/harvard-
lil/perma](https://github.com/harvard-lil/perma)

~~~
anotheryou
I especially like them for their rescue plan and that it will stay online no
matter what _I_ do...

