Hacker News new | comments | ask | show | jobs | submit login
Archiveis: simple Python wrapper for the archive.is capturing service (github.com)
35 points by ingve 8 months ago | hide | past | web | favorite | 14 comments

  domain = ""
  save_url = urljoin(domain, "/submit/")
Huh? What's this IP? Why does it send stuff in cleartext?

That used to be one of archive.is IPs but not since feb. Definitely worth a pull request.

[edit] adding screenshot from passive dns lookup https://m.imgur.com/a/uW9eL2h

Maybe of interest:

I recently open-sourced an archive.is package and command-line client written in Go:

https://jaytaylor.com/archive.is (aliased to https://github.com/jaytaylor/archive.is)

    go get jaytaylor.com/archive.is
So far I've been using it and it's worked reasonably well :)

One cool thing- I actually used this python package as a starting place for how to automate archive.is submissions.


RE: archive.is: The person running archive.is deserves a lot of credit, it is a remarkable system. It may not be immediately clear how how challenging it is to capture and bottle up (safely and _reliably_) the contents of arbitrary URLs until you actually try to make such a thing. Archive.is person has implemented it at scale, and plans to keep the content available indefinitely [0], all on their own dime.

Mad props.

[0] https://archive.is/faq

Yet I still can't shake the idea of a potential hidden agenda behind the effort (not saying there is, just curious).

Why go through so much trouble and financially support it, while it doesn't seem to bring anything to the maintainer to financially support it over time?

For a long time access to archive.is was blocked to the whole of Finland, which was frustrating.

I would love it if Archive.is was more open about their process.

I have a fairly large bookmark collection and I want to archive them. Burdening archive.is with that seems a bit of a waste tbh, I'd rather host a screenshot and selfcontained HTML file myself without costing someone else money.

I've been a part from the archiving community for a number of years. Is archive.is now considered the defacto standard for snapshotting sites?

> Is archive.is now considered the defacto standard for snapshotting sites?

No way!

web.archive.org is defacto standard for snapshotting[0] sites[1]!

[0] http://web.archive.org/save/https://news.ycombinator.com/ite...

[1] http://web.archive.org/web/*/https://news.ycombinator.com/it...

I wrote something similar some years back for a bot I used to run. Given an input URL, it concurrently attempted to create snapshots across as many archive sites (archive.is, Wayback Machine, etc) as possible, with caching, retries, smart backoff, and continuous updating of the destination as HTTP responses arrived. Always meant to release it as a standalone library, but never got around to doing so. I wonder where the code is now...

Gwern has released such a piece of software: https://github.com/gwern/archiver-bot

See also http://www.gwern.net/Archiving-URLs for a description of their usage of it.

Sort of. Looks like it archives to each service serially, doesn't retrieve the resulting snapshot URL, and doesn't do any caching, retries, etc. If one service goes down, starts responding slowly, or blocks your requests, it'll break. This is like the first version of my own implementation; it took a lot more work from there to make it robust.

Why is everyone using archive.is now? What happened to web.archive.org?

Archive.is is a bit superior in JS processing compared to WayBack, and ignores robots.txt

Yeah, but archive.org is run by a reputable non-profit with strong governance and support, which we can be fairly confident will remain operational for the foreseeable future. Who runs archive.is? How certain are we that they will be around 1, 5, 10 years into the future? How certain are we that the links will still work?

This is archiving that we're talking about, after all.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact