
Linkarchiver, a new bot to back up tweeted links - luu
https://parkerhiggins.net/2017/07/linkarchiver-a-new-bot-to-back-up-tweeted-links/
======
cyphar
URLTeam (part of the larger ArchiveTeam) are doing this for quite a few
different URL shortening services and they're providing torrents to download
the database[1] (similar to how they created torrents for the entirety of
GeoCities[2]).

I would recommend contributing to their various tools[3] so that the archive
can be maintained for future digital archeologists.

The Wayback Machine is run by the Internet Archive (and ArchiveTeam is a
project run in part by folks from the Internet Archive and a whole gaggle of
volunteers), so I'm not really surprised they were happy with you providing
backups of URL shortened links. But I would recommend helping URLTeam more
directly (since their method of keeping archives of URL shorteners is much
more efficient).

[1]: [http://urlte.am/](http://urlte.am/) [2]:
[https://thepiratebay.org/torrent/5923737/Geocities_-
_The_Tor...](https://thepiratebay.org/torrent/5923737/Geocities_-_The_Torrent)
[3]:
[https://github.com/ArchiveTeam/terroroftinytown](https://github.com/ArchiveTeam/terroroftinytown)

~~~
thisisparker
ArchiveTeam and URLTeam do great work, but those projects and the OP have
different goals. (I'm the OP.) This bot will NOT produce any kind of corpus of
tweeted links or links shortened by Twitter's t.co shortener; in fact, it
bypasses t.co entirely and I don't have any kind of record of those shortened
links.

Instead, it backs up the contents of the pages that are linked to at the
_time_ of the tweets. Frankly my tool doesn't do anything interesting at all
with the URLs — it just submits the "expanded URL" that was tweeted to the
Wayback Machine and lets it sort out any and all 301s.

~~~
gojomo
It might be nice to save as well:

(1) the tweet-detail page, for the tweet that includes the link;

(2) the t.co mapping, so that the tweet-detail page's t.co link can somehow be
resolved to the (archived) page to which it links.

I don't think there are any blocks against doing (1).

Unfortunately for (2), Twitter has a blanket robots.txt prohibition in place
for domain t.co. Perhaps IA could be convinced to ignore that robots.txt in
the public interest.

Alternatively, perhaps another site could be set up that itself accepts t.co
link-paths, in the background queries t.co, and returns both an HTML page and
working redirect that _isn 't_ robots.txt-blocked. LinkArchiver (and any other
similar sites) could as a convention archive responses of this other site
whenever they'd like to archive t.co.

------
jcahill
> It also required that I figure out how to “daemonize” the script[.] I found
> this aspect surprisingly difficult; it seems like a really basic need, but
> the documentation for how to do this was not especially easy to find. I host
> my bots on a Digital Ocean box running Ubuntu, so this script is running as
> a systemd service.

Agreed on the relative dearth of well-gisted tutorial content re: routine
sysadmin tasks.

Outlets that pump out tons of marginally-differentiable content on <today's
unnecessary stack element that'll have you terraforming the noosphere 333%
more, free> tend to prevail by sheer numbers.

That being said, systemd is its own problem in need of abstracting away from.
Supervisor⁽¹⁾⁽²⁾ might be the layer you're looking for.

____________________

¹ [http://supervisord.org/index.html](http://supervisord.org/index.html)

² [https://digitalocean.com/community/tutorials/how-to-
install-...](https://digitalocean.com/community/tutorials/how-to-install-and-
manage-supervisor-on-ubuntu-and-debian-vps)

