

ArchiveTeam + Yahoo Messages Shuttering + EC2 Spot Instances = MegaCrawl - duggan
http://rossduggan.ie/blog/technology/archiveteam-yahoo-messages-shuttering-ec2-spot-instances-megacrawl/

======
tgeek
I built 2 CloudFormation templates to allow you to easily spin up a ton of
these things across multiple availability zones. It uses the Amazon Linux AMI
that exists in each region instead of the ones listed, and builds up the
dependencies and the application on the fly. You can run either of these two
templates in any region and it should just work.

Download one of these files:

With a keypair ( so you can login to the host)
<http://files.wordsaboutbytes.com/yahoo-messages-save.cf.txt>

Without a keypair ( can’t log in locally, but it will run)
[http://files.wordsaboutbytes.com/yahoo-messages-save-
nokeypa...](http://files.wordsaboutbytes.com/yahoo-messages-save-
nokeypair.cf.txt)

Then:

1\. Open the console ( <https://console.aws.amazon.com> )

2\. Go to CloudFormation

3\. Give your stack a name

4\. Browse and select the file you downloaded from above

5\. Click Next.

6\. Fill in the parameters here ( # of instances, The nick you want to be
tracked with at the archive team site, the spot price you are willing to pay,
and optionally a keypair if you selected that file).

7\. Check the box at the bottom acknowledging that the template will create
IAM resources ( used by the host to bootstrap )

8\. Click Continue.

9\. Tags if you want, or click continue.

10\. Review. Click Continue.

11\. Close.

This will launch however many instances you told it to, as t1.micro’s, as the
spot price you set it to. When you want to stop, you just go and delete the
stack in this console and everything should go away.

Running this right now in US-West-2, spread across all 3 AZ's there, about 90
instances total, and cranking through things.

------
duggan
For those having trouble with the EC2 instructions, I thought I'd point out
that I think that the Archive Warrior[0] (which is much easier to get up and
running on your laptop, etc) running over my tethered cellphone is my most
performant client.

Yahoo don't appear to rate limit mobile devices / IP blocks as aggressively as
everything else (probably because cellular providers tend to have many
customers behind one IP).

[0]
[http://www.archiveteam.org/index.php?title=ArchiveTeam_Warri...](http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior)

~~~
duggan
I've also made the image available in all regions, for those who want to run
in additional regions:

    
    
      N. Virginia: ami-2400984d
      Ireland: ami-d8d2d8ac
      Tokyo: ami-a361e1a2
      Singapore: ami-6e703c3c
      Sydney: ami-4e0e9f74
      Sao Paolo: ami-9d7aa180
      N. California: ami-94f6dbd1
      Oregon: ami-cf9206ff

------
conroy
If you have boto installed, this is all you'll need to do

    
    
        import boto.ec2
        conn = boto.ec2.connect_to_region("us-east-1")
        conn.request_spot_instances('0.005', 'ami-2400984d',
                                    instance_type='t1.micro', user_data='USERNAME')

------
Auguste
We're currently at 16,000+ items and 61 GB uploaded. Nice work.

Edit: 25 minutes later and we are at 18,700+ items and 67 GB uploaded.
Distributed computing at its finest!

------
vitovito
When your spot instance gets killed mid-download, does the Warrior system
handle that and re-assign it to someone else immediately?

Or was your spot instance assigned some URLs to download with the assumption
that your Warrior would be reliable, and now they won't get reassigned until
they check them all at the end, which may be too late?

~~~
Cameron_D
They will be reassigned later on

------
pronoiac
If you'd rather run a script than a vm, check out:

<https://github.com/ArchiveTeam/yahoomessages-grab>

~~~
sp332
That works for this project, but having the warrior makes it easy to join
other ArchiveTeam projects now and in the future.

------
lignuist
I always hate it, when companies remove user generated content from the
internet. Why doesn't Yahoo just send some Dvds with the content to
Archive.org?

~~~
knackers
I was thinking the same thing whilst reading the article. There are surely
people at Yahoo! who would gladly help if asked.

~~~
Cameron_D
I'm sure there would be people in Yahoo would want to help, the problem is
actually finding and getting in contact with someone who does care.

~~~
seanp2k2
...then clearing it with legal and getting approval within the org to actually
/do/ this. AFAIK, this would be unprecedented, but would probably win Yahoo! a
lot of fans when they need it most. I have no real hope for them or anyone in
a similar position to provide user data back to community custodians.

~~~
ersii
.. as long as you forget all the other user-generated-content sites Yahoo has
closed over the years.

------
JustARandomGuy
Is there a way we can donate to this? If you can supply a Paypal account, I'd
gladly throw in a few bucks.

~~~
MichaelStubbs
Their wiki answers your question:
[http://archiveteam.org/index.php?title=Posterous#Can_I_donat...](http://archiveteam.org/index.php?title=Posterous#Can_I_donate_some_cash_instead.3F)

------
rpicard
I've got 50 spot requests running and another 50 pending evaluation. I'm
really curious to see how much I can contribute with that in 24 hours.

------
seanp2k2
Happy to help, but it looks like we're getting throttled hard by Yahoo. Anyone
have a contact over there who they can ping about this?

------
kogir
Is this whole archival movement the Internet equivalent of hoarding? When is
it ok to clean house?

~~~
sp332
Several groups tried to salvage as much of Geocities as possible before Yahoo
killed it. They got most of it (about 1 terabyte) and you can fix most
geocities.com links by changing them to point at reocities.com instead. The
main reason is that it's user data and deleting it is rude. Second is that you
can do interesting analysis on a terabyte of user accounts during the boom of
the internet. <http://contemporary-home-computing.org/1tb/archives/3297> The
third is that it's history! NSFW example from geocities <http://contemporary-
home-computing.org/1tb/archives/2736>

