Hacker News new | past | comments | ask | show | jobs | submit login
ArchiveTeam + Yahoo Messages Shuttering + EC2 Spot Instances = MegaCrawl (rossduggan.ie)
74 points by duggan on March 23, 2013 | hide | past | favorite | 21 comments



I built 2 CloudFormation templates to allow you to easily spin up a ton of these things across multiple availability zones. It uses the Amazon Linux AMI that exists in each region instead of the ones listed, and builds up the dependencies and the application on the fly. You can run either of these two templates in any region and it should just work.

Download one of these files:

With a keypair ( so you can login to the host) http://files.wordsaboutbytes.com/yahoo-messages-save.cf.txt

Without a keypair ( can’t log in locally, but it will run) http://files.wordsaboutbytes.com/yahoo-messages-save-nokeypa...

Then:

1. Open the console ( https://console.aws.amazon.com )

2. Go to CloudFormation

3. Give your stack a name

4. Browse and select the file you downloaded from above

5. Click Next.

6. Fill in the parameters here ( # of instances, The nick you want to be tracked with at the archive team site, the spot price you are willing to pay, and optionally a keypair if you selected that file).

7. Check the box at the bottom acknowledging that the template will create IAM resources ( used by the host to bootstrap )

8. Click Continue.

9. Tags if you want, or click continue.

10. Review. Click Continue.

11. Close.

This will launch however many instances you told it to, as t1.micro’s, as the spot price you set it to. When you want to stop, you just go and delete the stack in this console and everything should go away.

Running this right now in US-West-2, spread across all 3 AZ's there, about 90 instances total, and cranking through things.


For those having trouble with the EC2 instructions, I thought I'd point out that I think that the Archive Warrior[0] (which is much easier to get up and running on your laptop, etc) running over my tethered cellphone is my most performant client.

Yahoo don't appear to rate limit mobile devices / IP blocks as aggressively as everything else (probably because cellular providers tend to have many customers behind one IP).

[0] http://www.archiveteam.org/index.php?title=ArchiveTeam_Warri...


I've also made the image available in all regions, for those who want to run in additional regions:

  N. Virginia: ami-2400984d
  Ireland: ami-d8d2d8ac
  Tokyo: ami-a361e1a2
  Singapore: ami-6e703c3c
  Sydney: ami-4e0e9f74
  Sao Paolo: ami-9d7aa180
  N. California: ami-94f6dbd1
  Oregon: ami-cf9206ff


If you have boto installed, this is all you'll need to do

    import boto.ec2
    conn = boto.ec2.connect_to_region("us-east-1")
    conn.request_spot_instances('0.005', 'ami-2400984d',
                                instance_type='t1.micro', user_data='USERNAME')


We're currently at 16,000+ items and 61 GB uploaded. Nice work.

Edit: 25 minutes later and we are at 18,700+ items and 67 GB uploaded. Distributed computing at its finest!


When your spot instance gets killed mid-download, does the Warrior system handle that and re-assign it to someone else immediately?

Or was your spot instance assigned some URLs to download with the assumption that your Warrior would be reliable, and now they won't get reassigned until they check them all at the end, which may be too late?


They will be reassigned later on


If you'd rather run a script than a vm, check out:

https://github.com/ArchiveTeam/yahoomessages-grab


That works for this project, but having the warrior makes it easy to join other ArchiveTeam projects now and in the future.


I always hate it, when companies remove user generated content from the internet. Why doesn't Yahoo just send some Dvds with the content to Archive.org?


I was thinking the same thing whilst reading the article. There are surely people at Yahoo! who would gladly help if asked.


I'm sure there would be people in Yahoo would want to help, the problem is actually finding and getting in contact with someone who does care.


...then clearing it with legal and getting approval within the org to actually /do/ this. AFAIK, this would be unprecedented, but would probably win Yahoo! a lot of fans when they need it most. I have no real hope for them or anyone in a similar position to provide user data back to community custodians.


.. as long as you forget all the other user-generated-content sites Yahoo has closed over the years.


Is there a way we can donate to this? If you can supply a Paypal account, I'd gladly throw in a few bucks.



I've got 50 spot requests running and another 50 pending evaluation. I'm really curious to see how much I can contribute with that in 24 hours.


Happy to help, but it looks like we're getting throttled hard by Yahoo. Anyone have a contact over there who they can ping about this?


Is this whole archival movement the Internet equivalent of hoarding? When is it ok to clean house?


Several groups tried to salvage as much of Geocities as possible before Yahoo killed it. They got most of it (about 1 terabyte) and you can fix most geocities.com links by changing them to point at reocities.com instead. The main reason is that it's user data and deleting it is rude. Second is that you can do interesting analysis on a terabyte of user accounts during the boom of the internet. http://contemporary-home-computing.org/1tb/archives/3297 The third is that it's history! NSFW example from geocities http://contemporary-home-computing.org/1tb/archives/2736


When bits are essentially free, there isn't much of a reason not to. Could you imagine how fascinating it would be to be able to dive into the everyday culture of 100, or even 1000 years ago? The anthropological impacts of archiving day to day life is huge. For once, history may be written by facts, rather than the victors.

That may be a bit naïve, but who cares if they aren't hurting anyone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: