How You Can Help Save Upcoming.org, Posterous, and More

evanmoran · on April 20, 2013

"While companies like Yahoo work to destroy as much human history as possible, Archive Team is the only group actively trying to save it."

What kind of article is this? It is cool they are trying to save some of our virtual history, but these kind of statements aren't helpful. Clearly Yahoo's main intension isn't destruction.

lubujackson · on April 21, 2013

Destruction is exactly what "sunsetting" is all about. Or to be more accurate, its like saying you're going to stop watering your fern. No, you're not destroying the fern you're just... not keeping it alive anymore.

The problem with people using online services is that their data is stored there and can't be recovered ever again. You can say the data is pointless crud or everyone should know better, but I don't see how you can argue that the data isn't being lost against some people's wishes.

lhl · on April 21, 2013

I agree that Yahoo's intent isn't the destruction of historical data, and that the phrasing is hyperbolic, but the end result is more accurate than not.

Internet companies and service providers like Yahoo have collected huge amounts of human communications, intent, and intellectual output ("UGC") that continues to have historical significance even if the commercial interest is no longer there.

Maybe these companies don't make any implicit promises for archiving this data, but as a society, we're a lot poorer for its wholesale destruction and we should be thinking a lot more about the social contract/implications...

brk · on April 21, 2013

I had never heard of Upcoming.org until a day ago. I had heard of Posterous, but never really used it.

Why do these sites need to be saved? It appears that they have been shutdown due to lack of widespread traction or apparent value.

Sure, they are someone's "baby", and it's natural for some people to take this personal. But is there really truly anything of value in saving these sites?

gcr · on April 20, 2013

I don't understand. If this virtual machine:

- Downloads dying web pages, and

- Uploads them to the Wayback Machine at the Internet Archive,

That means I'm not saving the Internet Archive any bandwidth at all, and this is no more efficient than them just downloading the site themselves.

The virtual machine itself is 174MB. This times however many volunteers means distributing the virtual machine is probably more stressful than the actual archiving operation.

lhl · on April 20, 2013

As mentioned, the issue is getting around YDOD throttling (and that the shutdown is happening in 10 days).

If you have proper boxen, you don't need the VM. Here's how to get it up and running on a clean Ubuntu setup for example:

  sudo apt-get install build-essential -y
  sudo apt-get install git -y
  sudo apt-get install libgnutls-dev -y
  sudo apt-get install liblua5.1-dev -y
  sudo apt-get install python-distribute -y
  sudo easy_install pip 
  sudo pip install seesaw
  git clone https://github.com/ArchiveTeam/yahoo-upcoming-grab.git
  cd yahoo-upcoming-grab
  ./get-wget-lua.sh 
  run-pipeline pipeline.py [YOURNAMEHERE]

bdonlan · on April 21, 2013

Does that download posterous as well, or only upcoming?

lhl · on April 21, 2013

For Posterous you'd need to check out https://github.com/ArchiveTeam/posterous-grab and run that particular script I believe.

mark_olson · on April 20, 2013

Yahoo and others either rate limit or do temporary IP bans if you access too many pages too quickly. Distributing the tools through a VM distributes the workload in a pretty easy-to-setup way.

pronoiac · on April 20, 2013

The Internet Archive crawler is polite, going at a slow rate.

With a deadline looming so soon, a more aggressive effort is called for.

nwh · on April 21, 2013

It does actually save quite a lot of bandwidth in the process. The giles are gzipped as they are uploaded to the Archive Teams server, so your virtual machine will do a lot more download than it does upload.