Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How You Can Help Save Upcoming.org, Posterous, and More (waxy.org)
31 points by neilk on April 20, 2013 | hide | past | favorite | 11 comments


"While companies like Yahoo work to destroy as much human history as possible, Archive Team is the only group actively trying to save it."

What kind of article is this? It is cool they are trying to save some of our virtual history, but these kind of statements aren't helpful. Clearly Yahoo's main intension isn't destruction.


Destruction is exactly what "sunsetting" is all about. Or to be more accurate, its like saying you're going to stop watering your fern. No, you're not destroying the fern you're just... not keeping it alive anymore.

The problem with people using online services is that their data is stored there and can't be recovered ever again. You can say the data is pointless crud or everyone should know better, but I don't see how you can argue that the data isn't being lost against some people's wishes.


I agree that Yahoo's intent isn't the destruction of historical data, and that the phrasing is hyperbolic, but the end result is more accurate than not.

Internet companies and service providers like Yahoo have collected huge amounts of human communications, intent, and intellectual output ("UGC") that continues to have historical significance even if the commercial interest is no longer there.

Maybe these companies don't make any implicit promises for archiving this data, but as a society, we're a lot poorer for its wholesale destruction and we should be thinking a lot more about the social contract/implications...


I had never heard of Upcoming.org until a day ago. I had heard of Posterous, but never really used it.

Why do these sites need to be saved? It appears that they have been shutdown due to lack of widespread traction or apparent value.

Sure, they are someone's "baby", and it's natural for some people to take this personal. But is there really truly anything of value in saving these sites?


I don't understand. If this virtual machine:

- Downloads dying web pages, and

- Uploads them to the Wayback Machine at the Internet Archive,

That means I'm not saving the Internet Archive any bandwidth at all, and this is no more efficient than them just downloading the site themselves.

The virtual machine itself is 174MB. This times however many volunteers means distributing the virtual machine is probably more stressful than the actual archiving operation.


As mentioned, the issue is getting around YDOD throttling (and that the shutdown is happening in 10 days).

If you have proper boxen, you don't need the VM. Here's how to get it up and running on a clean Ubuntu setup for example:

  sudo apt-get install build-essential -y
  sudo apt-get install git -y
  sudo apt-get install libgnutls-dev -y
  sudo apt-get install liblua5.1-dev -y
  sudo apt-get install python-distribute -y
  sudo easy_install pip 
  sudo pip install seesaw
  git clone https://github.com/ArchiveTeam/yahoo-upcoming-grab.git
  cd yahoo-upcoming-grab
  ./get-wget-lua.sh 
  run-pipeline pipeline.py [YOURNAMEHERE]


Does that download posterous as well, or only upcoming?


For Posterous you'd need to check out https://github.com/ArchiveTeam/posterous-grab and run that particular script I believe.


Yahoo and others either rate limit or do temporary IP bans if you access too many pages too quickly. Distributing the tools through a VM distributes the workload in a pretty easy-to-setup way.


The Internet Archive crawler is polite, going at a slow rate.

With a deadline looming so soon, a more aggressive effort is called for.


It does actually save quite a lot of bandwidth in the process. The giles are gzipped as they are uploaded to the Archive Teams server, so your virtual machine will do a lot more download than it does upload.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: