
Come and help save Posterous from oblivion - jacquesm
http://jacquesmattheij.com/come-help-save-posterous-from-oblivion
======
kevinalexbrown
Whoa, instead of just lamenting the shutdown and how talent acquisitions are
horrid, and how VC's won't do the right thing and companies like Twitter are
killing innovation (you might believe these things to be true; that doesn't
change my point), these guys (Jason Scott and the Archive Team and jacquesm)
did something about it. And when Jacques couldn't do it himself he organized
other people.

My favorite kind of leadership, and an example of the (double-edged) sword of
instant communication: people can be rallied around time-sensitive causes,
like SOPA or posterous shutting down.

I know this seems a little obvious, but it's striking how rare it actually
seems to be. I'm curious why. Maybe it's just my perception and it's happening
all the time. Certainly people are doing great things, but I'm curious why we
haven't yet seen more specific, directed actions like this. Does it depend on
relatively homogeneous communities like Reddit (SOPA) and HN (this)? If there
is a proliferation of such communities, say subreddits or otherwise, could we
expect this to happen more frequently? Do we want it to happen more
frequently, or do we run the risk of, say, DHS running a pro-search campaign
like China's 50 cent army? I'm just curious about why this seemed so striking
to me.

This also fits in with an article I'd been meaning to read by Fukuyama[0] on
social capital written 15 years ago.

 _The vice of modern democracy is to promote excessive individualism, that is,
a preoccupation with one's private life and family, and an unwillingness to
engage in public affairs. Americans combated this tendency towards excessive
individualism by their propensity for voluntary association, which led them to
form groups both trivial and important for all aspects of their lives._

Perhaps we'll see more of these spontaneous actions as voluntary associations
are easier to make as the infrastructure that supports them (e.g. reddit)
becomes more well known and fine-tuned.

[0]
[http://www.imf.org/external/pubs/ft/seminar/1999/reforms/fuk...](http://www.imf.org/external/pubs/ft/seminar/1999/reforms/fukuyama.htm)

~~~
MatthewPhillips
I find this particularly disturbing:

> I made an offer to continue to host posterous.com and all the stuff on it
> but never received an answer.

Has Twitter completely lost touch with the web community? Is it possibly to
get a response from them these days if you're not an advertiser?

~~~
corin_
As someone who spends reasonably large amounts of money on digital media buys,
I've found that most companies like Twitter (though actually I have no
experience with Twitter, except as a user) certainly are a lot more friendly
when you have money to spend.

Fun story about Facebook: when I first wanted to start spending decent amounts
with them (decent not huge - $10,000s/month) a few months ago I literally
could not get in touch with a single person. Even completing contact forms I
wasn't hearing back from them. A friend who used to work for an SEO/SM company
told me a name of someone to contact, I used LinkedIn's InMail (a paid
feature) to message him, and 24 hours later I had 3 account managers
(including a technical expert, a media strategist and an overall account
manager), who answer calls to their mobiles at any time of day. Now I have a
nice route to get answers on any topic, not just paid advertising, thanks to
my spending. (Was actually shocked about how hard it was initially to make
contact and give them money, too.)

------
AlexMuir
This is a great effort. But it makes me furious that the founding team can be
so disparaging to their users.

Sachin Agarwal, you used this community to enrich yourself and further your
own career. In return, at the very least, you owe an explanation for why such
a convoluted effort has to be made to get this content off the servers.

It's also exceptionally discourteous to ignore emails from upstanding
community members, but perhaps you missed these. But I know for sure that you
will read this thread, so I'd love to hear why a database dump can't be
provided, or a couple of IPs whitelisted to just rip through a scrape.

------
dpcan
How does one opt out? I've migrated my blog to my own system so if I want to
make changes to an old post I can, but I won't be able to control my content
that you are archiving which is a problem for me. It's actually rather
bothersome to me. I figured I'd just go make sure my blog is deleted, but it
could be archived by now.

~~~
craigmc
That is the world wide web. You published it and anyone could save a copy of
what you published (right-click, save as) at any time. This is no different
surely?

~~~
skeletonjelly
And Facebook not actually deleting your photos when you click the "delete"
button is ok too right? It's published forever?

~~~
Dylan16807
I would presume that facebook photos are usually not public. There is quite a
distance between the way private and public data should be handled.

------
mseebach
It took me a bit of fiddling to get running on a spare Debian box, so I
thought I'd share:

    
    
        $ sudo apt-get install virtualbox-ose
        $ wget http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova
        $ tar xf archiveteam-warrior-v2-20121008.ova 
        $ VBoxManage import archiveteam-warrior-v2-20121008.ovf
        $ screen
        $ VBoxHeadless --vnc --startvm archiveteam-warrior-2
    

Hit Ctrl+A, D to exit screen and leave the VM running. From a non-headless
box:

    
    
        $ ssh -L8001:localhost:8001 you@yourserver.com
    

Point a browser to <http://localhost:8001>

EDIT: Added 'screen' to steps. You're gonna wanna use screen.

~~~
ersii
Thanks a lot for writing that up. I'll be sure to add it to the FAQ/wiki on
<http://www.archiveteam.org/index.php?title=Posterous>

------
wheaties
You're my new hero. I don't use it and I never used geocities but this is
still awesome. Historians will thank you in years to come. Sociologists and
such will praise what can be mined. And the lists go on...

~~~
sp332
Here's one of the best blogs about the Geocities data: <http://contemporary-
home-computing.org/1tb/>

~~~
runarb
The screenshot archive that page mention is fantastic:
<http://oneterabyteofkilobyteage.tumblr.com/>

Bring back so many memories about the early days :)

~~~
jacquesm
Most of those pages are still alive, for instance:

<http://www.reocities.com/Area51/vault/5058/>

Just change the 'g' from geocities to the 'r' of reocities.

------
jdrock
@jacquesm please send me a list of URLs to crawl (10M+), and I'll set up an
80legs job to do this. shion - at - 80legs - com.

~~~
ersii
The problem is that Posterous is hard to crawl. For one; They'll continously
and automatedly ban your IPs, even if you rotate over a lot of them. Two:
Posterous can't take all of the requests.

We've (ArchiveTeam) unfortunally made Posterous unresponsive multiple times.
So please be careful to not completely bring it down if you're doing a solo
effort.

Please also bear in mind that it's not just to "chuck it into the
downloader"..

~~~
ersii
Also, please use a sensible format if you're crawling/archiving this.

We're using WARC (Web Archive) which is an official ISO File Format standard -
which the Internet Archive's Wayback Machine can use. It's also a pretty good
and nice format for archiving web pages in general.

------
btown
For those of us who might want to donate cloud computing time but have
weak/memory-limited laptops, is there an EC2 image that we could fire up for
the cause?

~~~
veeti
The OVA image is a standard format. I'd imagine EC2 would support importing
it.

~~~
btown
Sadly, it doesn't seem to (ERROR: Unknown disk image format: OVA). I wouldn't
know the first thing about how to _reliably_ convert the format.

~~~
mintplant
StackOverflow (well, _ServerFault_ ) has a little bit on it:
[http://serverfault.com/questions/387049/how-to-run-an-ova-
ov...](http://serverfault.com/questions/387049/how-to-run-an-ova-ovf-
appliance-inside-of-aws-ec2)

Rather inconclusive, though.

------
guelo
In the long-term web startups will be dead because it will be more and more
obvious that it's imposible to trust your data to for-profit companies that
simply cannot maintain your interest in mind no matter what promises they
make. Everything important should be on open-source, community run, nonprofit
platforms.

~~~
ricardobeat
Or be hosted with a company that has a sustainable business model from the
get-go.

------
Samuel_Michon
Well, this is exciting: we've just now passed the halfway point!

items: 1470951 done, 1467514 to do

[url to leaderboard deleted upon request]

(NB: it does look like this thing has been underway since Feb 27)

~~~
ersii
Please don't link to the leaderboard (the link in parent post), it is a bit
loaded and fragile. That is used for all the ArchiveTeam Warriors.

Also, unfortunally - those numbers aren't totally up to scratch. Please hang
on, we'll have a FAQ describing that in a moment.

~~~
Samuel_Michon
Thanks for the heads up. I've removed the link.

~~~
ersii
Thanks for doing so. I'll be sure to include your thoughts from your post (re
our goals) in our FAQ by the way.

------
danparsonson
Here's a link to VMWare's OVF converter tool if you use VMWare instead of
VBox:
[http://communities.vmware.com/community/vmtn/server/vsphere/...](http://communities.vmware.com/community/vmtn/server/vsphere/automationtools/ovf)

~~~
sp332
The VMware Player (and Workstation, I'm assuming) will import it
automatically. The only issue is that you have to choose a different
connection for the second virtual disk. Not sure why that's a problem but
moving it works fine.

------
bherms
Why is it so important to save all this information? Seems like projects like
archive are just contributing to more information clutter. We generate more
information every single day, what makes a few million peoples blog posts so
damn important? We can't just keep saving shit forever, though the progression
of technology I guess makes it easier and easier, but eventually we're going
to hit a limit.

~~~
ersii
How are we to know what's important and not? Surely, there's interesting
content available at Posterous. Just to mention an example, CloudFlare's blog
is hosted there.

Sure, there's plenty of spam accounts and crappy content - but that might
prove worthful in the future. Maybe someone would study what kind of content
we as a race were contributing to that kind of platform, in this day and age -
maybe someone is researching the automated spam.

This is not really taking up all that much space, in this day and age. There's
around 2.2 TB downloaded - it's mostly text and images. That's half a single
4TB drive. Not really storage capacity to fight about in my opinion.

~~~
bherms
Yeah I guess you're right about the storage piece, however, I don't think it's
useful at all. We always live in the moment of "right now is the most
important moment in history", when really most of the content we're saving is
junk, and, as more and more of it compounds, more and more junk will just
accumulate on the pile. I'd assume that 90% of what's in posterous is
worthless, the other 10% is just people reiterating good points, but the key
word is _re_iterating. Do we really need tens, then hundreds, then thousands
of years of files of things people said on personal blogs in the past?
Absolutely not.

------
josephlord
Ethically this is far better behaviour than those who are shutting down the
service and there is altruism rather than profit motivation BUT legally isn't
this epic scale copyright infringement of millions of works created by
thousands of people?

~~~
sp332
It's been discussed. Because it was published in a public forum, fair use is
certainly a consideration. _Is it legal to copy stuff from websites without
permission? U.S. courts haven’t made a clear determination. Andy Sellars, a
staff attorney at the Citizen Media Law Project, says he would argue that it
counts as “fair use” under copyright law. However, he notes that the Archive
Team’s torrents don’t offer a mechanism for copyright holders to demand that
certain material be taken down, which could hurt its case in a court._
[http://www.technologyreview.com/featuredstory/426434/fire-
in...](http://www.technologyreview.com/featuredstory/426434/fire-in-the-
library/)

------
itsnotvalid
I am guessing that if Posterous and the ones who is responsible ultimately,
should be giving the data to archive.org directly. This is the only sensible
thing to do. Of course, they have to clear copyrights first.

------
cedricd
I love it. This is creating a benign botnet.

~~~
mortenjorck
Benign merely denotes non-maliciousness – I would call this a benevolent
bonnet.

------
frasierman
I think I saw a comment somewhere about your IP being banned after an hour...
anything we should do to avoid this? I'd hate to be scanning for 15 minutes
only to be banned and not be able to help anymore.

~~~
sp332
The bans nominally last 24 hours. There was a point where there were so many
IPs running (from AWS servers) they overflowed the ban list and the bans were
shorter!

~~~
frasierman
I didn't even think about AWS... someone should put together an article on how
to set them up... I'd be happy to launch some instances for the cause

~~~
ersii
We've had a few guys using Amazon Web Services and continously rotate IP's/set
up new instances - unfortunally, the last time we went too 'hard' on them -
effectively making Posterous unavailable.

We're thinking about this right now. Feel free to hang around #preposterus on
EFNet for updates.

------
savrajsingh
Awesome! Love the spirit of the effort, running a Warrior now for kicks. Just
curious, isn't it possible for someone at Google to press a button and make
this happen? :)

~~~
ersii
Most likely, yes. They probably have most of Posterous cached already.

It's however very unlikely that they'd just release that data. They probably
have their own archive format and things as well..

------
tectonic
Are you guys archiving photos / images too? When I just wget the site those
don't generally come down. I don't know if they're being loaded via JavaScript
or a plugin.

To preserve my own blog I just saved it as a pdf with wkhtmltopdf, but it'd be
nice to have a full HTML version.

[http://blog.andrewcantino.com/blog/2013/03/12/archive-a-
pdf-...](http://blog.andrewcantino.com/blog/2013/03/12/archive-a-pdf-of-your-
posterous-blog/)

------
lutusp
This event, and many like it, only remind me that the modern Internet is about
which groups you belong to, not who you are as an individual.

Websites owned and operated by individuals are now vanishingly rare, while
aggregations of people -- Facebook, Twitter, et. al. -- have become the norm.

Example individual website: <http://arachnoid.com/>

~~~
ersii
Yes, and in my mind, it's a bit unfortunate to have it all centralized to a
few major portals.

Especially since nothing lasts forever, like we clearly see here. When a giant
falls..

------
hedgehog
Torrent for the VirtualBox appliance:

magnet:?xt=urn:btih:b1b3df637bf9bb78f32e1667944535b60c9d37c1&dn=archiveteam-
warrior-v2-20121008.ova

~~~
jacquesm
c8c86a1e225bb28ccdd229f6f27fe39ad65c9831

Sha1 of image, in case you want to verify.

------
doki_pen
Something like this would be cool as a chrome app. I'm not sure how
complicated the job is, but if it's just a matter of hitting apis and queueing
data for upload, it should be possible with local storage. You could even use
subdomains to defeat storage limits.

------
niggler
"I made an offer to continue to host posterous.com and all the stuff on it but
never received an answer."

Have you tried making a public appeal (say, over twitter) to the Posterous
owners? They may be able to talk with Twitter and arrange to capture the
content directly

~~~
jacquesm
I will definitely make another one but it was here on HN to one of the former
posterous owners.

------
dasil003
Stoked to help with this. Will put my new BT fibre to the test.

~~~
ersii
Bandwidth in itself is not per say needed, in this project. IP's however are
very useful.

Feel free to stop by our project IRC channel at #preposterus (Note: It's not
spelled pre-posterous) or our main channel at #archiveteam on EFNet

~~~
Samuel_Michon
_"Bandwidth in itself is not per say needed"_

I guess so; I have a 200Mbit fiber connection but surprisingly little is
happening.

I started the instance 10 minutes a go and it has only uploaded 20MB. I've set
'concurrent items' and 'rsync threads' to the max.

~~~
ersii
We're rate limiting how many items/users gets handed out. Because Posterous is
very fragile, and we're blowing away their front end caches - which they rely
on heavily.

Because of this, we practically hit the back end every time (as well as other
users of Posterous, because we blow the cache away) - which makes Posterous
very slow.

We've ran a few hundered more threads earlier, successfully making Posterous
completely unavailable unfortunally.

------
Zirro
The large "Run your own ArchiveTeam Warrior!" link is currently just linking
to a fragment identifier. I think you'd like to change it.

~~~
jacquesm
Good point, thank you, fixed now.

------
zopticity
Stop being a necromancer and resurrecting a dead service. There's got to be a
good reason why they killed Posterous, so let it die. Let it go; stop holding
onto the past.

~~~
jacquesm
Please read the intro to the blog post and think again. I've seen these
comments by the boatload when we took on saving geocities and you are simply
_wrong_.

~~~
saraid216
Is there a place I can find this Geocities archive? I apparently missed
something I wanted to keep when I archived my own stuff, and I'm wondering if
I can find it again.

~~~
jacquesm
<http://reocities.com/>

lots at

<http://web.archive.org/>

more at:

<http://ascii.textfiles.com/archives/2720>

