

ArchiveTeam needs OPMLs and feed URLs to grab cached data from Google Reader - ivank
http://allyourfeed.ludios.org:8080/

======
ivank
ArchiveTeam has saved cached Reader feed data for 37.3M feeds so far, and even
though this seems like a lot, it still doesn't include many of the feeds
people are subscribed to. Hence the request for OPMLs/subscriptions.xml files.

If you're interested in being able to read old posts in some future feed
reading software, or just like having the data preserved, you can upload your
OPMLs and ArchiveTeam will make its best effort to grab the feeds.

More details:
[http://archiveteam.org/index.php?title=Google_Reader](http://archiveteam.org/index.php?title=Google_Reader)

7TB+ of compressed feed text:
[http://archive.org/details/archiveteam_greader](http://archive.org/details/archiveteam_greader)

Also, if anyone has billions of URLs that I can query, I could use them to
infer feed URLs and save an incredible amount of stuff. See email in profile
if you do.

------
ersii
I went to Google Takeout
([https://www.google.com/takeout/#custom:reader](https://www.google.com/takeout/#custom:reader)),
checked my feeds for private ones (didn't have any) and submitted my feeds

------
epaulson
I wrote some bad python to save my own copy of all the cached content from the
feeds I subscribe to:

[https://github.com/epaulson/stash-greader-
posts](https://github.com/epaulson/stash-greader-posts)

It doesn't upload them anywhere, but at least I've got my own copy of them if
I ever think of something I want to do with them.

~~~
ivank
Thanks for the link, I highly recommend people do this for the feeds they care
about, since ArchiveTeam cannot guarantee bug-free operation.

Also, it looks like there is another tool mentioned in
[https://news.ycombinator.com/item?id=5958188](https://news.ycombinator.com/item?id=5958188)

------
tmzt
Cool, installing the applications on docker on my dedi.

I have a couple of question though:

Will the data remain archived on my system after it is updated? And what
format will that be in?

Will there be a public API to access this data once uploaded, or for services
such as Feedly to import back entries from feeds? (I would hope they would
support that, but the public API would be enough for me.)

Thank you for providing this service.

~~~
ivank
After the greader*-grab programs upload data to the target server, it is
removed from your machine. All of the data eventually ends up in WARCs at
[https://archive.org/details/archiveteam_greader](https://archive.org/details/archiveteam_greader)

As for an API, someone will hopefully write one to directly seek into a
megawarc in that archive.org collection, or import everything into their feed
reading service.

~~~
tmzt
Any way I could patch the program to stop it from deleting the data after it
is uploaded?

Among other things I would like to set up an ElasticSearch cluster for my own
feeds.

Is the WARC format defined somewhere? I haven't looked at any other
ArchiveTeam projects so I'm not informed if this format is used elsewhere.

~~~
ivank
Yeah, you could patch seesaw-kit to not delete local data. Note that greader-
grab just gets a random work item from the tracker.

There's an ISO spec for WARC and tools linked at
[http://www.archiveteam.org/index.php?title=The_WARC_Ecosyste...](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem)

~~~
subsystem
I haven't tried it, but the --keep-data option might work?

[https://github.com/ArchiveTeam/seesaw-kit/blob/master/run-
pi...](https://github.com/ArchiveTeam/seesaw-kit/blob/master/run-pipeline#L48)

~~~
ivank
Yeah, that looks like the right thing.

------
antimatter15
This isn't relevant (except tangentially to the Internet Archive's Wayback
machine), but I'm curious about the ethics (or legal standing) of rehosting
the Google Reader app (client side portions) with a re-implementation of the
(internal) Google Reader API so that the app remains usable in an unchanging
state.

~~~
eli
The client code and assets are surely copyrighted by Google, no?

Also, I think the backend is the hard part.

~~~
th0ma5
I think there is at least one project out on GitHub that is trying to make a
compatible backend API built upon Node.js and other technologies.

------
drivebyacct2
Haha, no one procrastinated on exporting their data now did they?

------
webwanderings
I am sorry but I don't think people should upload their OPML like this.

Your feed collection is like your personal life. It should be private (so what
if the URLs are public and general). By disclosing your collection, it is like
you are living in a glass house.

Unless I'm missing something, I don't see a single helpful reason for this
service.

~~~
zeckalpha
The point is most feeds do not have all of their historical posts. Google
Reader preserved old posts in the feeds. ArchiveTeam is taking our OPMLs and
getting a copy of Reader's archive before it is taken down.

~~~
webwanderings
If you plug in a feed in InoReader, you get thousands of items from the past.
It seems they are fetching the historical feeds (I can't tell how far back
they go).

But why would ArchiveTeam wants to preserve the historical items in a feed if
the feed does not belong to them in the first place (neither did it belong to
Google)?

~~~
eli
Isn't that like asking why you should preserve an old book even if you aren't
its author?

~~~
webwanderings
No, there is a difference. An old book at the brink of extinction still
belongs to you. You can get third party services to preserve the book for you
with the stipulation that your preservation work and the book carries decent
privacy rights (it won't be broadcast to the world what you're doing).
Remember the old days when you would go to a store to develop your camera
roll? The service implied that your picture content is between you and the
developer of the film.

It's fine if people don't see any privacy implication here by submitting their
reading collection. But as far as my single individual point is concerned, I
don't see why I should upload my OPML for the sake of preservation. I have
hard time uploading it to any other Google replacement out there trying to
compete.

~~~
sp332
Oh, so you're not really worried about people downloading the _content_ of the
blogs, you just don't want anyone to connect the list of blogs to you? Well,
you can just take the URLs out of the OPML file, and submit them one at a
time. You can even use different IP addresses if you're that paranoid.

~~~
webwanderings
" you just don't want anyone to connect the list of blogs to you?"

That's exactly my point.

