
Homemade RSS aggregator followup - ingve
http://leancrew.com/all-this/2015/12/homemade-rss-aggregator-followup/
======
abengoam
That's awesome! I should know because I also created my own RSS aggregator
after the demise of Google Reader.

Here's a screenshot [https://imgur.com/YHJOiEX](https://imgur.com/YHJOiEX)

It fills my needs perfectly because I created it specifically for myself and I
control it fully in all aspects.

It's been such a tremendous success for me and so fun to create that I am
thinking about replacing other online services with custom-made versions, such
as google calendar, google tasks, etc. Something to look forward to in 2016.

Great job, and keep at it!

~~~
scrollaway
Your reader looks awesome. Did you open source it?

~~~
abengoam
Thank you! Alas, I did not. I am fully behind the open source/free software
movement(s), but right now I am at a point in my life where I can't manage and
support an open source project as my availability is super spotty. Also, you
don't want to see the css... it makes gotos look good.

~~~
scrollaway
You should not be afraid of open sourcing code you can't support or are not
happy with :)

Here's my most popular Github repository. I never use it and it's entirely
supported by its users, I just keep an eye on it:

[https://github.com/jleclanche/django-push-
notifications](https://github.com/jleclanche/django-push-notifications)

And here is some of my worst production code. Fully undocumented and with my
first Python code ever in it, never rewritten/cleaned up:

[https://github.com/jleclanche/pywow](https://github.com/jleclanche/pywow)

There, I just showed you terrible code and unmaintained crap ;) Don't be
ashamed!

------
mmb
The bigger problem is that the amount of information you can consume using RSS
feeds is declining. Most sites don't publish RSS feeds of their content any
more.

Sadly RSS is left over from a time when things were more open. Now everything
is an app and everyone wants you to stay in their walled garden.

~~~
anotherevan
Yeah, there's a couple of interesting sites I would like to follow, but they
don't provide RSS feeds, so I don't bother.

~~~
l1n
I personally use [http://changedetection.com/](http://changedetection.com/) to
create RSS feeds for sites that don't have them (though I may move to Huginn
agents in future).

------
rcarmo
I have a fair amount of code that people can re-use to build their own
aggregators, since I did a number of experiments when Google Reader died.

One was a Fever clone that had a number of strategies for doing parallel
fetching:

\- [https://github.com/rcarmo/bottle-fever](https://github.com/rcarmo/bottle-
fever)

Andrea Peltrin took that and evolved it into Coldsweat:
[https://github.com/passiomatic/coldsweat](https://github.com/passiomatic/coldsweat)
\- which I recommend if you want a web UI.

I did a number of other things, but eventually went back to what I used
_before_ Google Reader: e-mail.

I was one of the contributors for
[http://newspipe.sourceforge.net/](http://newspipe.sourceforge.net/), and
after getting bottle-fever going I decided to investigate the state of the art
and did a quick fork of rss2email that injected messages into an IMAP store
instead of sending them via SMTP, to avoid spam traps.

It was a quick hack, but it allowed me to read feeds using any mobile IMAP
client, and a friend eventually did a Go version, which I've also tweaked to
my liking:

\- [https://github.com/rcarmo/rss2imap](https://github.com/rcarmo/rss2imap)
(Python) \- [https://github.com/rcarmo/go-
rss2imap](https://github.com/rcarmo/go-rss2imap) (Go)

Any of the above are likely to save people a fair amount of time (do bear in
mind that the Python version was a hack atop code that was written by Aaron
Swartz a decade ago, and it shows its age).

These days I ended up going back to Feedly, simply because I have to use
Windows, the Web UI is good enough and there are lots of good clients for the
platforms I use (NextGen, Reeder, etc.)

Plus I realised that trying to archive stuff from hundreds of feeds was
somewhat pointless -- the stuff I really want to keep around goes into Pocket
or OneNote, and that's that.

Edit: Also, here are some notes from 2008 on Bayesian classification and its
effectiveness: [http://taoofmac.com/space/blog/2008/01/27/2203#an-update-
on-...](http://taoofmac.com/space/blog/2008/01/27/2203#an-update-on-my-rss-
setup)

~~~
voltagex_
I like the idea of (ab)using protocols to do not- _quite_ what they were
intended to do.

Pushing RSS feeds into IMAP is a great idea - I wonder how much work it'd take
to make NewsBeuter to that, then expose it somewhere and have FastMail pull it
into a folder for me.

These hacks eventually start looking like Rube-Goldberg machines, but they've
got a certain charm.

Offtopic: I wrote a Wake-on-Lan server that allows me to turn on VMs as if
they were physical machines -
[https://github.com/voltagex/junkcode/tree/master/CSharp/Virt...](https://github.com/voltagex/junkcode/tree/master/CSharp/VirtualBoxAsAService/NullReference.WakeOnVirtual)

The next one for me is probably going to be a DNS server that resolves the
name and IPs of VMs.

------
aw3c2
My perfect aggregator would also create a WARC archive of the webpage of each
post, including all external references, maybe the referenced external
websites and their references (with that single depth of recursion). The
internet is friggen fragile and I would love to archive what I consume.

~~~
derefr
To go further: there's basically no point in the "description" part of an RSS
item. RSS is broken in that authors need to lure people onto their sites, so
they make the RSS item itself enclose just enough of a preview to make you
"click through"—whereupon their site can show you ads and they can make money.

How RSS _should_ work, in an ideal technical sense, is to eschew enclosing any
content-body in feed items themselves, and instead just encourage RSS
consumers (feed-reader clients; feed-muxer daemons) to scrape the permalinks
of the feed items, and then heuristically extract the body-content from the
scrape-result, and cache both the resulting page-archive and the resulting
cleaned-up text, making _both_ representations available offline.

This, obviously, kills blog ad revenue. But it's better to kill it and replace
it with something better (402 micropayment-required errors at point-of-
caching, handled automatically by the RSS content-spidering daemon as an HTTP
client, with costs passed on to its subscribers?) than to continue on with
this semi-braindamaged "I have an offline cache but that doesn't actually mean
I can read anything offline" world.

~~~
okasaki
All the blogs I read include the full content in the feed. Some even include
top comments.

And if you're parsing sites then you have no use for RSS - you can just parse
the index to see what's new. Sounds like an nightmare to me though. Who's
going to maintain these parsers for hundreds of sites?

~~~
derefr
RSS tells you what's actually a new page. Usually, if a site wants to be
Google-friendly, it'll have a page for every new content item—so, if a site
publishes an RSS feed, that's usually enough to "chunk" their content by time.

Heuristics to decide what's new on a site are much, much harder to code than
heuristics to extract content when you're explicitly told what's new. There
was a service called Dapper, who tried the "extract content from a CSS-
selector-specified zone of a page when content changes" approach... and it
didn't go well. Yahoo Pipes had similar aspirations; again, got shuttered.

There are always sites that are really bad internet citizens. Some sites just
change their front page to update, without creating a canonical archival
URL—so then, if they _do_ publish an RSS feed, every item is just a link back
to said front page. There's not much you can do in these cases beyond just
crowdsourcing "parsers for hundreds of sites" (which does happen—webcomics
being a frequent and horrifying example.)

But there is a pretty good alternative, I think: if you can scrape the site
itself _as a whole_ at regular intervals, with fine enough granularity that
any two contiguous scrapes will detect either 0 or 1 change, then you can
_probably_ convert that into a useful RSS feed without trying to figure out
from "what changed" from DOM deltas. Basically, shove each scrape into the
working index of a git repo, commit, and generate an RSS feed of the diffs.
(Wikipedia can give you this in some form; I think gwern.net has an RSS feed
that also basically works this way.) And, if the site has even the most
minimal source RSS feed, you don't need the "regular intervals" approach: you
can use the source RSS feed to provide "cueing" information (i.e. timestamps
of when the site has changed) for your scrapes.

Which is all to say, RSS is a very "utilitarian" format. You don't need to
rely on all of this processing happening on the server side, but rather can
just take one RSS feed, and then use whatever bits of it you want to generate
another RSS feed, and then someone can consume _that_ RSS feed and write a
heuristic to generate cluster and combine the feed items from it into higher-
level summaries, etc.

This sort of thing really shows the "theme" of RSS, to me. RSS wasn't really
designed or intended for direct consumption by readers, but rather to make it
easy to have _something_ on your site (even your statically-generated site, or
your crap one-off PHP blog) that's easy for other services to consume and
_turn into_ stuff. It's "Really Simple Syndication": it's not the whole
process of getting stuff to clients, it's just the lowest barrier-to-entry to
get the _supply_ -side of the equation involved, so that well-made _delivery
infrastructure_ can take over from there.

There were never meant to be "RSS reader" software, really, or even e.g.
consume-side RSS-to-email gateways. Instead, RSS was meant to be lowest-
common-denominator format for other supplier-side services to consume. Instead
of writing "lifecycle emails", for example, you were meant to just have your
web app generate a lifecycle notification stream RSS feed for each user—and
then subscribe an emailing service to it.

We've replaced this behavior, for the most part, with webhooks—having our web-
apps actually reach out using a REST client and prod other services' APIs. But
RSS (with _one_ consumer—the hooked service) is much, much cheaper than a REST
client, from all perspectives: your web-app can likely already generate XML,
it likely already has a list of changes and "view rendering" capabilities,
etc. Webhooks are for the 1% of apps that can both 1. run server-side, 2.
reach-out from their sandboxes to poke something elsewhere on the web, and 3.
have a public URL where they can be poked back at. An RSS-based pub-sub event
system, meanwhile, will work even if the event-generator is on your local
machine: as long as you proxy over the RSS feed URL itself, you don't need to
worry about machines trying to figure out whether the site that prodded them
with a webhook request is coming from the right IP for that client or not,
etc. They just get to blindly consume a URL, like any other piece of idiomatic
web software.

------
pmoriarty
I've recently come back to using newsbeuter[1] and have been quite impressed.
It's really feature rich and very customizable. It's a terminal app, which
some might not like, but for me that's a plus.

[1] -
[http://www.newsbeuter.org/index.html](http://www.newsbeuter.org/index.html)

------
gerty
I guess Tiny Tiny RSS hasn't been mentioned yet. FOSS, self-hosted with
multiple Android clients. I had been using Feedly since Google Reader went
down but should have actually been using TTRSS since the beginning. I ain't no
power user but it has definitely more than I ever would ask for.

------
wanda
If anybody happens to be looking for an RSS aggregator, I'd like to recommend
GoRead.

Obviously I wouldn't pay for it, but self-hosting is pretty straightforward
and it has a companion Android app.

Never cared for Feedly and I don't really fancy making my own.

It's the best Google Reader clone I've found.

[https://www.goread.io](https://www.goread.io)

~~~
ents
I don't like feedly either, but as a backend for apps it works fine, and is
free.

------
petercooper
I made a similar script but that also has 'plugins' so 'URLs' like @username,
twitter:topic, /r/subreddit, and hn:topic load up the tweets, Twitter search
results, sub-Reddit items, or HN search results respectively for certain
keywords, using their respective APIs.

~~~
oneloop
Care to share?

------
krylon
Very interesting!

I am currently building an RSS aggregator, too. Mine is little more complex,
though - I wanna be able to rate items as interesting or boring and use some
kind of filter (currently, a simple Bayesian classifier, I intend to replace
or at least enhance with something more sophisticated over the holidays) to
weed out news that I am not interested in.

The biggest problem is that web design is not my strong suite (to put it
mildly), so the thing looks pretty ugly. Classification does not work very
well, yet, but I am not sure if this is because the classifier sucks in
general or if my training set is too small at the moment (I've only been using
the thing for a couple of days now).

Anyway, it is quite interesting to see another approach to the problem.

~~~
alanpost
Do you use a single classifier for your entire feed, or do you categorize the
feeds and maintain a classifier for each topic? (Or, as always, secret option
#3: neither.)

~~~
krylon
Currently, I use a single classifier for all the news items.

I store the items, along with manual ratings, in a database, from which the
classifier is trained. (This also allows me to keep a history of older items,
which I can search.)

I hope to make it more sophisticated, eventually.

------
oxplot
Given that email clients are fairly mature and advanced already (especially
Gmail's), it seemed logical to go the unix way and use it as the UI to stream
of feeds sent as email. I wrote a bit of python [1] and stuck on it a free
openshift cartridge and it sends me one email per feed item. It's been up
since April this year shortly after I abandoned feedly. I like it more than
Google Reader now.

[1]: [https://github.com/oxplot/lapafeed](https://github.com/oxplot/lapafeed)

------
axx
I'm also working on an open source RSS Reader called HappyFeed. It's
compatible with Fever RSS API, so you can use it with Reeder, ReadKit, Press
and so on.

I work on this project mainly for myself, but if you're interested and want to
contribute, feel free to get in touch!

Screenshots and development blog:
[https://need.computer/happyfeed/2015/12/20/happyfeed-drag-
an...](https://need.computer/happyfeed/2015/12/20/happyfeed-drag-and-drop-
sidebar-and-image-proxy.html)

GitHub:
[https://github.com/aleks/HappyFeed](https://github.com/aleks/HappyFeed)

------
fasouto
I started creating an RSS aggregator some time ago
([https://github.com/fasouto/django-
feedaggregator](https://github.com/fasouto/django-feedaggregator)) and it was
more difficult than expected. There are many broken feeds and different
interpretations of the standard.

One day I should finish it...

~~~
rcarmo
Check my top-level comment.

------
younata
Been working on my own RSS reader for iOS.
[https://github.com/younata/RSSClient/](https://github.com/younata/RSSClient/)

Pretty much all the internal logic (parsing feeds/opml files) are also written
from scratch, which was interesting to do.

------
newtang
I built a similarly simple feed aggregator for anyone to use at
[https://plumfeed.com](https://plumfeed.com) Shows the most recent post of
each feed. I've found it particularly nice for blogs and comics that update
once a day or less.

------
hasteur
Adding annother "Google Reader" replacement for TinyTiny-RSS. There is some
jiggery-pokery" that needs to be done, but it does nearly everything you could
want (I wouldn't mind getting the behind the cut reveals)

------
ojiikun
Surprised there has been no mention of NewsBlur. Though you can pay the author
for use of the central instance, it is also fully open sourced on github.
Saying it is more feature-rich than most examples would be an understatement.

