
Create your own machine-learning-powered RSS reader - doppenhe
http://blog.algorithmia.com/post/93293999119/create-your-own-machine-learning-powered-rss-reader-in
======
photorized
I recently launched a Twitter app designed to filter out the noise by
measuring the velocity of all tweets from all people I follow. It only shows
me 5 things at any given moment that are most likely "interesting".

Check it out, it's free, and it only wants read-only auth permissions via
Twitter: [http://skim.io/](http://skim.io/)

Still tweaking the algorithm.

~~~
trickjarrett
Years ago I did something kind of similar. I created my own Twitter client
(never public and now defunct) but one of the features was that it would
prevent one user from monopolizing my stream by hiding their posts after the
three most recent. So say someone was live tweeting an event I would only see
their three most recent, and after that it would display a collapsed bar to
let me know it had hidden something.

Made skimming much easier, I still miss this feature.

~~~
photorized
pmarca would hate that feature, it would interfere with his tweet storms. :)

In all seriousness, noise is the biggest problem I have with twitter. When you
follow more than 100 accounts, the timeline becomes pretty much useless.

Your comment gives me another idea for my app though - we will make it so each
of the Top 5 is from a different unique publisher... and if more than one of
their comments are trending, we choose the 'fastest'.

------
kevindavis
Really interesting, has my mind spinning around the possibilities. Would love
to be able to apply this stuff to filter my Feedly account - how hard would
that be using Algorithmia?

~~~
doppenhe
I am not super familiar with the the feedly api (or how it works) but
integrating with Algorithmia should be really simple. There is a registration
link at the bottom of the blog post and you can check out the docs at
[http://algorithmia.com/docs](http://algorithmia.com/docs).

feel free to drop me a note at diego at algorithmia dot com as well.

~~~
kevindavis
Just checked out the docs, looks like they have an easy call to mark
one/multiple articles as read - so could have something running periodically
filtering out stuff you're not interested in

------
photorized
One sentiment analysis, there seems to be a lot of false negatives - at least
when parsing Techcrunch. For example, all these were tagged as 'negative':

\- Timely Turns Your Calendar Into A Time Tracker

\- Audi Tests Self-Driving Cars On Florida’s Roads

\- Twitter Acquires Password Security Startup Mitro, Open Sources Its Product

~~~
doppenhe
We used the Stanford NLP library with their training data set which is
considered to be one of the better ones. I did notice false negatives as well
but it can definitely be trained to be more accurate.

~~~
photorized
I decided there's no such thing as a good NLP library. In my other apps, I
usually use several libraries at once, then make them vote. :) Works out
better than Stanford NLP.

I do need to add some intelligence to skim.io though.

~~~
hnriot
This is called ensemble classification where you feed the outputs of multiple
classifiers as features into an ensemble classifier that produces the final
result.

How are you using the Stanford NLP? That's all GPL?

There are alternatives you could look at for sentiment analysis but short
"documents" like those referenced will always produce poor results because
there's just not enough signal to work with. The training models need to have
vocabulary overlap with the documents (at least for word features); try
TextBlob which uses a lexicon approach rather than a classifier, or try
rolling your own with an off-the-shelf SVM and pull labeled training data from
one of the many sources (or generate your own using Crowdflower.) Small
documents (tweets/titles etc) pose unique challenges, especially when there's
irony or sarcasm involved or implicit sentiment through pragmatic knowledge.
For example knowing Sarah Palin and how she's regarded automatically gives a
person a head start in determining the sentiment of a short document with her
name. This kind of pragmatic knowledge is hard for classifiers to learn.

~~~
walterbell
In the example above, could social network analysis (e.g.
[https://en.wikipedia.org/wiki/NodeXL](https://en.wikipedia.org/wiki/NodeXL))
be used to profile Sarah Palin, then combined with text classification?

~~~
hnriot
It's an option, not sure about NodeXL, wikipedia is already available in
structured form in Freebase and DBPedia, but understanding something so
complex as a reputation is beyond current machine learning. Bringing the
pragmatic background knowledge to sentiment analysis is going to be one of the
differentiations. We got 65% easily with a lexicon, we got 75% with SVMs and
gobs of training data, we've gone past that with hierarchical aspect models
and technologies like word vectors, but the problem of making improvements
gets exponentially harder. Before social networks can play a role in sentiment
analysis likely we'll see breakthroughs in coreference and similar problems
that will help eek out more signal from training data. We can certainly use
the "hive mind" to assist in this problem, even something as simple as
collaborative filtering can help.

------
minimaxir
What are the rate limits on the APIs? Especially since they're very
computationally expensive.

~~~
doppenhe
currently no rate limits while in private beta.

------
Noctyrnal
This is great stuff! Super useful!

~~~
doppenhe
Thanks! Please feel free to send any feedback to diego at algorithmia dot com.

------
toisanji
the signup link doesn't work

~~~
doppenhe
should be working again. sorry about that.

~~~
jscheel
Still isn't working for me.

~~~
doppenhe
traffic killed our frontend for a hot sec. Should be good to go now. Thanks
for the patience.

------
mmenafra
Awesome!

