

Show HN: ParseRSS gets full-text articles from a RSS feed - Valid
http://parserss.com/

======
jkldotio
Nice. The output is similar to that of my rss ingest pipeline for
<http://jkl.io> although I've yet to add my custom document/topical hash,
sentiment and topical classifiers directly but it has article, stemmed
article, first sentence (which will evolve to summary), named entities and
resolves url redirects.

I am thinking I should clean up the code, add a few more extractors and
release it soon as a url analysis library (I was thinking "demands" would be a
good name to pair with Python's "requests"). I would like to get entity
disambiguation from Wikipedia in it first though as I think that is a vital
feature. My funding pitch largely failed though so I will approach that
somewhat more slowly, but the methodology and libraries for constructing
reasonable entity disambiguation from topic modelling (rather than heaviest
sub-graph approaches) are out there.

I recently saw an API on HN selling basically this type of extraction from
urls, but I think it's necessary (along with Common Crawl and other such
things) for this base layer to be there for free so people can properly
compete with Google. I think Google currently runs 200+ extractors and
classifiers on every page, so they have a huge advantage over startups (and
non-profits which is my area of interest) in this area which Common Crawl
can't help with by just providing the raw data.

~~~
sdoering
As I am trying to learn some basics on automated text-processing and
categorization, I am always fond of these experiments/ideas like yours.

The idea of releasing it as demands sounds great. I would love to hear from
you, when it is released.

------
sdoering
Made me smile, when I tried a German RSS Feed. The sentiment-analysis was
always negative, as the German word "die" (=the female/plural form) was
confused for the concept of dying.

So, this really is non the less a great service. I will try to incorporate
this in an experiment I am running. I am using the Readability-API till now,
but it is (on German news sites) not that good in extracting the pure text-
content.

It nearly always has navigation-, or advertisement-text in it. That does make
it difficult to do text-analysis on the content, as I am trying to do.

------
eli
Are you using something like diffbot or are you doing the scraping yourself?

Edit: Ah, I see, it's Streamified.me

