
Show HN: Realtime Hacker News Summarization Using Machine Learning - bexp
http://hn10.org/index.html
======
ecesena
Not sure what do you use to scrape, but I can recommend newspaper to extract
text.

I contributed a while ago and it was pretty solid.

[https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper)

~~~
sohkamyung
Thanks for that. The Newspaper demo site seems to show it working quite well
on the news sites I usually browse.

------
xiamx
Got that this is using nltk. Can you comment on which summarization algorithm
you are using (TextRank, SumBasic and etc.) What's the motivation towards
choosing one algorithm over another and how's its performance on average HN
articles?

~~~
a_c
This. It would be great if we can have the source code alongside with OP's
comment on his/her summarization

~~~
apapli
Agree. It would be great to see your source code if at all possible, would be
greatly appreciated.

------
spraak
I'm glad that someone made this! Ever since I've known about ML and HN I've
wanted to make something like this

------
acd
Good summary, how is h 10 built? It would be cool to summarize other sites
feed with the same tech.

~~~
bexp
I used NLTK in Python [http://nltk.org](http://nltk.org) \+ Twisted for
serving HTTP

~~~
amirouche
NLTK has no machine learning algorithm AFAIK

------
alistproducer2
This would really benifit from an explanation of what I'm seeing. can't really
upvote it without knowing what's going on....

~~~
bexp
there are couple things here: 1\. scrapping web page to get text content 2\.
use NLTK to proccess text and get summary and keywords 3\. wrap it into REST
API and serve as web service

~~~
jeffehobbs
You could probably get cleaner input for step 1 via the Mercury API
[[https://mercury.postlight.com/web-
parser/](https://mercury.postlight.com/web-parser/)] — it has a lot of
affordances for different kinds of HTML formatting.

~~~
bexp
Thanks will try at some point, my biggest concern was that those kind of API's
are almost all paid and rarely open sourced.

------
ostegm
Can you describe a little about how you go from raw text to a summary? Is this
supervised learning? If so, what is the training data?

------
usgroup
Consider just using the first paragraph as the summary.

For the sort of thing on HN, it'll be spot on more times than not.

~~~
aaronhoffman
Yes, could probably just grab the open graph tag data for most articles and
display that. But this is a cool idea.

------
earthly10x
Would be nice to connect other data to it like stocks e.g.
[https://cymetica.slack.com/apps/A26G72726-quantbot](https://cymetica.slack.com/apps/A26G72726-quantbot)

------
s_ngularity
It would be good if it still included a link to the HN comments for each post

~~~
bexp
done !

~~~
alexilliamson
There are still posts with no link to the comments. Seems like it is happening
whenever the summary is also missing below the title (currently "5\. Ask HN:
Leave job right before app goes to production?" is like this)

------
travelinreid
[https://thepul.se/](https://thepul.se/) also uses an algorithm based on nltk
to summarize general news. (It also includes a nice speed-reader.)

------
brunopedro
What summarization algorithm are you using?

------
Raphmedia
Does it summarize our comment threads or the pages it links to?

~~~
bexp
Just pages for now.

~~~
Raphmedia
Alright! Did you ever try to run a summarizer on a post's comment section? It
pretty much creates an "article" for you. I've always thought that article
generation from comment threads is an area that should be explored.

------
singularity2001
js error: "get top stories error index.html:56:4"

------
mulrian
Is this open source?

