Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Realtime Hacker News Summarization Using Machine Learning (hn10.org)
138 points by bexp on Jan 24, 2017 | hide | past | favorite | 29 comments



Not sure what do you use to scrape, but I can recommend newspaper to extract text.

I contributed a while ago and it was pretty solid.

https://github.com/codelucas/newspaper


Thanks for that. The Newspaper demo site seems to show it working quite well on the news sites I usually browse.


thanks will try that


Got that this is using nltk. Can you comment on which summarization algorithm you are using (TextRank, SumBasic and etc.) What's the motivation towards choosing one algorithm over another and how's its performance on average HN articles?


The summary seems to pick the most unique sentences, so maybe LexRank? Amazing though, I've been really wanting a way to skim HN articles quicker, since there is so much volume and you don't want to miss anything

Great work!


This. It would be great if we can have the source code alongside with OP's comment on his/her summarization


Agree. It would be great to see your source code if at all possible, would be greatly appreciated.


I'm glad that someone made this! Ever since I've known about ML and HN I've wanted to make something like this


Good summary, how is h 10 built? It would be cool to summarize other sites feed with the same tech.


I used NLTK in Python http://nltk.org + Twisted for serving HTTP


NLTK has no machine learning algorithm AFAIK


This would really benifit from an explanation of what I'm seeing. can't really upvote it without knowing what's going on....


there are couple things here: 1. scrapping web page to get text content 2. use NLTK to proccess text and get summary and keywords 3. wrap it into REST API and serve as web service


You could probably get cleaner input for step 1 via the Mercury API [https://mercury.postlight.com/web-parser/] — it has a lot of affordances for different kinds of HTML formatting.


Thanks will try at some point, my biggest concern was that those kind of API's are almost all paid and rarely open sourced.


Can you describe a little about how you go from raw text to a summary? Is this supervised learning? If so, what is the training data?


Consider just using the first paragraph as the summary.

For the sort of thing on HN, it'll be spot on more times than not.


Yes, could probably just grab the open graph tag data for most articles and display that. But this is a cool idea.


Would be nice to connect other data to it like stocks e.g. https://cymetica.slack.com/apps/A26G72726-quantbot


It would be good if it still included a link to the HN comments for each post


done !


There are still posts with no link to the comments. Seems like it is happening whenever the summary is also missing below the title (currently "5. Ask HN: Leave job right before app goes to production?" is like this)


https://thepul.se/ also uses an algorithm based on nltk to summarize general news. (It also includes a nice speed-reader.)


What summarization algorithm are you using?


Does it summarize our comment threads or the pages it links to?


Just pages for now.


Alright! Did you ever try to run a summarizer on a post's comment section? It pretty much creates an "article" for you. I've always thought that article generation from comment threads is an area that should be explored.


js error: "get top stories error index.html:56:4"


Is this open source?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: