Got that this is using nltk. Can you comment on which summarization algorithm you are using (TextRank, SumBasic and etc.) What's the motivation towards choosing one algorithm over another and how's its performance on average HN articles?
The summary seems to pick the most unique sentences, so maybe LexRank? Amazing though, I've been really wanting a way to skim HN articles quicker, since there is so much volume and you don't want to miss anything
there are couple things here:
1. scrapping web page to get text content
2. use NLTK to proccess text and get summary and keywords
3. wrap it into REST API and serve as web service
You could probably get cleaner input for step 1 via the Mercury API [https://mercury.postlight.com/web-parser/] — it has a lot of affordances for different kinds of HTML formatting.
There are still posts with no link to the comments. Seems like it is happening whenever the summary is also missing below the title (currently "5. Ask HN: Leave job right before app goes to production?" is like this)
Alright! Did you ever try to run a summarizer on a post's comment section? It pretty much creates an "article" for you. I've always thought that article generation from comment threads is an area that should be explored.
I contributed a while ago and it was pretty solid.
https://github.com/codelucas/newspaper