
Analyzing Articles on Hacker News Using NLP - luu
http://nbviewer.jupyter.org/github/jayantj/news-analyze/blob/master/notebooks/Analyze%20HN%20using%20NLP!.ipynb
======
stuartaxelowen
> Firstly, any articles that received under 50 points were filtered out

Why? There's still a lot of information in the posts that didn't receive
significant interest, and unexplained filtering here seems more suspicious
than anything.

------
baccheion
HN users/voters seem habituated and predictable. What if this is applied to
link titles to predict likelihood of making it to the front page? What words
(or sequences of letters) are associated with popularity?

I also wonder how many unique voters are present, especially regulars and
those who vote before a link hits the front page. I bet there aren't that
many. And I bet they are mostly INTJ. That is, content seems
curated/controlled by a handful of users (ie, bubble). What can be done to
buffer against bias? How can submissions automatically be surfaced/tested
(shown on the front page to patterned/known users) even before receiving any
votes?

I've always thought some percentage of the front page should be dedicated to
(randomly chosen, though slightly weighted) new submissions. Some percentage
of the page some percentage of the time to some percentage of users able to
vote. Or maybe the new tab should be shown inline on the right. That is, I'm
guessing most only see what's shown on the first page.

~~~
minimaxir
> HN users/voters seem habituated and predictable. What if this is applied to
> link titles to predict likelihood of making it to the front page? What words
> (or sequences of letters) are associated with popularity?

Apropos of nothing, I am working on building a model for predicting HN post
performance.

TL;DR it is not easy.

~~~
everdev
This was built in '17: [https://intoli.com/blog/hacker-news-title-
tool/](https://intoli.com/blog/hacker-news-title-tool/)

HN discussion:
[https://news.ycombinator.com/item?id=14400603](https://news.ycombinator.com/item?id=14400603)

~~~
minimaxir
Interesting and good to see that as a reference. However, I suspect that
approach overfits, and the author does not mention using a validation set.

~~~
everdev
Yes, apparently "Rust, Rust, Rust, Rust!" almost guarantees 1st page according
to the model.

------
TeMPOraL
A minor correction:

> _This was the time when SpaceX successfully launched and landed its
> satellites at sea._

They didn't launch any of their own satellites, and they haven't landed any
satellites at all :). I suppose the right word here would be "rockets".

------
gitinstinct
This is very cool research. I would love to see topics like this built into a
browser extension (or even into HN itself although that may be beyond the
scope of HN's core features). I find that there's a decent bit of content that
gets posted on HN that I'm not personally interested in and, on the flip side,
there are times when I want to go more in-depth on a topic but can't find more
posts that cover it. I don't really want a whole subreddit-style navigation
system, so some automatic topic tagging could be a nice middle ground.

------
pouta
Great job! Thank you for sharing this, will surely contribute to it.

------
pX0r
good work!

