

Which topics get the upvote on Hacker News? - datalink
http://blog.datadive.net/which-topics-get-the-upvote-on-hacker-news/

======
teraflop
For what it's worth: a scatter plot with lots of huge points, like the one
you've drawn for "upvotes vs. comments", is pretty useless for drawing
conclusions about the data. It tells you about the _support_ of the joint
distribution (the region on which it's non-zero) but very little about its
shape.

In particular, that graph could represent a fairly strong correlation (in the
R^2 sense), or a fairly weak one, or anything in between. If you want to say
something more quantitative about the data, you can do a linear regression and
look at the coefficients and residuals.

~~~
imh
When there's too much data for a scatter plot, a heat map will do nicely.

~~~
ced
Or pass alpha=0.3 to the plotting function.

------
eridal
some interesting chains pop up from the data!!

    
    
      - google-microsoft-windows-video-browser-user-support-chrome
      - security-key-attack-password-hacker-encryption-network-secure
      - language-type-program-code-programmer-java-write-class
      - number-point-algorithm-value-example-result-set-problem
      - space-nasa-tesla-rocket-launch-start-china-nuclear
      - data-database-map-table-analysis-information-graph-model

------
dangowango
Some of the categories are quite surprising, for example: space-nasa-tesla-
rocket-launch-star-CHINA-nuclear as space; ruby not as programming; com-http-
www-EMACS-LIST-org-book-pdf as junk; Those seem very specific, or plain
wrongly classified. Maybe you can show them individually, just as a big dump
of generated graphs? No need to make another post about it though, but i'd
like to see them in context. Thanks!

~~~
datalink
I think you're mixing up what topics are. The actual topics as generated by
LDA are the concatenated word lists (actually distributions of all words in
the corpus, of which i concatenate the top 8 words to generate a meaningful
descriptor of the topic). So server-client-http-request-service-ruby-
connection-user is one topic / word distribution, in which "ruby" happens to
be 6th most probable word, likely because it appears a lot in posts on
servers, web services etc. It does not mean ruby the word itself is classified
to be server related. Same applies to the other examples you gave.

The categories/domains I simply assigned manually, to show how one could
possibly interpret these word distributions that LDA generated.

~~~
hurin
I think you might want a new classification approach.

~~~
datalink
Not sure what you mean by a new classification approach. There is no
classification here, since there are no labeled documents. This is purely
unsupervised topic modelling. The topics are mathematical objects. How they
are later named or grouped for better human readability is a subjective
matter.

------
karmacondon
I used a similar technology stack for categorizing bookmarks (boilerpipe +
gensim lda). Interesting that we wound up choosing the same tools.

In the interest of reporting on failed experiments, I also tried a k-means
analysis written in php. It was slow and worthless, I wouldn't recommend
anyone else going down that road.

In terms of next steps, I've been trying to use the open source HLDA software
from David M. Blei's group [0] to do hierarchical clustering to avoid having
to decide on the number of topic parameters. Haven't gotten it to compile on
my machine yet though.

[0]
[http://www.cs.princeton.edu/~blei/topicmodeling.html](http://www.cs.princeton.edu/~blei/topicmodeling.html)

------
kitwalker12
Great analysis. You got me interested in the particular java libraries you use
for content analysis.

so according to your analysis. this particular post won't be getting too many
comments :P

~~~
ErikRogneby
Now I'm conscious of skewing the data by posting this comment...

I don't think the API is exposed but the aggregate up vote of comments with a
post might be interesting. It's one thing to have a lot of comments, but
measuring the quality of the discourse would worth knowing.

------
kator
I'm surprised language wars didn't show up. It seems to me "X is better than
Y" threads always get a lot of heat and light.

~~~
datalink
The topic "language-type-program-code" is the 6th ranked topic out of 30 in
terms of comments, so it's pretty high. Considering the error bars, it could
possibly be even further up.

------
Mz
Could use some basic topic labels like "math" and "history" and "women's
issues" (or "women in STEM" or something) and "culture."

