
Words growing or shrinking in Hacker News titles: a tidy analysis - var_explained
http://varianceexplained.org/r/hn-trends/
======
flavio81
A frequent post then would be:

"Using VR to train a deep learning neural network on driving and react
correctly to unexpected conditions, a bot implemented via a microservices
stack using aws as a container and of course connected with cars and related
traffic devices via the IoT, logging unexpected events into a blockchain."

~~~
forgot-my-pw
Missing "HN" somewhere.

------
robteix
I'm surprised both "NSA" and "surveillance" are two of the fastest shrinking
words. I thought we saw more now than ever. Shows how perception doesn't
always match reality.

~~~
jdminhbg
When the Snowden leaks first dropped, the front page was absolutely
overwhelmed with NSA news, to the exclusion of nearly everything else. Would
not be possible to keep that level of interest up without making this is an
exclusively NSA/surveillance-driven site.

~~~
Houshalter
IIRC the mods also soft banned it because of that. So posts with the word
"NSA" in the title get penalized and ranked much lower than other posts. Hence
the shrinking.

~~~
dang
We did that for a while but stopped already years ago. It was only needed
until the story barrage leveled off.

------
minimaxir
Hmm, the BigQuery HN dataset is now updated daily and contains comments as
well as stories? That's new, and I'll certainly give it another look at for my
projects.

With the bigrquery R package ([https://github.com/rstats-
db/bigrquery](https://github.com/rstats-db/bigrquery)), you can access the HN
dataset directly from R, using dplyr syntax too. (for simple queries atleast;
you can pass the raw SQL for complex queries)

As noted, the resulting dataset of words is large, so mapping the words in
BigQuery itself may be more practical (using a combo of SPLIT and UNNEST with
standard SQL), although of course you can't do complex operations like
logistic regression or splines there.

------
SippinLean
>I don’t currently have a guess for why “million” and “billion” had sudden
dropoffs in 2014. Is it some artifact of the Hacker News policy, with the word
becoming edited or deleted in newer posts? Or is it a real change in what the
site discusses?

Any guesses on this one?

------
aswanson
A more interesting analysis would be comment length.

~~~
minimaxir
In my old analysis ([http://minimaxir.com/2014/10/hn-comments-about-
comments/](http://minimaxir.com/2014/10/hn-comments-about-comments/)), it's
not that interesting.

Comments are getting longer over time on average
([http://minimaxir.com/img/hn-
comments/monthly_average_words.p...](http://minimaxir.com/img/hn-
comments/monthly_average_words.png)), and there is a slight positive
correlation between comment score and comment length
([http://minimaxir.com/img/hn-
comments/distribution_comment_po...](http://minimaxir.com/img/hn-
comments/distribution_comment_points_words.png)), but that can't be remade
with the BigQuery dataset since comment scores are no longer public.

------
drenvuk
It would be nice to see a comparison of fastest growing words between the last
5 years vs 10 years ago. I'm wondering about the demographics of this site and
if they've changed.

------
joecool1029
I am extremely surprised rust wasn't included in here.

~~~
var_explained
I've got a followup coming about what words lead to _upvotes_ , and rust
features quite prominently there!

