

Hacking Hacker News Headlines - th0ma5
http://metamarketsgroup.com/blog/hacking-hacker-news-headlines/

======
vnorby
So the optimal hacker news headline is as follows:

 _Why showing the future is essential to acquiring data_

Noted.

~~~
sesqu
_Showing why acquiring data is essential to the future_

I have a feeling that headline wouldn't do all that well, but it does seem to
be keeping with google's culture.

~~~
Jarred
Testing your hypothesis here - <http://news.ycombinator.com/item?id=2520098>

------
agscala
I'm not sure if I'm being naive, but does the '|' and the '-' have some sort
of NLP significance?

From the article, what's the difference between "data |" and "data -" ?

~~~
HardyLeung
I think '|' denotes the beginning or the end of the title, and '_' denotes a
wildcard.

------
joshu
I feel like you should work in absolute points-space, rather than rank-space.

Also, no clue if the factors you pulled out are orthogonal.

~~~
rauljara
True. But credit where credit is due. Very cool analysis for a throwaway blog
post specifically manufactured to garner karma.

Only thing I'll add as a data critique, the negative factors are reported as
things to avoid. But, in fact, all of the reported on titles actually made it
onto the Hacker News front page (1). There are an awful lot of submissions
that never make it that far. In fact, the significance of the findings
indicate that those terms make it onto the front page A LOT (2). I don't think
the negatively correlated terms should necessarily be viewed as failures. Just
less successful. My own suspicion is that those titles do draw eye-balls, but
someone using titles like those is also likely to be kind of a bad writer,
preventing those stories from getting upvotes. It would be very hard to prove
a correlation between quality of title and quality of writing, though.

(1) I believe. Hard to tell from the post.

(2) Otherwise there wouldn't be enough data for them to be significant.

~~~
mbreese
_It would be very hard to prove a correlation between quality of title and
quality of writing, though_

Especially since the author of the linked article isn't always the one who
submits the article and therefore gets to choose the HN title.

~~~
ma2rten
The point of the article was as I understand it, which influence the title has
independent of the data. Stuff like quality of writing will just be noise that
does not matter anymore as long as you have enough data.

------
pge
And, the best headline for Hacker News is a headline about hacking Hacker News
Headlines :) way to put the research to good use...

~~~
HardyLeung
The submission should have been titled:

"Essential Lessons Showing How to Hack Hacker News with Data Visualization"

and it will be ranked #1 in no tim... never mind :D

------
vorbby
As a counter-point...

This headline uses none of the hacks described in the article, yet it is
ranking quite well.

Perhaps people should focus on letting the content speak for itself rather
then using tricks like this?

~~~
derefr
It does, however, use the trick of being alliterative—which an N-gram analysis
of terms will miss.

~~~
ma2rten
it's short, which has one of the hacks

------
vlokshin
64% isn't the greatest accuracy, but you guys were transparent about
everything and the numbers look legit. Awesome job putting this together!

------
gjm11
Very nice, but the analysis seems to assume that HN rank is determined by the
headline and not by the content. (More precisely: for the analysis to give
useful guidance to would-be HN headline writers, it needs not to be the case
that content features correlated with headline features make a big difference
to HN rank.)

My proposal for a good headline according to the numbers in this article:
_Showing why impossible future controversy survived the problem could hire
data_. Score: 1.3 (could) + 1.2 (problem) + 1.3 (survived the) + 1.0
(controversy) + 0.9 (impossible) + 0.7 (why ___ future) - 3.3 (11 words) + 2.6
(showing) + 0.5 (hire) + 1.9 (data [END]) = 8.1. For comparison, _Why showing
the future is essential to acquiring data_ gets 1.4 (essential) + 0.7 (why ___
future) - 2.7 (9 words) + 2.6 (showing) + 1.7 (acquiring) + 1.9 (data) = 5.6
-- except that it doesn't really get the points for "essential" (not at start)
or "why ___ future" (two words in between) or "acquiring" (not in second
place, word isn't quite right). Of course my headline has the little drawback
of being total nonsense.

------
brendano
Great -- I'm hoping L1-regularized logistic regression will become the
standard first thing to try in these quick-n-dirty "predict response variable
from text" experiments. That's our approach too. (I assume this is L1 or
similar since you mention regularization causing feature selection.)

[[ Edit: deleted question about what 'k' is for the discretized 1{ rank <= k }
response. It's mentioned in the article ]]

~~~
joeraii
yeah pretty strong l1--most features were 0. we binarized rank on
I_{rank<=20}. it turns out there are tons of articles beyond the first page
that stay low forever. check out the interactive viz vad made:
<http://hn.metamx.com> (warning 2.6MB compressed js ahead)

~~~
brendano
Another question, how are standard errors calculated? I assume they're not
from the bootstrapping since the p-values clearly aren't from the standard
errors ( +/- 1.96*se is crossing coef=0 for several cases but with small
p-values). The other way I would think to get p-values would be the percentage
of bootstrap replicates that have (coef==0). But for only 20 replicates you're
stuck with p=0 or p=0.05.

I'm genuinely curious how to do coef significance testing for L1-regularized
models. I once saw someone ask this at a Tibshirani talk and he said "oh we
have no idea, we've resorted to the bootstrap before".

~~~
joeraii
to be honest we just recorded the coeff values for each replicate and did the
bootstrap variance calculation.

% of replicates with (coef==0) is potentially much more clever, especially
since that's the test we want to perform anyway. i'll run that over the data
and see what changes.

~~~
equark
I think the question is these don't look like NormalCDF(coef/se) p-values
given the coef and se you report. They tend to be too small.

From a frequentist perspective, counting zeroes don't make much sense because
under the null of coef=0 there is still a chance you don't estimate coef=0,
even after regularization.

~~~
brendano

        I think the question is these don't look like NormalCDF(coef/se) p-values given the coef and se you report.  They tend to be too small.
    

right that's my question

~~~
joeraii
interesting yeah some of them definitely don't look right. the output is from
scipy's stats.ttest_1samp

------
powdahound
It'd be interesting to see how the domain shown next to the title factors into
this too. Seems like everything from GitHub always does very well.

------
kingsidharth
I though this problem was with Digg, but I've experienced same with my
submissions. It's funny that people judge content by headlines, we need a
better way.

~~~
roc
There is: delegate.

Find someone who reads Hacker News [1] and blogs. Subscribe to their RSS feed.
There's a trick to [1], no doubt. But for most people the time savings far
outweigh the occasional mismatch between your interests and the delegate's.

[1] For the same types of articles you do, ideally.

~~~
ultrasaurus
That's exactly what I do, though I don't read HN every day I have a script off
in the cloud looking at the comments RSS feed, showing me every article that
certain people comment on. As a quality filter, I find it better than the
front page.

~~~
kingsidharth
Is that script on GitHub, by any chance? Would love to have a look at it.

------
bzupnick
also cleaning up the graph somehow would be great

