
Analyzing 10,000 Show HN Submissions - anton_tarasenko
http://antontarasenko.github.io/show-hn/
======
minimaxir
A large amount of Show HNs are not startups; they are personal projects.

Relatedly, using "the startup is not dead" as a metric for _the startup alive_
is a bad idea. People do not shut down everything when a Hacker News
submission does not get upvotes ( _especially_ in the case of Show HN, where
many are hosted on GitHub pages and are free to host, although you account for
that in the regression).

The regression has a few issues:

> It's a classification problem (alive or dead) so OLS doesn't make sense.

> Score and Comments are multicolinear and cannot both be in the same model.

> You don't answer or give statistics toward "how well" the model predicts.

> Related to the comment earlier, you don't comment on the magnitude of the
> on_GitHub coefficients, which are _huge_ and skew the entire result of the
> regression!

While I always appreciate analyses of HN data, the conclusions raise more
questions than answers.

~~~
anton_tarasenko
0\. I measure Show HN projects, some of which are startups. The dead/alive
status as success is a proper measure for survival. For success, see the
section on commercial success.

1\. OLS works fine in classification problems. And it has advantages. For
example, see [http://chrisblattman.com/2015/07/22/statistician-neal-
beck-j...](http://chrisblattman.com/2015/07/22/statistician-neal-beck-just-
justified-my-longstanding-hatred-and-loathing-of-logit/)

And I purposefully included results from three models, not one. They all say
the same thing.

2\. Multicollinearity refers to perfect multicollinearity. Having correlated
independent variables (as in score and comments) are okay. See
[http://stats.stackexchange.com/questions/86269/what-is-
the-e...](http://stats.stackexchange.com/questions/86269/what-is-the-effect-
of-having-correlated-predictors-in-a-multiple-regression-mode)

3\. I mention R² as one measure of predictive power.

4\. The regressions without GitHub-hosted projects return same results.

~~~
christopheraden
A couple questions.

> OLS works fine in classification problems. And it has advantages.

Do you have more explanation of these advantages? I read through the link you
sent, and a bit more about linear probability models. Such things were never
discussed in my statistics curriculum (BS, MS, PhD), except for motivating why
logistic regression was necessary. I'm not sure I understand the economist's
arguments in favor of LPM. Both the interpretation and the distribution of the
test statistics will be totally different with OLS versus Logistic Regression,
and the overall probability of a defunct project is pretty small (
\hat{P(y=0)} = .07 )--enough where there would be pretty big differences. To
be clear, my reservation is with the p-values in the OLS model, not the
predictions it generates. While the models agree on the direction of the
covariates, the magnitudes are quite different, even when you convert
logit/probit to be on the same scale as LPM.

> Multicollinearity refers to perfect multicollinearity.

Perfect multicollinearity will definitely mess up the estimation, but even if
Score and Comments are not perfectly collinear, it's difficult to talk about
each one's effect on the probability individually, as is the interpretation of
coefficients in a (logistic) regression. What does the VIFs look like for
Score and Comments, in particular?

> I mention R² as one measure of predictive power.

But the outcome is binary, so you'll have a similar issue as Minimaxir's first
point about OLS. If you wanted to talk about prediction accuracy, what about a
confusion matrix, misclassification rate, or specificity/sensitivity/F1?
Granted, you'll not want to predict on the same tagged examples that you
trained the model on, but maybe you could split it 80-20? Or tag another
20-50? There are also R²-like measures you can use when the dependent variable
is binary (a whole class of pseudo-R² measures).

I would be curious to see the relationship between these predictors and the
response. It's usually been my experience that linearity is a strong
assumption to make, and that I'd expect for something like comments or score
that once it reached a certain threshold, there was no extra value added by
getting more comments/score. Are the log-score and log-comments linear over
their entire support?

------
timr
The dataset interpreted my site as dead because it 301 redirected to the SSL
connection. This is probably quite common, so take the living/dead stats with
a grain of salt.

~~~
Dru89
Came here to say exactly this. Most stats should probably either treat 3XX as
"success" or follow the link to figure out what the next location's status is.

------
xando
This a really great overview.

Not complaining, just a note. Those counts are from different points in time
of Hacker News. Which means that a chance to get point grows with userbase.
Some extreamly popular thing in 2011 can't compete with something in 2016.

Probably it's hard to guess how many user HN got in those time points. But
maybe adding section by year could help a bit.

Anyway, Anton great job.

------
deanclatworthy
How were you able to use the crunchbase data set without paying their high
prices?

~~~
anton_tarasenko
I had their Excel export file c. 2015, when it was still free.

~~~
minimaxir
The last export date was 2013, which they still offer:
[https://data.crunchbase.com/page/pricing](https://data.crunchbase.com/page/pricing)

~~~
anton_tarasenko
No, it's the new conditions. Some time ago they just sent full Excel files to
journalists and researchers.

------
joeblau
Heh, I have 3 projects on that list and I'm still running two of the three.
One was a flappy bird clone in Swift; the other is a Touch Visualizer for iOS
and the third is [https://www.gitignore.io](https://www.gitignore.io). Pretty
interesting analytics; it's funny because I'm in the process of migrating the
touch visualizer to a Swift project and a new project owner.

------
maniyamamoto
Interesting. Thanks for putting this together!

------
BinaryIdiot
I'm just happy my msngr.js project is in here; the last two or so posts about
"Show HN" analysis left it out and it made me mildly sad.

