Hacker News new | past | comments | ask | show | jobs | submit login
Popularity of technology on Stack Overflow and Hacker News: Causality Analysis (github.com)
151 points by l____ on June 10, 2018 | hide | past | web | favorite | 58 comments

None of this is causal. For a problem to be, in a statistical sense, causally identified there must be some random or as-if random manipulation of treatment. The two major ways of thinking about causality in statistics are Judea Pearl's DAGS (a representation of causes and effects as an acyclic graph where pathways between variables must be clear of "colliders" which threaten causal validity) and the Neyman-Rubin causal model (also called "potential outcomes", where a unit's outcomes under treatment and control, hypothetically, are considered).

One example of an identification strategy here would be to find two languages that are identical in all regards (including in their overall popularity and exposure across the world), but where one was more popular than the other on SO specifically. This would be a matched selection on observables strategy, where selection into "treatment" (popularity on SO) is not globally random, but random conditional on certain pre-treatment covariates, such that the non-treatment potential outcome of both languages would be expected to be the same and the difference between the observed outcomes (the level of popularity of both languages on HN) is only a product of the treatment.

Here's a simple inferential threat; the author ascribes the popularity of technology on SO as causing the popularity of a technology on HN. What if, instead, some third common cause caused both, but it caused SO spikes faster than HN spikes. Now, in the world I've described (where the true treatment effect is zero), what statistical test involving comparing SO and HN data, even incorporating temporal ordering, would correctly come up with an estimate of 0? If your answer does not come up with an estimate of 0, then its real-world causal estimate is also presumptively wrong.

I also have concerns about how the author measured both treatment and outcome.

Overall I think there is an interesting DESCRIPTIVE (non-causal) question somewhere in this article, but it's bogged down by the author trying to apply something they heard about from a Wikipedia article as though it were a substitute for taking causality seriously. We've all heard Alexander Pope's adage that "a little knowledge is a dangerous thing".

Fair point , thank you for the feedback. As I wrote, I did not detect causation (except for a couple of examples in which the null hypothesis of Granger causality test was rejected). The aim of the analysis was to find relationships between data from Hacker News and Stack Overflow without verification of other data sources , which could influence both websites.

Nor did I say I had found causation neither did write that I had not. I specifically stated that I tested the hypothesis of Granger causality which does not mean causality. Even if the main question of the article was causality, I could not answer it (and I stated this in the text).

Nevertheless, I agree that I should have probably been more careful with the term "causality" so thank you once again for pointing this out.

Measuring causality is hard. Without randomized control trials, you pretty much need to assume some causal graph structure (plain data is not enough).

See: "ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus" http://www.inference.vc/untitled/

When it comes to this SO vs HN article based on data one can look at correlations. Even if some delayed correlations, then still - it does not imply causation (e.g. HN crowd my by faster to post links to new technologies than SO crowd to questions).

Sure, but you need to make your assumed graph convincing. As the GP said,

Here's a simple inferential threat; the author ascribes the popularity of technology on SO as causing the popularity of a technology on HN. What if, instead, some third common cause caused both, but it caused SO spikes faster than HN spikes.

You either need to account for this, or make a convincing argument that no such common cause exists.

Sure, fully I agree with that.

>> For a problem to be, in a statistical sense, causally identified there must be some random or as-if random manipulation of treatment.

It would be good to see your ideal example of a casual, for non statistics people to understand your long note better.

You take a random sample of 1,000 white men between the ages of 45 and 55, who have lived in New England for at least 10 years, with no known history of heart disease. Your randomly split them in half. You give half of them a supplement to take every day for 12 months, and you give the other half a placebo. If the number of heart attacks in the placebo sample is greater than in the treatment sample, you have some believable evidence that the supplement can help prevent unexpected heart attacks, at least in white men in their 40s and 50s.

The idea is that you've controlled for just about every factor that could affect the rate of unexpected heart attacks, or those factors are evenly distributed throughout both samples because you were careful to sample randomly. Therefore, if there is a difference between the groups, on average, it must be because of the treatment that you introduced to one group and not the other.

I'm hand-waving, of course, and I'm sure there are medical researchers out there who will read my study design and laugh at how badly controlled it is. But that should give you the general picture of one comon method used to perform "causal" analysis.

Great example. In the design you propose, in expectation, you would have an unbiased causal inference. We would probably want to check for pre-treatment balance between groups to make sure that stochastic (chance) imbalance did not emerge even though the process itself is good. I don't know anything about heart attacks so I don't have the subject matter knowledge here, but imagine that smoking causes heart attacks. If that's the case, although your design should not cause the presence of smokers among treated and control units to systematically vary, maybe it did by chance. We'd want to assess balance. Same with any other potential confounders.

Another technique we might use is a blocked (or stratified) random sample. Knowing that there will be both smokers and non-smokers, we recruit two separate samples, and randomize treatment assignment within each. This ensures that smoking status does not predict treatment assignment and guards against some potential threat from overall randomization.

We could also mitigate the imbalance that does exist by doing a matched analysis, where each treated unit is paired with a control unit that looks most like him (some control units are reused). Or we could match on propensity scores. Or we could weight on inverse propensity weights. Or we could weight using covariate balancing. Or...

My point in doing this info dump is to a) back up nerdponx's example, which is great and b) illustrate how there's a lot to learn about how statisticians have taken the problem of causal analysis seriously and developed techniques appropriate for answering causal questions.

People in the CS side of things tend to use Pearl's DAGS for conceptualizing this stuff. I'm in the stats/econ side of things so I use Neyman-Rubin. They're equivalent. Allow me to suggest Rubin and Imbens - Causal Inference for Statistics, Social and Biomedical Sciences as a good textbook that we assign to graduate students learning this stuff. Some of my students tell me the "Causal Inference Mixtape" is popular among people who want less statistical theory and more "what should I do as a practitioner". A virtue of both the resources I just mentioned is that they discuss not just experimental designs but also observational data studies, like the one the original post would have wanted to conduct.

> “After each time series is transformed (if necessary) to a stationary one, a Granger causality test is performed.“

This is actually methodologically incorrect. You test Granger non-causality by taking a VAR system and you do not difference the data for stationarity! You use the cointegration order to determine how many extra lags of the exogenous variable to use beyond the order p that defines the lags tested against the null hypothesis (what the authors mention is tested by AIC, BIC, and observing autocorrelation of residuals).

> “The ADF test, differencing and the Granger causality test were performed on data aggregated to monthly frequency”

Hmm. It looks like differencing was done based on the ADF test... this is not valid for Granger causality.

Remember, with the frequentist statistical tests, there are so many fraught issues like this.

In the end, a test like this only talks about the following:

“Under the assumed hypothesis of no causality, how extreme do the observed test statistics appear to be?”

— where the test statistics are in part based on estimated model parameters that are highly sensitive to methodological errors, like differencing prior to computing the coefficient-is-non-zero tests in this Granger VAR model.

In the end it makes me skeptical of interpreting any of the results from the OP.

A good explaination is here: [0].

[0]: < http://davegiles.blogspot.com/2011/04/testing-for-granger-ca... >

They're "data scientists", what did you expect.

I expect they are paid 2x more than trained statisticians unfortunately. Happily though, they didn’t try to solve this problem with Judea Pearl-style causal inference hype.

A better approach to this would probably be treating the data as panel data and either doing monthwise hierarchical Bayesian regressions, and testing if the coefficients pooled over time are significantly non-zero, or else using a sort of distributed lag model or instrumental variables model on the monthwise simple regressions. In these cases, it’s better to roll up lagged features of the exogenous time series to serve as constructed covariates in the lag model, than to rely on differencing and time series tests which are like a landmine of interpretation issues.

Checking causality in models like this is very tricky, so a lot of care has to be taken, and you should expect work to proceed slowly and require a lot of posterior checks.

Hey, thank you for informative comments. I would gladly read more about hierarchical Bayesian regression. Would you recommend any sources from which I could learn more?

(I am fully aware that I can google it but in my opinion it's usually better to ask someone who is already familiar with the topic because he/she may know good learning materials.)

- < http://www.stat.columbia.edu/~gelman/book/ >

- < http://www.stat.columbia.edu/~gelman/arm/ >

If you happen to use Python, pymc3 is a great place to look for hierarchical and Bayesian timeseries models with full examples.

Thank you. One of my personal goals with regard to this analysis was to learn Python on-the-go (I feel more comfortable in R) so I will definitely check pymc3.

As it's often repeated "correlation is not causation".

Personally, I would only expect the former and not the latter. In my opinio it's more plausible that some other cause causes the increase of popularity on both sites.

The rise of Swift, for example, was caused first by Apple releasing the language, the increase of online articles and communities around the language and its open-sourcing.

This lead to an increase in popularity pretty much everywhere, including SO and HN. These are correlated, but there is no causal relationship between the two.

Hey, thanks for the comment. I am aware that "correlation is not causation" and did not stated otherwise. The correlations would not answer the original question about the causation and it seems like this analysis cannot cope with it. Swift example suggests that causes of popularity of a given technology could be found outside Stack Overflow. It is, however, not impossible for Stack Overflow itself to be one of the causes of popularity of posts on Hacker News.

I disagree with this analysis.. from a non-statistician's point of view, a "popularity" is not the same as significance, and neither term has a time-aware component.

Said differently, the number of interested questions, over certain periods of time, concurrently with adoption and use of the platforms being measured, are being measured here.

Not as simple as "correlation is not causation"

Modern statistical theory has adopted techniques of causal inference from statisical data. There are some questions about the theoretical foundations but the stats folk by and large have run with it and feel perfectly comfortable talking about causality in certain situations.


Yes, but all of them require a complete model of all possible influences and hidden factors. It's still only as good as the model, which in the OP is just "HN and SO".

If everyone ties their ego/identity/career/intellect to upvotes then what is voted up the most is accepted as most valid, most correct, or best - even if the individual does not understand all of the language or background material mentioned.

This can create a causal feedback cycle that establishes general trends in technology, because it's rooted in self identity and the appearance of intelligence (doesn't have to actually be intelligence).

Correlation is not causation but correlation combined with a system that intertwines (belief, personal motives/self identity/self image, crowd validation) can cause correlation to become causal.

That's why the author didn't simply look at correlation, but used Granger causality tests.

And then the author remarked that Granger causality doesn't indicate causation. It is a measure of how well one variable predicts upcoming changes in another. It doesn't mean that direct manipulation of the first variable will necessarily influence the second one.

Causation is an ill-defined concept. Already Aristotle had to define several kinds of causation, and philosophers still struggle with it.

It's even possible, per Hume, that causation doesn't actually exist.

Granger causality is basically just a time-shifted correlation. It doesn't prove causation any more than a correlation test (although it's directional in a way regular correlations aren't).

> Additionally, in case of two technologies (JQuery and Tensorflow) the variables regarding data from SO were pointed out as potential results of variables from Hacker News. The idea of Hacker News influencing popularity of technology on Stack Overflow is not so easy to accept (at least for me) as the opposite one, nevertheless, it shouldn’t be entirely disregarded.

What is so unplausible about this hypothesis?

I came here to say the exact same thing. To me the hypothesis put forward by the author (SO influencing the popularity on HN) seems unlikely. I visit HN to find out about new technology and only visit SO when the technology I've already chosen isn't working out - and even then it's only indirectly via Google search results.

Yeah, I'm sure the causal arrow goes both ways. You learn about it on HN, try it out, ask a question on SO. Somebody else hears of the tech, visits SO, sees that there's activity, tries it out, and mentions it on HN.

But I'm puzzled by the framing. If I were looking at the relationship, I'd be asking questions like: Is HN a good place to post if I want to make something popular? If I learn about something on HN, is it likely to become popular? If I want to know what the next big thing is, is HN an indicator of that?

If I'm thinking of SO as the cause of something, it'd be mainly in the category of docs: a supporting factor. So I'd be inclined to measure documentation and developer resources more generally.

> Yeah, I'm sure the causal arrow goes both ways.

Even not fully realizing this (see my replay to @tofflos) I did not exclude such a possibility and that's why I performed the Granger causality test for both hypotheses (SO influencing HN and HN influencing SO).

> But I'm puzzled by the framing. If I were looking at the relationship, I'd be asking questions like: Is HN a good place to post if I want to make something popular? If I learn about something on HN, is it likely to become popular? If I want to know what the next big thing is, is HN an indicator of that?

I guess these are different questions which are not covered by my analysis.

> If I'm thinking of SO as the cause of something, it'd be mainly in the category of docs: a supporting factor. So I'd be inclined to measure documentation and developer resources more generally.

To be honest, I do not understand this one. Could you please elaborate?

Sure. I personally wouldn't put SO down as the cause of anything in that I think it's intermediate in the kinds of causal chains I see around software popularity. I almost never hear about a new technology on SO. I'm there because I am already using it, have a question, and see a Google result.

So if I'm looking at technology adoption, SO is in a basket of factors that I think of as supporting. Does this technology have a website? Is it easy to get started? Is there a place where I can chat with people? Is there a meetup? Is there a conference? Are there good docs? Are there good videos? Are there blog posts and books? And, of course, are there good questions and answers?

All of these supporting factors can keep somebody on the road to adoption, in that once somebody has decided to try the technology, they aid the person in getting to a useful result. I don't think they're causal in the sense of initiating anything. But if I squint I could call them causal in the sense that if you invest in them, you see increased adoption. Given their substitutability, though, if I were looking at causality I'd try to find a broader metric than SO alone.

is that helpful?

I was thinking rather otherwise, that is that people who use certain technology (technologies) would check posts about them on HN. Your point of view did not come to my mind so thanks for pointing this out.

I wonder if we could use this kind of correlation analysis to detect shilling/astroturfing, like when there's a spike on HN that isn't backed up by SO for a given topic. It's also timely I guess since MS seems to have been on a spree lately on reddit and elsewhere.

Meh: A more developer friendly product would cause less activity on Stack Overflow, right? How about frameworks with their own forums?

Isn't this one of the fundamental problems of the advertising industry? I.e., answering the question "did this ad cause the increase in sales?"

How do they approach this?

There are two primary approaches, both with their issues. Media Mix Modeling is the top down approach, where you input all marketing activities over time with corresponding response (sales) data. Direct attribution (multi touch attribution) attempts to identify each touchbase a specific customer had with your brand, and use that to infer causality.


Probably estimates through closer studies of other ads of the same format, sales increase correlation & just a whole lot of guessing.

The graphs shown are cumulative, which is defined by the README to be the sum of all the points up to that date (the most common definition). However, some graphs show a dropping number of points, like the one for Java, even in the non-standardized plot. (https://raw.githubusercontent.com/dgwozdz/HN_SO_analysis/mas...) Would this indicate some sort of error in the data collection, or did I miscomprehend the "cumulative" label?

He explains that when discussing HTML.

>Such a situation occurred due to greater number of downvotes than upvotes. This may be result of high number of duplicates since 2014 or the questions which were not formulated in a clear way or were not reproducible (and therefore were downvoted).

Thanks for the answer, I missed that paragraph. It seems quite disconcerting to me that the average question in the Java tag has been rated negatively the last 4 years. I'm curious as to the specific reason for this downward trend, which isn't reflected in most of the other languages.

One thing that comes to mind for the three which exhibit these patterns, is that they're high volume, slow moving ecosystems. Given the SO community's high penchant for closing questions as duplicates, I wonder if we're just at the point where these languages have exhausted their supply of "sufficiently-unique-and-not-homeworky questions for SO".

Hey, I was thinking about writing something like that in the analysis. I finally decided not to do this because languages develop over time. I'm not a user of Java/HTML/Pascal/PHP but I use SQL quite frequently and in my opinion it is possible to ask a well defined question that did not appear before, at least for some "dialects" (if I name it correctly) like TSQL or Oracle SQL. Maybe if those "dialects" were investigated, their trends would not be downward.

You mentioned Rust, but it wasn't included in your analysis. Would you be able to regenerate the analysis easily to include it? I'd be super interested in seeing trends around Rust.

I only reported plots I found interesting/saw some resemblance between HN and SO and in case of Rust probably none of those was he case. However, it must be said that I didn't have any quantitative criterion for this and this decision was purely subjective.

In case the author decides to do this, might I request Nim be included too? I'm curious how it would compare.

The easiest way to perform this analysis is to use a statistical package that features Vector Autoregression and Vector-valued Error Correction Models.

Example: http://www.statsmodels.org/dev/generated/statsmodels.tsa.vec...

Thanks! I wasn't aware that such a function exists.

- number of times questions from a certain time span (e.g. from a given day) were tagged as a favourite,

- number of comments for questions from a certain life span,

- number of views of questions from a certain life span,

- number of replies for questions from a certain life span.

Perhaps it's me, but questions (on SO) could be indicative of poor documentation, or a lack of answers elsewhere. The latter, obviously, somewhat counter to popularity.

Thanks for pointing this out, I didn't think about this. Nevertheless, I assume that in an ideal situations when documentations for two programming languages have the same quality then the number of questions for the programming language which would have larger group of users would be greater therefore I still think of this measure as valid. I am also aware that we do not live in an ideal world.

Offtopic: I do not know how to measure and compare the quality of documentation and am curious whether there are any methods, so if you know some and could elaborate on this, I would be grateful.

At the bottom of the README you’ll find the answer:

> To sum up: does popularity of technology on StackOverflow (SO) influence popularity of post about this technology on Hacker News (HN)? There seems to be a relationship between those two portals but I could not determine that popularity on Stack Overflow causes popularity on Hacker News.

Common sense would dictate that one website does not cause another website to do anything, particularly those with vast communities. This is a classic case of someone misunderstanding the utility of an analytical technique. IMHO they may execute it right (didn't bother reading) but apparently fail to ask the right question in the first place or comprehend the limited utility of the result.

It would be interesting to see if the prevalence of comments of the common internet form “I am deliberately ignorant, but I think...” caused a decline in quality on HN, or if the decline in HN quality was caused by the prevalence of such comments.

To clarify: I read the intro and the ending, not the middle. I am not ignorant, I run companies where my entire job is to manage investment in areas I cannot possibly fully comprehend, therefore spotting errors in process is a skill I have honed. I came from a software background and am familiar with delving in to specificity. The point I was making was that there is no point in delving if you've asked the wrong question or will misinterpret the result, as it seems is the case here.

How do you measure the decline?

Measuring it would mean first figuring out what was valuable in HN and then tracking that. It may turn out that what was valuable isn't easily measurable, even with ML.

In a preliminary way, however, it seemed there used to be a stronger correlation between length of comment (as a proxy for thoughtfulness, perhaps crossed with reading level) and comment score. It also seemed there were fewer of the "I know nothing about field X, but let me pontificate on X" comments. Frequency of one-liners has probably increased, in relative terms. We could probably also look at the relative frequency of comments of the type "Thank you for your (specialist) explanation" or "Very interesting!" over time, as a more or less direct proxy for quality.

One might also be able to come up with a measure of "feeling of safety." In other words, the crowd probably perceives the level of discourse and guards their contributions accordingly. For example, "How do you measure the decline?," is interpretable as a technical question, or as a rhetorical question with "one" replacing "you," or as a snarky attack (by reading it with a sneering emphasis on "you," to indicate skepticism that a decline has happened), and so on. The general tone of replies might serve as an indicator, especially if crossed with something like days-since-joined-HN or karma.

But, really, my original comment was meant to contribute something positive while calling out, en passant, a particularly egregious kind of comment behavior that I am saddened to see on HN, particularly since I perceive a general decline over time.

Full(ish) disclosure: I burned down a previous account because I had made a shameful anti-Muslim comment a la Sam Harris, whom I'd been reading (and was inexplicably taken with) at the time. I could no longer edit or remove the comment by the time I'd come to my senses, so opted to irreversibly change the password on that account to gibberish. This is all just to point out that, despite my apparent HN youth, I can indeed seem to remember a time when thoughtful comments that represented either expertise or care elicited far more upvotes than they do now, and a time when short quips and adolescent snark elicited far swifter and more brutal karmic beat-downs.

Thank you for your thourogh explanation. My question was indeed sincere in that I wanted to understand how you perceive the decline. And, your statement makes sense to me. I think HN is becoming much more mainstream these days. It’s growing out of its niche is my feeling. That is probably driving “the decline”. I don’t know whether this is a good or a bad thing. Maybe, however, the development makes a good case for an HN alternative once enough people think the way you do.

Given that he's tested for lots of languages, shouldn't he be applying something like bonferroni correction to the Granger causality results? Wouldn't this have the effect of making them all insignificant?

"Correlation doesn't imply causation"

Hey, thanks for the comment. I am fully aware of that and did not stated otherwise in the analysis. I only established that there seems to be some kind of relationship but it is not possible to determine causality.

This source includes a helpful summary of Granger causality, written by Granger himself. http://www.scholarpedia.org/article/Granger_causality

The topic of how to define causality has kept philosophers busy for over two thousand years and has yet to be resolved. It is a deep convoluted question with many possible answers which do not satisfy everyone, and yet it remains of some importance. Investigators would like to think that they have found a "cause", which is a deep fundamental relationship and possibly potentially useful.

In the early 1960's I was considering a pair of related stochastic processes which were clearly inter-related and I wanted to know if this relationship could be broken down into a pair of one way relationships. It was suggested to me to look at a definition of causality proposed by a very famous mathematician, Norbert Weiner, so I adapted this definition (Wiener 1956) into a practical form and discussed it.

Applied economists found the definition understandable and useable and applications of it started to appear. However, several writers stated that "of course, this is not real causality, it is only Granger causality." Thus, from the beginning, applications used this term to distinguish it from other possible definitions.

The basic "Granger Causality" definition is quite simple. Suppose that we have three terms, Xt , Yt , and Wt , and that we first attempt to forecast Xt+1 using past terms of Xt and Wt . We then try to forecast Xt+1 using past terms of Xt , Yt , and Wt . If the second forecast is found to be more successful, according to standard cost functions, then the past of Y appears to contain information helping in forecasting Xt+1 that is not in past Xt or Wt . In particular, Wt could be a vector of possible explanatory variables. Thus, Yt would "Granger cause" Xt+1 if (a) Yt occurs before Xt+1 ; and (b) it contains information useful in forecasting Xt+1 that is not found in a group of other appropriate variables.

Naturally, the larger Wt is, and the more carefully its contents are selected, the more stringent a criterion Yt is passing. Eventually, Yt might seem to contain unique information about Xt+1 that is not found in other variables which is why the "causality" label is perhaps appropriate.

The definition leans heavily on the idea that the cause occurs before the effect, which is the basis of most, but not all, causality definitions. Some implications are that it is possible for Yt to cause Xt+1 and for Xt to cause Yt+1 , a feedback stochastic system. However, it is not possible for a determinate process, such as an exponential trend, to be a cause or to be caused by another variable.

It is possible to formulate statistical tests for which I now designate as G-causality, and many are available and are described in some econometric textbooks (see also the following section and the #references). The definition has been widely cited and applied because it is pragmatic, easy to understand, and to apply. It is generally agreed that it does not capture all aspects of causality, but enough to be worth considering in an empirical test.

There are now a number of alternative definitions in economics, but they are little used as they are less easy to implement.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact