The functions I am referring to are estimators of cognitive indicators (like backwards digit span, say, that they use in the paper), as a function of working hours.
Take a look at page 21 for some plots. For each of the cognitive indicators, the estimator is a downwards parabola as a function of working hours. What I am saying is that this is an artifact of the analysis. The shape could be far different -- in fact it could be a bad case of curve fitting. Additionally -- why not just directly plot the data as a scatter plot or a binned average of cognitive indicators for bins between, say, 20 - 25 hours, 25 - 30 hours, etc? Then at least we could see if the parabolas are close to the data...
You can't just plot a single predictor against the sample outcome and expect the plot to be particularly revealing in multiple regression. Plus, this isn't even multiple regression; this is a two-stage least squares multiple regression. The working hour (WH) variables are instruments, not predictors. See page 6.
Instrumental variables exist specifically to deal with the case of a possible bidirectional causal association between predictor and response.
I'm unfortunately too busy to make it clearer, so I will just leave you with a koan.
Why not include a third order term in the regression? What about an n-th order term? What assumptions do we "bake into" the results of a statistical regression as an effect of including, or not, any function on the original data?
The statistical significance of the quadratic term is actually dependent upon the presence of any more complex or higher order terms in the regression, just as the coefficients and the statistics of the linear term will depend on the presence of the second order term in the analysis.
I'm not saying you should never include a quadratic term in a regression, I'm saying we should understand what the regression is doing when it is fitting a model.
I understand what the regression is doing. The authors understand. You do not, though.
You've already admitting not to having a background in stats, yet you keep throwing around words like "significance" and "model fitting" without having the faintest clue what they mean mathematically. I'm sorry, but I can't fit several semesters of undergrad-level stats in these comment boxes.
Here is my redoubled effort.
Part of what gives me hope is that the CEO of a prominent data analysis company, who does have a background in statistics and data analysis, and has a PhD in computational mathematics from Stanford, said that my original comment was "amazing" and that "the complacency of selecting variables is lost on the hive mind."
And, so, while I have diminished hope that I'll be able to get through to you this time, at the moment, since things seem to have regressed to statements asserting that I don't understand the "faintest clue" of things like model fitting and significance (this is absolutely false, actually, I'm quite deeply aware of the meaning), in fact this conversation does have at least some merit, even if outside of this, surprisingly argumentative, Hacker News context.
Including a specific number of terms in a Taylor series expansion (as per the suggestion of xapata, or in any expansion, be it a Fourier expansion, a Lagrange expansion, or whatever, is a somewhat perilous choice that can distort the meaning. Any form of model fitting has this problem. But one cannot dismiss all other models that could be fit to the data as insignificant in this case!
In particular, choice of a quadratic function for fitting, using constant, linear, and quadratic terms, automatically distorts this data, because of the nature of the data, where there is a anti-correlation between unemployment and cognitive indicators.
This is demonstrated by the following example, which took me about 5 minutes to construct -- and bit more to explain and write about here.
Suppose the data show a completely flat response for IQ versus working hours, except for the unemployed population, which has a lower set of cognitive indicators.
The data and curve are linked here.
In this example, the data shows no optimum number of working hours, and IQ doesn't diminish for more hours worked. But the quadratic fit does suggest this: a peak for IQ near 25 hours of hours worked.
Obviously, this example is not the data the study worked on. The study doesn't directly share the data. But, from the graphs on page 20, the example I constructed is quite like the data. The part-time and full-time work probability density curves are practically identical to one another -- they are right on top of each other. The only really significant difference is between the working and not working populations.
Yet, the authors do not hedge their findings.
"Our findings show that there is a non-linearity in the effect of working hours on cognitive functioning. For working hours up to around 25 hours a week, an increase in working hours has a positive impact on cognitive functioning. However, when working hours exceed 25 hours per week, an increase in working hours has a negative impact on cognition."
and the study concludes "Our study highlights that too much work can have adverse effects on
In my judgment, this analysis does not demonstrate this, even though it would be convenient for me for this to be true.
Because they did two stage least squares, and instead of directly using working hours they used fitted values, there is a slight adjustment that needs to be done to the example above, in order to be relevant.
It is not entirely obvious exactly how well the anti-correlation for cognitive indicators will carry through after "working hours" are estimated by regression with the variables:
Vacancy rate, Inner regional, Outer regional, Remote, Very remote, Number of dependent, Children, Parent is still alive, Other public benefits, Australian citizen, Work experience, Ownhouse.
I mean literally the best connection there is is in "other public benefits" which is a variable with an effect measured in dozens of hours of work per week. Everything else is a far smaller effect. So, effectively, what the second stage of least squares is really doing, is doing a regression on the variables about versus cognitive indicators; and really mostly, upon whether or not they receive public benefits.
A large fraction of those people who have public benefits will have their "estimated work hours" estimated below 0, then will be reassigned to 0 for the purposes of the final regression. Hence, if there is an anti-correlation for "receiving other public benefits" (their terminology) with cognitive indicators -- and there is -- it will appear that there is a significantly lower set of cognitive indicators for the instrument WH* that they estimate.
After that, the rest of my toy example is still quite apt -- there can be no effect in IQ as a function of WH* or WH (as measured) outside of the unemployed population, even though the quadratic analysis will suggest an optimum.
Isn't this, effectively, what they are doing when they say they have found an optimum number of "working hours" for one to work to maximize cognitive indicators?
I'm not even quibbling whether they're demonstrating causality versus correlation. The problem is that their result and the number that they found are likely artifacts of the method of the analysis, and the choice to include only linear and quadratic terms.