Gah, I wish I had time to fully read this and get into it, but I have to spend t...

andersource · on April 22, 2022

Hey omnicognate, good to see you here, appreciated our previous discussion.

What you're saying is that we need to verify the statistical reliability of the skill tests DK gave, and to some extent that we need to scrutinize the assumption that there indeed is such a thing as "skill" to be measured in the first place. I hope we can both agree that skill exists. That leaves the test reliability (technical term from statistics, not in the broad sense).

What's simulated by purely random numbers is tests with no reliability whatsoever. Of course if the tests DK gave to subjects don't actually measure anything at all, the DK study is meaningless. If that's what the original article's author is trying to say, they sure do it in a very roundabout way, not mentioning the test reliability at all. I'd be completely fine reading an article examining the reliability of the tests. Otherwise, I again fail to see how the random number analysis has anything to do with the conclusions of DK.

In fact, DK do concern themselves with the test reliability, at least to some extent. That doesn't appear in the graph under scrutiny but appears in the study.

If you assume the tests are reliable, and you also assume that DK are wrong in that people's self-assessment is highly correlated with their performance, and generate random data accordingly, you'll still get no effect even if you sample twice as you propose.

> The key point though is that if you use a separate sample of X that correlation disappears completely

Separate sample of X under the assumption of no dependence at all of the first sample, i.e., assuming there is no such a thing as skill, or assuming completely unreliable tests. So, not interesting assumptions, unless you want to call into question the test reliability, which neither you nor the author are directly doing.

topaz0 · on April 22, 2022

I think the other piece that has been glossed over a bit is that DK are using quantiles (for both the test and the self-assessment). That means everything is bounded by 0 and 1, and you can't underestimate your performance if it was poor, or overestimate your performance if it was perfect. Or conversely, if you're the most skilled person in the room, your (random) actual performance on the day of the test is bounded above by your true skill, and vice versa for the least skilled. So e.g. we could simulate data with perfect self-assessment of overall skill, add a small amount of noise to actual performance on the day of the test, and get the same results. The bottom quartile (grouped by actual test score) will be a mix of people who are actually in the bottom quartile in skill and some who are in the higher quartiles. The top quartile by actual test score will be a mix of some from the top quartile in skill and some from lower quartiles.

andersource · on April 22, 2022

I agree in principle, although I think to get an effect size similar to what DK observed you'd need quite large noise. Which again comes back to the test reliability.