Basically, this autocorrelation take shows that if performance and evaluation of performance were random and independent, you would get a graph like the D-K one, and therefore it states that the effect is just autocorrelation. But in reality, it would be very surprising if performance and evaluation of performance were independent. We expect people to be able to accurately rate their own ability. And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result. They then posit reasons for this. One could certainly debate those reasons. But to say the whole effect is just a statistical artifact because random, independent variables would act in a similar way ignores the fact that these variables aren't expected to be independent.
Yup. Assuming the sample sizes are statistically significant, the original paper clearly shows:
- On average, people estimate their ability around the 65th percentile (actual results) rather than the 50th (simulated random results) -- a significant difference
- That people's self-estimation increases with their actual ability, but only by a surprisingly small degree (actual results show a slight upwards trend, simulated random results are flat) -- another significant difference
The author's entire discussion of "autocorrelation" is a red herring that has nothing to do with anything. Their randomly-generated results do not match what the original paper shows.
None of this really sheds much light on to what degree the results can be or have been robustly replicated, of course. But there's nothing inherently problematic whatsoever about the way it's visualized. (It would be nice to see bars for variance, though.)
The autocorrelation is important to show that it's transformation to D-K plot will always give you the D-K affect for independent variables.
However, the focus on autocorrelation is not very illuminating. We can explain the behaviors found quite easily:
- If everyone's self-assessment score are (uniformally) random guesses, then the average self-assessment score for any quantile is 50%. Then of course those of lower quantile (less skilled) are overestimating.
- If self-assessment score vs actual score are dependent proportionally, then the average of each quantile is always at least it's quantile value. This is the D-K effect, which is weaker as the correlation grows.
-The opposite is true for disproportional relation.
So, the D-K plot is extremely sensitive to correlations and can easily over-exaggerate the weakest of correlations.
> "On average, people estimate their ability around the 65th percentile (actual results) rather than the 50th (simulated random results) -- a significant difference"
This is a different issue than D-K. The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals. People think they're better than average is a different (and much less controversial) bias.
---
[DK-Effect] : I totally know I scored at least a 30% on that test, and that's certainly way better than average (it's not). [Actually scored 10%]
[No DK-Effect] : I totally know I scored at least a 30% on that test, and that's certainly way better than average (it's not). [Actually scored 30%]
> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals.
Isn't that what the graph shows? The bottom quartile group is guessing almost 50 percentile points higher than their actual performance, whereas the top quartile is at most 15 points off.
They're all guessing somewhere between the 60th and 75th percentiles (i.e. "I'm a bit better than average") - with some upwards trend since the high performers seem to at least know they have some skill, although not very accurately. It's just that for the poor performers, a guess of the 60th percentile wayyy off the mark.
EDIT: Something important for the rest of this post. In case it's not clear, the graph is showing your percentile ranking within the group - not your actual score.
Nope, because there's an interesting statistical trick in play. Imagine you take 100 highly skilled physicists and give them some lengthy series of otherwise relatively basic physics questions. Everybody is going to rate their predicted performance as high. But some people will miss some questions simply due to silly mistakes or whatever. And those people would end up on the bottom 10% of this group, even if the difference between #1 and #100 was e.g. 0.5 points. Graph it as D-K did, and you'd show a huge Dunning Kruger effect, even when there is obviously nothing of the sort.
In fact the fewer differences in ability within a group, and the greater the relative ease of a task, the bigger the Dunning-Kruger effect you'd show. Because everybody will rate themselves relatively high, but you will always have a bottom 10%, even if they are practically identical to the top 10%.
You can see this most clearly in the original paper. They carried out 4 experiments. The one that was most objective and least subject to confounding variables was #2, where they asked people a series of LSAT based logic questions, and assessed their predicted vs actual results. And there was very little difference. Quoting the paper, "Participants did not, however, overestimate how many questions they answered correctly, M = 13.3 (perceived) vs. 12.9 (actual), t < 1. As in Study 1, perceptions of ability were positively related to actual ability, although in this case, not to a significant degree." Yet look at the graph for it, and again it shows some seemingly large D-K effect.
And there's even more issues with D-K, and especially experiment #1 (which is the one with the prettiest graph by far), but that's outside the scope of this post. I'm happy to get into it, if you are though. I find this all just kind of shocking and exceptionally interesting! I've referenced the D-K effect countless times in the past, never again after today!
Yes yes yes! I’m in the very same boat, and came to an epiphany that the ranking trick here, combined with some subjective questions (ability to appreciate humor - seriously!?), that these things hide almost everything about actual skill. Not only does it amplify mistakes, it also forces the participants to have to know something about their cohort. Having to guess your ranking fully explains the less than perfect correlation. It also undermines all claims about competence and incompetence. They’re not testing skill, they’re only testing ability to randomly guess the skill of others.
What about the slight bias upwards? Well, what exactly was the question they asked? It’s not given in the paper. They were polling only Cornell undergrads looking for extra credit. What if the question somehow accidentally or subtly implied they were asking about the ranking against the general population, and then they turned around and tested the answers against a small Cornell cohort? I just went and looked at the paper again and noticed that the descriptions of the ranking question changed between the various “studies” with the first one comparing to the “average Cornell student” (not their experiment cohort!). The others suggest they’re asking a question about ranking relative to the class in which they’re receiving extra credit. Curiously study 4 refers to the ranking method of study 2 specifically, and not 3. The class used in study 4 was a different subject than 2 & 3. How they asked this question could have an enormous influence on the result, and they didn’t say what they actually asked.
Cornell undergrads are a group of kids that got accepted to an elite school and were raised to believe they’re better than average. Whether or not all people believe they’re better than average, this group was primed for it, and also have at least one piece of actual evidence that they really are better than average. If these were majority freshmen undergrads, they might be especially in calibrated to the skills of their classmates.
In short, the sample population is definitely biased, and the potential for the study to amplify that bias is enormous. The paper uses suggestions and jumps to hyperbolic conclusions throughout. I’m really surprised that evidence and methodology this weak claims to show something about all of humanity and got so much attention.
> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals.
I’m not sure that’s an accurate summary. The correlation of the perceived ability is effectively the slope of the line, and the slope is more or less constant. The paper suggests that the bias of the bottom quartile is higher than the bias of the upper quartile, not that the correlation is any different.
But it’s strange that the DK paper makes an example of the lower performers, since the bias of the scores appears to be constant; it appears the high performers have pretty much the same bias as the low performers — it’s a straightish line that goes through 65% in the middle rather than the expected straight line that goes through 50% in the middle. If the ‘high performers’ had a different bias, then the line wouldn’t be so straight.
1. the slope of self-perceived ability is lower than actual ability
2. The y intercept is dependent on difficulty of test
Therefore with an easier test the better testies are more accurate, and with a very difficult test the worse testies are more accurate because of where the lines intersect. Meaning DK is artifact of test difficulty.
This also means if the test was difficult enough you could create a bizarro-DK effect where the better testies were less accurate.
For 1, the data is based on guessing, so it’s zero surprise that self-perceived ability doesn’t correlate perfectly with actual ability. It would be extremely surprising and unbelievable if the slopes were the same, right?
For 2, the DK paper shows one thing, but the replication attempts have show this effect doesn’t even exist for very complex tasks, like being an engineer or lawyer. The DK effect doesn’t generalize, and doesn’t even measure exactly what it claims to measure, which is why we don’t need to speculate about the bizarro-DK reversal effect - we already have evidence that it doesn’t happen, and we already have a big enough problem with people mistakenly believing that DK showed an inverse correlation between confidence and competence, when they did no such thing.
> The D-K hypothesis is that self assessment and actual performance are less correlated for weaker than higher performing individuals
That may have been a hypothesis Dunning and Kruger had at some point, its not the effect they actually identified from their research. But I don't think its even that, its an “effect” people have associated with D-K because they heard discussion of the D-K research that got dustorted at multiple steps from the original work, and then that misunderstanding, because it made a nice taunt, replicated widely and became popular.
To be fair, the paper itself uses hyperbolic language that completely distorts it’s own data. It heavily pushes and leads the reader into one possible dramatic explanation for their results, while downplaying and ignoring a bunch of other less dramatic explanations. Using words like “incompetent” are almost completely unfounded based on what they actually did. Section headings like “competence begets calibration”, “it takes one to know one”, and “the burden of expertise” are uncurious platitudes and jumping to conclusions. I’m kind-of stunned at the popular longevity of this paper given how unscientific is it and how often replication results with better methodology have shown conflicting results.
"Perhaps more controversial is the third point, the one that is the focus of this article. We argue that when people are incompetent in the strategies they adopt to achieve success and satisfaction, they suffer a dual burden: Not only do they reach erroneous conclusions and make unfortunate choices, but their incompetence robs them of the ability to realize it."
> That people's self-estimation increases with their actual ability, but only by a surprisingly small degree (actual results show a slight upwards trend, simulated random results are flat) -- another significant difference
If everyone thinks they are slightly above average, isn't this inevitable? If everyone thinks they are slightly above average, people who are slightly above average are going to be the most accurate at predicting where they land?
> If everyone thinks they are slightly above average, isn't this inevitable? If everyone thinks they are slightly above average, people who are slightly above average are going to be the most accurate at predicting where they land?
Yes, it’s inevitable. And this study only asked Cornell undergrads what they think of themselves - people who were taught to believe they are above average, and also people who got into a selective school and probably all had higher than average scores on standardized tests. Is it surprising in any way that this group estimated their ability at above average?
Even if "people tend to slightly overrate their own ability," was the only takeaway, it would still refute the author's conclusion that DK has nothing to do with human psychology.
Have you not just summarized the Dunning-Kruger effect in other words?
That essentially follows from everyone assume they are slightly above average. That's also the crux of the refutation and why the whole autocorrelation is a red hering, even if we all would just self assess completely randomly, that actually confirms the Dunning-Kruger effect is real (because if we self assess randomly worse performance are more likely to overestimate).
We could argue that this is not surprising, but the "surprising" bit is that the curves show that better performers are actually more skilled at assessing their performance, which incidentally was also confirmed by the followup studies.
Is it though? Everyone overestimating their ability a bit isn't DK effect. It's when people with less knowledge and ability vastly over estimate their ability (because they don't know how little they know - while others do), and the opposite for those who are truly more able and knowledgeable (again because they understand how vast the topic is and though they know more and are capable more than the average person, they also understand how little they truly know compared to what they don't know)
There are those that don't know, and don't know that they don't know. They evaluate themselves the highest.
There are those that know, and don't know that they don't know. They evaluate themselves a bit better than those before.
There are those that know, and know that they don't know. They evaluate themselves worst than those before them. This is the d-k valley, imposter syndrome, confidence issues.
There are those that know, and know that they know. They are much better at evaluating themselves than those before them. They have experience to know what they know, and what they dont know, and they still continue to underrate themselves vs the first bunch, but they are more accurate and closer to the truth.
> And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result.
"D-K effect in its original form" vs "D-K effect in pop culture" is the biggest D-K effect live example. Of course I mean D-K effect in pop culture here.
Interestingly, the "interesting" part of the original result is that the correlation between actual performance and perceived performance is less than people intuitively think.
But as the "D-K effect in pop culture" spreads, people's collective intuition changes. Today if you explained the original D-K effect to a random person on the internet, they might find it interesting because the correlation is greater than they thought: they thought the correlation would be negative!
> And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result.
Right, so:
1. If the data were truly random, with no correlation, we'd expect the line to be straight across the middle, with the first quartile at 50% and the last quartile also at 50%
2. If the data were 100% accurate and precise [1], we'd expect the line to be diagonal, with the first quartile at 12.5% and the last quartile at 87.5%.
3. If the data were accurate but not precise (i.e., basically right but with some randomness built in), we'd expect the line to be in between #1 and #2 -- basically, changing from #2 into #1 as the randomness increases, but with the intersection at 50%.
That's because someone in the 2nd percentile can't underestimate themselves as much as they can overestimate themselves, and someone in the 98th percentile can't oversetimate themselves as much as they can underestimate themselves. But in any case, the "0 bias" case looks symmetric.
4. But what we actually see is none of the above: we see the 1st quartile being at (eyeballing the chart) 60%, and the last quartile at 75%.
That shows that there is indeed some ability for self-evaluation, but it's off. The fourth quartile could indeed just be random, the effect of clipping at the top meaning that the upper quartile cannot overestimate themselves as much as they understimate themselves. But there's no getting around the fact that the bottom quartile are overestimating themselves.
> But there's no getting around the fact that the bottom quartile are overestimating themselves.
It's because higher competence goes along with more accurate self-assessment but not less bias. So the high performers underestimate with less magnitude than the low performers overestimate, but they both under and over estimate themselves with the same frequency.
The author of this assumes the conclusion in order to decide how to analyze his data.
He cannot reasonably say both:
> we have a decision to make: what are we going to assume? How are we going to quantify our surprise from the results?
> The first option is, as in the case of the state census, to assume dependence between X and Y. I.e. to assume that, generally, people are capable of self-assessing their performance.
> The second option conforms with the Research Methods 101 rule-of-thumb “always assume independence.” Until proven otherwise, we should assume people have no ability to self-assess their performance.
> It seems to me glaringly obvious that the first option is much, much more reasonable than the second.
— and -
> most notably the claim that the more skilled people are, the better they are at self-assessing their performance. This result is supported by their plot, but in any case, my issue is not with objections to this claim
and then expect to carry any credibility.
The author of this piece both suggests that a key variable is fixed and later admits it varies within the same dataset.
I guess at least they admit it, but this lacks basic self-consistency.
I'm utterly confused. The latter statements it just the author explaining which parts they didn't discuss in their article; it has no bearing whatsoever on the section before it.
It discloses the cognitive dissonance in his position. He seems to be saying both “skill at assessing ability is random and mathematically bounded only” while admitting “skill at assessing ability changes with ability.”
> The author of this piece both suggests that a key variable is fixed and later admits it varies within the same dataset.
I don't see how that variable changes, here is an example how the error variable can be exactly the same for everyone and reproduce the results:
Lets say the overconfidence is always that you feel 50% of those better than you are actually worse than you. So everyone is equally overconfident, just that the top wont move their own placings as much as the bottom since there are much fewer people that they can mistake being worse than them. Then apply noise to this and you get the graph Dunning-Kruger got.
You could say "But they are better at estimating their rank!", but that is just a mathematical artefact, it isn't a psychological result. Even if everyone always guessed that they are number 1, the better you are the better your guess will be, but in that case it is easy to see that everyone overestimates their skill in the same way instead of the better people having a fundamentally different way of evaluating themselves.
Both analyses seem to agree on one finding: people’s skill at estimating their own ability increases with that skill. It can’t be a purely mathematical artifact because you would see a tapering at either end, or a narrowing distribution of errors at the bottom end, not just a narrowing toward the top end.
This should be unsurprising for anyone who has become sufficiently skilled at something. Beginners can’t even discern the differences the experts are discussing, and frequently make errors in classes they don’t even understand.
Beginners, by definition, are guessing 100%. Some will guess high, others low, and the rest in between. But they are all guessing. Perhaps There's a cultural bias to over-estimate their skill? Perhaps there's a nudge in the process of the study that led them to overestimate?
The lede isn't that people over-estimate their skill level. The lede is, why would that be as they have nothing else to go on. That is the trigger or triggers? And to say, the more experienced estimate better? Well, duh.
> Lets say the overconfidence is always that you feel 50% of those better than you are actually worse than you. So everyone is equally overconfident, just that the top wont move their own placings as much as the bottom since there are much fewer people that they can mistake being worse than them. Then apply noise to this and you get the graph Dunning-Kruger got.
But the data of original D-K paper shows that the top 25% people underestimate their placings. So this whole paragraph, while logically true, has little to do with the original D-K effect.
> You could say "But they are better at estimating their rank!", but that is just a mathematical artefact, it isn't a psychological result. Even if everyone always guessed that they are number 1...
If everyone always guessed that they are number 1, it's a huge psychological result: it means people are extremely irrational when it comes to self-evaluation.
> But the data of original D-K paper shows that the top 25% people underestimate their placings. So this whole paragraph, while logically true, has little to do with the original D-K effect.
That is what you would expect under my model, due to the randomness being limited upwards for the high placings but still go downwards. That is the effect the article we are talking about refers to when they say "Autocorrelation".
I found two very interesting things in the original D-K paper [1] that challenge your otherwise reasonable point. The first is that the graph everybody associates with D-K, the one showing the beautifully perfect linear result, is one of 4. The other 3 graphs are far messier, and indeed the paper discusses the fact that the correlations tend to be weaker and in some cases nonexistent.
The second thing is that that beautiful perfectly linear graph everybody references, was measuring 'humor'!!! Humor is going to be something that's all but guaranteed to create near complete noise between self evaluation and 'expert' (professional comedians in this case) evaluation. And if everybody is essentially randomly guessing on their performance, then it will always show an extremely strong D-K effect with the top performers underestimating themselves, and the bottom performers overestimating themselves.
The experiment that most simply and directly measured 'intelligence', without complicating matters in a potentially confounding fashion, is #2. It was based on logic problems from the LSAT. And the resultant graph is just all over the place. Quoting the paper's evaluation of this study:
---
"Participants did not, however, overestimate how many questions they answered correctly, M = 13.3 (perceived) vs. 12.9 (actual), t < 1. As in Study 1, perceptions of
ability were positively related to actual ability, although in this case, not to a significant degree."
Yes, D-K is another one of those "classic" psychology studies that everyone knows about but is actually rubbish and shouldn't be cited for anything. You're not the first to notice this, I pointed it out on HN last year:
At some point I should write up a proper blog post on the D-K paper in the hope that it eventually surfaces in search results, because it's past time for this paper to be put to bed. The problems you cite aren't even the full set. The whole thing was (of course) a study on a handful of psych undergrads, their selection method for expert comedians has circular logic in it and it all goes downhill from there.
But again isn't the fact that, "perceptions of ability were positively related to actual ability, although in this case, not to a significant degree" an interesting result? Not the fact that they were related, but the fact that they mostly weren't! That does demonstrate the core result as I understand it, that people are little better than random at evaluating their own performance, which was a surprising finding.
Nope, because I think D-K played a neat little trick. Whether it was intentional or not is another topic. They were using a largely homogenous group of people - Cornell undergrads taking psychology classes, and querying them on things where all performance would fall close to a similar mean.
Imagine you take 100 literal clones of somebody and query them on something, and then ask them to predict their performance. Assuming these clones are smart, they'd all estimate their performance as being at 50%, which is what would be expected. But due to natural variance, not everybody will score identically (in the same way even identical twins do not perform identically). And so you'd end up seeing some huge D-K effect with literal clones! Those at the bottom would be greatly overestimating their performance, while those at the top would be greatly underestimating it. Now step away from clones into the regular world of students, where everybody is going to think they're a bit better than average, and you get people predicting a score of about 60%. Now suddenly you see the same thing, except the lower performers would be overestimating their performance by a greater degree than the top performers were underestimating theirs.
To truly measure D-K, you'd need an extremely heterogenous group of people, and you'd also need questions with perceived and real domain expertise. Would a farmer with a 5th grade education evaluate his performance on a differential equations test as above average? Would a professor of diff eq evaluate his performance on a test of optimal growth strategies for buckwheat and corn, as above-average? Of course not, but then you can't make a shocking claim, don't get published, and don't become famous.
The issue is people have differing personal definitions of Dunning Kruger. The generally demonstrated effect in the sample of people Dunning and Kruger analyzed was "people tend to estimate the percentile of their own skill as closer to the average than it really is, with a slight bias towards an above-average mean. This leads to overestimation of relative ability by those in lower percentiles, and the opposite for those in higher percentiles"
However when people cite Dunning Kruger in popular culture they mean "below average people think they're above average, and above average people assume they're below average", which was not shown in the original study, and wouldn't show up in an analysis attempting to justify it via a misunderstanding of autocorrelation.
The general point in the rebuttal is correct. A completely noisy graph of people's estimations of their own ability would show a Dunning-Kruger resembling residual graph (x-y vs x). However, one wouldn't expect people in the 1st percentile to have an equal distribution of perceived skill as people in the 50th or 99th percentile. If that were true, it would be worth reporting.
> "below average people think they're above average, and above average people assume they're below average"
There’s no way to know if you’re wrong, but when I see it used it seems to be pointing out - “some (not all) under qualified people tend to defer to their own beliefs rather than the views/statements from experts, even when that is demonstrably silly.”
^ Referring to the pop-sci interpretation, not in disagreement with the general point.
The rebuttal by Daniel (andersource.dev) is useful, generally. However, when he writes ...
> The history of statistics is well out of scope for this post, but very succinctly, my answer is that statistics is an attempt to objectively quantify surprise.
... I cannot agree. Statistics is not this; it is much broader. One may or may not be surprised by particular statistics, sure, but there are _specific_ concepts that map more directly to surprise, such as entropy from information theory.
You aren't suggesting that statistics as a field defined a notion of "order", prior to thermodynamic entropy or Shannon entropy, are you? To me, that would be circular.
Based on my knowledge, it seems likely the first published quantification of disorder arose in the study of thermodynamic entropy. Later, Shannon defined entropy in information-theoretic terms, independent of physics. It can be interpreted as a notion of 'surprise' or what he called information.
My claims:
First, the field of statistics is _not_ historically rooted around concepts such as: "order/ordering" or "information/surprise".
Second, the field of statistics, as a directed graph of abstractions, is not rooted in ordering nor surprise.
Third, in teaching statistics, practically or conceptually, the concept of surprise isn't foundational. The idea of _variation_, on the other hand, is central.
I'll add a few more comments. To talk meaningfully about 'surprise', there has to be a stated or assumed baseline or 'expectation' about what is _not_ surprising. For Shannon, if the probability of an event is certain, there is no surprise. Probability and statistics work together, but they are conceptually separable. This is particularly clear when you compare descriptive statistics with, say, probabilities over combinatorics problems.
> The field of statistics is not organized around concepts relating to "order" or "ordering".
Sure but reduced to the simplest form, statistics are used to predict things, the most basic thing in the Universe being "is this particle gonna stay put or move a little in a given direction", which is related to entropy, so to me intuitively these two things seem very related. The fact that in statistics we don't use the words "order" and "disorder" doesn't mean it doesn't reduce to that.
Btw I'm an electrical engineer that isn't amazing at statistics or thermodynamics so beware I might just be talking nonsense.
> ... reduced to the simplest form, statistics are used to predict things
Inferential statistics is not the simplest kind of statistics. Descriptive statistics are both simpler and foundational for inference.
P.S. I should say that I am a bit of a stickler regarding discussions along the lines of e.g. "these things are related". Yes, many things are related, but it is really nice when we can clearly tease things apart and specify what depends on what.
I was surprised by the figure from the original article, imho that's the strongest rebuttal: perceived ability grows strictly mononotonically with actual ability, no sign of the famous non-monotonic U-curve. Yeah, the slope is less than one, and it grows a bit faster from the second to the third quartile than from the first to the second, but none of that changes the fact that people tend to slot themselves correctly. The chart is interesting in that it confirms that everyone perceives themselves to be slightly above average in terms of ability, which of course can't be true in practice. But what it also shows is that when they think they'll be below or above that (false) baseline, they're actually correct about it. So pretty much the exact opposite of what the Dunning-Kruger effect claims.
> The chart is interesting in that it confirms that everyone perceives themselves to be slightly above average in terms of ability, which of course can't be true in practice.
No, everyone biases their self-assessments toward a point slightly above the mean. That's not the same as saying everyone thinks they're slightly above average, nor that people's self-assessments have no predictive power whatsoever. The lowest performers still think they're below average, just not as much as they should. The highest performers still think they're considerably above average. But they all have a bias toward (slightly above) the middle.
So yes, people are generally correct in the direction that they deviate from that median self-assessment, but that just shows that people's self-assessments aren't completely without basis. Which D-K certainly didn't claim.
D-K claim a non-monotonic relationship, which simply isn't supported by that data, as you yourself point out: people rank themselves correctly (ordinally). I didn't mean to say that all self-assessments are the same, if that was the misunderstanding. My point is that the self-assessments indeed are meaningful, even more so than D-K claim.
Check the original paper by D-K. Fix only focused on the first plot which has a monotonically increasing trend. The later plots show varying degrees of nonmonotonicity, though sadly they don't include error bars to indicate how statistically significant the differences between groups is.
But we don’t know their true ability, only the results on one test. It could be they accurately predicted their ability but because of random chance they did better/worse than their guess. Then you would get the exact data that is observed.
The slope will be less than one if there's e.g. any random guessing in the test even if the self-assesment is perfect (apart from whether they know if their guess is right or wrong of course) [1].
I think this is the effect that the post is dancing around, but doesn't seem to really understand (and how "autocorrelation" and indepence are discussed is very nonstandard to be charitable).
I agree, the statistical analysis in the original post makes me very uneasy. I think it could be a case where the conclusion is correct, even though argument isn't necessarily.
And yes, the fact that the slope is less than one is fairly uninteresting.
The real problem here is that the Dunning-Kruger effect, as it's classically stated, claims that if you asked four people to rank themselves in terms of ability, the result would be 1-3-2-4, ie the people who know a little would put themselves above the people who know a lot but aren't quite experts. The problem is the data shows that they'd actually rank themselves correctly 1-2-3-4. But such a boring finding probably wouldn't have made the authors quite as famous, which might be why they tried bit of data mangling, and they found this really cool story that everyone would secretly love to be true.
Which is a shame, because I think the fact that the mean of perceived ability is too high (and the variance too low) is really interesting too, and perfectly supported by the raw data.
Yes. The methodology in the original D&K is quite shoddy, and vulnerable to e.g. good old regression to the mean, and the interpretations are too strong. This is sadly very common in psychology (and many other fields I'd guess) and even researchers don't care so much if the story is juicy enough.
The pop version of the DK effect seems to be something like a 4-3-2-1 ranking, which is obviously not supported by the data.
But they wouldn't. They'd rank themselves something like 1,2,2,3. We're not dealing with a population collaborating to all rank themselves in order, but rather each person individually estimating where their abilities lie in the population.
The point is that if you ask someone in the, say, 5th percentile of ability what their ability is compared to the population, they might say 25th percentile. Ask someone at the 25th,and they might say 40th. At the 40th they could say 55th. And at the 90th, maybe they'll say 80th. So yes, if you order their guesses, they will be in roughly the correct order. But, crucially, that doesn't mean that they are ranking themselves correctly!
I really appreciate that he points out that the use of the term in the original article of autocorrelation is nonstandard. Because it is nonstandard but it's a rather flippant way to dismiss the rest of the article.
This rebuttal seems weak because it’s using unbounded datasets (population). A big issue with the DK research is using bounded data (test scores). For example if I get 100% right it’s mathematically impossible to have overestimated.
I have to agree. You cannot separate the statistical analysis from the meaning of the study. In the article, the author's random data is exactly an extreme replication of Dunning-Kruger. Why? Because, in his random data, people with low test scores almost always overestimate their ability, while people with high test scores almost always underestimate.
That is precisely the premise of the Dunning-Kruger effect. The fact that the original Dunning-Kruger paper shows a less extreme effect? That just shows that people are slightly better than random at estimating their own abilities - but still nowhere accurate.
> But in reality, it would be very surprising if performance and evaluation of performance were independent. We expect people to be able to accurately rate their own ability.
This seems to be attacking an irrelevant point in the analysis. The argument goes as such: Researcher carries out all the studies needed to prove the Dunning-Kruger effect, then trips and drops all the results into a vat of acid. But he's ashamed and quickly generates random numbers for the results, and somehow the data still proves the Dunning-Kruger effect. Not just that, repeating the same exercise again and again with completely random data leads to the same result, the effect is always present. So is the Dunning-kruger effect so powerful that it exists in the very fabric of the universe devoid of any human interaction, or is something amiss?
In this situation we are forced to look at the test we have that concluded from the data that the Dunning-Kruger effect exists and conclude that it's a bad test, we need something different.
You seem to be arguing "oh no, you can't look at random data, because we wouldn't expect the experiment to yield random data!". But that doesn't work as an argument for why the test should still be considered good. If it's supposed to have any worth, then the test has to be able to come to one of two conclusions: The Dunning-Kruger effect exists or the Dunning-Kruger effect doesn't exist. And if the test is set up such that for positive experimental results, or just random noise, it comes out in the positive, and only in extremely unlikely and a narrow band of the possible outcome space come out negative, then the test is bad.
If we want to try to rephrase everything a bit to make the issue much clearer. Lets set up a coin-toss competition between ChatGPT and a group of 100 people. Each participant goes 1:1 against ChatGPT where both parties toss a coin and whoever has the most heads wins, on draws toss again, in case a pair goes into an infinite loop that doesn't end before our allotted trial time, they get removed from the study. A human assistant tosses on the behalf of ChatGPT on account of it not having arms yet.
Now we ask each person how they would rate their ability vs. ChatGPT in a coin-toss, everyone answers 50/50, for obvious reasons.
So we run the experiment, the line for "ability plotted against ability" is a straight diagonal line. The line for estimated ability vs actual ability is a a straight flat line at 50%.
Eureka! To the presses! we have just proven the Dunning-Coin-Kruger effect! People who are worse at throwing coins tend to over estimate their ability, and people who are better at throwing coins underestimate their ability! What a marvelous bit of psychological insight, it really tells us something about how the human mind works, and has broader insights about our society! But naturally we always expected this outcome, people who are bad a tossing coins are dumb and of cause they are overconfident, not like people who are good at tossing coins who have a remarkable Intellect about themselves and are therefore humble in their self estimation... and so on and on about preconceived biases that have nothing to do with the actual test we performed.
But we would not expect the coin toss to have a correlation. Whereas we might expect a correlation between actual and perceived ability.
So yes, both are null results, but only one is interesting.
For instance, we would probably expect there to be a correlation between height and ability to dunk a basketball. If someone were to show that there is not a correlation, that would be an interesting result. Just because random data would match my result doesn't mean my result is nonsensical. Getting results that look like random data is still a result--it just means there isn't a correlation.
thinking about this more (i'm replying to myself!) -- i guess what the experiments for D/K show is exactly that performance on a test is uncorrelated with your idea of the performance on a test.
yes, it's kind of surprising that, having dropped the "real" results in a vat of acid, our hapless researcher replaces the missing data with random numbers and gets the same result -- but that's only because we didn't expect random numbers to model the outcome.
instead, we would have expected that, towards the bottom of the distribution of test-takers, those folks would rate themselves lower, while towards the top they would rate themselves higher. at the extreme of perfect self-awareness, the line for subjective results would exactly match the line for objectively-scored results.
this is the exact argument that is made in the post linked in the top comment:
> by using random data to argue that the Dunning-Kruger effect is not real, the author is arguing to default to the base assumption. But which base assumption do they make? One even more radical than what’s proposed by Dunning-Kruger. In the author’s world, the Dunning-Kruger study should be interpreted in the reverse direction, claiming that there is at least some self-awareness in the way people self-assess.
Yeah this must be some high end satire where the guy Dunning-Krugers up an explanation of Dunning-Kruger. Since even an economist is supposed to understand ANOVA I have to conclude that this article is a joke.
So what we have here is some scientists trying to prove that the Dunning-Kruger effect doesn’t exist and instead they give us a perfect example of the Dunning-Kruger effect.
> The irony is that the situation is actually reversed. In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.
The effect that the worst overestimate their skill is known since before, that wasn't the main result of Dunning-Kruger. The effect that the best underestimate their skill can be chalked up to auto-correlation.
The best don't tend to overestimate their skill; they underestimate it. The D-K results show a consistent bias in estimates toward (somewhere near) the mean. Hence an overestimate at the bottom and an underestimate at the top.
I have the same question...why do some get it so wrong? Was there a nudge in the process of the study that caused some to answer what they did?
Heck, I'm wondering if "Honestly, I can't say" was an allowed response. Or were they forced to pick a number? If so, then I'd want to know what happens when you ask 100 ppl to pick a number between 0 and 100. I bet it's not evenly distributed. Maybe the beginners give a "discounted" version of the distribution?
Even if the autocorrection explanation is off, there does now seem to be flaws in DK, at least from the perspective of pure and proper science
The authors did "X - Y vs X," but that's not even the biggest problem. The authors subtracted two measures that had been transformed and bounded from 0 to 1 (think percentiles). What happens at the extremes of those bounds? How much can your top performers overestimate their performance? They're almost at 1 already, so not much. If they were to overestimate and underestimate at the same rate and by the same magnitude in terms of raw values, the ceiling effect on the transformed values means that the graph will make it look like they underestimate more often. The opposite problem happens for the worst performers.
See "Random Number Simulations Reveal How Random Noise Affects the Measurements and Graphical Portrayals of Self-Assessed Competency." Numeracy 9, Iss. 1 (2016), particularly figures 7, 8, and 9.
Exactly, that was my thought. How would it be possible to get anything other than the D-K effect, even if it wasn't just averaging to the mean?
The lowest quartile can't say they're below the lowest quartile, so any error at all will be counted as "overconfidence." The top quartile can't say they're above the top quartile, so any error at all will be counted as "underconfidance."
> Exactly, that was my thought. How would it be possible to get anything other than the D-K effect, even if it wasn't just averaging to the mean?
Quite easily with the method they demonstrate in the study in figure 11. In that study test participants are not rating themselves in terms of population percentages, but in terms of the percentage correct they got on the test. In such a case the test could be designed to have a huge ceiling that even the most knowledgeable participants would have trouble reaching. And could have such a low floor that even the least knowledgeable participants would still get some answers correct (unless they weren't even trying, which would allow throwing out their data points).
With 20 questions you could have four gimmes and four impossible questions, bounding the worst participants to about 20% and the best to about 80%.
It would have been noteworthy in the original design if more than one group of participants were, on average, within their quartiles on the guessing. I also find it noteworthy that the average guess of the lowest quartile is lower than the average guess of the second lowest quartile, and on up the quartiles. On one hand this shows some awareness of relative ability along a massively smooshed logarithmic scale. On the other hand I wonder if this laddering follows as the averages are split into quintiles and deciles.
I think if people at all levels of skill were reasonably good at measuring their own ability, we would see two curves that roughly overlap. Instead we see the graph given.
The fact that random noise can generate a mean curve on the Y axis doesn’t mean DK doesn’t exist. It just means DK’s mean self analysis resembles a middling random mean, which if you think about it, makes sense. Most people will probably self evaluate as average, regardless of their actual skill. This means DK is right as rain.
> I think if people at all levels of skill were reasonably good at measuring their own ability, we would see two curves that roughly overlap. Instead we see the graph given.
Actually, due to the construction of the test, the ability to evaluate your own absolute ability in a subject isn't sufficient for the two lines to be able to overlap.
It's a percentile axis, so you need to be able to reasonably accurately estimate the ability of everyone taking the test, and where you fall in the quartile range of those participants.
Why does it matter if it’s absolute result vs percentile result?
In the former, you’re asked to predict your score.
In the latter, you’re asked to predict your place among others.
Yes, the latter more difficult to do accurately, but if people were really able to evaluate themselves, they would be able to understand they’re, on average, below or above the median. The results of DK show that most people think they _are_ the median (we all think people are like us). This means, as a result, less capable people overestimate their abilities and more capable people underestimate their abilities. It tracks.
> Yes, the latter more difficult to do accurately, but if people were really able to evaluate themselves, they would be able to understand they’re, on average, below or above the median.
Lets suppose I ask you "How tall are you?" would you be able to answer? Good, then you are able to accurately asses your own height.
Now lets suppose I ask you "How tall are you as a percentile of this group including you and 99 people who you don't know who are". You should realize that you can't do that exactly as accurately because you don't have perfect knowledge of their heights just from knowing your own.
Now for the even more convoluted and actual Dunning-Kruger assessment. I ask you "How tall are you?" great, now at which percentile do you think your deviation from your actual height falls compared to the deviations from these other 99 peoples esstimates of their actual heights?
How on earth are you supposed to answer that unless you have some sort of knowledge about how they perform? Are people a cm off? Are some people 10cm off? Are people being mm precise?
The problem with the Dunning-Kruger effect is that it effectively says "People who are on average worse at estimating their own height tend to underestimate it, while people who are on average better at estimating their own height tend to overestimate it", but if you look at the absolute ability of people to estimate their own height it's similar independent of how close people get. But the Dunning-Kruger analysis methodology is set up such that it transforms random noise into an observation of the Dunning-Kruger effect, which is the problem highlighted in the OP. Part of the problem here is to have participants estimate on a percentile range instead of doing a simple absolute estimation. You can ask people "So how far off do you think you are in cm's?" And you'll see that people are fairly consistent in accessing their own ability and so the Dunning-Kruger effect goes away. The effect is a result of the methodology not of the actual people being test.
But that's a hard sell for most people because they have a bias about "dumb people" and the effect as originally stated confirms that bias, so people hold on to the conclusion even as holes in the methodology becomes apparent.
This can be dealt with to an extent by truncating the extreme ends. Even the middle quartiles in the graphs in the linked article show the same trends.
Lognormality of data is killing for the methods of social scientists. If I were to hypothesize the underlying mechanism then it would be that raw skill is lognormally distributed for those taking tests at all (at least participating in these test usually entails an implicit lower bound on IQ, but also from the long tail of high performance in say sports), tests try to measure performance but with a reduction to normality (or 4 categories) and then people estimate their own skills based on their task and grading experiences which are also reduction to a normal or constant distribution. (“I was always a B- in math in high school and expect that to have distribution X and this test to follow that distribution“).
It’s three places where reductions in dimensionality take place both implicitly and explicitly. I don’t envy researchers trying to unpeel this onion. I do like the unraveling of all these problems that pop up in pretty accessible designed experiments. It makes for better understanding.
Thanks for stating just how much of a statistical minefield this is. The reference does a great job showing just how wrong the DK studies are. Unfortunately, most people have already made up their minds and are happy to link conflicting blog posts as evidence.
The DK studies are not wrong, they are misinterpreted by people who don't know what they're talking about (e.g. what tge DK effect actually is), like this blogger.
"People have worse self assessment ability as their real ability declines" would be a valid interpretation of the DK data and notably would NOT be a valid conclusion from the random data in the blog post.
> The most common critique of our metacognitive account of lack of self-insight into ignorance centers on the statistical notion of regression to the mean. Recall from elementary statistics classes that no two variables are ever perfectly correlated with one another. This means that if one selects the poorest performers along one variable, one will see that their scores on the second variable will not be so extreme. Similarly, if one selects the best performers along a variable, one is guaranteed to see that their scores on the second variable will be lower…
His full response is longer than is appropriate to quote here, but you can easily find the chapter online.
Dunning, David (1 January 2011). "Chapter Five – The Dunning–Kruger Effect: On Being Ignorant of One's Own Ignorance". Advances in Experimental Social Psychology. Vol. 44. Academic Press. pp. 247–296. doi:10.1016/B978-0-12-385522-0.00005-6. ISBN 9780123855220
> Some scholars observe that Fig. 5.2 looks like a regression effect, and then claim that this constitutes a complete explanation for the Dunning–Kruger phenomenon. What these critics miss, however, is that just dismissing the Dunning–Kruger effect as a regression effect is not so much explaining the phenomenon as it is merely relabeling it. What one has to do is to go further to elucidate why perception and reality of performance are associated so imperfectly. Why is the relation so regressive? What drives such a disconnect for top and bottom performers between what they think they have achieved and what they actually have? [...] As can be seen in the figure, correcting for measurement unreliability has only a negligible impact on the degree to which bottom performers overestimate their per-formance (see also Kruger & Dunning, 2002). The phenomenon remains largely intact.
The DK effect says roughly, "low performers tend to overestimate their abilities." Yet when researchers analyzed the data, they found that high and low performers overestimate and underestimate with the same frequency. [0] It's just that high performers are more accurate than low performers (note how this statement differs from the DK effect). Since you can completely explain the "X graph" by the random noise combined with the ceiling effect, and since beginners' self evaluations are noisier than experts', you don't even need regression to the mean to explain why you get the "X graph."
0. Nuhfer, Edward, Steven Fleisher, Christopher Cogan, Karl Wirth, and Eric Gaze. "How Random Noise and a Graphical Convention Subverted Behavioral Scientists' Explanations of Self-Assessment Data: Numeracy Underlies Better Alternatives." Numeracy 10, Iss. 1 (2017): Article 4. DOI: http://dx.doi.org/10.5038/ 1936-4660.10.1.4
The discussion between Nicolas Boneel and the author in the comments of the article is interesting and Nicolas expresses the doubts I had when reading this. The whole point of the DK effect is that people are bad at estimating their skill, so if you assume that they randomly guess their skill level then of course you will replicate the results.
The correct model for a world without DK should be something like (estimated test scores)=(actual test scores)+noise, and then the only form of spurious DK you'd expect is caused by the fact that there's a minimum and maximum test score. But this effect would be proportional to the variance of the noise, and I assume the variance on the additional dataset is too low to fully understand the effect seen there.
Also, in this model on average everyone should still guess correctly in which half of the distribution they are, but even the bottom quartile seemed to estimate their abilities as above the 50th percentile
The correct model is probably (estimated test score + estimation noise) = (actual test score + test noise). The test contains a random element, e.g. guessing, that the person can't estimate.
Just because the data appear random doesn’t mean you’ve gotten at the cause though.
From those charts it could equally be low skill throughout, or something nuanced like lack of skill at estimating at the bottom, improving skill in estimating through the middle, and high skill and learned modesty at the top.
> Also, in this model on average everyone should still guess correctly in which half of the distribution they are, but even the bottom quartile seemed to estimate their abilities as above the 50th percentile
Depends on the noise applied. If the noise is -10% to +100% for everyone then you get roughly the graph Dunning-Kruger got. So there is no reason to believe that the best are better at estimating their abilities, just that you can't estimate your own rank as better than the best.
That's a great observation. For what it's worth though, it does seem logical to me that the best would also be best at estimating their skill. Not necessarily because they're better at it per se (though there's likely some of that too, for the reasons originally posited by D-K), but also because they have an easier problem to solve. When you know something well, it's fairly obvious that that's the case. (Think of the experience of acing a math test. It's entirely possible you'd know you answered everything correctly.) When you struggle somewhat though, it's much more difficult to estimate how much you're struggling compared to how others would fare.
Nonstandard terminology warning: the author is using "autocorrelation" in a way I've never seen before. There is a much more common usage of "autocorrelation" to refer to the correlation of a timeseries with itself (shifted by some amount).
If you use autocorrelation to refer to the thing in OP, you'll probably confuse people who know statistics, and vice versa.
> Nonstandard terminology warning: the author is using "autocorrelation" in a way I've never seen before.
That's a nice way of putting it. A more accurate description would be: the author is butchering the key essence of autocorrelation, since they don't clearly mention that it is a temporal relationship!
> What is autocorrelation?
> Autocorrelation occurs when you correlate a variable with itself.
Groan.
A standard definition is:
> Autocorrelation refers to the degree of correlation of the same variables between two successive time intervals. It measures how the lagged version of the value of a variable is related to the original version of it in a time series. Autocorrelation, as a statistical concept, is also known as serial correlation.
The more common experience with autocorrelations are with time series, but what the author said is correct even in that context. A time series autocorrelation relates the same time series function at different times. At the simplest you plot the arrays X vs X where X[i] = f(t[i]). You then may complicate it further by some transformation g(X) vs X (e.g., moving average).
Consider the imaginary world that the author describes, in which people's estimate of their score is independent of their actual score. Wouldn't it be fair to say that, in this imaginary world, the DK effect is real?
The point of the effect is that people who score low tend to overestimate their score and people who score high tend to underestimate. Of course there are lots of rational reasons why this could occur (including the toy example the author gave, where nobody has any good sense of what their score will be), but the phenomenon appears to me to be correct.
The author's example with random points is bad because you might reasonably expect people to behave differently than uniform random points.
It'd be reasonable to expect that people who are good at a thing estimate that they are good at it, and that people who are bad at a thing, estimate that they're bad at it. I mean, my kids love math and always estimate themselves to do well on math tests (and they usually do). They have classmates loudly detest math, estimate they'll do badly, and often do (at least somewhat). Similarly I'm a bad cook and I have no doubt that if I join a cooking contest, I'll get few jury points. The expected data is correlated.
So if a study finds that, well actually, the data is not at all that correlated! Lots of people who estimate that they'll do fine actually don't, and equally many people who estimate that they'll do badly, actually do fine (ie it looks like uniform random data), then that's surprising, and that's the D-K effect.
Right? I'm no statistician at all so I might be missing something.
If it's a statistical illusion, the correlation is still true, it just has no business being studied by psychologists.
If I roll a die, and then roll a second die, I might study the behaviour of the second die and wonder why it wants to add up to 7 with the first die. Since they're dice, I can dismiss that as a stupid idea, but if they were people, I could certainly be led astray by psychological theories about them.
Autocorrelation occurs when you correlate a variable with itself.
Wikipedia's definition of autocorrelation:
Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay.
Of course, 0 delay is the trivial case of time delay but really, the article's definition is at best inaccurate. D-K has nothing to do with time delay and calling it autocorrelation seems like a weird pun that doesn't quite land.
To be fair, there is such a thing as spatial autocorrelation (from geostatistics), the term autocorrelation does not necessarily imply the varying dimension is time
I think the issue here is a confusion about what "bias" means. If they are self-assessing at random, then the high performers will all underestimate themselves, but this is not a bias towards underestimation as they are choosing randomly.
That said, the chart from D-K seems to show a different bias and line up roughly with what you would expect. Someone with no knowledge assumes they are average skill and hence inflates their position, someone who is very good doesn't want to rate themselves the best because they assume others know as much as they do. The assumption underlying both groups is that you are normal and others are similar to you.
I hypothesise that most people think they're average, which is something you could easily test by asking them to rate how well they think the average person would do on a test and comparing it to that individual's test score. I'm almost certain that high performers will overestimate the average, and low performers underestimate it.
If there is a linear relationship between test score (X, ability) and test score self-assessment (Y, self-perception), then the random variables are modeled as:
$$
Y \sim aX+b+N
$$
Where N is some statistically independent noise, mean zero.
To get a "DK effect" we need (a-1) < 0, or a < 1. If a=0, in the case of the blog post, then this is absolutely true. If a=1 (which, along with b=0, is the ideal scenario), then this is barely not true. If a > 1, then we'd have a whole new effect about arrogant experts.
So the only thing that matters from this "auto-correlation perspective" is the rate at which an individual's self-assessment increases with their ability. As long as they underestimate the increase, a "DK effect" will occur.
However, in the above analysis, we ignored the variable b. If a = 0.8 and b=0, we'd never have the so-called "DK effect" even though it matches the "auto-correlation perspective" because everyone would underestimate their ability.
This tells me that the value of b matters. It is sort of like the prior ability everyone assumes they have. What the DK papers shows is that b > .5, which I think is in line with the spirit of the popular interpretation of the "DK effect". People should not be assuming they have, at a minimum, a capacity higher than the average.
At the same time, the value b isn't insanely higher than .5, which also makes me want to cut those unskilled and unaware some slack. It "seems reasonable" to assume your baseline is average. That can't be the case, but it feels intuitive.
That is not an autocorrelation. The OP is equating linear dependence with autocorrelation, which not how we use that term. Autocorrelation is when a random process is correlated with time lagged version of itself.
Figure 2 in this paper shows the result of an experiment where skill and perception of one's skill are measured independently. To eliminate any statistical artifact of auto-correlation. And lo and behold - on average skill is uncorrelated to the accuracy one's own assessment. No DK effect at all. What does show up actually is that more qualified people are more consistent in estimating their skill (i.e. their assessments are less variable), but the mean accuracy is still 0.
So indeed, on average actual and perceived skills are uncorrelated. That's exactly what the numerical proof with random numbers shows and why in many cases we apply Occam's razor.
How does the author miss the fact that in his graph the self-estimation hovers around the 50th percentile (as expected with random data), whereas in the DK graph it averages around the 60th to 75th percentile? That is a significant bias.
As a control, the author should have plotted Fig.8, but based on DK's data (or at least estimate how Fig. 8 would look like based on the result in Fig. 3). Then it would have been obvious that the self-estimation error tends more towards overestimation in the lower quartiles than ist does towards underestimation in the higher quartiles - which is exactly what DK's conclusion was.
In essence: this article actually confirms that DK is not just a random statistical artifact.
At the risk of sounding like a complete idiot, isn't the hypothesis of the original paper still true? Let's assume self assessment score is perfectly random between 0% and 100%, so on average every group will always estimate themselves to be 50% correct
Then by definition that means people who are unskilled and often incorrect will overestimate themselves, while people who are often correct will underestimate themselves. Take a complete idiot for example. You always get 0% test score. Yet your self-assessment is random between 0% and 100%. Hence you overestimate yourself much more often than people who always get 100% test score.
In fact, if the two are uncorrelated, then that still means that
And here it is from OP (which made me laugh—right or wrong). And leave your hubris at home unless you rate yourself a damn fine statistician ;-)
“However, there is a delightful irony to the circumstances of their blunder. Here are two Ivy League professors7 arguing that unskilled people have a ‘dual burden’: not only are unskilled people ‘incompetent’ … they are unaware of their own incompetence.
“The irony is that the situation is actually reversed. In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.
I disagree. Dunning Kruger is not a statement about predicted score correlating with actual score in some way. It states that predicted score does not correlate well with actual score. This can be rephrased as the prediction error having a negative correlation with the actual score. The article then claims that this negative correlation is autocorrelation. That is true but the correlation still exist. The thing is that ideally we EXPECT there to be no correlation of the prediction error with the actual score, but we find autocorrelation. Going back to variables where this autocorrelation is not there, we EXPECTED to find a 1:1 positive correlation between predicted score and actual score but find no correlation, or a weak correlation.
So finding autocorrelation when you expected to find no correlation is pretty much the Dunning-Kruger effect here.
In fact their example with the random data totally makes sense: Suppose people uniformly randomly estimate their performance. Then the people who are low skilled will consistently over-estimate and the people who are high-skilled will consistently underestimate. Of course there is no causation here, as the people choose randomly, but there is an undeniable correlation. I guess the question is if you view the Dunning-Kruger effect as a claim to low skill CAUSING positive prediction error, or just correlating with it.
Naïve take: I’ve always felt like Dunning-Kruger is just the result of the fact that when guessing the value of anything people tend towards some common mean, and so if the true value is low your guess tends to be high, and vice versa. This assumes nothing about what is being guessed, but does assume (perhaps wrongly) that there is a commonly believed mean value and that people tend to imagine they are close to it.
That's essentially the plain-language interpretation of what the author of this article is pointing out - when you plot (actual score) against (difference between test score and actual score), you will always find a trend that underperformers overestimate and overperformers underestimate - for the exact reason you state.
I know I'm not smart enough on statistics or psychology to evaluate the article but it always struck me that D&K seemed to say something similar to what my grandpa said when I was a wee lad, "The more you know, the more you realize how much you don't know", I know he wasn't the first person to say that, but he was the first person to say it to me. I don't know if D&K is autocorrelation or not, but I know that an awful lot of people seem to think they know more than maybe they actually do, probably me included. Hmmm, maybe the author of that article as well? I wonder if that occurred to him, seems like a glaring oversight not to at least recognize that possible irony.
In the article, a real study was used as a counterexample to the DK effect.
Part of the results was a correlation that people who were "less capable" were also worse at predicting their own skill, and people who were "more capable" were better at predicting their own skill.
While similar to the DK effect, this is different, as the DK effect states that "less capable" individuals specifically _overestimate_ their skill, as opposed to simply being wrong (both over and under -estimating).
With relation to some people "seeming to think they know more than they actually know", this is likely confirmation bias in the sense that there are an equal number of people who don't know much, and know that they don't know much.
A related effect that I've wondered about is: perhaps lower-skilled people compare themselves to the general public, while perhaps skilled people compare themselves to a smaller group of skilled peers.
In other words, if you asked me if I'm good at riding a bicycle, I'd compare myself to others in the general population and say "yes". But if you ask a weekend bicyclist, they'd be better than me but perhaps compare themselves to weekend bicyclists, and rate themselves lower. And the effect might repeat for competitive bicyclists.
If true, this could explain why we intuitively believe the DK effect.
1. People really like the idea of smart people being humble and arrogance meaning stupidity, so they like to believe that DK is true, and they like to repeat this.
2. Some smart/skilled people are humble, some are arrogant.
3. Some smart/skilled people underestimate their skills, some overestimate.
4. Some stupid people are humble, some are arrogant.
5. Some stupid people underestimate their skills, some overestimate.
Overall, even if there is a correlation, you can't tell by just arrogance of a person whether we are dealing with DK or whether it's an effect at all. People's personalities, skills and everything are a bit more complex than that.
Overall bringing DK up seems like some sort of social justice/fairness effort rather than something that is actually true given any situation where someone is arrogant.
The author fails to make his point quite badly. Of course if everyone's self assessment was random the bottom quartile would overrate themselves! And that would be half of the Dunning-Kruger effect and we could truthfully say "the bottom quartile of people overrate themselves"!
The other part where those at the top have a better idea or where they rank noticeably does not come out in his toy example.
Honestly, he comes across as not having the slightest understanding of how people interpet those graphs...
In my experience, people abuse flattery too much so it is hard to tell if their positive opinions of me are genuine and with merit. Generally speaking, I try to see the big picture and realize no matter how well I do, in a more global sense at best I am too 50th percentile, slightly above average. It is chance,relationships and supply/demand economics that ultimately decide our ability to apply our talents effectively.
When it comes to others, I wish more people experienced the D&R effect. It gets frustrating sometimes dealing with smart and talented people who think they are revolutionary rockstars. You know the kind, they see other people's work and they are shocked how bad everything is, but never fear, they, our heroes are here to refactor everything until they leave and another hero looks at their work and rescues metropolis from it again. Patience and humility are a rare virtue for all of us.
We discussed this in a previous thread. The author is basically hypothesizing that perhaps people are so universally terrible at predicting their ability, their self-rating is like an unconditional random variable - just a random draw that is not influenced by their actual ability level at all.
If this is true, then when your actual ability is high, your self-rating is likely to be lower than your ability simply by random chance. For example, if ability ranges from 0-100, your actual ability is 99, and your self-rating is a uniform random number from 0-100, your self-rating is 99% likely to be lower than your actual ability. Conversely, if your actual ability is low, your self-rating is likely to exceed your actual ability level.
When it's explained clearly and simply, the criticism raises a lot of questions. Are people actually that bad at rating their own ability? I doubt it.
> However, there is a delightful irony to the circumstances of their blunder.
Indeed. And I find the tendency of people in this comment section to defend the flawed theory is further confirmation of another scientific finding: that we decide based on emotion and then justify our decision using rationality.
I would call this type of argument a case of regression to the mean rather than "autocorrelation". That, of course, in principle requires independence between performance and assessment of performance. In many cases, it would make little sense to assume that the performance and assessment of performance are independent. But even then, one can simulate random data with some correlation, and still get a DK effect merely as statistical artifact. An overview of similar critiques, and a similar argument in https://www.frontiersin.org/articles/10.3389/fpsyg.2022.8401... .
So from my understanding, the Dunning-Kruger Effect paper doesn’t show the distribution of the perceived test scores nor the standard deviation, only an average, which rises with actual test score level.
If they showed the spread bar in each bin, you could form very different conclusions. Do low skilled people consistently estimate their score at around 60, or do they give effectively random results centred around 60?
Assuming the latter, it could mean that low skilled individuals are completely unable to evaluate their performance while higher skilled people are slightly better at it but still not very good, giving a slightly positive correlation which… is very distinct from what the DK effect implied.
You can take out the x from both sides, and the y would still not be a horizontal line.
In their eagerness to 'deconstruct' the narrative, do the authors merely provide another example of Dunning-Kuger by overestimating their own cleverness?
The DK effect has gotten WAY more cred than it should. Today, it is just anoter feel-good piece that people use to justify their feeling that they're (ironically) surrounded by loud idiots.
The author measures the Dunning Kruger effect on his random data exactly because he assumes it when generating his random data.
By modelling skill and perceived skill as uniform draws between 0 and 100, the unskilled (e.g. skill=0) will over-estimate their skills (estimated skill = 50, the mean on the uniform random variable) and the skilled (e.g. skill=100) will underestimate it (as 50 as well, again the mean of the same random variable). The only ones who will be correct (on average) are the average skilled ones (skill=50).
Yes. This article highlights the 2016, 2017 and 2020 debunkings of DK. But it hangs on as an oft repeated scientific fallacy.
The fact that anyone has to ask if it has debunked shows how desirable some people find the DK myth. Even in the comments here, people are not willing to be skeptical of DK. That's interesting psychology.
Idk I genuinely feel like after having to deal with 10+ doctors who all had different opinions. The last doctor finally made the same conclusion as me and he was the last person I had to see.
There's always exceptions. And sometimes reading publications pertaining to a very specific thing should give you more say on a subject.
I just feel bad American tax payer money and the best years of my life was spent on telling medical professionals they don't know what they are talking about.
It's fascinating how great Elo and similar ranking systems are at curbing DK. You just get a number, and that's how good (bad) you are. It's incredibly precise too, there's just no arguing with it.
Also since the topic is D-K I'm a bit scared that I'm the fool here, but isn't he misusing the term autocorrelation? What he describes sounds like just normal correlation?
I think what this article is missing is “the chart DK should have used.”
Instead we get a spurious explanation that doesn’t make a lot of sense based on completely fabricated data. It’s entirely natural for something that looks like DK to emerge from randomized data, especially when the Y axis is represented by some number of the mean (actually 50ish in this case).
If self evaluations are random, and you group a bunch of them together, then you'll see values around the 50th percentile. That's why their self evaluation line is nearly flat.
In the actual data though, the line clearly trends upward. The people who did well appear to be scoring themselves non-randomly.
This is not ‘autocorrelation’, it is regression to the mean. I find the article unclear and imprecise.
For those interested in a better overview of the Dunning–Kruger effect, I recommend this short article by McIntosh & Della Sala instead:
> in the academic literature, it has been suggested that the signature pattern of the DKE (Figure 1A) might be nothing more than a statistical artefact. In a typical study, people’s tendencies to under- or overestimation are analysed as a function of their ability for the task. This involves a ‘double dipping’ into the data because the task performance score is used once to rank people for ability, and then again to determine whether the self-estimate is an under- or over-estimate. This dubious double-dipping makes the analysis prone to a slippery statistical phenomenon called ‘regression to the mean’.
The best way to differentiate DK from autocorrection is motive. Low performance people will focus on motives that reinforce the perception of their competence, for example preferring code style over code delivery because while both may be arguably important one requires less effort and risk to attain.
There is research to qualify this out of Stanford. People will shift motives to attain complements and the types of compliments received will dictate the challenges they are willing to accept. When a compliment is specific to an action and measurable people will strive for continuously more challenging tasks to continually receive specific compliments. When compliments are generic and directed to the person they will tend to preference progressively less challenging tasks so that they continue to shine relative to the attempted effort. The differences in behavior produces a natural Dunning-Kruger effect wherein people seeking less qualified activities are more likely to over estimate their potential and degree of success.
This also statistically verified in research that correlates predictions to confidence. The more confidence a person is in their predictions, such as political talk radio hosts, the less accurate their predictions tend to be.
I don't know if I agree that it's an autocorrelation, but one way to explain The Dunning-Krugger Effect is by acknowledging this simple fact:
Most people think that they are an average person, but they can't be all average—there must be some people substantially below the median. Therefore, those people must overestimate their abilities.
This also applies to other aspects, such as attractiveness. Less attractive people would overestimate their attractiveness.
For all of the tests and rebuttals of the Dunning-Kruger effect the people tested are not drawing from the totality of other people, but trying to compare themselves solely to those who also took the same test.
Anyone in a position to take such a test is almost guaranteed to be above average compared to the general population (which includes babies for intellectual tests, or the extremely old for attractiveness tests).
> If the Dunning-Kruger effect were present, it would show up in Figure 11 as a downward trend in the data (similar to the trend in Figure 7). Such a trend would indicate that unskilled people overestimate their ability, and that this overestimate decreases with skill. Looking at Figure 11, there is no hint of a trend.
There certainly is a hint of a trend. Why do people, when visualizing data with a distinct trend, say that because the "error bars" from a particular statistical test overlap zero that no trend exists!?
Freshman trend to over-confidence. Grad students trend to under-confidence. Undergrads in general trend to over-confidence (though this trend decreases as year in school increases), and post-graduates, whether grad students or professors, trend to under-confidence.
These "trends" are not statistically significant, but they certainly are a trend!
Also, the random data distribution in figure 9 doesn't show the same trends as Dunning-Kruger's curve in figure 2. Perhaps there is at least one psycho-social mechanism here worth investigating?
If they're actually error bars, you can shrink them with more data. That will turn the hint of a trend into an observation of a trend. If it wasn't random noise giving a fake hint.
> If they're actually error bars, you can shrink them with more data.
Assuming the new data has the same systemic or instrumental bias as the old data. Even using a different test date could skew results enough to widen the error bars.
I place mechanistic theory prior to statistics in science. Mechanistic theory can be tested, statistics are a kind of test.
If a statistically-insignificant result shows consistent, though non-significant deviations, such as the kind seen in Figure 11, then it tells me it's worth investigating whether mechanism(s) are explaining a very small portion of the variation that will not, in itself, show up as statistically significant, as it's being swamped by variation in other parameters.
Consistency is a synonym for statistical significance. If there's consistency beyond random alignment, then there should be a statistical test you can apply over your data to extract the signal.
You can extract surprisingly small signals relative to variation in other parameters. But if it's actually swamped, then it might not be real, so go get more data.
> Consistency is a synonym for statistical significance.
So basically you're telling me that if I can visually see a consistency that does not show up in their statistical test, then they aren't running an appropriate statistical test on what I'm seeing.
> But if it's actually swamped, then it might not be real, so go get more data.
> So basically you're telling me that if I can visually see a consistency that does not show up in their statistical test, then they aren't running an appropriate statistical test on what I'm seeing.
Either they're not doing the right statistics, or it's a "consistency" that is much more likely to show up randomly than you naively expect, and the study needs to be repeated or enhanced.
Sometimes you can see a pattern that's just a figment of chance. See also: numerology, jelly bean xkcd
Psychologists using their pet theories to explain results and then people taking that explanation as the truth when they should really just look at the data is probably an as large problem as the replication crisis.
auto correlation, or self correlation. A correlation between different things may indicate an actual relation (smoking is correlated with early mortality). A self correlation is a tautology.
the numeric experiment does not produce a line identical to what DK report. if DK's line where horizontal at 50%, it would indeed be nothing but autocorrelation.
DK for me is simply: "You don't know what you don't know." When that happens, it's easy - surprise, surprise! - to misjudge your skill level. In a way, it almost feels cruel to ask someone with too few points of reference to say how much they know. The fact is whether high, low, or in the middle...they are guessing.
On the other hand, with enough experience the depth and breadth of your context improves, as it should. At that point, mis-self-assessment is the result of arrogance, bravado, etc. That's a different problem than simply not knowing.
If nothing else, DK has a case of apple v oranges.
I must object to this paragraph: "To be honest, I’m not particularly convinced by the analytic arguments above. It’s only by using real data that I can understand the problem with the Dunning-Kruger effect. So let’s have a look at some real numbers."
He then goes on to use synthetic data.
Beyond that dishonest slight of hand, this is in the category of "one thought experiment didn't prove the phenomenon exists, therefore it must not exist" logical errors.
Most people, even here on HN, do not know what the DK effect actually claimed to show. It does not show that confident people are more likely to be incompetent. Their primary result shows a positive correlation between confidence and supposed skill. (What skill, you ask?*)
I don’t know which statistical artifact it is, but I am quite convinced that the so-called DK effect is not demonstrating something interesting about human psychology, I don’t buy that this is a real cognitive bias. I’ve read the paper several times, and the methodology seems to be lacking rigor. They tested a small handful of Cornell undergrads volunteering for extra credit, not a large sample, not the general population, and tested nobody who actually fits the description of ‘incompetent’ in a meaningful way. They primarily measured how people rank each other, not what their absolute skill was - and ranking each other requires speculating on the skills of others. There are obvious bias problems with asking a group of pampered Ivy League kids how well they think they rank.
* One of the four “skills” they measured was ability to get a joke - “appreciation of humor” - Huh? This is subjective! The jokes used aren’t given in the paper, either. Another was ‘grammar’ tests.
The Dunning-Kruger effect isn't as the article first quotes. It's an effect that everyone experiences. We as humans tend to over simplify things we don't understand well or at all. Therefore we over estimate our expertise on these subjects. We also tend to under estimate how much an expert on subjects we do know well. Everyone does this. It's not just dumb people.
> We also tend to under estimate how much an expert on subjects we do know well
Any evidence for this, except Dunning-Kruger? To me it looks like everyone overestimates themselves. There are a lot of professionals who think they are undervalued and that people worse than them gets all the rewards and fame.
I went through the whole article, and I am not only very skeptical about the claimed debunk but wonder what kind of psychological trope you might label as corelative to such an article.
I mean "bad science built only on rhetoric" is a double edged sword, you know.
To start with, the graph presented at the end does not look like the one from the original article, where the self assessment does grow significantly, though it starts higher than average and grows less quickly than external assessment.
Also the article focus on "random" data set which, but we know that there are different classes of apparent noisy plots. Noisy distribution of self assessment would actually be an informative figure too.
So the biggest issue here is its kind of pretending that whatever the way the ordinate value is coupled to, if it includes the abscissa in its definition you'll get the same kind of plot as a result, which is obviously false. You could easily come with arbitrary values coupled to "x" that would look radically different.
I was curious if the self assessment is done before or after the test.
Bing chat gave me this wild answer:
> The effect is usually measured by comparing self-assessment with objective performance. For example, participants may take a quiz and estimate their performance afterward, which is then compared to their actual results 1. Therefore, people estimate their ability before the test by Dunning-Kruger.
In the case estimation is done before: If you've had training, like a soup of ingredients, that matches the priorities and biases of the test it would be strange if no measurable effect remained.
If it's done after: You can create trick questions specifically designed to test if someone learned a specific thing. A good test would test for that. If someone didn't learn the specific thing they could give/guess the wrong answer with some confidence.
The design of the test has great influence on how poorly you'll think you've done. I would argue that the superior test is the one designed to fool you. Hans Rosling famously created a multiple choice test with 4 answers per question with average results below 25%.
On a more fascinating note, unskilled means all areas of expertise outside your own.
People who are universally unskilled in all areas are of course more likely to think they are unskilled. In reality these people know little bits about many things.
This in contrast with people who spend all day, every day, for their entire lives pondering topics inside their area of expertise. If you are doing one thing you aren't doing all of the other things.
Wikipedia had hilarious instances of experts contributing to countless articles accidentally ending up on the wrong page. Suddenly they have no patience, think they know everything and act like children. It's funny because you cant just ban valuable contributors.
I would love to see this DK test done with professors furthest removed from the area of expertise.
DK says that skilled people tend to underestimate their skill while unskilled people tend to overestimate their skill. This is likely a statistical artifact.
IS says that people tend to overestimate their own skill compared to how other people estimate their skill. This seems likely true on average but not necessarily in all cases.
I do think the original Dunning-Kruger plot is a bit of an odd presentation. The way I look at it is just to say that people's self-estimates of their ability fall into a relatively narrow range (e.g., 55-75th percentile on the graph), whereas their actual abilities of course cover the whole range from 0-100th percentile. You don't really need the plot of "x versus x" (average score in each quartile). You just need to say "people's self-assessments seem to start unrealistically high and only go up a little, even as their ability goes up a lot".
This makes sense. IMO, the reason why Dunning-Kruger effect is so popular among the upper classes (along with Impostor Syndrome) is that it helps to provide justification for social inequalities as it corrects inner monologues.
"How come I have so much given that I'm not as skilled as these other people? I must suffer from impostor syndrome."
"Look at all these people complaining instead of taking responsibility for their own failures, they probably suffer from Dunning-Kruger effect. Their work must not be good enough."
But of course this requires a certain detachment from reality (hence why many upper class people have blind spots). If they actually took a look at the evidence, they may find that some of these 'Dunning-Kruger people' are actually far more skilled than they imagine. I think it explains why people like Jürgen Schmidhuber who made significant contributions to AI tend to be ignored. Then because people are ignoring them, they are compelled to promote themselves harder to try to get their fair share of attention but they are then put in the 'Dunning-Kruger basket' until someone with a very good reputation like Elon Musk comes along and gives them credit. I think the same could be said about the mathematician Srinivasa Ramanujan; many mathematicians ignored his work or assumed he was a fraud because he seemed too sure of himself for someone who was completely unknown at the time. If such gross injustice can happen in a perfectly-quantifiable field like math, you can be sure it can happen in any field.
Article claims Dunning-Kruger is present in a population where everyone estimates their own skills based on dice rolls. Someone who estimates their own skills based on a dice roll is objectively crap at estimating their own skills. Dunning-Kruger claims people are objectively crap at estimating their own skills.
A general problem with Dunning Kruger is the assumption that if you score low on a test then you are bad at the subject it is evaluating. I’ve taken enough bad quizzes that purportedly evaluate skills that I am an expert in, to know that that is a leap.
I think this article would've made more sense if it had a title "The Dunning-Kruger effect is regression toward the mean", because that's what the author is actually showing.
OP's own analysis shows that using random data (two variables uniformly distributed over the same range!) for both skill and self-assessment results in a different graph. The original comparison therefor implies another effect on the second dimension, which could be interpreted as: people don't estimate their skills correctly, but drift towards the mean.
But then the question becomes: what did they really ask their subjects? To pick the percentile or a true test score?
The Dunning Kruger effect is simply the same reason expensive projects are undertaken and never hit budget - not because we cannot estimate costs but because if we did we would never do anything.
every domain of expertise has two "elo" systems, the niche one and the broader one.
e.g. you can learn basic juggling in 30 minutes that you are top 10% of your friends/colleagues etc...
however within the juggling community itself this is known as the "3 ball cascade" a really simple trick relative to the ones that requires years to master. an outsider may not be able to tell the difference between the 1 year expert and the 10 year master.
a lot dunning-kruger can be explained by people in one or the other not understanding the other system
What Blair Fix's article gets wrong is that there are two stark differences between what Fix generated with random data and what Dunning and Kruger observed in theirs.
Fix has each person guess randomly between 0 and 99 where they will lie in the percentiles. They simulate every person having no idea and giving equal probability to being the best or the worst. If we then sort them by how well they really did into quartiles and then evaluate the average of how well they thought they would do, we get what we would expect: each quartile has an equal chance of predicting that they will do well or do poorly, with an average expected percentile of 50, which is what you would expect by a random guess.
Note two key things about this:
- All quartiles guessed the same - there was no correlation between what they guessed and how well they actually did
- All quartiles guessed the expected average percentile - 50%. This means they were unbiased in how well they thought they would do.
If people were unbiased but also unaware, this is the null hypothesis we would expect: on average people predict themselves to be average and there's no correlation between how well they predicted they would do and how well they actually did.
Now compare that to what Dunning and Kruger observed:
- The quartiles did NOT guess the same. There was a bit of an upwards trend, which suggests that people at least somewhat were able to determine their actual percentiles, even if only weakly on average.
- The predictions were biased. All groups estimated they would do better than the expected average. That is to say, on average, they thought they were above average. This is an important bias.
- The differentials between quartiles are not equal. The first and second quartile typically predicted the same, over-estimated value, implying that neither group had any idea they were better or worse than each other. However, the upper quartile consistently estimates a higher average. That is to say, people who perform well, on average, believe they are performing even better than those who don't perform well. And perhaps most surprisingly, there was often a statistically significant dip at the third quantile. Comparing their beliefs, people who did well believed they had done worse than the people who actually did worse.
Fix also fails to go beyond the first figure of the paper. After seeing this inconsistent behaviour between the quartiles, Dunning and Kruger then test what happens if the respondents are given an opportunity to grade each other - therefore getting an idea of what the percentiles actually look like - and to have their skills improved - thereby possibly making them better able to judge their own and each other's abilities. Again, if Fix's premise that this is all just a result of manipulating the autocorrelation of an otherwise unbiased random sequence, then these interventions should have no discernable effect. Yet, Dunning and Kruger find markedly significant changes after these interventions, and those changes are different within the different quantiles.
It is precisely this difference between quantiles which is the Dunning-Kruger effect. Fix effectively makes their point for them by building a null model and showing what would happen if there were no Dunning-Kruger effect - if people were fully unaware and unbiased. Instead, it is the way in which Dunning and Kruger's observations deviate from this model that is the very effect that bears their name.
Instead, all that Fix manages to do is point out how confusing the plot is that Dunning and Kruger produced. The plot can easily be misinterpreted to suggest that it's the difference between y and y-x that is important. Instead, in their writing, Dunning and Kruger actually focus on the differences in how y-x changes when the situation changes, demonstrating that it's actually dependent on knowledge and how different people respond to that knowledge. What they actually show is that delta(y-x) vs x has a nonzero relationship and this is particularly interesting.
Perhaps if Dunning and Kruger had not included the example of perfect knowledge as a comparison, but instead included the example of unbiased and unknowledgeable that Fix produced as the thing to compare against, the Dunning-Kruger effect would be much better understood.
Further, both could benefit greatly from plotting and tabulating not just an average, but the overall distribution within each group. Fix should know that variance is just as important as bias. Even if all groups are biased in their prediction, differences in variance between each group indicates their confidence in their belief. Knowledge should help to reduce both bias and variance. A guess with high variance tells us little, while a guess with low variance tells us quite a bit. Even if all quartiles predicted the same average, we wouldn't fault those with little ability for guessing a high number if they did so with low confidence. On the contrary, we would expect people with high ability to be more confident (and correct) in the assessment of their ability.
Lmao this article is an example of Dunning-Kruger at work. The author thinks they have found and are revealing something but they are just failing to fully understand the subject. Amazing.
Basically, this autocorrelation take shows that if performance and evaluation of performance were random and independent, you would get a graph like the D-K one, and therefore it states that the effect is just autocorrelation. But in reality, it would be very surprising if performance and evaluation of performance were independent. We expect people to be able to accurately rate their own ability. And D-K did indeed show a correlation between the two, just not as strong of one as we would expect. Rather, they showed a consistent bias. That's the interesting result. They then posit reasons for this. One could certainly debate those reasons. But to say the whole effect is just a statistical artifact because random, independent variables would act in a similar way ignores the fact that these variables aren't expected to be independent.