> Why so angry?
I know I’ve taken this far too personally. I have no illusions that everything I read online should be correct, or about people’s susceptibility to a strong rhetoric cleverly bashing conventional science, even in great communities such as HN. But frankly, for the last few years, the world seems to be accelerating the rate at which it’s going crazy, and it feels to me a lot of that is related to people’s distrust in science (and statistics in particular). Something about the way the author conveniently swapped “purely random” with “null hypothesis” (when it’s inappropriate!) and happily went on to call the authors “unskilled and unaware of it”, and about the ease with which people jumped on to the “lies, damned lies, statistics” wagon but were very stubborn about getting off, got to me. Deeply. I couldn’t let this go.
I am afraid I actually agree with the author's point. The anti-intellectual, anti-scientific streak in many poor analyses claiming to debunk some scientific research is deeply concerning in our society. If someone is trying to debunk some scientific research, at least he should learn some basic analytic tools. This observation is independent of whether the original DK paper could have been better.
That said, I give the benefit of doubt to the author of "The DK Effect is Autocorrelation." It is a human error to be overly zealous in some opinions without thinking it through.
Let's not forget though that a great deal of "science" is in fact trash[1]. The problem isn't really people being anti-science or pro-science. The problem is science being done poorly, whether by scientists in the credentialed sense, or amateurs.
There is no pat "trust science more" or "trust amateurs less" answer here. The actual answer is that if you want to understand research, you need to actually understand mathematical statistics and the philosophy of statistics fairly deeply. There just isn't any way around it.
I think there's two extremes here. One is the issue covered well above. There is a great deal of junk science that gets published. That is a problem and it does erode trust. But in some ways it's also how the sausage gets made, there's going to be room for things to get published they later gets refuted. People rightly so have distrust for results coming out in fields they don't have good knowledge in. Without becoming an expert yourself it's very difficult to know who or what to trust.
On the other end there's distrust of broad scientific consensus across different professions, countries, etc. It's the distrust at these levels that is the increasing problem we are facing today.
> I think there's two extremes here. One is the issue covered well above. There is a great deal of junk science that gets published.
It's more than that, I think. Sibling-thread poster hit the nail on the head when he complained of politicised science.
The social sciences have this dominating and silencing effect on the rest of the sciences.
There's always been junk science, and when found out it gets discredited. This is still happening and is a good thing.
What's new is that any research that might produce results counter to the what the PC-mob deems acceptable is attacked. Whether or not there is consensus amongst researchers in that field is irrelevant when the mob calls for the firing of any researcher who doesn't toe the current political party-line.
Sure, we're not actually in the dark ages, but a trend of silencing voices in the name of purity of thought is particularly troubling, especially as the mob asking for this is unashamedly attempting to implement NewsSpeak[1].
[1] See the argument in yesterdays threads about what "man" and "woman" mean, and should dictionaries be changed, etc.
No this is not new. You have always have a direction set by political views, even if we have decided they are wrong they are still hard to kill like: smoking is good, white people are superior. There is still "science" being done to bolster those political views.
>> What's new is that any research that might produce results counter to the what the PC-mob deems acceptable is attacked.
> No this is not new.
I don't recall a PC-mob being used to silence any and all non-supportive voices until quite recently.
> You have always have a direction set by political views, even if we have decided they are wrong they are still hard to kill like: smoking is good, white people are superior. There is still "science" being done to bolster those political views.
I don't see what that has to do with that I said - that a very vocal bunch of non-science people seem to have successfully lobbied into silencing specific topics.
> I don't recall a PC-mob being used to silence any and all non-supportive voices until quite recently.
Are you genuinely serious or were you completely unaware of anti-communist government sanctioned blacklisting of academics suspected of being communist for political clout? Are you unaware of churches excommunicating Galileo for daring to scholarly research into the earth rotating around the sun? Are you unaware of our own Alan Turing, of the Turing award, literally castrated not for his research but because he was a known gay researcher? Are you unaware of why HBSUs exist(black scholars were segregated for being black, their research dismissed because of the race of the researcher)?
>Are you unaware of churches excommunicating Galileo for daring to scholarly research into the earth rotating around the sun?
Your point is good but this is a pet peeve of mine. Galileo was not punished by the church for saying the earth orbits the sun. Galileo was indicted and punished by the church because he was a local elite with several personal and political enemies within the church, and more directly, because he slighted the pope, his former friend, by taking a philosophical argument made by said pope, and putting it, paraphrased, into the mouth of a character in his book who was named "simplicio" and cast as a moron. That pope literally gave him permission to publish his claim that the earth orbited the sun, a claim which Galileo did not make based on science, but instead made because he felt the resulting (incorrect and based on outdated observations) mathematical model for the orbits of planets was more "elegant".
If the church truly wanted to punish him, they would not have sentenced him to literally stay at home in a beautiful villa and write books all day. His official charge was that lay people are not allowed to interpret the scripture, which he did a bit in his book. The church did not care if you made mathematical or scientific arguments about how the world worked. They only cared that you leave theology to the priests.
"Galileo was not punished by the church for saying the earth orbits the sun. Galileo was indicted and punished by the church because he was a local elite with several personal and political enemies within the church, and more directly, because he slighted the pope, his former friend, by taking a philosophical argument made by said pope, and putting it, paraphrased, into the mouth of a character in his book who was named "simplicio" and cast as a moron."
This is not really true. While there is some truth to the claim that Galileo placed an argument made by Urban VIII in the mouth of Simplicio and that Urban took offense at this, the trial documents, especially Galileo's sentence, make it very clear that Galileo was being punished for heresy and the heresy he was being punished for was the notion that Sun did not move and that the Earth did. From the sentence:
> We say, pronounce, sentence, and declare that you, the abovementioned Galileo, because of the things deduced in the trial and confessed by you as above, have rendered yourself according to this Holy Office vehemently suspected of heresy, namely of having held and believed a doctrine which is false and contrary to the divine and Holy Scripture: that the sun is the center of the world and does not moved from east to west, and the earth moved and is not the center of the world, and that one may hold and defend has probable an opinion after it has been declared and defined contrary to Holy Scripture.
Note that the term "vehemently suspect" is technical term. The Roman Inquisition in the 17th century didn't generally deliver straight up and down guilty or not-guilty verdicts and rather organized convictions according to degrees of suspicion. "Vehement suspicion" indicated that there was at least some (but not much) degree of plausible deniability that Galileo didn't believe what he had written, and that was only because Galileo denied it to the court.
"That pope literally gave him permission to publish his claim that the earth orbited the sun, a claim which Galileo did not make based on science, but instead made because he felt the resulting (incorrect and based on outdated observations) mathematical model for the orbits of planets was more "elegant"."
No, Galileo was given permission to publish a book that presented a neutral comparison of the Copernican and Ptolemaic models on mathematical grounds with the intention of proving that the Church was justified in its suppression of Copernicanism. Galileo's book was not neutral--it argued heavily in favor of Copernicanism--and that's why he got in to trouble. Urban VIII had been his friend prior to this episode so it's likely that had Galileo not placed Urban's argument in Simplicio's mouth at the end that Urban would have protected him rather than punished him, but the reason that Simplicio was given that argument was because that argument was intended to be the end of the book. After four days of continuously losing the debate, Simplicio finally raises Urban's argument about the omnipotence of God and his opponents are forced to agree with him. The idea being that Galileo could stick to the letter of his remit, while still arguing what he wanted. Unfortunately he argued too well and readers realized where is real sympathies lay. Urban VIII was accused of protecting heretics (not just Galileo, but also but supporting the French against the Hapsburgs in the 30 years war,) and so he made an example of Galileo.
Galileo's arguments were not simply mathematical. In fact, if they were, he would never have been punished because it was already permissible to treat Copernicanism as a purely mathematical hypothesis; the 1616 prohibition of Copernicanism explicitly carved out that exception. But Galileo used a wide variety of arguments, including physical arguments. Galileo after all, was primarily what we would call a physicist rather than an astronomer. It was Galileo's insistence that Copernicanism must be physically true and not just a better mathematical model that made him a heretic in the eyes of the Sacred Congregation.
It's important to note that the quality of Galileo's scientific arguments were never a subject of his trial. It was only his conclusions that the Sacred Congregation took issue with. Galileo's chief (but not only) argument, from the tides is now considered to have been spectacularly wrong, but at no point did that come up in the trial. Galileo's arguments could have been 100% perfect and unassailable (and scientific arguments rarely are) and he would have still been punished.
"If the church truly wanted to punish him, they would not have sentenced him to literally stay at home in a beautiful villa and write books all day."
He was sentenced to life imprisonment. That was commuted to house arrest on account of his old age and not at his own house at first. He was prohibited from receiving medical attention late in life. All books by him were placed on the index and he was prohibited from taking visitors. He did manage to continue his work, but that was by publishing in the Netherlands which was a Protestant country and hence outside the reach of the Church. He had first tried to publish in Venice which was a hotbed of anti-clericalism and usually Inquisitorial orders, but even they would publish him.
"His official charge was that lay people are not allowed to interpret the scripture, which he did a bit in his book."
That was not the official charge. I quoted the official charge above. Galileo's interpretation of scripture happened much earlier and preceded the 1616 prohibition of Copernicanism. There Galileo gave counter argument as to why Copernicanism didn't contradict scripture and that served as the catalyst for the investigation that led to the prohibition. Galileo ultimately was not censured for writing on scripture and his argument was even well received to a degree (he had checked it with a cardinal before publishing it,) but the Sacred Congregation decided that it was more concerned about undermining the authority of the Church Fathers, many of whom took the famous passage from Joshua literally, than it was about accidentally hooking scripture to a provably false view of the world. If you read Cardinal Bellarmine's response to Foscarini regarding Galileo's letter, you'll see him very clearly cite the authority of the Church Fathers as his primary consideration. Bellarmine controlled the Sacred Congregation at the time so his opinion on the matter was the Church's opinion.
"The church did not care if you made mathematical or scientific arguments about how the world worked.T hey only cared that you leave theology to the priests. "
The decree from the Index of Forbidden Books banning Copernicanism:
> This Holy Congregagtion has also leaned about the spreading and acceptance many of the false Pythagorean doctrine, altogether contrary to Holy Scripture, that the moves and the sun is motionless, which is also taught by Nicolaus Copernicus' On the Revolution of the Heavenly Spheres and by Dieage de Zuniga's On Job. This may be seen from a certain letter published by a certain Carmelite Father, whose title is Letter of the Reverend Father Paolo Antonio Foscarini, on the Pythagorean and Copernican opinion of the Earth's Motion and the Sun's Rest and on the new Pythagorean World System in which the said Father tries to show that the abovementioned doctrine of the sun's rest at the center of the world and of the earth's motion is consonant with the truth and does not contradict Holy Scripture. Therefore, in order that this opinion may not advance any further to the prejudice of the Catholic truth, the Congregation had decided that the books by Nicolaus Copernicus and by Diego de Zuniga be suspended until corrected; but of the Carmelite Father Paolo Antonio Foscarini be completely prohibited and condemned; and that all other books which teach the same be likewise prohibited, according to whether the present decree it prohibits, condemns and suspends them respectively.
Note that Diego de Zuniga and Paolo Foscarini are both priests. This wasn't about keeping theology to the priests, it was about prohibiting certain theology that would undermine the authority of the Church. The correction applied to Copernicus's book is that it be changed to suggest that his system was not intended as a literal interpretation but only as a mathematical model. As I mentioned earlier, an allowance for treating with Copernicanism as a pure mathematical contrivance for the convenience of astronomers was made but treating it as literally true was declared "error", and later upgraded to "heresy" during Galileo's 1633 trial.
Sorry for the essay, but this subject is a pet-peeve of mine.
Some good books on the subject:
1. Behind the Scenes at Galileo's Trial - Richard J, Blackwell
2. The Essential Galileo - Maurice A. Finocchiaro
3. Galileo Heretic - Pietro Redondi
Yes, I've TOF's blog posts before. TOF's perspective is one that I've encounter many times before. You mostly find it in some very conservative Catholic circles.
If I was being generous, I would say that TOF was pushing back against the overwrought hagiography that often surrounds Galileo and his confrontation with the Church. It's true that the common story is an over simplified account and there are some persistent myths that have worked their way into the story over the years into in order to make Galileo look more heroic and the Church more villainous than they either actually was. The real story is a bit complicated.
But the Church really did ban heliocentrism and it really did punish Galileo for arguing for it. That's not in question. I feel that TOF's argument otherwise rests on a deliberate misrepresentation.
TOF's argument the Church was actually just smartly waiting for proof before it changed it's doctrines is a bit rich in that Galileo offered proof and was punished for it. Now, Galileo's proof was bad, but that's not why he was punished. If it was, the Inquisitors would have mentioned that and not simply accused him of heresy for arguing that the Sun stood still. Not to mention that this whole idea rests on the bizarre notion that science works best when you silence debate until proof can be provided
Ironically, that whole argument stems from a quote from Robert Bellarmine's letter to Foscarini, where he admits that Galileo has a point about not interpreting scripture in a way that is provably false. In that quote, Bellarmine isn't saying that the Church is waiting for proof, he's saying that while proof would force him to change his mind, he doesn't think such proof is possible so he may as well go ahead and ban Copenicanism anyway. It's real obvious if you read the very next sentence that TOF for some reason doesn't quote:
> I add that the one who wrote, "The sun also ariseth, and the sun goeth down, and hasteth to his place where he arose," was Solomon. who not only spoke inspired by God, but was a man above all others wise and learned in the human sciences and in the knowledge of created things; he received all this wisdom from God; therefore it is not likely that he was affirming something that was contrary to truth already demonstrated or capable of being demonstrated.
In other words, he doesn't believe that Galileo or anybody else will be able to find proof that the sun stays still because that would contradict what he already believes based on his reading of the Bible!
So no, TOF is misrepresenting the stance of the Church in 1616. A lot of people repeat this because it looks like a clever debunking of a common myth, but it's actually more of bunking in that he inserts a lot of detail in order to disguise some blatant misrepresentation.
> Are you genuinely serious or were you completely unaware of anti-communist government sanctioned blacklisting of academics suspected of being communist for political clout? Are you unaware of churches excommunicating Galileo for daring to scholarly research into the earth rotating around the sun? Are you unaware of our own Alan Turing, of the Turing award, literally castrated not for his research but because he was a known gay researcher? Are you unaware of why HBSUs exist(black scholars were segregated for being black, their research dismissed because of the race of the researcher)?
Every single one of those was NOT a mob.
The people in authority, using their authority to push their PoV, is very different to people with no authority forming a mob and demanding that the current authority silence other people from speaking their minds is a very different thing.
Whatever your view of the current authority is, it is infinitely better than mob-justice.
If you think castrating gay men wasn’t mob Justice at the time maybe you should reconsider. Same for whether or not black scholars were considered equal to white ones.
The real problem is that all of those mobs have to answer to the PC mob now. Or at least that's what I hear from people who think that censorship didn't exist until coincidentally around the time of Gamergate.
Mobs that try to force their own views on others are common. PTAs, or strong willed individuals with kids in school, are mob experts. Good things have come out of those mobs. Amsterdams good bicycle infrastructure was built on mob rule instead of listening to "rational engineers" who wanted to build motorways through the city.
Good science defines its terms. Can you unroll "PC-mob" so we're all on the same page here? You don't sound like an asshole, so your meaning is probably not the usual "anything that leans a little left."
Group think exists in science too, I mean Newton calling Leibniz a copy cat was a set back and it's crazy that I still had to learn about the priority controversy almost 300 years later. We feed on unneccessary controversy.
> People rightly so have distrust for results coming out in fields they don't have good knowledge in. Without becoming an expert yourself it's very difficult to know who or what to trust.
A problem here is that there are fields of science that are almost certainly bogus in themselves. One very likely candidate is nutrition, which seems to be fumbling in the dark and has a long history of producing worse recommendations than doing nothing (e.g. replace fat with sugar). More controversially, the entire field of economics is seen by some to be very suspect from a basic foundations view.
>A problem here is that there are fields of science that are almost certainly bogus in themselves. One very likely candidate is nutrition, which seems to be fumbling in the dark and has a long history of producing worse recommendations than doing nothing (e.g. replace fat with sugar).
It's not the field that is bogus in that case. People were quite literally bribed to push this. This could happen anywhere, anytime, in any field.
That was just an example, but we can go into more detail - at every point, nutrition studies are highly questionable. Sample sizes are usually minuscule (you can find important studies with 5-10 subjects, and even the largest studies rarely have more than a few hundred), they are often biased samples (only overweight people, only people with heart disease or diabetes etc.), they often don't account for likely confounding factors (using weight without accounting for muscle mass, no accounting for stress, time in the sun etc.) and on and on. And all of this in a subject where we basically don't have any clear idea about how much metabolism differs from one person to the next, based on what factors (e.g. gut microbiome has only been recently identified as a major component of digestion that can differ significantly between people; psychological effects of diet are even less well understood, even though your food choices are obviously not coming from some pure realm of reason).
Basically the digestive system is far too complex for us to understand from first principles at this time. Your diet has a very complex and very slow effect on your body, with some exceptions. Numerous diseases and environmental factors impact how this plays out exactly. So, to do real research in nutrition, the only chance right now would be to conduct massive studies over long periods of time with rigorous controls on subjects' nutrition and activities - which is basically impossible, or at least prohibitively expensive.
Instead, we get conclusions drawn from studies of a few dozen people over a few months or at best a year (in "long-term" studies). Or, we get conclusions drawn from comparing diet across huge populations ("the Mediterranean diet", "the Japanese diet", "the American diet" etc) with no possibility to control for obvious differences in nutrients, environmental factors, lifestyle differences, access to healthcare etc. Both of these are worthless conclusions, they don't tell you anything at all.
The only successes of nutrition science have been identifying the most basic nutrients we need to survive at a basic level (protein, fat, carbohydrates, and the various vitamins and minerals). Basically nothing beyond that should be trusted.
As a fun historical note, after the discovery of the macro-nutrients there was a budding field of nutrition scientists confidently recommending optimal diets using scientific methods. Unfortunately, they had no idea about the existence of micro-nutrients, so actually following some of their diets you could actually end up getting scurvy or other serious malnutrition diseases. The current slew is not that bad, but I wouldn't be surprised if in the future we will look back similarly at some common diet advice of today.
I think commercial operations like Huel and similar supplements will definitely be thought of like that. It just seems crazy, having watched the evolution of "scientific" diet advice over the past 70 years, to now think it's plausible that industrially-produced protein drinks are a suitable food substitute.
So, the reasons people learn a generalized distrust of science are that often the sausage doesn't get made. Bad science is published, applauded, cited, breathlessly covered in the media and may even be replicated, yet the first time outsiders to the field actually read the paper they realize it's nonsensical. But then they realize nobody cares because careers were made through this stuff, so why would anyone inside the field want to unmake them?
The degrading trust doesn't come from bad results per se, but rather the frequent lack of any followup combined with the lack of any institutional mechanisms to detect these problems in the first place beyond peer review, which is presented as a gold standard but is in no way adequate as such.
For example, consider how programmers use peer review. We use it, and we use lots of other tools too because peer review is hardly enough on its own to ensure quality. Now imagine you stumbled into a software company that held a cast-iron policy that because patches get reviewed by coworkers you simply don't need a test suite, nor manual testing, nor a bug tracker, code comments, security processes, strong typing, etc. And their promotion process is simply to make a ranking of developers by commit count and promote the top 10% every quarter, and fire the bottom 10%. Moreover they thought you were nuts for suggesting that there was any problem with this. You'd probably want to get out of there pretty fast, but, that's pretty much how (academic) science operates. So of course this degrades trust.
Maybe I'm missing something, but 'self-correcting' doesn't necessarily mean 'immediately self-correcting'. I guess it's safe to assume, that incorrect studies are not cementing our world view and entirely stopping us from questioning studied topics again.
The way I see the self-correcting nature of science: the truthiness of our view about specific set of topics increases over time (in some approximation).
Self-correcting doesn't mean immediately self correcting, but it does imply self-correcting in a somewhat reasonable time period, and ideally not needing to self correct too often.
What's reasonable, well, probably not years or decades. Average people cannot make major errors that destroy the value of their job output and then blow it off with "well but the company self corrected eventually so please don't fire me". When they judge science, they will judge it by the standards they are themselves held to in normal jobs.
And what's too often, well, probably papers that don't replicate should be a clear minority instead of (in some fields) the majority. Recall that failure to replicate is only one of many things that can go wrong with a study. Even if the replication rate was 100% many fields would still be filled with unusable papers.
Exactly this. It's even worse, distrust of broad scientific consensus is purposefully cultivated to further political and economic goals, and the methods to do so increasingly perfected. This damages our ability as societies to function in a healthy way. Our capacity to navigate hard problems is diminished by the ever decreasing influence of hard science on politics and policy.
> Without becoming an expert yourself it's very difficult to know who or what to trust.
Who says there is anyone who can be trusted? People keep looking for leaders they can trust and it takes only a brief look at history to see that the search won't stop despite the jaw dropping futility of the exercise.
The important thing is to check that people have incentives to tell the truth and no conflicts of interest. I'd trust someone untrustworthy if they were making money off my well-being. The only thing to watch out for is them not being forthright about their incentives.
We shouldn't trust that skyscrapers stay up because engineers are trustworthy. They stay up because the engineer goes down with the building.
Completely agreed. While I strongly believe in citizen science and people's right (and perhaps obligation) to critique established science, there is just so much poor analyses done by people to criticize some scientific findings they do not approve, motivated emotionally or otherwise. This phenomenon does not bring us closer to finding scientific facts or resolving the replication crisis. People should learn some basic analytic tools first.
I can see why this has been downvoted, but I didn’t mean to sound like a luddite. I really just think there is a lot of low quality research, especially in the social sciences, done primarily to keep an publishing schedule up for pressured academics. We’d often be better to spend more time thinking about and planning fewer better studies.
One way that journals could actually add value (a concept to which they seem resistant) would be to review the statistical analysis. Statistics is hard and easy to get subtley wrong, and is often an independent skill to the underlying science. If journals had statistics experts to critique the analysis techniques prior to publishing it would be a great improvement in the confidence in which we could read papers.
I think this is a great idea, but in my experience there are surprisingly few such experts available. In practice, most statisticians not doing active methods research (I'm thinking of 'trial statisticians' mostly here, in CTU's) just cargo cult whatever procedures previous trials used. I guess they would pick up issues around sample size, but without also integrating that with some substantive knowledge about plausible effect sizes I'm not sure what value they would have.
Plus, it would reduce the number of publishable papers quite substantially including from high profile authors/groups, so I don't think they want the fight. We should also remember that most journal editors are also involved in publishing this research — they often have no real incentives to make things awkward.
>The problem isn't really people being anti-science or pro-science. The problem is science being done poorly, whether by scientists in the credentialed sense, or amateurs.
That's a very simplistic take on it. Bad science is a necessary part of the process and dealt with accordingly by the scientific method.
The problem is that science that is bad or incomplete is being reported as fact or truth, or arguably even worse, as entertainment in order to gain an audience. This is what actually eroded the trust in science, as people kept repeating things that reinforced and further misshaped their biases.
Human nature tends to distrust stuff we don't understand. Hence, our trust in science, which in many fields is often beyond the understanding of laymen, have to have constant reinforcement. However, the goal of media, and especially social media is to increase eyeballs for their content, and the truth sells a lot less well than sensationalised content pandering to the audience.
Simply put, there is not much profit in reporting science truthfully, and every incentive to sensationalise it.
part of the issue is that producing shit research cost very little, and debunking shit research cost a lot. just pick up one of the many "x reasons why earth is not a sphere" videos, some are pretty easy to debunk, but other require to understand i.e. potential fields (if earth is round why don't train engineer take into account curvature when laying tracks and variations thereof)
Shit research can sometimes be debunked just by running the numbers given by the shit researchers. Flat earth theorists don’t often offer up actual research (and the one group I’ve seen try got negative results and concluded they messed up, not that they were wrong).
A theory ought to be able to answer questions like “Why don’t train engineers take the curvature of the Earth into account?”
The problem is when someone comes along and thinks an unanswered question (or even just someone not knowing the answer off the top of their head) proves a theory is completely false (or worse, proves their favorite theory correct). (And to even believe the Earth is flat is to be a conspiracy theorist so in this particular case no prove will ever suffice anyway.)
I agree, but I don't think this is limited to science. I think a great deal of everything is in fact trash. This is why we need education and good faith discussion.
Science itself is the best method we have for exploring and making sense of the world around us. The method is rock solid.
In between us (gen pop) and The Method are scientists, and scientists are just as fallible as any other group of people - lawyers, politicians, coders, shop assistants.
In other words, emphasize the process more than the outcomes. If the scientific process used proves sound, then I have more confidence in the outcomes.
Alas, that's a hard sell to laypersons thru the mediums of soundbites and tweets.
What about the replication crisis? It's possible to use rigorously sound statistics to lie (or at least unknowingly spread falsehoods). I can't tell you how many times I've seen headlines or abstracts of studies that seem to contradict ones I've seen previously, and back and forth! Particularly in the social sciences.
I recall one study that said all white people are committing environmental racism against all non-white people. I dove in and read the whole thing wondering what method could have yielded scientific confidence in such a broad result. Turns out the model used was a semi-black box that required a request for access and a supercomputer to run. But it was in a Peer Reviewed Scientific Journal and had lots of Graduate Level Statistics so I guess it seemed trustworthy.
The issue here has not much to do with the replication crisis. It has to do with the fact that most people who use bits of information to make their point more convincing don't care whether that information is true or not. They are not seeking to convince the other side of the issue, they are seeking to convince other believers.
It is literally like this:
- someone makes a point that questions your believe
- you google a phrase that would come in studies that proof otherwise
- you take the first thing that looks promising, and fly over the first page, and paraphrase a good bit in a way that makes your point
- you publish it as part of a post, youtube video or whatever
- danger averted
Bad studies play into this, but even if the studies are good, or bad studies that have been retracted the same thing happens. James Wakefield who originally published the "combined vaccines cause autism study" after patenting a non-combined measles vaccine had his study retracted by the lancet soon after publication. He lost his status as a doctor etc. And you will still find people who use his study as a source.
Of course studies whose outcome collide with our believe systems are always harder to trust than those who validate it — but this is why you look at the methods used and other indicators that might make that study bogus.
A replication crisis indeed exists. All the more reason to analyze rigorously. Poor analyses (and borderline name-calling) in the original article do not help with the crisis.
>it's possible to use rigorously sound statistics to lie (or at least unknowingly spread falsehoods).
I don't think this is true. It is possible to put a lot of work into unsound statistics and to make a lot of "noise and fury" about how mathematical you are while failing some basic principle, but I don't think sound statistics can mislead. The replication crisis was caused by scientists not being rigorous and journals not forcing them to be. You absolutely cannot accept publication as a sign of sound techniques except in journal/field combinations that have a deserved reputation.
Of course they can, unless you magically exclude all statistics that made a bad assumption on independence.
I plot all the daily high temperatures and the presence of the ice cream cart and it turns out the ice cream cart causes warmer highs! Solid statistics.
Turns out the guy that has the ice cream cart has a weather app on his phone though and doesn’t come out on forecasted cold days.
Is that the fault of statistics though, or the non-statistical implication of causation that was tacked on the end of the statistical detection of correlation? Statistics is pretty explicit that it can't tell you about causality, right?
> It's possible to use rigorously sound statistics to lie (or at least unknowingly spread falsehoods)
The book "How to lie with statistics" is one of the best statistics textbooks that I have read. It basically makes you immune to misleading stats (charts, tables, everything).
IIRC, the only thing that is missing from the book (it's a really old book) which is very relevant is p-hacking.
Given the explosion in the number of journals and the impossibility of effective peer review, being published in a journal does not mean what it used to. This is part of the material drivers for the replication crisis (journals can no longer effectively gatekeep scientific validity), but it also reflects something real about the practice of science: little social cliques come up with pet theories and, over time, "fight" with these theories on epistemic common ground. The successful ones, we'd like to think, are the ones that last the most rounds in the fight, but that probably only holds in the long run. Contradiction, in itself, is normal (and was before!)
> That said, I give the benefit of doubt to the author of "The DK Effect is Autocorrelation." It is a human error to be overly zealous in some opinions without thinking it through.
If only there were a term for "a cognitive bias whereby people with limited knowledge or competence in a given intellectual or social domain greatly overestimate their own knowledge or competence in that domain relative to objective criteria or to the performance of their peers or of people in general"
That happens when science is politicized, and any scientists critical of the “official” results is destroyed. From climate to Covid, so many areas where that happens.
No, I mean researchers and professors suddenly not getting any research grants anymore, suddenly getting fired from their tenured jobs, not being invited anymore to conferences, etc.
You could build your whole career from the 50's on discrediting the link between smoking and cancer.
The Koch brothers were (and the current brother and estate of the other) are happy to write economists checks to prove laissez-faire capitalism under a libertarian government is the best system. Or that climate change isn't real.
If you're willing to generate evidence climate change isn't real, Exxon etc have some nice checks for you.
If you're willing to show how corn syrup is good for American than the Iowa farmer association has money for you.
It still seems to me like "The DK Effect is Autocorrelation" is basically correct. The important thing isn't whether or not independence should be the null hypothesis, because calling something a "null hypothesis" is just an arbitrary label that doesn't affect reality. The important thing is that what we can actually conclude from the Dunning-Kruger paper is a lot less than popular presentations of the concept claim. In particular, "more skilled people are better at predicting their own performance" is really not supported by the paper, since that's not true of random data, which has everyone being equally terrible at predicting their own performance. If the random data can reproduce that graph, then the graph can't be proof that more skilled people are also better predictors.
Anyway, "The DK Effect is Autocorrelation" definitely seems to be both statistically literate, and a good faith criticism of the Dunning-Kruger paper. In light of that, calling it "anti-scientific" seems unfair, since criticism and debate are an important part of science.
> calling something a "null hypothesis" is just an arbitrary label that doesn't affect reality
It does affect your conclusions though.
The choice of null hypothesis in "The DK Effect is Autocorrelation" determined how the random data was generated. The hypothesis is: "nobody has any clue whatsoever how competent they are". The random data was specifically crafted for that hypothesis.
The choice of null hypothesis in this article is: "everyone roughly knows how competent they are". This random data, too, is specifically crafted for the null hypothesis.
So what does this mean? If you pick the a particular null hypothesis then you can try to argue that the DK is a statistical artefact. But it's not, it is an artefact of choosing a particular null hypothesis.
No, nulls matter a great deal. If you want to test a claim in Null Hypothesis Statistical Testing, the "significance" of the claim is in direct reference to the null. Changing a null will change the significance of the alternative. My favorite statement of this is from Gelman:
> the p-value is a strongly nonlinear transformation of data that is interpretable only under the null hypothesis, yet the usual purpose of the p-value in practice is to reject the null. My criticism here is not merely semantic or a clever tongue-twister or a “howler” (as Deborah Mayo would say); it’s real. In settings where the null hypothesis is not a live option, the p-value does not map to anything relevant.
I think what contributes to this phenomenon are both second-option bias[1] and motivated reasoning, at least with respect to those who choose to believe in the poor analyses.
I read a lot of papers on behavioural economics and psychological decision making experiments for university, like dunning-kruger, kahneman, etc and in my opinion the first autocorrelation article reads like a rebuttal paper but more informal, the approach is scientific even if it may be flawed. This is how knowledge advances. I disagree that it is anti-science. Challenging accepted postulations is good. Even famous professors make mistakes, I don't blame the writer for making an honest mistake. That's how we got this new piece of writing
Behavioural science is a pretty new field, its pretty easy to get abberant results or manipulate the results to show 'something' statistically. Many findings in earlier papers could not be replicated, or had applied statistics incorrectly, or showed different results when research participants were not white college kids.
This is a whole other problem within academia, the pressure to publish something even when there is nothing and perceived legitimacy based on the number of citations a paper has. My professor always said don't look at the number of citations, understand the method and the rebuttal, there were numerous low citation but solid papers showing flaws in famous ones but everyone who isn't deep into the subject holds the original assertion to be legitimate because its "famous"
Most social science is shoddy, fake, or otherwise misleading (i.e. it proves nothing meaningful despite the claims of the researchers). If you believed every social science study you heard about, you'd be more wrong about the world than if you disbelieved them all.
> The anti-intellectual, anti-scientific streak in many poor analyses claiming to debunk some scientific research is deeply concerning in our society.
People endlessly reference the Dunning-Kruger effect as a meme, without ever having read the paper, let alone having checked its methods. You don't seem to have a problem with that.
On the other hand, after seeing an article that uses essentially statistical arguments to debate a scientific study you conclude that there is some "anti-intellectual, anti-scientific streak" in our society and that it should be of grave concern.
This doesn't make any sense except as an extreme case of virtue-signaling.
Seems quite reasonable to argue that superficially plausible "debunkings" by people that apparently misunderstood a paper are more harmful to scientific progress than people casually referencing the scientist's names as a meme or insult. (And I say that as someone who didn't think the DK "debunking" argument was totally without merit)
What's more harmful to medicine: a fashionably non-expert contrarian who doesn't understand the appropriate null hypothesis making a superficially plausible statistical argument that actually the trials suggest the drug is harmful to wide acclaim from laymen, or people casually referencing or even being administered the drug without reading the original trial writeups for themselves?
Read the actual paper [1], there is so much more than those charts. They ask for an assessment of the own test score and an assessment of the ranking among the other participants to distinguish between misjudgments of the own abilities and the abilities of others. They give participants access to the tests of other participants and check how this affects self assessments - competent participants realize that they have overestimated the performance of other participants and now assess their own performance as better than before, incompetent participants do not learn from this and also assess their performance even better than before. They randomly split participants into two groups after a test, give one group additional training on the test task, and then ask all of them to reconsider their self assessments - incompetent participants that received additional training are now more competent and their self assessment becomes more accurate. This is not everything from the paper and probably also somewhat oversimplified, I just want to provide a better idea of what is actually in there.
Everyone is free to question the results, but after actually reading the entire paper I can confidently say that poking a bit at the correlation in the charts falls way short of undermining the actual findings from the paper. The actual results are much more detailed and nuanced than two straight lines at an angle.
I think if you wanted to poke holes in the paper you'd start with the generic issues that are typical to much psychological research:
1. It uses a tiny sample size.
2. It assumes American psych undergrads are representative of the entire human race.
3. It uses stupid and incredibly subjective tests, then combines that with cherry picking:
"Thus, in Study 1 we presented participants with a series of jokes and asked them to rate the humor of each one. We then compared their ratings with those provided by a panel of experts, namely, professional comedians who make their living by recognizing what is funny and reporting it to their audiences. By comparing each participant's ratings with those of our expert panel, we could roughly assess participants' ability to spot humor ... we wanted to discover whether those who did poorly on our measure would recognize the low quality of their performance. Would they recognize it or would they be unaware?"
In other words, if you like the same humor as professors and their hand-picked "joke experts" then you will be assessed as "competent". If you don't, then you will be assessed as "incompetent".
Of course, we can already guess what happened next - their hand picked experts didn't agree on which of their hand picked jokes were funny. No problem. Rather than realize this is evidence their study design is maybe not reliable they just tossed the outliers:
"Although the ratings provided by the eight comedians were moderately reliable (a = .72), an
analysis of interrater correlations found that one (and only one) comedian's ratings failed to correlate positively with the others (mean r = -.09). We thus excluded this comedian's ratings in our calculation of the humor value of each joke"
The fact that this actually made it into their study at all, that peer reviewers didn't immediately reject it, and that the Dunning-Krueger effect became famous, is a great example of why people don't or shouldn't take the social sciences seriously.
> is a great example of why people don't or shouldn't take the social sciences seriously.
Oh the irony in your last statement. Somebody who hasn't done social science research professionally (this is an assumption, let me know if I'm wrong), has difficulty judging what social science research can (and can't) do ...
One does not need to have done social science research to be able to recognize obvious general philosophy of science level problems with the methods used in much social science.
I’d we take your claim seriously then we have to disallow all critiques of the replicatability crisis in the social sciences that don’t come from social scientists, but that would present an obvious new problem: conflict of interest. It’s also just an absurd requirement.
You are correct - I should have been more precise: I hypothesis parent has not done science research professionally (again happy to be proven wrong).
Don't get me wrong, I'm not defending social science research per se (yes, there are questionable methods). I'm critiquing parent who has high confidence in pointing out issues with the DK paper, yet misses the real issues. Which, in the context of discussing whether the DK effect is more than just regression to the mean, is quite ironic (which I have worded quite strongly, agreed).
Parent's arguments lead to absurd conclusions like "two Cornell professors not being very logical people" or "a HN poster being better at peer review than experts in the field". If you want to see a state-of-the-art critique of whether the DK effect is explained by metacognition vs. regression to the mean see [1].
Why is this relevant? From the article:
> I have no illusions that everything I read online should be correct, or about people’s susceptibility to a strong rhetoric cleverly bashing conventional science, even in great communities such as HN. But frankly, for the last few years, the world seems to be accelerating the rate at which it’s going crazy, and it feels to me a lot of that is related to people’s distrust in science (and statistics in particular).
I completely agree with the author here. Science is rarely black and white, and, arguably, there are more shades of grey in the social sciences. Just as an example, because you mentioned the replicability crisis. I still see many commenters here on HN believing that from the failure to replicate a result it follows the result is wrong. It doesn't. But that's a whole other discussion.
None of your points actually address the sample size and study design issues that wouldn’t be unacceptable even in social sciences today. Generalizing results from a fistful of privileged undergrads is a well-known issue even in the community.
I totally agree, the first study with the jokes seems silly. But I am also not from the field, maybe it is not actually as silly as it seems to me. But the other studies seem much better to me and removing the first one would not change the conclusions.
Is there supplemental material I didn't notice? I only scan read it after the joke section but I can't find any mention of supplemental data anywhere. That's a problem because although you say the other tests are better, no information appears to be provided on which we can judge that.
Let's look at the second test. It's advertised as a "logic test". The description is:
> Participants then completed a 20-item logical reasoning test that we created using questions taken from a Law School Admissions Test (LSAT) test preparation guide (Orton, 1993).
That's the entire description of their method. So immediately, we can see the following problems:
1. Just like the joke test, there's no way to replicate this given the description in the paper. Which questions did they take and why? In turn this throws all claims that the DK study has been replicated into question.
2. The citation is literally a Cliffs Notes exercise for students. It's about memorization of answers to pass law exams, not an actual test itself designed to verify logical reasoning ability. Why do they think this is a good source of questions for testing logic? Law is not a system of logic, there's even a famous saying about that: "the life of the law is not logic but experience". If you wanted to test logical reasoning a more standard approach would be something like Raven's Matrices.
Putting my two posts together there's a third problem:
3. Putting aside the obvious problems with subjectivity, their joke test is defined in an illogical way. They define a test of expertise (working as a comedian), select some people who pass this test and define them as experts, then discover that one expert would have been ranked by their own test as "incompetent but doesn't know it". Yet this is a contradiction, because this person was selected specifically because the researchers defined them as competent. Rather than deal with this logical contradiction by reframing the question they simply ignore it by discarding that comedian from their expert pool.
This is good evidence that DK themselves weren't particularly logical people, yet, they claim to have designed a test of logic - a bold claim at the best of times. Ironically, it appears DK may be suffering from their own effect. They believe themselves to be competent at designing tests yet the evidence in their paper suggests they aren't.
In my experience prepping and taking the test, I found the LSAT logic questions to be pretty good at assessing deductive reasoning.
They’re 100% divorced from law and are closer to puzzles of the nature of, “Six people sit at a table, four of whom are wearing hats, three of which are red, …”
I could not quickly find the LSAT preparation guide but I found some LSAT sample questions [1] and they seem suitable to assess reasoning abilities. Also I do not think that it really matters which questions you choose as long as they span a wide enough difficulty range so that you are able to separate participants.
Hmm, do they? The logical reasoning test in that page is a question about lab rat studies on coffee+birth defects, and a hypothetical spokesperson's response that they wouldn't apply a warning label because the government would lose credibility if the study were to be refuted in future. You're then asked a multiple choice question:
1. Which of the following is most strongly suggested by the government’s statement above?
(A) A warning that applies to a small population is inappropriate.
(B) Very few people drink as many as six cups of coffee a day.
(C) There are doubts about the conclusive nature of studies on animals.
(D) Studies on rats provide little data about human birth defects.
(E) The seriousness of birth defects involving caffeine is not clear.
Given the structure of this question I assumed there'd be more than one right answer but apparently, the only "logical" answer is C.
Maybe the word logic is used differently in the legal profession, but this doesn't resemble the kind of logic test I'm used to. It's about unstated/assumed implications of natural language statements i.e. what a 'reasonable' person might read into something, rather than some sort of tight reasoning on which logical laws could be applied. I can see why that's relevant for lawyers but it's not really about logic.
Still, let's roll with it. (A) and (B) are clearly irrelevant given the stated justification, strike those. But (C) and (D) appear to just be minor re-phrasings of each other. Why is C correct but D not? An implied assumption of the study is that rat studies provide a lot of data about human birth defects, and the government's position implies that they don't agree with that. D could easily be a reasonable subtext for that position. E could also be taken as a reasonable inference, that is, the government believes there's a risk the study authors are using an exaggerated definition of birth defect that voters wouldn't agree with, and that 'refutation' of the study would take the form of pointing out the definitional mismatch.
So if I was asked to score this question I'd accept C, D or E. The LSAT authors apparently wouldn't.
That said, the "analytical reasoning" sample question looks more like a logic test, and the logic test looks more like a test of analytical reasoning. But even their bus question is kind of bizarre. It's not really a logical reasoning test. It's more like a test to see if you can ignore irrelevant information. The moment they say rider C always takes bus 3, and then ask which bus {any combination + C} can take, the answer must be (C) 3 only. Which is the correct answer.
> I do not think that it really matters which questions you choose as long as they span a wide enough difficulty range so that you are able to separate participants.
The problems here are pointing at a fundamental difficulty: all claims about competence/expertise are relative to the person picking the definition of competent. In this case the tasks are all variants on "guess what the prof thinks the right answer is", which is certainly the definition of competence used in universities, but people outside academia often have rather different definitions.
So the questions really do matter. If the DK claim was more tightly scoped to their evidence - "people who think they're really good at guessing what DK believe actually aren't" - then nobody would care about their results at all. Because they generalized undergrads guessing what jokes Dunning & Kruger think are funny to every possible field of competence across the entire human race, they became famous.
> Given the structure of this question I assumed there'd be more than one right answer
I did not, since the question is explicit about there being one correct answer only:
“Which of the following is most strongly suggested by the government’s statement above?”
> but this doesn't resemble the kind of logic test I'm used to. It's about unstated/assumed implications of natural language statements
Agreed.
> But (C) and (D) appear to just be minor re-phrasings of each other.
I think the key here is “there are doubts”. The government’s position stems from doubts on the conclusive nature of the study, that’s it. The statement doesn’t say anything about how much data studies on rats provide about human birth defects. If we’re being logical, studies on rats provide “no data” on human birth defects. Across many studies with different substances there may be a correlation (p(human birth defect | rat birth defect) = x), but an observation of birth defects on rats for a particular substance gives us data about rat birth defects, not human ones.
Ah yes - is vs are. You're right. I think I assumed there'd have to be >1 right answer after reading the options.
It's a remarkably poor question, but option (C) isn't about doubts on the conclusive nature of this specific study, but rather the nature of all studies on all animals. You could credibly argue (and I'd hope a lawyer would!) that no government would base policy on doubting all animal studies and that their position in this case must therefore be due to something about this specific study, e.g. the usage of rats, or the topic of birth defects, or both. So they could argue that (D) is the most logical answer.
Not that it really matters. Pretty clearly the LSAT authors are using the word logical in the street sense of "makes sense" or "sounds plausible" rather than meaning "based on an inference process that's free of fallacies". If DK based their test of competence on questions like this then it doesn't mean much, in my view.
If the validity or significance of the paper depends on whether LSAT questions are fit for DK's purpose, we have entered a much more subjective realm than whether they mishandled the statistical analysis - but as we are there, now, I feel that this particular question is not as bad as it is being portrayed.
Firstly, I think we should put aside the fact that it is labeled as a test of "logical reasoning": it is certainly not a test of formal logical reasoning, and an ambiguous or erroneous label does not necessarily make it a bad question (it is not necessary that it be characterized at all.)
Secondly, we are not logically obliged to accept that it has only one answer among the options presented, though if it has more or less than one while the people who posed it thought it had exactly one, that is a problem (I once was nearly expelled from a class for making this point at greater length than the instructor liked!) On the other hand, the question asks which of the candidate answers is most strongly suggested by the passage, which is not a statement that the others are false.
Here, however, it certainly has no more than one answer among the candidates: there is nothing in the passage that has any bearing on options A, B, D or E
- this is perhaps most obvious in the case of B, but the others are like it. In particular, with respect to D, that specific issue is not raised, and furthermore, if it was the government's opinion now that D was the case to the extent of having a bearing on the decision, there would be no need to explain its position in terms of a potential future determination that the tests are inconclusive.
C, on the other hand, is suggested by the government's explanation: if the tests were conclusive, their future refutation would not be a worry.
As I said, this is not a test of formal logic, where the government's response would not imply the possibility of future refutation. Nevertheless, to explain something on the basis of a premise that is only formally possible would be almost as much an informal fallacy as begging the question, IMHO, and one might suspect it is being offered deceitfully (a concept that has no place at all in logic.)
The sort of analysis of natural language called for here (to see what are and are not the issues being considered) is useful and important, for lawyers and the rest of us, and it is, as I have set out above, more objective than "makes sense" or "sounds plausible." If people were more practiced in analytical reading, then corporations, governments and other organizations would less easily get away with blatant non-sequiturs in their explanations of their positions and actions ("there is no evidence the attackers took any personal or confidential information"...)
The second question labeled analytical reasoning is probably closer to what you consider a logical reasoning question, maybe they picked questions more like those?
That's the issue - we don't actually know what they did. Which means their claims would have to be taken on faith.
Now, maybe other researchers designed different more rigorous studies that are replicable and which show the same effect. That could be the case. The point I'm making here is that the DK paper isn't by itself capable of proving the effect it claims, and that you don't need a statistical argument to show that. Sanity checking the study design is a good enough basis on which to criticize it.
> It's about memorization of answers to pass law exams, not an actual test itself designed to verify logical reasoning ability. Why do they think this is a good source of questions for testing logic?
I'm not sure what that has to do with anything? The paper doesn't claim to have anything to do with testing logic. It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
> The paper doesn't claim to have anything to do with testing logic.
The paper reports on the results of a 'logic' test administered to undergrads and uses this to define competence. It's a key part of their evidence their effect is real.
> It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
The specific tasks matter a great deal.
The whole paper relies very heavily on the following assumption: DK can accurately and precisely tell the difference between competence and lack of competence. In other words, that they know the right answers to the questions they're asking their undergrads.
In theory this isn't a difficult bar to meet. They work at a school and schools do standardized testing on a routine basis. There are lots of difficult tasks for which there are objectively correct and incorrect answers, like a maths test.
But when we read their paper, the first two tasks they chose aren't replicable, meaning we can't verify DK actually knew the right answers. Plus the first task is literally a joke. There isn't even a right answer to the question to begin with, so their definition of "competence" is meaningless. The other tasks might or might not have right answers that DK correctly selected, but we can't verify that for ourselves (OK, I didn't check their grammar test but given the other two are unverifiable why bother).
That's a problem because the DK effect could appear in another situation they didn't consider: what if DK don't actually know the right answers to their questions but their students do. If this occurs then what you'd see is this: some students would answer with the "wrong" (right) answers and rate their own confidence highly, because they know their answer is correct and don't realize the professors disagree. Other students might realize that the professors are expecting a different answer and put down the "right" (wrong) answer, but they'd know they were playing a dangerous game and so rate their confidence as lower. That's all it would take to create the DK effect without the underlying effect actually existing. To exclude this possibility we have to be able to check that DK's answers to their own test questions are correct, but we can't verify that. Nor should we take it on faith given their dubious approach to question design.
> The paper reports on the results of a 'logic' test administered to undergrads and uses this to define competence.
Right, but my point is that 'logic' is simply being used as an example of 'a task'. It's immaterial whether it's actually a good test of logic. As long as you agree that whatever it is is a good example of 'a task', then it's equally probative for the purpose of their argument.
The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence. That's why DK is a well known effect, it claims to hold true for anything even though they can't test every possible task.
> we presented participants with tests that assessed their ability in a domain in which knowledge, wisdom, or savvy was crucial: humor (Study 1), logical reasoning (Studies 2-and 4), and English grammar (Study 3).
They picked humor because they think it reflects "competence in a domain that requires sophisticated knowledge and wisdom". They then realized the obvious objection - it's subjective - and decided to do the logical reasoning task to try and rebut those complaints (but then why do the first experiment at all?):
> We conducted Study 2 with three goals in mind. First, we wanted to replicate the results of Study 1 in a different domain, one focusing on intellectual rather than social abilities. We chose logical reasoning, a skill central to the academic careers of the participants we tested and a skill that is called on frequently ... it may have been the tendency to define humor idiosyncratically, and in ways favorable to one's tastes and sensibilities, that produced the miscalibration we observed-not the tendency of the incompetent to miss their own failings. By examining logical reasoning skills, we could circumvent this problem by presenting students with questions for which there is a definitive right answer.
So logical reasoning was chosen because:
1. It's objective.
2. It's an important skill.
3. It's a general "intellectual" skill.
That makes it very important if it's actually a good test of logical reasoning. If it was truly an arbitrary test like an egg-and-spoon-race or something, then there's no reason to believe the results would generalize to other areas of life and nobody would care.
> The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence.
I’ve seen absolutely nothing suggesting this. It’s explicitly about task competency; no particular task is specified nor needs to be specified.
> That's why DK is a well known effect, it claims to hold true for anything even though they can't test every possible task.
Yes, they claim it holds true for everything because it’s how human beings introspectively experience being poor at a task. It’s really not necessary to have some Platonic ideal of Task Competency … which is then specifically restricted to logical tasks for reasons known only to you.
> Logical reasoning was chosen because: It’s objective.
I think there’s a kernel of truth in this, albeit assuming by ‘objective’ you instead mean (as people often do) something like “people almost always agree in their evaluations of this quality”. You need that for a good experiment. I’m still not sure how it relates at all to your point here. Personally I would find it easier to just say “I was wrong, it’s not explicitly about logic, I just associated it with that because it’s commonly adduced in silly arguments about logic/intelligence on the internet” - but ah well, it’s an interesting theory so I’m happy to discuss it.
The reason no particular task needs to be specified to invoke DK is exactly because they argue that their initial selection of experimental tasks is so general, that the effect must apply to everything.
It feels like you and danbruc are inverting causality here. You start from the assumption that DK is a real effect and then say, because it's real and general, it doesn't matter what tasks they used to prove it. But that's backwards. We have to start from the null hypothesis of no effect existing, and then they have to present evidence that it does in fact exist. And because they claim it's both large and very general, they need to present evidence to support both these ideas.
That's why they explicitly argued that their tasks reflect general attributes like wisdom and intelligence: they wanted to be famous for discovering a general effect, not one that only applies in very specific situations.
But their tasks aren't great. The worst are ridiculous, the best are unverifiable. Thus the evidence that DK is a real and general effect must either be taken as insufficient, or you could widen the argument to include studies by other psychologists that pursue the same finding via different means.
And because they claim it's both large and very general, they need to present evidence to support both these ideas.
To me the claims in the paper do not really seem that strong, almost to the point that I am not sure if they claim anything at all. If you read through the conclusions, they mostly report the findings of their experiments. The closest thing to any claims about generality I can find is that they discuss in which scenarios their findings will not apply. You could maybe read into this that they claim that in all other scenarios their findings apply, but that is not what they actually do.
But I guess the better way to discuss this is that you just quote the claims from the paper that you consider too strong and unjustified instead of me trying to anticipate what you are referring to or me going over each claim in the paper.
The tasks aren't arbitrary. They're meant to be a proxy for some universal concept of competence.
This seems at least somewhat wrong to me - the competence is not universal but task specific. They compare how your competence to perform task X is related to your ability of assessing your performance of task X in absolute terms and relative to the other participants. They repeat this for different tasks and find that for all tested tasks the same pattern emerges - roughly, the better your performance, the better your ability to accurately assess your own performance and the performance of others.
So you can be competent doing task X and provide accurate assessments for task X performances while at the same time being incompetent doing task Y and being less accurate in assessing task Y performances. This essentially means that you can not be universally good at assessing performances of arbitrary tasks, you can only do this well for tasks for which you are yourself competent.
For completeness I would add that a good task must allow objectively rating the performance of participants with [much] room for debate. But given that, the whole setup is self-contained and task-independent. Let participants perform the task and establish their competence by rating their performance. Then let participants perform the meta-tasks of rating their performance in absolute and relative terms and finally check how task and meta-task performances are related.
I can't quite figure out from this post and the posts after if you have any background in social science or not (you have stated you didn't do social science professionally - but I get a nagging feeling you have studied it) - and I'll try to explain why I think it matters. For what it's worth - I wouldn't necessarily object to what you wrote here if you finished with "great example of why people don't take the social sciences seriously" and left it there. I do have a problem with "shouldn't", although in a different setting (i.e. amongst social science people) I would probably argue for "shouldn't".
Full disclaimer - I was a sociological researcher before I started working in IT - and would (I can appreciate the irony given all of this is about DK effect) rate myself as very significantly above average in terms of methodological rigour and mathematical skill compared to other social researchers.
One thing that is taught to social researchers - although I've seen it much less with psychologists - is that social research is fundamentally different from natural sciences in that it is accepted as fundamentally subjective. Now, a radical such as myself will tell you that all research, including natural science, is not entirely objective due to very subjective navigation of selection bias, but putting that to the side - this is an extremely important point when evaluating social research.
Coming back to your original point - I would agree with the points you object to vis-a-vis original DK Effect paper, however, as a social researcher, I am always already coming into reading that paper knowing that I'll have to take it with spoonfulls of salt. There is no need to write the paper in a way that puts in many of the disclaimers you might expect, because we are institutionally taught that these disclaimers apply.
Having said that - one of my peeves with social research, and why I ultimately went away, is that a lot of garbage goes on and gets through peer review. There is almost no proper testing of quantitative instruments and methods. Which is why I agree with your point that it rightfully isn't taken seriously - but I would object to your assertion that it shouldn't be taken seriously. Especially amongst IT professionals who are already going to have a bias against non-STEM. Point out the shortcomings and apply a different interpretive lense, rather than discounting the field completely - as social science can be better and taken seriously if it was held to a higher standard, even with the methodological shortcomings we have today - but it is very often discounted wholesale, which I don't think is going to incentivise the bubble that is forming around it to reform and get better.
The plot to me always read "People estimate themselves at 60-70% percentile - above average, but not the best". And then given this broad prior, people do place themselves accurately(because the plot is increasing).
So it seems people are bad at doing global rankings. If I tried to rank myself amongst all programmers worldwide, that seems really hard and I could see myself picking some "safe" above-average value just because I don't know that many other people.
There's also: If you take 1 class in piano 30 years ago and can only play 1 simple song, that might put you in the 90th percentile worldwide just because most people can't play at all. But you might be at the 10th percentile amongst people who've taken at least 1 class. So doing a global ranking can be very difficult if you aren't exactly sure what the denominator set looks like.
So I think it's an artifact of using "ranking" as an axis. If the metric was, "predict the percentage of questions you got correct" vs. "predict your ranking", maybe people would be more accurate because it wouldn't involve estimating the denominator set.
This is exactly my conclusion, and it seems obvious... just look at the self assessment line - pretty much everyone thinks they are slightly above average. Once you know that everyone thinks they are above average, you already know how it will play out... the bottom quartile will have the biggest gap between actual skill and estimated skill.
> There's also: If you take 1 class in piano 30 years ago and can only play 1 simple song, that might put you in the 90th percentile worldwide just because most people can't play at all. But you might be at the 10th percentile amongst people who've taken at least 1 class. So doing a global ranking can be very difficult if you aren't exactly sure what the denominator set looks like.
Yes, and this literally implies that people in the lowest quartiles can't and won't rate themselves to be in the lowest quartiles when they are forced to give an answer. (Especially on tests that doesn't measure anything (getting jokes? really?), on tests that they have no knowledge about (how would they know that how their classmates perform on an IQ test???), or on tests that just have a high variance.)
And therefore they will "overestimate their performance".
It's like grouping a bunch of random people, and forcing them to answer whether their house is short, average or high. The "people living in short houses" will "overestimate the height of their houses", while the "people living in towers" will humbly say they live in an average high house.
Is this an existing and relevant psychological phenomenon, different from the general inability to guess unknown things? I don't think so.
One difference is that as the experiments were run on psychology students, they know the population, those are their peers with whom they interact on a daily level and they should have an idea of how they compare with them.
> how would they know that how their classmates perform on an IQ test???
Are you serious? If you're interacting with your classmates, you definitely should have some idea on how their intellectual capabilities differ between each other and also with respect to you. In a small class doing lots of things together, someone might even literally count their "ranking" at some metric that highly correlates with IQ, estimating that Bob, Jane and Mary are above me and Dan and Juliet are below me, so I'm at 40th percentile.
It's not appropriate to treat these aspects as unknown things or unknowable things.
One minor correction - the article creates a dichotomy where the hypothesis must be either
1) self-assesment is somewhat correlated with skill, or
2) completely uncorrelated
And this is a true dichotomy. The "autocorrelative" effect doesn't need perfect correlation, just some correlation.
I think Dunning Krueger makes intuitive sense. When you become skilled in your field you learn from other people in your field, and your assessment of yourself is based on your relation to the skills of those other people. But if you know very little about something, you have no reference point to evaluate yourself against.
When you learn something you also learn what are some of the mistakes you can make. You evaluate your performance then against the mistakes you didn't make. Consider a piano player, or figure-skater. You have to know about what figures are difficult to perform to evaluate a performance, and you don't know what the difficult ones are until you have studied and tried to perform them.
It’s been argued before that this is the only reason that DK gained any notoriety; because it feels right, not because it is right. It’s the “just-world” theory: we want to believe that confident people are overcompensating.
Is it actually intuitive though? Consider your own example. Most people who don’t know piano or figure skating are well aware that they don’t know, and do not rate themselves highly at all. Would it be surprising to learn that people who don’t know any law or engineering don’t often hold any doubts about their lack of skill, and by and large are not deluded nor erroneously believe they’re great at these things they don’t know?
The DK paper didn’t measure knowledge-based skills like piano, figure skating, or law. It measured things like the ability to get a joke, and conversational grammar. How would you rate your own ability to get a joke? (Does this question really even make a lot of sense?)
It’s important that the methods in the DK paper focused on tasks that are hard to self-evaluate, because when people have tried to replicate DK with more well defined knowledge-based activities, they have often demonstrated the complete opposite effect, that there is widespread impostor syndrome, and skilled people underestimate themselves.
"Most people who don’t know piano or figure skating are well aware that they don’t know, and do not rate themselves highly at all. "
I think this case (real_skill = 0, perceived_skill = 0) is maybe a trivial case, and that the bit of truth DK-idea catches is when someone with very little skill considers how much work it would be to get to whatever a 'fully skilled' version would be, they woefully underestimate.
Picture someone in their first summer of mountain biking watching youtube videos of the best guys. Yes, you know you can't jump like they do, or turn as skillfully, but you're getting better each month. However, you still grossly underestimate how hard it is to get to that skill level.
At least it's my personal experience as a thoroughly unskilled!
Notoriety means “the state of being famous”, which is the meaning that I intended. I actually don’t want to use “notable” in this case, because that would imply that I believe the DK effect is real. Notorious might even be a good word here, since the paper has problems with its claims and its interpretation of it’s own data.
Thinking about all the "INAL" answers and the obviously wrong "legal" advice and opinion on legal topics you can come across on HN on a daily basis, I think law is good example of people overestimating their knowledge.
Replace figure skating with any other sports, so, and try having discussions about, e.g, a defeat of any soccer team. All of a sudden everyone just became a soccer coach. And everyone is able to critique individual player's performance and skill and technique. If anything, this proofs the DK effect rather well.
You might be forgetting that DK demonstrates a positive correlation between confidence and skill. They gave statistical evidence for people who believe they’re right actually being right more often on average. The question the paper is actually asking is why aren’t people’s self-estimates perfect, but it does not, contrary to popular misunderstanding, demonstrate that confident people are lower skilled. Reading bad legal advice on HN is not a demonstration of the so-called DK effect.
But is DK not explicitly about assessing yourself and not about assessing others?
I feel like being able to critique the performance of others is different from being able to critique yourself. There might be a correlation between the two, but they're not the same.
The DK paper is not about assessing one’s self. There was a self-eval, but the primary methodology used for most of the data & conclusions was to rank one’s self against the others in the group! You are spot on -- this ranking is a major problem for the credibility of the paper’s narrative. Being unable to rank against others precisely, especially when you don’t know their skill level, does not demonstrate that someone is unaware of their own lack of skill.
I think it's actually the next step; to get good at something you have to get good practice, and that requires good self-assessments and knowing how to practice the part you're weak on.
There are people who can't advance because they can't see the problem, and people who can't advance because they can't (or don't want to) correct it. The end result is the same though.
It's also intuitive if you think about error in self assessment. Skill is asymptotic to some upper bound. The closer to the asymptote (higher skill), then most likely the error in estimation is under it, since it cannot be above it.
Conversely it cannot also be under zero, so error is most likely going to be above the actual skill line (over estimation, since it's clamped below it).
Most people can't play the piano or skate. Don't consider those ones. Consider the things that everyone can do. Let's pick driving a car. I am fairly convinced that most people feel after a few years they are excellent at driving their car but in fact they are just OK to terrible. And this is with a lot of practice!
I think that's assuming more ignorance than even an unskilled person has.
If you had never listened to a professional play piano before then you'd have no idea what level of performance is possible. Similarly, if you had never seen skilled skaters perform on TV.
But we have done these things, so it's obvious that they're doing something that's very difficult.
Maybe you don't fully appreciate the skill, though. You wouldn't do well as a judge who compares the performances of professionals. But comparing novices to professionals seems easy?
If you had never listened to a professional play piano before then you'd have no idea what level of performance is possible. Similarly, if you had never seen skilled skaters perform on TV.
But we have done these things, so it's obvious that they're doing something that's very difficult.
Sometimes the things we find most impressive, in a demonstration of a skill we don't have, aren't the most difficult things.
I remember being absolutely blown away by some aerial circus tricks and stunts I saw at shows. Later, I started studying and eventually performing myself, and it's often the case that the most crowd-pleasing stunts are some of the easiest to perform.
As a performer, you could always tell which members of the audience knew their stuff, because they'd be the only ones applauding the tricks that might not have looked so spectacular, but were actually the most difficult.
It's more like intermediate (most vulnerable to the DK-effect) to advanced (utmost appreciation for professionals).
Taking the piano example: after 1-2 years of progressive learning you can certainly give off the impression to somebody unfamiliar/untrained (including yourself to an extent) that you are actually quite good: Intermediate stage.
But after awhile when confronted with more and more challenging stuff, by discovering different styles and finetuning your hearing; you at some point reach the very visceral and uncanny sensation of the countless possible roads you can now explore: advanced stage.
>Similarly, if you had never seen skilled skaters perform on TV.
then, as a person who has lived in the world and has the normal physical skills of such you probably think "whoa, how in the heck did they do that" when you finally see it.
The OP article mentions in their rumination that there's some difficulty in generalising DK:
"And maybe there’s no contradiction - there’s always room for nuance, for finding out where the Dunning-Kruger effect is relevant and where it’s not. That can be done with more studies, but only if the authors manage to agree on assumptions and basic statistical practice."
Your post reminded me of one of my favorite Adam Savage videos where he touches upon this idea you're exploring. I encourage folks to see it, he articulates it so well.
I linked to the start of the video where he begins to build the idea. TLDR is he mentions Monet painting Impression Sunrise and how it was something that people have never seen before and it took a bit of time for it to blow people away--they needed to develop "new eyes" to see the genius. Adam then dives into this idea of "new eyes". I'm sure many of us have experienced this in our life and it was so nice to hear Adam unpack it.
This rings so true! I have no idea about programming, in Supply Chain and logistics I do see the same thing so. And the most frustrating people to work with are those stuck at > 2. without realizing it.
Fascinating! I don’t know about supply chain, but I can tell you it’s the same in programming. I suspect these points apply to creative ideas in general.
Well, so far I've seen it SCM (professionally), you see it in sports basically every weekend (we have close to 80 million national soccer coaches in Germany), I've seen it in boxing (I'm nowhere near to number 2 there anymore, but usually novices come with no idea, then they think are good until someone shows them "nah, you still don't know how to box")... So I guess it is the same everywhere in any domain.
If human cultures can be characterized as default arrogant or default humble then it stands to reason that arrogant cultures will have a DK effect, and in humble cultures you won't.
I think you have the common misconception about the DK effect, which is incorrectly summarised as "unskilled and unaware".
There is also the other end of the scale where "skilled and unaware" occurs: people under-assessing their skill (presumed that this is due to judging that most people also have similarly high skill levels).
I think your two "cultures" would shift the self-assessment line up or down on the graph (constant), but not affect the slope very much (multiplier). The line shape or line slope must change somewhat since values are limited (between 0 and 100).
Even people conditioned to be humble could have a strong motivation to believe something is true and overestimate their own knowledge/ability in order to stand on what they perceive as evidence. For example a person's depression, religious beliefs, or an over-emphasized belief in DK itself could be a possible reason they have an erroneously deflated opinion of themselves, and simultaneously employ inflated confidence in irrational arguments that demonstrate why they are almost completely worthless at their field. That's pretty much how depression is secretly prideful in a sense: over-estimating our own mental ability to assess our helplessness.
When I did some cognitive behaviour therapy, I un-learned things like "all or nothing thinking" and the expectation that I could accurately predict the outcome of any course of action by modeling future performance off of a past failure.
> I un-learned things like "all or nothing thinking" and the expectation that I could accurately predict the outcome of any course of action
Do you know any words, stereotypes, or clichés for this? Or even what the related mental disorder is called if it were to become debilitating? Or a specific word for the complete clustering of related signals?
I am guessing those issues plus there related issues (¿syndromic?) are common - but I don’t know where to group it in my own mind.
It is closely associated with having a highly systematizing mind. People with ASD get drawn and pushed down a particular life history corridor. There are rewards of parental/teacher approval for high-performance in an area of profound interest, and a punishment in thef orm of peer bullying for low social skills. This conditions them to operate this way to avoid bullying and optimize for time seemingly well-spent with these impersonal systems. To justify one's own existence, there is this urge to live in a world where a narrowly focused mind is able to predictably produce an ideal world through expertise (all), and a tendency towards refusal to live outside of that, sometimes advancing into self destructive behaviour should that not be an available outcome (nothing). Getting ALL is not only about having things one wants, it is also about seeing the system work, and about identity, a sense of vindication.
Collective layer:
We see so much of it I think even among neurotypicals because we live in an extremely systematized world. Every single aspect of our world is seen by the "haves" of our society as a candidate for profitable separation through a digital layer. Anyone can end up metastasizing a systematizing mind. They just need to exhibit a hyper-focus on something impersonal and complex. As a global civilization, we have been doing this to ourselves and strapping others into it as much as we can. Mostly, only those living in rural parts of materially impoverished nations are spared this temptation.
Mental Symbolism layer:
It is symbolically speaking one entrance into a realm of mental death. With the devotion to lifeless systems the human is de-personalized, atomized, depressed, unrelational. They leave unfulfilled the inescapable truth of what it means to be human. To live your entire life this way is to betray your parents, ancestors, any friends or lovers you ever had, and anyone you could have helped.
Spiritual layer:
The pattern is quite literally satanic. The person has chosen to reign in this hell (both all and nothing), rather than serve in heaven (the humble, narrow middle path). The most powerful and beautiful of all created entities, with an astonishingly powerful mind, insisting on dwelling in a state of supreme perfection betrays the Father to become an engine of extinction.
No. I was just curious, and seeking understanding. I have friends that have told me their detailed future plans with exact timeframes and no contingency. I have seen others struggle to rationalise unpredicted forcing events in their lives - especially negative emotions when other people do not act according to their plan.
Something I generally keep in mind about articles posted to HN:
A large portion of the HN audience really, really wants to think they're smarter than mostly everyone else, including most experts. Very few are. I'm certainly not.
Articles which "debunk" some commonly held belief, especially those wrapped in what appears to be an understandable, logical, followable argument, are going to be cat nip here.
Articles like this are even stronger cat nip. If a member of the HN audience wants to believe they're mostly smarter than mostly everyone else, that includes other members of the HN audience.
So, whenever I read an article and come away thinking that, having read the article, I'm suddenly smarter than a huge number of experts, especially if, like the original article, it's because I understand "this one simple trick!", I immediately discard that knowledge and forget I read it.
If the article is right, it will be debated and I'll see more articles about it, and it'll generate sufficient echoes in the right caves of the right experts. Once it does, I can change my view then.
I am not a statistician, or a research scientist. I have no idea which author is right. But, my spider sense says that if dozens of scientific papers, written by dozens of people who are, failed to notice their "effect" was just some mathematical oddity, that'd be pretty incredible.
And incredible things require incredible evidence. And a blog post rarely, if ever, meets that standard.
"The second option conforms with the Research Methods 101 rule-of-thumb “always assume independence.” Until proven otherwise, we should assume people have no ability to self-assess their performance"
It's not that at all. The assumption should be that everyone is equally good (or bad) at assessing their performance. Not that they have no ability but that the means between groups is the same vs. not the same. That the ability to assess themselves is independent of performance.
This confused me at first too. The issue is that "X" is your performance, and "Y" is your perceived performance.
Say that everyone is equally okay at assessing themselves, and get within 0.1 of their actual performance (rated from 0 to 1). Then X and Y are going to be very correlated, as X - 0.1 < Y < X + 0.1. But X-Y will look like a random plot, since Y is randomly sampled around X.
The only case where X and Y wouldn't correlate at all is if people have no ability to assess their performance (IE, Y isn't sampled around X, but is instead sampled from a fixed range).
The less you know, the more random your guess at your own knowledge is. The actual value is low and less than zero isn't an option, so this drags the average up consistently.
The more you know, the more accurate your guess of your knowledge is. Especially as you hit the limits of the test, this noise can only drag the average down, but less dramatically than the other case.
With the reasonable conclusion: We all suck at guessing how much we know, but the more you know the less you suck until you hit the limits of the framework you are using for quantization of knowledge.
I had the same thought while reading this. The test has a limited range of values, you can only estimate your score within that range, no higher or lower. Those at the top and bottom will naturally estimate into the body of the range since a lower or higher estimate is not possible. However, I’m not sure this explains the results entirely, and I’d like to see a statistician take this further.
Assuming a world where all the participants understand normal distributions, would this be addressed by asking people to rate how they did in terms of "standard deviations compared to the average" or such?
That still wouldn't be useful. The root problem is that a scoring system that isn't infinite both ways (or the reasonably achievable scores are significantly farther from the bounds than the variance of guesses) will end up with a "clipping" at the edges of the model.
There are ways to fix this:
- Throw out the extreme high and low ends of the data bc the model breaks down there. (Which results in a very boring result)
- Have people guess their score and a rough level of confidence along side it (just a 0-5 sort of thing) and see what happens.
Note that I actually do think from my own experience that the effect is real, but the arguments presented fail to prove it statistically bc the model breaks down at the extreme where the effect is detected.
I’m not a statistician but I do have some basic training in psychometrics. It might be interesting/helpful to point out that your priors about self-assessment seem more reasonable generally but also put a lot of faith in the test’s validity as a measure of skill.
I’m relying on intuition here, but it seems a little problematic that the actual score and the predicted score are both bound to the same measurement scheme. Given that constraint on some level we’re not really talking about an external construct of skill, just test performance and whether people estimate it well. Which is different from estimating their skill well.
Maybe someone with more actual skill can elaborate or correct haha.
What’s more interesting to me is what all the buzz over DK tells me. We are asymmetrically skeptical. In the same way as intelligent people doubt their own performance, they rightly doubt others’ performance. Maybe too much.
I think that most people who talk a lot about DK believe that they are the experts in one field or another.
It serves mostly as a way of reassuring themselves of their own superiority. The message (for them) basically amounts to "other people's claim to knowledge is just further proof that they don't know anything."
It’s a zero-effort, zero-evidence-required way for people to disparage others, in a way that they believe makes them sound smart. It’s also basically unfalsifiable in most of the cases where it’s referenced.
I feel like I’m honestly yet to see somebody make DK accusations in a way that’s not totally cringe.
> I think that most people who talk a lot about DK believe that they are the experts in one field or another.
I recall that either Dunning or Kruger once made a remark to that effect. That rather than an indictment of stupid people, it would be better to view it as a warning to those who consider themselves the smart ones.
I feel like there's a bit of a paradox here. The more I internalize how easy it is for us to be overconfident in our intelligence, the more confident I feel in my intelligence...
It's called a giant fucking lack of self awareness, with a good helping of societally instilled narcissism on the worst side of it all, and then add in imposter syndrome, self righteousness and gaslighting. The best side of it all basically is all of these things, but with a tight leash on things and sans the gaslighting. There might be better, but those people are probably off doing their own thing minding their own business; etc.
We’ll done. I read the autocorrelation post when it came out a couple weeks back and it didn’t sit right with me. But I didn’t have the motivation to figure out why. Your explanation resonates perfectly with my initial (snap) intuition and I thank you for taking the time to write it out and post!
Gah, I wish I had time to fully read this and get into it, but I have to spend the next few hours driving.
Unfortunately the original article isn't very clearly explained, and it's only on reading the discussion in the comments under it that it becomes clear what it's actually saying.
The point is about signal & noise. Say your random variable X contains a signal component and a noise component, the former deterministic and the latter random. Say you correlate Y-X against X, and further say you use the same sample of X when computing Y-X as when measuring X. In this case your correlation will include the correlation of a single sample of the noise part of X with its own negation, yielding a spurious negative component that is unrelated to the signal but arises purely from the noise. The problem can be avoided by using a separate sample of X when computing Y-X.
The example in the original "DK is autocorrelation" article is an extreme illustration of this. Here, there is no signal at all and X is pure noise. Since the same sample of X is used a strong negative correlation is observed. The key point though is that if you use a separate sample of X that correlation disappears completely. I don't think people are realising that in the example given the random result X will yield another totally random value if sampled again. It's not a random result per person, it's a random result per testing of a person.
This is only one objection to the DK analysis, but it's a significant one AFAICS. It can be expected that any measurement of "skill" will involve a noise component. If you want to correlate two signals both mixed with the same noise sources you need to construct the experiment such that the noise is sampled separately in the two cases you're correlating.
Of course the extent to which this matters depends on the extent to which the measurement is noisy. Less noise should mean less contribution of this spurious autocorrelation to the overall correlation.
To give another ridiculous, extreme illustration: you could throw a die a thousand times and take each result and write it down twice. You could observe that (of course) the first copy of the value predicts the second copy perfectly. If instead you throw the die twice at each step of the experiment and write those separately sampled values down you will see no such relationship.
Hey omnicognate, good to see you here, appreciated our previous discussion.
What you're saying is that we need to verify the statistical reliability of the skill tests DK gave, and to some extent that we need to scrutinize the assumption that there indeed is such a thing as "skill" to be measured in the first place. I hope we can both agree that skill exists. That leaves the test reliability (technical term from statistics, not in the broad sense).
What's simulated by purely random numbers is tests with no reliability whatsoever. Of course if the tests DK gave to subjects don't actually measure anything at all, the DK study is meaningless. If that's what the original article's author is trying to say, they sure do it in a very roundabout way, not mentioning the test reliability at all. I'd be completely fine reading an article examining the reliability of the tests. Otherwise, I again fail to see how the random number analysis has anything to do with the conclusions of DK.
In fact, DK do concern themselves with the test reliability, at least to some extent. That doesn't appear in the graph under scrutiny but appears in the study.
If you assume the tests are reliable, and you also assume that DK are wrong in that people's self-assessment is highly correlated with their performance, and generate random data accordingly, you'll still get no effect even if you sample twice as you propose.
> The key point though is that if you use a separate sample of X that correlation disappears completely
Separate sample of X under the assumption of no dependence at all of the first sample, i.e., assuming there is no such a thing as skill, or assuming completely unreliable tests. So, not interesting assumptions, unless you want to call into question the test reliability, which neither you nor the author are directly doing.
I think the other piece that has been glossed over a bit is that DK are using quantiles (for both the test and the self-assessment). That means everything is bounded by 0 and 1, and you can't underestimate your performance if it was poor, or overestimate your performance if it was perfect. Or conversely, if you're the most skilled person in the room, your (random) actual performance on the day of the test is bounded above by your true skill, and vice versa for the least skilled. So e.g. we could simulate data with perfect self-assessment of overall skill, add a small amount of noise to actual performance on the day of the test, and get the same results. The bottom quartile (grouped by actual test score) will be a mix of people who are actually in the bottom quartile in skill and some who are in the higher quartiles. The top quartile by actual test score will be a mix of some from the top quartile in skill and some from lower quartiles.
I agree in principle, although I think to get an effect size similar to what DK observed you'd need quite large noise. Which again comes back to the test reliability.
Beyond the validity of the statistical methods used.. can someone clarify what is the actual hypothesis we are debating about competence? And what does each article propose?
My understanding is that the hypothesis is "Those who are incompetent overestimate themselves, and experts underestimate themselves".
DK says: True
DK is Autocorrelation says: ???
"I cant let go..." says: True?
HN says: also True?
Is there really any debate here? The "DK is Autocorrelation" article seems to be the only odd one out, and it's not clear if it even makes a proposal either way about the DK hypothesis. It talks about the Nuhfer study, but that seems Apples vs Oranges since it buckets by education level. Then it also points out that random noise would also yield the DK effect. But that also does not address the DK hypothesis, and it would indeed be very surprising if people's self evaluation was random!
So should my takeaway here just be that the DK hypothesis is True and that this is all arguing over details?
DK is Autocorrelation says: The DK article is based on a false premise, we got to disregard it
"I cant let go..." says: Actually, given that we assume people are somewhat capable of self-assessment, which is reasonable, "DK is Autocorrelation" is the one based on a false premise, and we should disregard that one instead, and not DK.
> My understanding is that the hypothesis is "Those who are incompetent overestimate themselves, and experts underestimate themselves".
The DK hypothesis is "double burden of the incompetent": "Because incompetent people are incompetent, they fail to comprehend their incompetence and therefore overestimate their abilities more than expertes underestimate theirs"
Arguably the hypothesis that matches the data from the DK paper best is: "Everyone thinks they're average regardless of skill level"
> The DK hypothesis is "double burden of the incompetent"
The actual DK result (which is much criticized, but that's a different issue) was actually a pretty much linear relationship between actual relative performance and self-estimated relative performance, crossing over at about the 70th percentile.
(Because there is more space below 70 than above, that also means that the very bottom performers overestimated their relative performance more than top performers underestimated, not because of any “double burden” (overstimation didn't rise faster as one moved below the crossover), but just because there was more space below the crossover point.
> Arguably the hypothesis that matches the data from the DK paper best is: "Everyone thinks they're average regardless of skill level"
If there was a perceptual nudge toward average relative performance, you'd expect a crossover at the median with a slope below 1, the nudge is toward a particular point above average.
Sure, there's in aggregate a slight positive slope to self-assessment when plotted against performance. But all of these have in common that the range of self-assessments is small across the full range of performances and they're all centered somewhere around 60.
> The actual DK result
The "incompetent self-assessment because incompetent" claim is literally everywhere in the paper. It's in the title, the abstract, the introduction and every section thereafter until the end.
>Arguably the hypothesis that matches the data from the DK paper best is: "Everyone thinks they're average regardless of skill level"
No, if you look at the graph[0] everyone thinks they are above average (over 50). The worst think they are a little above average and everyone else thinks they are better and better but increasing by less than the real difference.
At any rate, the issue seems to be with how people imagine everyone performs - they seem to think there are a lot of people who are really bad for a start, and seemingly a bit more people who are really good than there are (at least if we assume the results are accurate).
What I don't like about statistics, or rather the use if them, is the tendency to focus exclusively on them instead of treating them as the tool they are. Statistical analysis is not the subject of the DK effect or paper, it is a tool D & K used in analyzing the effect, nothing else. D&K did put more expertise, research and knowledge into their research than simple statistics.
I hate it when people are "solely2 using statistics, and other first-principle thinking approaches, to understand well researched and documented topics. And I hate it if people use solely statistics to criticize research without considering the other aspects of it. Does it mean the DK effect can be discarded or not? I don't know, I think some disagreement over the statistical methods is not enough to come to any conclusion.
Attacking the Dunning-Kruger study only on statistical grounds looks like aprime example of the DK effect in itself...
For anyone who is interested in playing around with these charts, the various assumptions that under pin them etc. I've thrown together a colab notebook as a starting point.
Observation: if you rank via true "skill" and assume for a particular instance the predicted performance and observed performance are independent but both have the true skill as their mean you dont observe the effect. CC of 0.00332755.
If you rank via observed performance and plot observed vs predicted the effect is there. CC of -0.38085757.
This is assuming very simple gaussian noise which is not going to be accurate especially as most of these tasks have normalised scores.
What your simulation includes and the original article didn't (and I didn't touch at all in my article) is the statistical reliability of the tests they administered. Where you got a CC of -0.38 you used equal reliability (/ unreliability) of the skill tests and self-assessments. You can see that as you increase the test reliability, the CC shrinks and the effect disappears.
I have no idea what's the actual reliability of the DK tests, they do seem to consider that but maybe not thoroughly enough. In my view it's very fair to criticize DK from that angle. But that would require looking at the actual tests and their data.
My point being, that any purely random analysis is based on assumptions that can easily be tweaked to show the same effect, the opposite effect, or no effect at all.
That's a nice spot about the decreasing CC as we increase accuracy!
My hypothesis would be that some of the DK effect in the original paper may be down to an effect like this (as suggested in the original article) but that asserting it is completely incorrect because of it is premature. We'd need access to more data to verify that the level of reliability was sufficiently acceptable.
Right. Just to be clear, "an effect like this" is (comparatively) unreliable tests, not some elusive statistical phenomena as implied by the original article. I'd have no issue if the author had called the article "the DK effect is due to poor skill tests", spent 5 minutes showing that the DK results are consistent not only with their claims but also with unreliable tests (like you did), then went on to show data that indicates that the tests indeed are not reliable enough to draw the conclusions that DK did. Instead the author spends a lot of time digging under the wrong tree and no time at all saying anything about the reliability of the tests.
I agree, the article seems to imply the plot in the original paper will always be an incorrect thing to do, instead of something which can have some issues in cases when we have inaccurate tests.
I've gone back and updated the colab notebook to use orderings exclusively instead of values and you can see that the auto correlation plot B from the first article exists when the noise is high enough but disappears when you reduce it, definitely not a statistical law.
Would it be possible to understand the results differently? It looks to me that the data could be explained by the participants moderating their self assessment away from extremes or perhaps towards the population mean which is arguably not an unreasonable thing to do if your knowledge of the population mean is better than your knowledge of your own performance.
And this is why we need error bars on all plots. Looking at these plots there is no way to know whether people guessed uniformly or whether the self assessment is clustered around the mean.
Yeah I agree that's the likely explanation. Nobody wants to admit that they're terrible and nobody wants to boast that they're the best and be proven wrong.
So my suspicion is that the DK effect is not really a symptom of people's inability to accurately self-assess, but they're unwillingness to accurately report that self-assessment.
And I don't think it is unique to self assessment either. It's common knowledge that ratings on a scale out of 10 for pretty much everything are nearly always between 6 and 9.
I don't know how they did the experiment but I bet they'd get different results if the self-assessments were anonymous and accuracy came with a big financial reward.
Anyway that's all irrelevant to the point of the article which I think is correct.
I think the main point of this post is correct -- just because you can find the effect in random noise, doesn't mean it's not real phenomenon that happens in real life. But it's missing a nuance there: if an effect can be replicated with random noise, then it's not a psychological effect (e.g. something that you would explain as a human bias), but a statistical effect. E.g. regression towards the mean is a real effect, but it's a statistical effect, not a psychological effect.
And that's the point the original article was trying to make ("The reason turns out to be embarrassingly simple: the Dunning-Kruger effect has nothing to do with human psychology. It is a statistical artifact — a stunning example of autocorrelation."), though that point does lost a bit as it goes on.
> if an effect can be replicated with random noise, then it's not a psychological effect
This isn't true either. Statistical dependence does not determine or uniquely identify causal interpretation or system structure. See Judea Pearl's works (e.g. The Book of Why) for more on this.
People lacking the ability to self-assess is interesting psychologically. People can learn from experience in many other contexts. People can judge their relative position versus other people in many contexts. Why would they be so bad at this particular task? There could be a psychological underpinning.
Even if it turns out we have useless noise-emitting fluff in the place that would produce self-awareness of skill, that would be a psychological cause of a psychological effect. Not the ones that Dunning and Kruger believed they were seeing, but still.
Now, if you asked frogs for a self-assessment of skill, I would expect that data would not show any psychological effects.
> People lacking the ability to self-assess is interesting psychologically. People can learn from experience in many other contexts. People can judge their relative position versus other people in many contexts. Why would they be so bad at this particular task? There could be a psychological underpinning.
Is there an existing and relevant phenomenon about people lacking the ability to self-asses, that is true, proven, and not just trivia?
I do believe that people understand all the available information about their skills and performance, and they rate themselves according to it.
E.g. if they are asked about whether they perform good on an IQ test against their classmates they will produce noise (see E.g. the article "I can't let go of.."), and if they have the results of the IQ test, they will be able correctly calculate in which quartile they are.
I have no idea. I think it would be a fascinating experiment to take people who have never taken an IQ test, ask them how they think they'll do, then compare that against their actual performance.
"Noise" isn't a generic concept that can be universally applied in the same way in any scientific context. That's the point of the article. It doesn't make much sense to assume that people's self-assessments come from an internal random number generator.
It makes sense as a robotically developed null hypothesis, but it doesn't make sense in the real world.
To riff on one of the author's previous comments, if height was uncorrelated with age for 0-20 year olds, that would be very surprising, and hopefully we wouldn't need to make posts saying "the fact 20 year olds are just as likely to be 1 ft tall as 1 year olds is not a physical phenomenon, it's a statistical effect."
As a novice on DK, it seems to me that, for DK to be 'suprising' (in the parlance of the OP), four phenomena must hold:
1) an incompetent person is poorer than average at self assessment of their skill
2) as a person's competence increases at a skill, their ability to self-assess improves, until they become 'expert' which is defined by underappreciating their own skill (or overappreciating the skill of others)
3) DK is surprising (interesting) only when some incompetent persons who suffer from DK cannot improve their performance, presumably because their poor self-assessment prevents their learning from experience or from others.
4) Worse yet, some persons suffering from DK cannot improve their performance in numerous skill areas, presumably because their poor self-assessment is caused by a broad cognitive deficit (e.g. political bias), preventing them from improving on multiple fronts (which are probably related in some thematic way).
If DK is selective to include only one or two skill areas, as in case 3, that is not especially surprising, since most of us have skill deficits that we never surmount (e.g. bad at math, bad at drawing, etc). DK becomes surprising only in case 4, when we claim there is a select group of persons who have broad learning deficits, presumably rooted in poor assessment of self AND others — to wit, they cannot recognize the difference between good performance and bad, in themselves or others. Presumably they prefer delusion (possibly rooted in politics or gangsterism) to their acknowledgement of enumerable and measurable characteristics that separate superior from inferior performance, and that reflect hard work leading to the mastery of subtle technique.
If case 4 is what makes DK surprising, then DK certainly is not described well by the label 'autocorrelation' — which seems only to describe the growth process of a caterpillar as it matures into a butterfly.
>it seems to me that, for DK to be 'surprising' (in the parlance of the OP), four phenomena must hold:
The surprising things about DK, to me at any rate, is how unvarying it is in application. Under DK people who are poor at something never think wow I really suck at this, or if they do they are such a minuscule part of the population that we can discount them.
I've known lots of people who were not good at particular things and did not rate themselves as competent at it, although truth is they might have claimed competence if asked by someone they didn't want to be honest with.
There is an easy reframing that works well. People that are "poor" at something don't know enough to know just how good someone can be at something.
And this tracks for most skills. How good are you at tying your shoes? Probably average? Just how good can you get? Probably not that much better, all told. It is a clearly defined goal and likely has a limit on the skill you can build.
What about writing your name? Putting on your clothes? Making your bed? All things that are somewhat bound in just how good you can be.
Now, throw in something like "play the piano." Turns out, the expertise bar is much much higher suddenly. But, it you haven't been trying, how would you know?
If someone asked me how bad are you at playing the piano, having never tried, I would say I was totally incompetent which I take to mean the worst possible. According to DK I should somehow be worse than that.
> And this tracks for most skills. How good are you at tying your shoes? Probably average?
In fact, this is pretty much what the Dunning-Kruger graphs look like. The article shows the one for humor which has the bottom quartile participants answer "eh, about average" while the top quarter of participants realize they're better than average, but estimate roughly 75-percentile rather than 87.5-percentile.
That is because DK does not day "incompetent people see themselves as pros". It says "they overestimate their abilities". They rate themselves low, but in fact their competence is even lower.
On a pure human level, a large portion of DK discourse seems to be a fight over which people are the "Unskilled and Unaware." Or more bluntly, who gets to call who stupid.
The author says as much in this article:
> Why so angry? [...] [Frankly], for the last few years, the world seems to be accelerating the rate at which it’s going crazy, and it feels to me a lot of that is related to people’s distrust in science (and statistics in particular). Something about the way the author conveniently swapped “purely random” with “null hypothesis” (when it’s inappropriate!) and happily went on to call the authors “unskilled and unaware of it”, and about the ease with which people jumped on to the “lies, damned lies, statistics” wagon but were very stubborn about getting off, got to me. Deeply. I couldn’t let this go.
> In their seminal paper, Dunning and Kruger are the ones broadcasting their (statistical) incompetence by conflating autocorrelation for a psychological effect. In this light, the paper’s title may still be appropriate. It’s just that it was the authors (not the test subjects) who were ‘unskilled and unaware of it’.
But on some level, the original paper sounds just as condescending and dismissive. It presents a scholarly and statistical framework for looking down on "the incompetent" (a phrase used four times in the original paper). In practice, most of the times I see the DK effect cited, it functions as a highbrow and socially acceptable way of calling someone else stupid, in not so many words.
Cards on the table, I've never liked DK discourse for this reason. It's always easy to imagine others as the "Unskilled and Unaware", and for this reason bringing DK into any discussion rarely generates much insight.
> it functions as a highbrow and socially acceptable way of calling someone else stupid
I think it's even worse that that: it's also a socially acceptable way of enforcing credentialism and looking down on others for not having a sufficiently elite education.
Even though I lack a medical degree I have a high degree of confidence that the intestines of most people are not sufficiently large enough to support that amount of faeces.
When I saw the graphs in the original article I immediately came to a different conclusion - that people with a given amount of skill have low confidence in their ability to gauge how skilled they are compared to an arbitrary group.
For example, if someone gave me (or you) a leetcode-style test, and told me I'd be competing against a sample picked from the general population, and ask me how well I did, I'd probably rate myself near the top with high confidence.
Conversely, if my competitors were skilled competitive coders, I'd put myself near the bottom, again with high confidence.
Now, if I had to compete with a different group, say my college classmates, or fellow engineers from a different department, I'd be in trouble, if I scored high, what does that mean? Maybe others scored even higher. Or if I couldn't solve half of the problems, maybe others could solve even less - point is I don't know.
In that case the reasonable approach for me would be to assume I'm in the 50th percentile, then adjust it a bit based on my feelings - which is basically what happened in this scenario, and would produce the exact same graph if everyone behaved like that.
No need to tell tall tales of humble prodigies and boastful incompetents.
> Again, my main point is that there’s nothing inherently flawed with the analysis and plots presented in the original paper.
I find the use of quartiles suspicious, personally. It's very nearly the ecological fallacy[1].
> I’m not going to start reviewing and comparing signal-to-noise ratios in Dunning-Kruger replications
DK has been under fire for a while now, nearly as long as the paper has existed[2]. At present, I am in the "effect may be real but is not well supported by the original paper" camp. If DK wanted to they could release the original data, or otherwise encourage a replication.
Agree. From the DK article graph it is not possible to separate the cases
1. Average self assessment coincides with true skill, but variance increases with low skill.
2. Average self assessment is biased, and the bias is positive when you are unskilled and negative when you're highly skilled.
These two situations would create indistinguishable DK-graphs. I don't understand how anyone can be sure on either (1) or (2) after seeing one instance of such a graph.
As I see it, the only way out for "DK positivists" is to say that the DK hypothesis is unrelated to the truth values of (1) and (2). Or, that there is other evidence making DK convincing.
FWIW extreme groups (e.g. using upper and lower quartiles) is well understood in its inflation of effect size (there are even formulas to correct this, given an extreme groups design).
It's definitely related to ecological fallacy in the sense that both underestimate relative error and inflate effect sizes.
If others can't replicate it entirely on their own without "encouragement". Then it isn't useful at all and the original experiment can be safely ignored as irrelevant to humanity, along with any "prestige" associated with it.
If you measure competence as relative performance, a person cannot know how competent they are compared to others... because to do that correctly, they would not only have to know how much they know but also know how much other people know... preferably in relation to them.
This is not possible, so the self-assessment data will be random because it is a random guess... so it does not correlate to actual performance or anything else for that matter. Hence, DK effect has to be a result of faulty statistical analysis.
I believe we'd have completely different results if the question was framed differently: "how many do you believe you got right?". Then, more confident people, regardless of competence, would answer that they got more right and less confident people, again regardless of competence, would believe that they must have gotten more wrong than they did.
> If you tell me you didn’t have a single serious thought of self-assessing today, even semi-conscious, I simply won’t believe you.
I stopped reading at this point. Someone that is so certain that they say “I simply won’t believe you.” is too self-assured to be worth paying much attention to.
> I stopped reading at this point. Someone that is so certain that they say “I simply won’t believe you.” is too self-assured to be worth paying much attention to.
Actually it is even more ironic. You are too self-assured that a multi-page article is not worth paying attention to because of a single sentence in it that irritates you.
The author seems to go completely astray at some point.
> “Never assume dependence” gets so ingrained that people stubbornly hold on to the argument in the face of all the common sense I can conjure. If you still disagree that assuming dependence makes more sense in this case, I guess our worldviews are so different we can’t really have a meaningful discussion.
Hypothesis testing is concerned with minimization of Type I and Type II errors. In the Neyman-Pearson framework this calls for specific choice of the null hypothesis. Of course nothing prevents you to define the sets for H0 and H1 as arbitrarily as you want as long as you can mathematically justify your results.
It seems like the author fundamentally misunderstands the basics of statistics.
It bugs me that DK reached popular consciousness and get misinterpreted and misused more often than not. For one, the paper shows a positive correlation between confidence and skill. The paper is very clearly leading the reader, starting with the title. The biggest problem with the paper is not the methodology nor the statistics, it’s that the waxy prose comes to a conclusion that isn’t directly supported by their own data. People who are unskilled and unaware of it is not the only explanation for what they measured, nor is that even particularly likely, since they didn’t actually test anyone who’s verifiably or even suspected to be incompetent. They tested only Cornell undergrads volunteering for extra credit.
If DK is regression to the mean (a view I find convincing) that doesn't mean the effect isn't real; i.e. one would still observe that people of low ability overestimate their ability, simply because there is more "room" for overestimates than underestimates. And v.v.
Put differently, if everyone's estimate was exactly the mean, you'd still see a "DK effect".
I’m not sure I understand. If the effect shown in the paper is regression to the mean, then that does mean the paper doesn’t actually demonstrate what it claims to, right? I mean you can argue that the idea is still plausible, but this would mean that the paper doesn’t support the claim that low skill people overestimate themselves, right?
It’s also an interpretation to focus on unskilled people as the explanation. DK’s data shows the very same effect on highly skilled people. The people in the top quartile were just as bad at self-estimating as the bottom quartile, yet the paper claims only the unskilled people were unaware!
I recommend reading the DK paper. It didn’t test any people of low ability, and it did not evaluate skill in absolute terms. The sample size was tiny. The kids who participated were all earning extra credit in a class (it’s a self-selecting population that might have excluded both A students and F students.) The students were all Ivy League undergrads who might all overestimate their abilities precisely because they’re in a prestigious school and their parents told them they’re great. The paper didn’t test any actual low IQ population. The paper has methodology problems when it comes to non-native English speakers.
It absolutely blows my mind that the paper is held up as evidence for some kind of universal human trait with such miniscule and completely questionable evidence. I have no doubt that some people overestimate their abilities in some situations. Like you, I’m sure, I’ve witnessed that. But as a commentary on all of humanity, I’m becoming convinced that the so-called DK effect does not exist, that they didn’t show what they claim to show. It doesn’t help that many replication attempts have not only failed to replicate, but have ended up showing the opposite effect: that for many kinds of skilled activities, people.
I don't really understand the article. My understanding was that the mistake was that the error bounds differ depending on the test score from the original DK paper. A test score of 0 or 100 means a potential error of 0-100, whereas a test score of 50 means a potential error of 50. So if you take a group of people who score 0-25 points, if their self-assessment is completely random you'd still see a bias of overestimating score,because people who would give themselves a lower score if possible are unable to.
The charts make it clear that people's self-assessment was (roughly) independent of their skill level. It's not obvious that students' self-assessment would be mostly random / unrelated to skill level. For me that's a non-obvious result.
If people wander off through the verbiage of any article, where the chatter isn't supported by data, sure, they'll tend to get speculation.
I don't really understand what you're saying. Are you the saying the charts don't actually make it clear, or that they make it clear that self assessment is independent but not necessarily uniform?
The charts make it clear that self-assessment was roughly unrelated to ability. That's not an artifact of autocorrelation, instead it appears to be an experimental result.
Imagine in the Dunning-Kruger chart the second plot (perceived ability) was a horizontal line at 70, which is not true but not far off from the real results. Now imagine I told you "did you know that, regardless of their actual score, everyone thought they got a 70?" That's a surprising fact.
I think the most egregious thing about the original presentation is that it leads you to believe that people with a given skill level all self-assessed similarly. If you plotted the scores and self-assessments of each individual you would see that it's not "everyone [in the first quartile] thought [they were about average]", it's that their self-assessments varied wildly, from low and accurate to high and inaccurate.
> Most people have an above-average number of legs.
The arithmetic mean and the median are both averages, but the upthread comment was about the median and yours about the arithmetic mean.
> There's really no contradiction there; all it takes is for there to be a couple low scores pulling the average down.
Well, no, when what you are estimating is relative performance by score percentiles, and people's self evaluation is biased toward the 70th percentile, that's not what is happening.
It seems like the people who want to disprove Dunning-Kruger are falling victim to it.
I honestly think people take it way too seriously and apply it too generally. Quantifying "good" is hard if you don't know much about the field you're quantifying. Getting deep into a particular field is humbling -- Tetris seems relatively simple, but there are people who could fill a book with things _I_ don't know about it, despite playing at least a few hundred hours of it.
Is there an answer to that humility gained by being an expert in one field being translated to better self-assessment in other fields? I feel myself further appreciating the depth and complexity of fields I "wrote off" as trivial and uninteresting when I was younger as I get deeper into my own field (and see just how much deeper it is too).
> Is there an answer to that humility gained by being an expert in one field being translated to better self-assessment in other fields?
I think that often the opposite is true: people who become experts in one domain often assume that they are automatically experts in completely unrelated fields. I suspect that this is the cause of "Nobel disease": https://en.wikipedia.org/wiki/Nobel_disease
The open question this raises to me is why a DK=true set of data would show up with the same graph as a uniformly random set
What I'm really missing is a plot of the data without the aggregation. I find it very strange that X is broken down into quartiles but Y isn't, and when in quartiles, people estimated their skills relative to each other quite well: the line still goes up, and from bottom to top, would be a perfect X to X corelation
Uniformly random data means that someone’s perception of their ability is uncorrelated with their actual ability, which is exactly what DK=true is saying!
In partial "defense" of the "autocorrelation" article, the author was in fact arguing against their own perceived definition of DK, not what most people consider to be DK. They just didn't realise it.
Which is an all too common thing to begin with. (that particular article pulled the same stunt with the definition of the word 'autocorrelation', after all).
I read about DK and I was absolutely convinced that the effect was real. Then I read the article about DK being mere autocorrelation and I came away absolutely convinced that DK was bullshit. Then I read this article and I'm absolutely convinced that the 'DK is autocorrelation' hypothesis is utter BS. Sigh. There are lies, damned lies and statistics... :-)
Consider taking a more Bayesian view of the world, especially with scientific papers. I informally tell the students I work with to look for a constellation of papers that offer supporting evidence from multiple perspectives.
Me too. I believe the effect contains a logical recursion that is impossible to escape from. Maybe the randomness variable in it? It looks as if all validations and refutations of it are always going to appear logical. I don't know what to call it or compare it with but it feels this must be documented as being a prime example of its category.
In the sense that people shouldn't let themselves be convinced by arguments they don't fully understand, that seems somewhat related to DK, in that people shouldn't be believing they are more competent than they are.
(That's not to criticize OP - when someone makes an argument that sounds convincing, it can be pretty convincing! It's just different than actually being valid.)
They immediately reached a conclusion about something upon reading about it and now as they learned more, they understand that there might be more nuance to it
Thank you infinitely for taking the time to respond.
I don't have this luxury in my life right now but I admit after reading the "original" post almost a fourth time, I was really hoping someone would take the time to explain why/how the author could be completely wrong (or not).
Sounds like the premise is flawed. He's assuming kids are good at getting another 10 minutes before bedtime. All of them? What about those who fail? Those that don't even try?
The issue is not the way our brains generalize, but that you are using just one brain, one life's experience.
It can give us an indication of how the growth rate depends on size
Except that what you've plotted there isn't the growth rate, but the absolute growth. Your argument for DK isn't convincing either, they claimed sth much stronger than that we can't assess our own skills.
Question for you folks that are smarter than me (see what I did there?) - DK has surfaced a lot here and in the online world more broadly with seemingly increased frequency. Why do you think that is?
Science is in deep crisis. It's only utility today is supporting industry and some public infrastructure. Social sciences are a scam, being economics the greatest racket amongst them all.
tldr; D+K's experiment was: Assign the numbers 1 thru 10 to ten people. Have each role a 10 sided die. The person assigned a 1 will roll higher than his assigned number 90% of the time.
Daniel:
>It’s not a “statistical artifact” - that will be your everyday experience living in such a world.
You can experience statistical effects. I think a lot of controversy comes from how Dunning and Kruger's paper leads people to interpret the data as hubris on the part of low-performers, and the statistical analysis demolishes that interpretation. Not knowing how well you performed is not the same thing psychologically as "overestimating" your performance.
Dunning Krueger is precisely about the surprising result that people are bad at estimating their performance!
If you accept the 'D-K is autocorrelation' argument, you don't get to throw out the existence of the D-K effect: you are saying Dunning + Krueger failed to show that humans have any ability to estimate how skilled they are at all.
That seems like an even more radical position than the D-K thesis.
> Dunning Krueger is precisely about the surprising result that people are bad at estimating their performance
Isn't DK about estimating your performance relative to the rest of the population? To do that, you need to not only know your own performance but also everyone else's. To me, guessing the performance of others sounds quite difficult.
The implicit hypothesis is that if a test is full of questions you have no idea how to answer, you really ought to have strong priors that you're a below average performer. Tests are generally designed so that people familiar with the relevant material and methodologies can attempt answers; you dont need to know exactly how good other test takers are for it to be reasonable to assume you're in the bottom quartile if you can't. Same as you should have a lot less difficulty than most cyclists estimating whether your time trial was a good one relative to the rest of the field if you struggled to stay on the bike.
Of course, there are tests where the bottom quartile find the majority of it easy and have no particular reason to assume that most others found it even easier, and circumstances in which the weak undergrad who can only answer half the questions may reasonably believe that the test is being administered to a general population full of people who won't understand any of the material at all. But in general, it's reasonable to assume that if there's a lot of stuff you don't know, other people will know better.
The claim that skill does not exist or that people are totally unable to recognize how good they are at anything is quite radical.
You are sort of smuggling in the assumption for example that Olympian medalist lifters, when asked how much they can deadlift, will have the same distribution of answers as people who never deadlift (but are aware that totally sedentary men can probably deadlift like 200lbs and totally sedentary women can probably deadlift like 150lbs). If this were true, it would be worth publishing a paper about it.
It's sort of surprising to me to read your comment because TFA is an extended rebuttal of your comment.
> I think a lot of controversy comes from how Dunning and Kruger's paper leads people to interpret the data as hubris on the part of low-performers, and the statistical analysis demolishes that interpretation. Not knowing how well you performed is not the same thing psychologically as "overestimating" your performance.
D-K actually found that low performers were less accurate at assessing their skill than high performers, and the article you refer to obviously did not find this effect in random data, so I'm not sure how it was demolished.
> We don’t need statistics to learn about the world.
A sentence, written by the author on, commented by me on and read by the HN community on devices, which exist only thanks to 80-90 years of rigorous, statistics based QA in engineering, especially in mechanical/hardware engineering.
Anyhow, after spending years on a team filled with social science PHDs, I would not waste my time on reading papers about statistical analysis done by social scientist.
> I don't think your interpreting the sentence correctly
And I think you are injecting the words "only", "there are" and "everything" here and there just to change the meaning of the sentences I quoted and I have written...
I feel like the author read the autocorrelation result, hated it and ignored the central point. There are ways to bucket data that removes the autocorrelation and in those experiments we also see the DK effect disappear. Trying to argue that we should study the effect with the autocorrelation present but ignore the autocorrelation for 'reasons' is not the way forward.
I feel like this article is severely over-complicating the analysis. Looking at the original blog post [1], their key claim appears to be that "random data produces the same curves as the DK effect, so the DK effect is a statistical artifact".
However, by "random data", the original blog means people and their self-assessments are completely independent! In fact, this is exactly what the DK effect is saying -- people are bad at self-evaluating [2]. (More precisely, poor performers overestimate their ability and high performers underestimate their ability.) In other words, the premise of the original blog post [1] is exactly the conclusion of DK!
Looking at the HN comments cited [3] by the current blog post, it appears that the main point of contention from other commenters was whether the DK effect means uncorrelated self-assessment or inversely correlated self-assessment. The DK data only supports the former, not the latter. I haven't looked at the original paper, but according to Wikipedia [2], the only claim being made appears to be the "uncorrelated" claim. (In fact, it is even weaker, since there is a slight positive correlation between performance and self-assessment.)
So, my conclusion would be that DK holds, but it does depend on exactly what is the exact claim in the original DK paper.
> I haven't looked at the original paper, but according to Wikipedia [2], the only claim being made appears to be the "uncorrelated" claim.
Is it that hard to actually check the original paper before bothering to make such a claim? The original paper explicitly claims to examine "why people tend to hold overly optimistic and miscalibrated views about themselves".
Yeah, the model is a simple linear model (which I've yet to see written down) with some correlation coefficient which is the unknown. Derive an estimator for that correlation coefficient, being explicit about the assumptions, then we can have a discussion. Until then it's all lots of noise. The raw data would help too.
The "The Dunning-Kruger Effect is Autocorrelation" article is an example of obvious bullshit.
Their claim that "If we have been working with random numbers, how could we possibly have replicated the Dunning-Kruger effect?" is the first blatantly false statement, and then the rest is built upon that so it can be safely disregarded.
It's easy to see this because while the effect is present if everyone evaluates themselves randomly, it's not present if everyone accurately evaluates themselves, and these are both clearly possible states of the world a priori, so it's a testable hypothesis about the real world, contrary to the bizarre claim in the paper.
Also, the knowledge that the authors published that article provides evidence for the Dunning-Kruger effect being stronger than one would otherwise believe.
Your comment amounts to saying that some of the randomly generated data really is consistently over estimating it's performance. How absurd.
Like similar analyses here you don't factor in that DK is about bias. Of course you can't see bias when test score=self assessment. That's because "IF everyone perfectly knows their score then there is no bias in their assessment" is a tautology.
That original article was bogus and needlessly combative. I feel like the majority view in the HN comments saw it as such.
Most comments were splitting hairs on what _exactly_ the Dunning-Kruger effect was, plus some general nerd-sniping on how the original article was off base.
IMO it was something that fell flat on its own rather than something that needed a lengthy refutation, but I can understand that sometimes these things get under your skin.
Just based on the graph just under the "The Dunning-Kruger Effect" section, one observation I'd like to present is that the subjects's numerical self-assessments fall into the same range as passing but non-stellar grades do in school. This may reflect a psychological bias in how the subjects use and understand percentages. Accordingly, that the two lines cross is a red herring.
The corollary of Dunning Kruger is that everyone is equally capable and equally capable of assessing their performance. This nicely suits the current social rhetoric but does not match observed reality.
Any discussion of statistics-based reasoning should include the concept of systematic bias, and that's not mentioned in this article at all. An example of systematic bias is that of an accurate but miscalibrated thermometer, where the spread of measurements at fixed temperature is small, but all measurements are off by some large factor.
Now with D-K the proposed problem is statistical autocorrelation, not systematic bias, due to lack of independence, as here:
> "Subtracting y – x seems fine, until we realize that we’re supposed to interpret this difference as a function of the horizontal axis. But the horizontal axis plots test score x. So we are (implicitly) asked to compare y – x to x"
Regardless, it's fairly obvious that D-K enthusiasts are of the opinion that a small group of expert technocrats should be trusted with all the important decisions, as the bulk of humanity doesn't know what's good for it. This is a fairly paternalistic and condescending notion (rather on full display during the Covid pandemic as well). Backing up this opinion with 'scientific studies' is the name of the game, right?
It does vaguely remind me of the whole Bell Curve controversy of years past... in that case, systematic bias was more of an issue:
> "The last time I checked, both the Protestants and the Catholics in Northern Ireland were white. And yet the Catholics, with their legacy of discrimination, grade out about 15 points lower on I.Q. tests. There are many similar examples."
I am reminded of something my very accomplished PI (in the field of earth system science) confided privately to me once... "Purely statistical arguments," she said, "are mostly bullshit..."
> Regardless, it's fairly obvious that D-K enthusiasts are of the opinion that a small group of expert technocrats should be trusted with all the important decisions
It seems like you're roughly the only person who thinks this.
> Why so angry? I know I’ve taken this far too personally. I have no illusions that everything I read online should be correct, or about people’s susceptibility to a strong rhetoric cleverly bashing conventional science, even in great communities such as HN. But frankly, for the last few years, the world seems to be accelerating the rate at which it’s going crazy, and it feels to me a lot of that is related to people’s distrust in science (and statistics in particular). Something about the way the author conveniently swapped “purely random” with “null hypothesis” (when it’s inappropriate!) and happily went on to call the authors “unskilled and unaware of it”, and about the ease with which people jumped on to the “lies, damned lies, statistics” wagon but were very stubborn about getting off, got to me. Deeply. I couldn’t let this go.
I am afraid I actually agree with the author's point. The anti-intellectual, anti-scientific streak in many poor analyses claiming to debunk some scientific research is deeply concerning in our society. If someone is trying to debunk some scientific research, at least he should learn some basic analytic tools. This observation is independent of whether the original DK paper could have been better.
That said, I give the benefit of doubt to the author of "The DK Effect is Autocorrelation." It is a human error to be overly zealous in some opinions without thinking it through.