Hacker News new | past | comments | ask | show | jobs | submit login
Stop Using the Flesch-Kincaid Test (andreadallover.com)
55 points by polm23 14 days ago | hide | past | web | favorite | 23 comments



It's time people realize that a great deal of scientific papers are just written as a requirement to finish a masters/doctors degree, or are just about meeting the yearly publication quotas of the department. I know very proficient writers that probably haven't written a single article worth reading.


While there are many things extremely wrong with the quantitative output requirements of modern science, arguing that "a great deal of scientific papers are just written as a requirement to finish a masters/doctors degree" is somehow a bad thing is not sincere.

For many their PhD thesis is probably both the most laborious and the most scrutinized piece of work they will have to produce.


The idea of counting syllables for languages other than English reminded me of a fun personal experience.

When I was in high-school, I was in a camp with some American Peace Corps students in my country. They were learning Romanian since they had been spending some time here and were curious. At one point, one of them asked 'how do you say "hug"?' to which we replied 'îmbrățișare' (uhm-bruh-tzee-shuh-reh, approximately). They were taken aback, and after a pause, jokingly quipped 'don't you have a shorter word? Like... Hug?'.


"embrace" it sounds like


That's because Romanian is a Romance language, descended from Vulgar Latin after the Roman colonization of present-day Romania.

Both are (through intermediaries at least in the case of English) coming from the latin prefix "in-" and "bracchia" (arm).

Also compare to modern Italian "abbracciare".


Yes, that is the same root.

Just as a side note, 'brace' here (brațe in Romanian, braccia in Italian) is simply the word for arms (as in hands, not as in swords), so the word in romance languages sounds essentially like 'enarm' would in English (without the weapon connotation).


English is remarkably monosyllabic, especially for nouns.


Also English in particular has a tendency to shorter nouns (often deriving from its Germanic roots) for simple, everyday words, plus more formal or precise synonyms derived from French or Latin which are longer. So there's a correlation between length and whether you have the 'plain' word or the 'formal, precise, intellectual' one. But that doesn't carry over to other languages inherently -- for instance in Japanese the 'plain' word is usually derived from old Japanese which is more multisyllabic, whereas the 'formal' word is the usually two-syllable Chinese loanword.


Expanding on this point: English is a Germanic language with a vocabulary that's 80% Romantic (i.e., french, latin) in origin, due to the Roman and then Norman occupations of England (though the 20% that's Germanic makes up the most commonly used vocabulary). Because the ruling classes for several centuries spoke Latin or French, education became synonymous with those languages and English speakers borrowed heavily from them, leading to Germanic and Romantic versions of the same words ("pee" vs. "micturate", "shit" vs. "defecate") where the latter is the more formal, highbrow version of the former.

A similar transition is happening with Japanese, where English, the dominant world language in the 20th century and second language for many Japanese, is having its vocabulary widely adopted into the Japanese language. For example, the more common word for "ball" in Japanese is now "bōru" (literally, "ball" pronounced with Japanese syllables), while "tama" refers to more traditional items.

French, interestingly, is consciously hostile to borrowing vocabulary from other languages, to the point of having the Académie française in France [0] (and, to a lesser extent, the Office québécois de la langue française in Quebec [1]) act as "the authority" on the language, prohibiting borrowed words and issuing directives on correct usage and new vocabulary.

[0] https://en.wikipedia.org/wiki/Acad%C3%A9mie_fran%C3%A7aise [1] https://en.wikipedia.org/wiki/Office_qu%C3%A9b%C3%A9cois_de_...


Just a note, but differences like 'pee' vs 'micturate' are common in all European languages at least, even the romance languages. The more formal words are often almost identical, derived from a cleaner Latin with fewer phonetic transformations than the more vulgar words.

One of it's best traits! It's just not that hard to say stuff with it, they roll of the tongue!


Author's criticism that Flesch-Kincaid isn't suitable for non-English text is on point, but history hasn't been kind to critics of readability scores in general: http://www.impact-information.com/impactinfo/newsletter/smar...

The given counter-arguments against their use in general ("but what about context!", "they're clearly too simple!") have been tread over for years. Readability formulas are surprisingly robust, although obviously weak to adversarial input.


Not just adversarial. FK simply uses word length as a proxy for frequency, which is obviously wrong. It makes crocodiles and elephants score higher than gybe and vaunt. It also doesn't acknowledge that embedded clauses are more difficult than a "sequence" of clauses, etc. In practice, it'll often point in the sort-of-direction-ish, but you need to take the outcome with a pinch of salt.

Here's an almost random, older paper that discussed readability score differences between genres: http://csjarchive.cogsci.rpi.edu/Proceedings/2008/pdfs/p1978.... Their conclusion:

... the analyses confirmed that several frequently used approaches for measuring vocabulary difficulty tend to be structured such that resulting text difficulty estimates overstate the difficulty of informational texts while simultaneously understating the difficulty of literary texts. These results can be explained in terms of the higher proportion of “core” vocabulary words typically found in literary texts as opposed to informational texts.


To be fair, these measures are most useful when we are considering alternative ways to state a given set of straightforward facts. It is unlikely that one version will use 'crocodile' and another will use 'gybe'... which, on the other hand, is a good reason for regarding them as merely a first step in studying political speech.

I wonder what William Buckley would have thought of the claim that the language of conservatives is simpler than that of liberals?


Even then, many of these metrics will prefer "the rat the cat the dog bit chased escaped" over "the dog bit the cat that chased the rat that escaped". They are very, very rough measures, to be used only in case of emergency.

> the language of conservatives is simpler than that of liberals?

Times have changed.


More importantly, the test measures complexity, not how 'smart' the text is. A higher score is not necessarily good. In fact, one should strive to express a given idea as simply as possible - from that perspective, a high score is bad. As programmers, this should be very clear to us.


The title is perhaps missing "... for spoken and/or non-English sources, preferably not at all".

If we should stop using this test, what should we start using? In the author's comment on the study, they noted "There are ways to study linguistic complexity".

I'm aware, for example, of this python project which provides F-K scores along with 7 other readability metrics to consider: https://pypi.org/project/py-readability-metrics/


I read the headline and said "I don't even know what a Flesch-Kincaid test is." Then I read the article, and realized that I actually do use this test every day. TIL

The Boomerang extension for Gmail applies a test, in the free version, to give you a readability grade level. I always find this scores me at 12+ and I try to do better (lower score), because I know that there's nobody reading my emails and I will have wasted my time, unless I make it short and to the point.

And now I know the name of the test. Cool.


> Liberals lecture, conservatives communicate: Analyzing complexity and ideology in 381,609 political speeches

At what point does it become irresponsible to put out studies with this kind of title?

All this does is create two warring factions - one side that is bolstered by the claims and feels they are superior and the other side that feels attacked and loses some faith in the instition of studies like this.

Furthermore these are rarely ever conclusive. They take a topic like understanding and interpreting linguistics (an already incredibly difficult field of study) and boil it down to a dinner table conversation quip.

But the damage is done. This click bait will be freely shared on social media and be consumed by unwitting participants.


So much of popular science reporting is abysmal that I have to flip it around: assume that the study is going to be mis-reported and just accept it as a given. I don't want to let that certainty factor into the choice of what research is done.

Politics is hard to study, but it's important. If the results are valid, people will apply them, and that has a direct effect on people's lives. The data may be messy and the theory doubtful, but if there's any measure of merit to it, it can gain a small advantage that can be the critical difference. And hopefully, each subsequent study builds evidence for a paradigm shift that makes the theory less dubious over time.


Imagine applying F-K to code. Since it's based around syllables, we'd discover that Perl5 code I write on the command-line is some of the most readable code out there.


I feel like the author of this criticism is screaming into the void. I have no doubt that the study is probably flawed, and likely has not proven beyond doubt any of the hypotheses it has put forward. While FK may not be a great test of linguistic complexity, it is a test.

To truly annul the study, the author needs a smoking gun - for example:

- Take a couple of the source texts that score wildly far apart

- Rework the punctuation such that an objective reader still agrees that the transcription is a valid representation of the original speech

- Show that the readability scores in the re-punctuated texts are inverted - or at the very least, score within a small percentage of each other.

It's a bit like trying to measure the happiness of a city by counting the percentage of smiling people as they walk down the high street. There are many social reasons a person might smile from the depths of misery, different people might classify a smile differently, and different cultures might view walking down the street with a grin as either frowned upon or mandated. Weather, time of day, month of year, public holidays - etc - all will affect the results.

But so long as you work hard to eliminate these biases, you still couldn't use such a measure to test whether Beijing is happier than Belfast, but you might use it as an indicator of whether left handed people are happier than right handed people. You can't tell whether men are happier than women (social pressures will influence facial indicators along gender lines from an early age in many cultures) but you might be able to begin a conversation on whether conservatives are happier than liberals. It won't be conclusive proof, but as a starting point for future research, it still holds some validity.

My point is that nothing in this criticism conclusively proves that this particular measure is unsuited for measuring against the stated outcomes. Even a measure that is the current darling of the linguistics community will have its drawbacks, and the author needs to prove that the weaknesses in the measure present a bias in and of themselves, not just that they are generally deficient.

To my mind, the better criticism is intent. A person can - and should - craft a speech based on the audience. A speech given at a university or research facility should have a different complexity score than one given at a rally or disaster zone. The interesting question to my mind is whether more effort is put into this by conservatives or liberals, and then comparing that to those in power vs their contenders.

But then, that paper wouldn't have such a catchy title, would it?


The smoking gun you're looking for is in the Language Log posts linked from the blog post. In particular, https://languagelog.ldc.upenn.edu/nll/?p=21847 shows how trivial changes to punctuation can inflate a paragraph's score from 4.4 to 12.5.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: