Seems like OP should have considered a popularity weighting. Among the 25 weirdest we see Mandarin and Spanish, the world's two most popular languages by native speaker count. That's a hint that what you're measuring isn't exactly weirdness.
Meanwhile I see Hungarian, famous for being the hardest language to learn, is fifth least weird. Even stranger, Cantonese, which is almost exactly the same in writing as 'weird' Mandarin, is sixth least weird. How can two languages that you write the same be so very far apart in feature set?
I think 'weirdness' in this context should not be conflated with 'norm'. Instead, as I read it, I think about it the same way you would think about the regularity of a language (to put it into compsci terms)
I speak both mandarin and cantonese. I think a lot of the 'weirdness' factor comes from actually saying the words. Also there are some words in cantonese that are not in mandarin. For example, in Cantonese, there is 'mou' (the negation of having), while in Mandarin, it'd be pronounced as two words ('mei2 you3). Spoken cantonese in some regards, is easier to pick up than spoken mandarin (definitely easier to cuss in)
I wouldn't say all English speakers - someone saying "gonna" sounds distinctly American to me. Around here, something like "gonnae" (which sounds more different than the small difference in spelling would suggest), common usage:
As I understand it dialects which somehow qualify as language are weighted more than English and all it's dialects? Definitions of languages tend to be quite arbitrary or even politically motivated at times; look no further than the Balkans or Moldovan .
This ceased to be when Moldova declared its independence, in early nineties, reverting the official language to the Latin alphabet.
Saying that Moldavian is a dialect of Romanian is sort of an overstatement, because besides a couple of regional words and phrases imported from Russian, it's the same as literary Romanian in writing and when spoken we have the same accent used in the region of Romania that borders Moldova (and that's bigger than Moldova itself, we even call that region Moldova and when we refer to Moldova, the country, we're calling it Basarabia).
Romanians are also the largest population group living in Moldova. This country is basically a part of Romania that was taken from us. Unfortunately they don't want to unite with us, because many Romanians there forgot that they are Romanians or consider themselves to be Moldovans. We also help them whenever we can and we also represent their door to EU, since obtaining Romanian citizenship is easy for Moldovans. Unfortunately having them close to our hearts went without returned favours. But whatever, if they want to be Moldavian-speaking Moldovans, living in Moldova, to each his own.
As a Finn, I have to point out that just because Finnish and Hungarian are related, it doesn't mean it is easy to learn, or in any way "not weird" to us. Some basic things about the grammar are the same, but you wouldn't be able to tell that they're related without a higher degree in linguistics. I think Hungarian is usually considered very difficult to learn precisely because it doesn't resemble any other language. Nothing you know already will make it (noticeably) easier to learn.
Seems like they are just counting the weird features of languages. It is correct that language like Turkish (probably Hungarian as well) is least weird because it does not have any of the weird features like word gender, irregular verbs, prefixes etc. But two main features of language, extreme inflection and vowel harmony makes learning it difficult.
I suspect that Turkish scores low on the weird score as it was reformed in the 20th century. Prior to it's reformation it was a language mostly only spoken and occasionally shoehorned into a more ornate language's arabic script (called Osmanci - pronounced Oss-mann-juh).
Turkish was the language the lower and middle Turkish classes typically used during the Ottoman period. When the language reforms came in not long after the establishment of the modern Turkish republic, they were specifically designed with the idea of unifying the people by language and being easy to learn to read and write, which is how Turkey went from a literacy rate of 30% to 70% in just a few years.
Turkish is weird to a western european, but once you get past the initial weirdness is probably the most regular and structured language you'll ever learn. Which you'll find exceptionally weird if you've ever tried to queue for a ferry ticket in Turkey (which is not regular or structured at all).
It would be cool if the author made the dataset available it would be fun to try other things with it - population weighing (as WildUtah mentions), grouping them by language families and calculating intra and inter-group weirdness (and distances), clustering the languages into new groups, calculating weirdness as mean distance in the 21 dimensional space to all other languages, projecting the space itself on a plane so that we can see it better,...
That's the thing, he reduced the dataset from 2677 languages with 192 features to 239 languages with 21 features. That's usually the hardest part of any analysis, so it would be nice if he shared the results of his hard work with us...
I don't know where to start on this page's absurd notion that Cantonese is a non-"weird" language, but that Mandarin is weird. This must be some academic nonsense based on sounds and disregarding other critical language features.
Let's put aside the fact that Cantonese has among the most complex pronunciation systems in the world, with seven tones and both long and short versions of a number of sounds. This is a language which has four different communication modes. Here's how weird Cantonese is.
1. Cantonese speakers speak in Cantonese.
2. Cantonese speakers read and write in a totally different language, namely Modern Chinese, which for all intents and purposes is Mandarin.
3. Cantonese speakers read out loud in Modern Chinese (Mandarin), but pronounce each of the characters with radically different Cantonese sounds.
4. For purposes of comic book dialogue, etc., it is possible to read and write some Cantonese using various co-opted Chinese characters. But you can't pronounce all of Cantonese this way: many words have no written form whatsoever. This has resulted in a bizarre pidgin written form. For example, one very common word ("di1" -- "a few") is actually usually written as "D" rather than as a character. Other characters are impossible to write in current fonts, or are also used in Modern Chinese but for different words than in Cantonese, and so you see Latin letters like "o" and "a" next to them to suggest a different meaning.
I think we should have expected that Mandarin is weird but Cantonese is very normal - in the same way that Japan has a weird primary writing system, but its secondary writing system (Hiragana) is one of the most regular in the world. In fact the same reasoning could apply to Hindi - it's (or was until recently) a secondary language in India, with English as the language of government. Do other countries with two languages follow the same pattern? E.g. I would predict from this that Afrikaans would be a very non-weird language.
It'd be cool if they could include artificial languages like Esperanto and lojban. Given that one goal of both languages is to appeal to speakers of any language, it would be interesting to see if they achieved their goal (i.e. produced a very "non-weird" language).
I see this was posted overnight in my time zone. Several of the earlier comments correctly point out that empirically, a language that has been acquired by many second-language speakers (for example, English) must not strike too many people as unlearnably "weird." Many widely spoken languages have undergone a process that linguists call "koineization" (after the spread of Koine Greek as a common language of the ancient eastern Mediterranean and Near East)
in which the language simplifies some grammatical (and possibly phonological) features as it is spoken by more second-language speakers for trade or for use as a language of national administration in a multilingual region.
The United States is largely an English-speaking country, but only about one-fourth of Americans have ancestors who spoke English before arrival in North America. (Indeed, only one of my four grandparents, all of whom were born in the United States, grew up in an English-speaking household.) In other words, General American English is a koine language of second-language learners of English, so it is not surprising that it is spreading all over the world.
P.S. Feel free to visit my user profile here on HN to see more about my background in linguistics and language learning and teaching.
AFTER EDIT: Cantonese versus Mandarin as "dialects" or "languages" were mentioned in other comments. Cantonese is at least as different from Modern Standard Chinese (Mandarin) as German is from English. How you might write the conversation
"Does he know how to speak Mandarin?
"No, he doesn't."
in Modern Standard Chinese characters contrasts with how you would write
"Does he know how to speak Cantonese?
"No, he doesn't."
in the Chinese characters used to write Cantonese. As will readily appear even to readers who don't know Chinese characters, many more words than "Mandarin" and "Cantonese" differ between those sentences in Chinese characters.
I read something about how English was simplified as a result of the Viking invasion of England in the middle ages. It sounds like koineization. English might be weird, but it seems to be weird in a way that makes it highly exportable, like a successful product.
There are a few potential problems which limit the significance of this:
1. There is not a universal definition of what defines a language and what is simply a dialect or a regional variation; this applies especially on large continents where there can be greater variation in language features between geographically remote locations, but no clean boundary at which you can say people speak one or the other language.
2. Languages evolve, diverge and sometimes borrow, and so a group of related languages can share the same potentially idiosyncratic feature because of common evolutionary roots rather than because the feature makes sense. This could explain the result for Hindi - it is a standard language that 'averages' a large number of other Indian languages.
I would call these more "caveats" than "problems" -- anything with a title like "weirdest languages" is going to be incredibly subjective, and there will be no doubt no shortage of people who disagree with specific choices, but as long as the reasoning is clear it can still be interesting/useful.
> This could explain the result for Hindi - it is a standard language that 'averages' a large number of other Indian languages.
I think this is a little misleading; a lot of Indian languages come from a completely different language family to Hindi (Hindi is Indo-European, but a lot of Indian languages are Dravidian, e.g. Tamil and Urdu), though I'm not qualified to speak about this.
Could well be true! I'm not qualified to speak about that. Looking it up, seems like I mis-spoke when I said Urdu was Dravidian. But there are a lot of Indian languages that are (e.g. Kannada), so the overall point stands that the "averaging" comment is a bit misleading.
The main problem with this analysis - I think - is that each language rank as 1, regardless of actual speaker size and history. There is a reason why some languages are much larger than others (mostly historical and political), and these larger languages (like English) has thus evolved towards a more simplified version of its former self.
English is thus - for all intends and purposes - not weird.
A more interesting approach would be to take larger languages, like English, Mandarin, Spanish, etc. and value their features higher than languages spoken by tribes or very few people, and thus you could determine a more accurate 'weirdness' index out of that.
One possible theory is that it could be related how the data set deals with (or fails to deal with) Book-Norwegian vs New-Norwegian. Also modern Swedish and Danish grammar only uses two genders while Norwegian still has three, so that could weight in.
Actually, Swedish and Danish ARE very weird, but they didn’t make the cut-off of “14 or more of the 21 features attested”. Swedish has 12 of the 21 features listed in WALS, Danish has 13. Both of them are actually weirder than Norwegian (15/21 features):