Hacker News new | past | comments | ask | show | jobs | submit login
What's the fastest language in the world? (atlasobscura.com)
55 points by ColinWright 61 days ago | hide | past | favorite | 61 comments



Article is new but paper mentioned within is from 2019, predating LLM revolution. From abstract:

>We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate.

Quite interesting result. Curious how constructed language Ithkuil, designed to convey information in highly compressed manner, compares.


I would hypothesize that Ithkuil would be spoken fairly slowly similar to Cantonese and so yield a similar bit rate in line with the overall results for natural languages.

That said, nobody speaks Ithkuil fluently enough to test this so maybe I'll be proven wrong some day!


> predating LLM revolution

So? It's not about LLMs?


Maybe they were referring to tokenization? Tokens per sentence might be a useful metric although imperfect.


Current LLMs are biased (English): different bits per second for different languages (assuming constant tokens/s for a model, and the paper is true (same 39 bits/s for different languages)). Because tokens (as they are used for llms now) do not reflect information density for different languages (may degrade to token per unicode codepoint for non-English).


The most interesting takeaway from this article for me was that there's an inverse correlation between the number of syllables spoken per second and the bits of information conveyed in that time.

> Japanese, for example, has an extremely high number of syllables spoken per second. But Japanese also has an extremely low degree of complexity in its syllables, and much less information encoded per syllable.

It seems like our brains might only be capable of processing ~39 bits of spoken information per second. Now I want to see a comparison of the information throughput of other forms of communication!


Maybe we can process more than 39 bps, since people will often increase the speed on podcasts. Perhaps the challenge is in transmitting.


(Good) podcasts are not recorded at anything approaching the normal limits of understanding. They're paced for good conversation or storytelling and know that the vast majority of their listeners are multitasking, not giving them their full attention.


I'd say less transmitting, and more preparation of the message.

"Please move" or "Get the fuck out of the goddamn way" both communicate the same information, one a bit more colorfully.

establishing and maintaining context, desired action and desired outcome take (well, me anyway) a substantial amount of time. Partly (for me) figuring out what the desired outcome actually is, and partly encoding that in a way that will be well received.


Yeah, the sender-side is probably the main bottleneck, just consider how often people speak filler phonemes like "uhm", it's so common that you might not notice unless you're looking for them. They are basically placeholders into the data-stream, to indicate that it isn't over yet but there's a delay producing the next item.

In contrast, consider a listener who is equally focused and invested as the producer: They don't often indicate that their own buffer is full or request a repeat. While you may say "hold up" or "run that by me again", it's usually for reasons other than word-rate. (For example, to prompt the producer to try another encoding, to express disbelief or contempt, etc.)


> It seems like our brains might only be capable of processing ~39 bits of spoken information per second.

I'm quite sure this isn't true, since I can listen to even fast English speakers at 2X speed and still understand them. Although that's right up against my limit, presuming the speaker is already on the fast side of normal.

I would say, rather, that the bottleneck is the human ability to synthesize and speak the message they intend to convey.


This article is terrible. Here's a link to the actual paper. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6984970/pdf/aaw...

Figure 1 on page 3 is incredible. I'm so annoyed the article didn't include it. It cleanly visualizes how different language families have wide ranging syllable rates but remarkably similar information rates. Honestly the entire article could be replaced with just this image.

If you're really lazy here is said figure on imgur: https://i.imgur.com/QUbhBKq.png


Awful and sad

> There’s a technical meaning devised by a guy named Claude Shannon that involves, basically, how quickly a listener can reduce their uncertainty about the message they’re getting. This involves calculations of the number of possible syllables in a language, the relative popularity of each of those syllables, and the probability that a certain syllable will follow another. All the Shannon stuff is kind of abstract and involves a lot of math that, frankly, made my head hurt.


Wait, so the IR chart shows that italian with all its bippity-boppity conveys less bits of information per second than languages with silent final e like french and english?

nuvola, nuage, clowde


This is about spoken language, not written, so silent letters or pathological spelling (qu'est-ce que c'est?) don't matter.

FWIW, the information density of written languages has also been studied, and unsurprisingly Chinese comes on top because it has 4000+ characters to express concepts with.


Whether or not a letter is silent literally only matters when speaking/listening. Back in the day, writing was a reflection of what was spoken but in the past few centuries of orthography reform and standardization the written form has become canonical. By silent e I mean to refer to the lack of an additional syllable after the last consonant e.g. faro/loin/far, davanti/devant/before, via/voie/weye

As for Chinese, I don't imagine they measured information density per brushstroke

𠔻𠔻𠔻


Per brushstroke is interesting but per unit area in the average font size is also key.


QR codes must have Mandarin beat


There is a hidden speedup in Japanese in that words are kept short by reusing them for many meanings: it has an astonishing number of homonyms. The words are distinguished in spellings (like different combinations of kanji and kana) but to a much lesser extent in spoken language (all you have is pitch accent; but there are many homonyms that are the same down to pitch accent).

Japanese printed dictionaries are ordered by the kana expansion of a word, so you can see the homonyms together.

Pretty much on every page of a Japanese dictionary, you can spot some runs of homonyms or near homonyms.

Sometimes it's like clusters of 9 words or even more (Of course, not equally frequent: say only maybe two, of those will be in frequent usage.)

Also, informal Japanese will be even faster, by dropping various repeated formalities that add syllables in every sentence.


I'd like to see metrics for Turkish. I was told by a native speaker that it developed over the years for efficient military communication purposes and I've learned it as a foreign language for a couple of years.

There's an unusual level of modularity, concatenation and configurability to the base verbs. For instance "we were not able to buy" can be expressed as a single word (I think it's alamadık but someone correct me)


> I’d like to see metrics for Turkish.

Turkish is in figure 1; its blob is on the lower end.

> developed over the years for efficient military communication purposes

A crude statement of this view would be pseudohistory. It’s common enough for registers of language to develop (and even diverge) in particular social circumstances, and so it wouldn’t be wholly surprising of a military register of Turkish were to have emerged. But there have been lots of other influences on the historical development of Turkish: for example, Arabic and Persian influence, especially in literary Ottoman Turkish; the natural evolution of the vernacular during the Ottoman period; and language reform under the Türk Dil Kurumu to remove Arabic and Persian vocabulary. Any plausible claim of this kind would have to be about the relative influence of military registers.

It’s also not obvious whether the needs of military communication would be particularly different from e.g. industrial communication, or whether that would prioritise speed over e.g. accuracy gained from redundancy.


I would say that Turkish is just "normal". Very typical. Often used alongside Japanese or Cree in linguistics textbooks as an example of how a "typical" language works. Fully regular affixes (what you call modular concatenation is known as agglutination in linguistics). And it has a heavy focus on verb modality.

Japanese and Mongolian grammar has many syrprising similarities to Turkish (all those languages are unrelated as far as anyone knows). "I was made to read it repeatedly" can be one word, a verb with the right suffixes.

The Indo-European languages with their irrgular suffixes and inflections are actually unusual. Many languages have no irregular verbs, for example. Most that have irregular verbs, only have a couple. For some inexplicable reason, many Indo-European languages have dozens and dozens. This is a major outlier.


Unrelated? Japanese derived from Turk languages, sorry.

Ant the whole study doesn't take into account the speed of spoken languages, where and how the vocals or consonants are formed. Guttural consonants need much more time to be formed than labial. We had that simplification (ie speedups) with the German Second Lautverschiebung (and elsewhere), and then we can see how English bypassed Bavarian German in terms of simplicity and speed. or the esp. new greek, which consists mostly of labially formed sounds, just to be able to speak faster. Greek is missing which is highly suspect, because because every linguist should know that Greek should be the fastest spoken language. Now slow guttural sounds.


> Unrelated? Japanese derived from Turk languages, sorry.

Yes, unrelated. Please don't perpetuate the now widely discredited Altaic hypothesis. If you must bring it up, at least acknowledge that it's highly controversial.


The density vs syllabic speed concept is pretty widely known, but I'd be interested in analyses of accents inside languages. Why are some accents much slower than others, and how is that not a serious disadvantage? I have firsthand knowledge of the fact that when I, as an American Southerner, try to speak as quickly as a Northerner my mouth falls to pieces.


Article notes that info transmission is roughly the same for all languages. I wonder if this is the case for all subgroups? Do New Yorkers convey information more quickly than Vermonters?


It's evolution. If something is better. It will take over.

Thus it's not surprising that all languages have evolved to about the same speed. None is "better". They all convey information at the same rate. That's why there is no pressure to speak just "the best one".

Like you, to my ear, American English seems to be spoken at different rates regionally. I don't know how to think about that...

I tried learning Danish. It's not like French where you don't speak the ends of many words. There, you just say only the most important syllables in the sentence, usually "deh" or "leh" (which are pronounced the same). Respect the Danes! Their language is impenetrable!


> It's evolution. If something is better. It will take over.

That's not how the evolution works actually. How it really works is: “If it's better at spreading itself then it will take over”, but it says nothing about the intrinsic quality of the trait. And for memes (literally speaking, as described in the last Chapter of Dawkins' Selfish Gene) it means being cool enough for people to copy, no matter if it's actually more efficient at conveying information or anything.


Roughly is the operative word there. It's a ballpark figure, like within an order or magnitude. It's hard to quantify in the first place.

We can think of a language like a digital modulation, with the set of sounds being a symbol constellation, and the syllable time, as the baud rate. When you do such numbers, most (all?) languages fall in a fairly narrow window (something like 30 - 50 bits of actual information per second at average speaking speed).

If a language has fewer possible sounds, the syllables come more quickly. This is an innate tradeoff with a communication channel in information theory.

But the analogy is only partial because, unlike most designed protocols, the symbol distribution is not a normal or random distribution. And human language has, in information theory terms, internal redundancy and compression (pronouns do a table lookup) as well as error correction codes and checksums (most combinations are illegal).

This makes a analogy beteeen spoken language and a modem less than straightforward, though they are governed by the same underlying laws.


I wonder if Greek (not considered in the paper cited) might beat even Japanese for syllables per second. I'm learning it and it sounds incredibly fast sometimes.


With automated translation and the general availability giant corpuses of English text, it strikes me that at the very least a partial ordering can be imposed on the set of languages.

As in: to compare two languages A and B, translate large English corpus to A, then to B, then compare sizes. Ta-da.

And if actual speed is really what matters, feed results in some speech-to-text algorithm and clock the output.


Don’t speedrunners normally choose Japanese versions of the game for faster text?


It's normally Chinese because only the number of characters would be relevant here.


I've always found it frustrating that our languages lack more bandwidth are serialized.

I love the idea of parallelizing sentences to provide much needed context rather than having to reference the history in text or conversation.


Are unnecessary syllables used for error correction? Or are they mostly just noise?


Like the NATO Phonetic alphabet. Make everything two syllables, because it makes the correction of corrupted recordings less ambiguous.


November.


Okay, thank you opposing council, ya got me. I should have said "make everything unambiguous, usually by using two or more syllables, and in a couple of instances where it was okay, one syllable."


Golf.


In Russian, half the unnecessary syllables are just swallowed. Together with half the necessary ones. (Source: somebody who learned the language many years but never understood a native speaker speak)


Could have a similar function to grace notes in music.


Pellegrino's research on this has been discussed here before

2019: https://news.ycombinator.com/item?id=20880789

2011: https://news.ycombinator.com/item?id=2976044


I wonder if any South Indian languages were included because the speed at which syllables come at you seems incredible.


We should be able to answer this pretty accurately today, with all the language model work happening.


Next up: What language has the most information bits per [blank] when in written form?


The one you understand.


this, same for what is the programming language to build your startup


F77


C++


I mean, not really? Could argue that compile time is how long the "language" itself is being processed, and that's pretty slow in C++.


t;dr: it's Japanese


That really was not the takeaway.

  …The paper found that, in terms of sheer number of syllables spoken per second, the fastest languages of the 17 studied were Japanese, Spanish, and Basque. The slowest were Cantonese, Vietnamese, and Thai.
  …
  What Pellegrino found is that, essentially, all languages convey information at roughly the same speed when all the factors are taken into account: around 39 bits per second. The higher the syllable-per-second rate, the lower the information density, which creates a trade-off that makes all languages around the same in terms of information rate.
So if you wanted to communicate the most quickly, you should rapidly speak in a more dense language like Cantonese or Vietnamese.


Machine code?


TLDR; All spoken languages convey information at approximately the same rate: 39 bits per second.


Okay but (just eye-balling the graph) 47 (French) seems significantly more than 35 (Hungarian)


I'm sure everyone who read the title immediately thought of the fastest programming language in the world.


The linked URL should be a context clue.


No, because "in the world" when talking about programming languages seems very odd.


Swift, QuickBASIC, Turbo Pascal?


Why, in God's given in your language name, are you putting together such a wild zoo of animals from ancient times and different planets? It's like cramming together in one cage a dinosaur with a lion and a predator. Of course the predator will always win, don't you know already?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: