Hacker News new | past | comments | ask | show | jobs | submit login
Human speech may have a universal transmission rate: 39 bits per second (sciencemag.org)
671 points by headalgorithm on Sept 4, 2019 | hide | past | favorite | 250 comments

But, he says, instead of being limited by how quickly we can process information by listening, we’re likely limited by how quickly we can gather our thoughts. That’s because, he says, the average person can listen to audio recordings sped up to about 120%—and still have no problems with comprehension.

Some years ago I worked on an accessibility project for an app and website designed for people with disabilities. One of the team members had low vision, and used a screen reader that must have been set to 3x or even higher. I usually listen to YouTube and podcasts at 1.5-2x and I could barely understand the audio. He seemed surprised, which indicated to me that 3x+ was the norm for people in his circle.

I wonder if his ability was trained through years of using fast screen readers, vs. a lower visual processing load leads to better audio processing, or some other explanation.

I'm the blind dev who refactored a huge chunk of the Rust compiler [0]. I'm at roughly 800 words a minute with a synth, with the proven ability to top out at 1219. 800 or so is the norm among programmers. In order to get it we normally end up using older synths which sound way less natural because modern synthesis techniques can't go that fast. There's a trade-off between natural sounding and 500+ words a minute, and the market now strongly prefers the former because hardware can now support i.e. concatenative synthesis.

1219 is a record as far as I know. We measured it explicitly by getting the screen reader to read a passage and dividing. I spent months working up from 800 to do it and lost the skill once I stopped (there was a marked level of decreased comprehension post 1000, but I was able to program there; still, in the end, not worth it). When I try to described the required mental state it comes out very much like I'm on drugs. Most of us who reach 800 or so stay there, though not always that fast for i.e. pleasure reading (I do novels at about 400). it's built up slowly over time, either more or less explicitly. I did it because I was in high school doing muds and got tired of not being able to keep up; it took about 6-8 months of committing to turn the synth faster once a week no matter what, keeping it there and dealing with a day or two of mild headaches. Note that for most blind people these days, total synthesis time per day is around 10+ hours; this stuff replaces the pencil, the novel, etc. Others just seem to naturally do it. You have little choice, it's effectively a 1 dimensional interface, so from time to time you find a reason to bump the knob. And that's enough.

Whether and how much the skill transfers to normal human speech, or even between synths, is person-specific. I can't do Youtube at much beyond 2x. Others can. It's definitely a learned skill.

0: https://ahicks.io/posts/April%202017/rust-struct-field-reord...

And as a followup to that--because really this is the weird part--some circles of blind people (including mine) talk faster between ourselves. That's not common, but it happens. I still sometimes have to remember that other people can't digest technical content at the rate I say it and remember to slow down. A good way to bring it out is to have me try to explain a technical concept that I understand really well. I have the common problem in that situation of not being able to talk as fast as I think, but I also seem to have the ability to assemble words faster in a sort of tokenize/send to vocal cords sense once I know what I want to say.

To me, the fact that this does in fact seem to be bidirectional at least some is more interesting than that I can listen fast.

I kind of want to hear a "blind Danish" conversation now. For those unaware: Danish is phonetically one of the most difficult languages in the world, to the point where children who acquire Danish as their first language on average start speaking a few months later than children with any other first language. To clarify: speaking is not the same as understanding - toddlers are often capable of understanding language before they have the motor control required to speak it, which is why baby sign language exists.

Actually, understanding seems to be affected, too! See:


Oh wow, did not know that! Then perhaps a better way to phrase it is that it does not imply slower general mental development of the children - at least that is what I read elsewhere.

Are you Danish? (your account name looks rather Dutch)

I always wanted to know whether it's true what I heard some years ago: a friend told me that older Danish people are complaining that the younger Danish are harder to understand because there has been a trend in recent decades to swallow consonants even more than in the past

(that's something hard to image for me as a non-Danish speaker :)

No, I'm Dutch like you guessed. While the Netherlands and Denmark are very similar in some aspects (flat countries obsessed with having the best bike infrastructure in the world) the language is quite different ;)

Living in Malmö I also find this quite hard to believe! Besides, isn't old people complaining about the younger generation butchering their language something that happens all the time in all cultures?

This is vague, because it was long ago, but I remember reading an article in a newspaper (Dagens Nyheter, Sweden, IIRC) in... Oh, the 1990s, I think.

It was about how Danish linguists were worried about how the language was getting more unintelligible. The example I remember was about the ever-increasing number of words that are pronounced, basically, “lää'e”: läkare, läge, läger, lager... Those are written in Swedish here; some Dane may translate. (In English: healer, situation, camp [laager], layer...) There were more that I don't remember; in all, I think they mentioned at least half a dozen words that had become basically indistinguishable from each other.

I don't think serious linguists have had the same fear about most other languages.

It is common for any "in-group" to speak faster than the same people in other groups. This is commonly studied in linguistic 101 courses by undergrads because it is so easy to observe - in your own groups - once you are looking for it.

But when the in-group is defined as "is blind"?? I'm not just talking about programmers in this context, or any other cross-section wherein there's some sort of shared vocabulary and context other than the disability itself. I don't think it's been studied, but I've noticed, my parents have noticed, in general enough people around me have noticed it over the years that I'm convinced the effect is real. Is whatever mechanism you're referring to typically general enough that the group can be defined this broadly and still have it happen?

That's awesome. It reminds me of the Bynars in Star Trek, who evolved ultra-fast spoken language: https://www.youtube.com/watch?v=52_iSQnB6W0

I don’t think this is a blind person thing. Ask any nerd about something they’re into, and you stand a pretty good chance of receiving a firehouse of words representing their steam of consciousness.

Has anyone tried overlapping words instead of speeding them up? Like so:

I often wondered if this, or at least sped up speech, should be the default robotic interface... it would make sense to optimize for efficiency/speed (while maintaining legibility) if we can do so.

Wow, that's incredible. Do you find it frustrating talking to actual humans now? I'd imagine it feels like they're speaking in slow motion.

Edit: Hah, just saw your post on talking faster to other people who have the same audio skills.

Other people in realtime aren't...I guess the best way I have to put it is informationally sparse. There's a lot going on beside what's being said in conversation. Synths don't imply things for example; in a context with active implication slowing/pausing the synth is sometimes necessary. The skill doesn't extend beyond the blatant transfer of information. In social contexts and especially when you can't go off any visual cues whatsoever to figure out what the other person is thinking/feeling, there's a lot more going on than simple information transfer.

However most blind people I know who do this start hating audiobooks, start hating talks, and generally by far prefer the text option. audiobooks aren't annoying, but they're below my baud if you will. Net result: boredom/falling asleep to whatever it is and the need to actively make an effort to listen. Some things which require active listener participation--math lectures for example--are different. I guess the best way I can put it is that speed is inversely proportional to the amount of, I guess let's call it active listening, required.

I've given a lot of thought to this stuff, but we don't really have the right words for me to communicate it properly. A neuroscientist or linguist might, but I'm not either of those.

This is fascinating; thank you. How does an audio book read by the author, such as Anthony Bourdain's Kitchen Confidential, which contains autobiographical information, compare? Is it more like a social situation that you need real time to absorb, or do you prefer it at a higher speed? How does a stage play compare? Do you watch movies sped up?

Also, how does your "baud" vary with your familiarity with the ideas? I can't imagine it's independent. As a ~35 year old programmer with a decade of professional experience and a decade of hobby experience before that, I cracked open SICP for the first time and found almost everything familiar. I had digested the ideas from other sources, so I could read at a "natural" rate. If I had read it as a teenager, it would have been a mindfuck, and I would have taken multiple slow readings to understand. When you talk about numbers like 800, are you talking about writing that challenges you and changes the way you think, or are you talking about stuff you do for a living that is just information you're already primed to accommodate?

I haven't specifically tried different types of audiobooks to see if there's some preferred category.

With movies I don't bother with them unless they have descriptive audio, at which point you've got music, sound effects, and two somewhat parallel speech streams going on. That's high informational content.

I did an entire CS degree at 800 words a minute. I program in any programming language you care to name (including the initial learning) at that speed as well. For more complicated concepts I stay at that speed, but pause after every paragraph or so to chunk the content as needed. I'm doing this thread at that speed. Pretty much the only time I slow it down is pleasure reading or sometimes articles when i want to go off and do chores while I listen, but even then it's still faster than human speech.

In general i think answering these sorts of questions needs research that we don't have to my knowledge. Nothing in my personal experience or background really allows me to give you good definitive answers. The sample size to work with is pretty small and in all honesty there's not a lot of good research around blindness day-to-day in the first place.

> audiobooks aren't annoying, but they're below my baud if you will. Net result: boredom/falling asleep to whatever it is and the need to actively make an effort to listen. Some things which require active listener participation--math lectures for example--are different.

That sounds like a similar description to what it's like for people with IQs significantly higher than the average.

I can't find any recordings of 800 WPM synths. Would it be possible for you to make one? I'm curious of what it sounds like.

I don't think it'd be legal for me to hand you an espeak recording, but it works fine at 800WPM.

    espeak -s 800 "Things to say."

You can pass Espeak recordings around legally. It's just GPL. The license applies to the software, not the content produced via it.

I will attempt to remember and find the time to take my demo recording of this on Rust compiler source code that's currently in dropbox and put it up somewhere more permanent. I doubt Dropbox will care for me much if I allow HN-volume traffic to hit my account. It's Espeak using an NVDA fork with an additional voice that some of us like, so vanilla espeak is in the ballpark.

What I don't remember is if vanilla non-libsonic espeak softcaps the speech rate. It might. I believe new versions of espeak integrate libsonic directly, but that old versions just silently bump the speaking rate down if it's over the max. I haven't used command line espeak directly for anything in a very long time.

Libsonic is an optimized library specifically for the use case of screen readers that need to push synths further: https://github.com/waywardgeek/sonic

Here's an online version, I bet it sounds similar as the original program: https://eeejay.github.io/espeak/emscripten/espeak.html

There is a range slider that maxes as 450, which is the maximum speed according to the manual.

I tried listening to a Wikipedia article at 450, I am so amazed you can comprehend that. Perhaps that's equivalent of me visually scanning the text instead of reading, however when I do that, I tend to focus on interesting parts for long stretches of time. With espeak, how do you focus? Can you pause it at will?

Screen readers have a lot of commands for reading different sized chunks of content. In general there's probably around 50 keystrokes I use on a daily basis. It's not as straightforward as reading from top to bottom, though it can be. I can usually do a Wikipedia article without pausing at 450 or so.

If anyone is curious, here is the NVDA keystroke reference: https://www.nvaccess.org/files/nvdaTracAttachments/455/keyco...

As an interesting sidenote, screen readers have to co-opt capslock as a modifier key, then there's fun with finding keyboards that are happy to let you hold capslock+ctrl+shift+whatever.

> Whether and how much the skill transfers to normal human speech, or even between synths, is person-specific. I can't do Youtube at much beyond 2x. Others can. It's definitely a learned skill.

I find that the maximum understandable rate varies a lot between speakers. For some speakers 2.5x is possible, but just 1.5x for others.

One advantage synths has, is that they can more easily control the speed at which words are spoken, and the pauses between words independently. When watching/listening pre-recorded content I often find that I'd want to speed up the pauses more than the words (because speeding up everything until the pauses are sufficiently short make the words intelligible).

If someone knows of a program or algorithm that can play back audio/video using different rates for speech and silence, please share.

Are old speech synths not harsh on the ears to listen for longer periods? Or maybe I'm just familiar with the super robotic ones (I like them for music production).

If so, have you considered using an EQ plugin to maybe turn down the harsher high frequencies a few notches? Just a thought.

They're harsh. But you get used to it in about a week. Espeak is an atypically bad example, which is why NVDA experimented with a fork (and maybe one day the NVDA work will make it upstream). part of what allows them to stay intelligible is the harshness. I've never tried passing one through an EQ but there are already pitch settings and similar to play with, and given that even not wearing headphones slows me down I expect that an EQ would probably be bad for it.

But more to the point there is nowhere to really plug that in to a screen reader, so we can't try it anyway. The audio subsystems of most screen readers are much less advanced than you'd think.

I've known a lot of people that push podcasts, videos, and audiobooks to extreme speed. I knew a guy who'd turn video speed up to 8x so he could binge watch a season of generic anime in an hour flat. I knew a girl who'd get through paperback romance novels by scanning each page diagonally, in 10 seconds each. And here in this thread we have a lot of people bragging along the same lines.

I just don't get the point. If you can process content much faster than it was meant to be played, it doesn't mean you're learning much faster than you could, it means the novel information density is low. Any content that can be sped up that much without loss is not worth listening to in the first place. You're just skipping the trite cliches, filler, and obvious facts.

I can read fast, and I typically go through fluffy NYT bestseller nonfiction at 600 WPM. But when I do this I constantly have a sneaking suspicion that I'm just wasting my time. When I read a good book full of new ideas, I barely go at 150 WPM, but the time always feels well-spent.

Exceedingly slow narration, particularly what's normal for audiobooks, is annoying to me because it's slower than I process words. It's like walking with someone whose pace is far slower than your natural gait -- it takes more energy and concentration to slow down. It's why slow-talkers are so annoying.

This isn't "how fast can I go through this" but "what is a comfortable pace"?

So I bump the speed up, though usually fairly modestly: 1.25x - 1.5x is generally enough.

I've noticed that preferred speeds vary tremendously with the quality of the work and speaker -- high-density information and an exceedingly good speaker, and I'll slow down. Slapdash redundant content and poor speaker, I'll speed up.

The degree of polish in the production matters tremendously. I've listened to CPG Grey's YouTube videos (highly polished) and podcasts (a lot of chit-chat with his co-host). The videos work well at normal speed, or perhaps slighly sped up. The podcasts I find nearly unlistenable, though they improve at much higher speeds (1.75x - 2x).

Yes, and the painful slowness becomes even more evident if you speed it up to the 1.25/1.5x range, get comfortable with it, and then go back to 1x. IME, it sounds like the speaker is going over-the-top to enunciate, like you're a small child or have learning disabilities or something.

Audiences vary, and many listeners of audiobooks are visually disabled who may also have hearing problems. Pitching the default to these seems a fair deal, particularly as speeding up is so straightforward.

I watch lectures online at 1.5-2x for a similar reason. At that speed I have to concentrate. Slower and it is easy for my mind to wander.

As a slow walker I respectfully disagree on the energy to slow down vs speed up! Concentration may be equal however. I get absolutely exhausted keeping up with you fast walkers. :)

That's interesting. If you're talking about Hello Internet, I enjoy listening to them at 1x. I find it quite relaxing to listen to passively.

To date, this is the only podcast I can actually enjoy. I generally do other things while listening, and feel like I'm in the company of friends.

It undoubtedly has a relatively low density of information however I wouldn't speed it up, I would slow it down!

I tend to listen to audio books at 2.5-2.8 times speed and all media I consum is at 2x speed.

Tip there are 10s of chrome extentios that allow you to change all html videos to 2x speed or even higher (including most ads on sites like YouTube and Hulu)

I'm totally with you, I speak at effectively 1.5x in real life, so that's what feels natural to me listening. But I get turned off if something needs to be sped up _beyond_ reasonable human speaking speed to be palatable, like 2.5x or 3x. It's like putting a pile of salt on really bland food. I get the feeling that I'm panning for gold.

I can't keep up with fast talkers and I like my books and movies at nominal speeds. When things are fast I can't pay attention. My brain needs to extra cycles to wander and consume and let my focus be split.

I think some audio books are slightly slowed down by default, many I've listened to sound more natural at about 1.2x

Part of the reason for this is that its rare for audiobook software to have a 0.9x playback option, but common to have a 1.25x option.

Yup! _+1

I used to watch a lot of lectures at a high speed. But I've came to realize that the faster I watch something, the faster I forget it.

It is like the infromation doesn't have the time to settle in my memory, despite me understanding it.

It's maybe because when things are slow, I can use the dead time to think about the implication/corner cases of what's being said.

If you really want to understand it, I would encourage you to write down notes about it by hand.

For some reason I do not yet understand, the motion of physically putting pen(cil) to paper helps ingrain that information into your brain, in a way that typing it into a computer does not.

Some people say writing is better because typing is "unnatural", but from an evolutionary standpoint both are extremely, equally unnatural activities. There's no reason moving a cylinder of graphite with one hand should be inherently better matched to the brain than moving squares of plastic with two.

My personal guess is that, with fast typing speeds, it's too easy to just copy things word for word. With writing, you have to at least rephrase it and reorganize it to fit the notes reasonably on the page, which forces some processing to occur. I take notes solely by typing, but I only have retention if I do it slowly and reflectively.

I have a pet theory that our brains are hard-wired to store information "better" if it bears some kind of personal/individual imprint.

In other words, we can "store" knowledge more efficiently when written in our own handwriting than when typed into a neutral/generic text editor.

Hmm... then perhaps it's good that I'm extremely particular about how I type my LaTeX, down to microadjustments to the kerning to make it look nicer.

There is a difference: people have been using small handheld objects to create images on flat surfaces for tens of thousands of years. Moving squares of plastic to remotely and invisibly trigger a change in the patterns of light emitted by an electronic display is somewhat different. This could explain some of the studies that have found differences in retention of typed and handwritten notes, but there are many confounding factors.

While writing is ancient, in almost all cultures the ability to write hasn't been widespread until a few centuries ago. It just isn't that useful in agriculture when books cost a fortune, and historically almost everybody was a farmer. I doubt an ability possessed by a small portion of the population over a few thousand years had a large evolutionary impact.

IMHO it's primarily because you have to process and comprehend the information in order to make a decision about what's important enough to jot down.

Although, anecdotally, I find that writing notes by hand is better for recall than typing.

I'm not sure if this is because writing activates different parts of the brain, or simply because writing is slower and forces me to therefore think harder and comprehend better because I need to be 2x as selective about what makes it into my notes.

My money's on the latter, but who knows?

I think this is likely it. In my opinion, it reminds me of autoencoders. Having to compress the information means understanding the content enough to discard unnecessary bits.

I can roughly keep up in real time with speaking on the keyboard. Even though speaking is a bit faster than my max rate around 90 wpm, by dropping filler words and gaps I can mostly avoid summarizing and quote verbatim. When transposing at full speed like that, I feel like a conduit at times, very little sticks deeply in memory.

Contrast with writing notes, where I'll only write down things I find particularly important. Most of the time I'm just trying to actively listen.

Purely anecdotally, I feel like when I am typing and summarizing to the same compression ratio as writing, my retention with typing is better, because I can do it without looking, it's faster, and I can tune back in to the speaker.

On the other other hand, with a laptop open, I'm much more likely to get distracted with emails/tasks.

Bottom line, I think we need to study more axes of this problem, if only because it gives neat insights into cognition.

Here's a paper on the topic which presents some ideas: https://journals.sagepub.com/doi/full/10.1177/09567976145245...

Also, I was amazed at how listening twice to a history lecture really helped my retention. I'd listen while walking to my next campus class.

This varies by person. Not everyone learns by writing. Some only learn b hearing or seeing. Some can only learn by doing. Most of us are a mix. I learn just abou every way except for writing or typing. Online lectures i do best by playing at 1.2x at most and pausing regularly to let me think of implications and to crosslink to things i have already learned.

This post was far more controversial than i thought it was. I keeps gettin voted up and down. Anyone explain why?

People tend to enjoy genre fiction because it is full of predictable filler. Not all entertainment has to be surprising or life changing.

Just spending time in the moment with an enjoyable story is not wasting it.

I think this really is a case of to each his own. Pure fiction with drawn out descriptions and excess vocabulary drives me to anger.

I think I have a similar thing visually, where above a certain text size, I have a very hard time reading. The optimal text size for me is pretty small, even on a standard resolution display.

After doing speed reading exercises a few years ago I initially also started to use the techniques for novels, but that was a bad mistake in my opinion.

While I could get my comprehension percentage quite high with a bit of training, I lost all connection to the characters and story, stopped imagining the scenes and felt like the reading the book was a waste of time.

Novels should be read at a natural pace to give room to your imagination and dive into the story. You can still quickly scan over boring/repetitive filler text, but I did that without caring about WPM already.

With other things like textbooks / articles / reports cranking up your WPM and applying your attention more selectively by focusing on or re-reading critical parts is a very helpful skill though.

Maybe "speed readers" still receive knowledge at around 39bps - they just filter out a lot more.

I think there's something to that. You're zipping through just listening for the highlights. Similar to how you can listen to a full podcast, or often just watch highlight clips on Youtube and feel like you "mostly" enjoyed it. You got the best stuff. Obviously that isn't what some people want, but it's an interesting optimization.

> . If you can process content much faster than it was meant to be played, it doesn't mean you're learning much faster than you could, it means the novel information density is low.

I 'read' your comment using TTS at 3x. What does that say about the information density of your comment?

(Little to nothing. TTS at that speed is still marginally slower that I normally read with my eyes. Human speech is generally much slower than is necessary to be understood.)

That's a good point, 3x speaking speed is reading speed, so there's nothing unusual about using it on comments largely intended to be read. That makes perfect sense.

What doesn't make sense to me is consuming things that have a _set_ speed, such as video/TV or lectures, at a dramatically faster speed.

I imagine it's a compromise between cutting through the fluff and using the primary source material.

You could just read the plot synopsis or watch the highlights, but sometimes those don't convey build-up, suspense, or other data that are hard to losslessly compress.

Being comfortable with the "boilerplate" of a given medium or genre usually lets you skim or skip it to jump right into the good stuff.

I listen to a lot of podcasts and audiobooks while doing other things; walking, cleaning, cooking, traveling, playing games, etc. Every time I try speeding up, even just to 1.25x, I don't enjoy it as much, as it feels rushed and stressful. I think it could be interesting to learn to listen and read at extremely high speeds, but nothing more than interesting, and I'm even doubting the usefulness of it.

heh, when I slow audiobooks down to 1x it sounds like the reader is drunk. I usually go with 1.5-2x depending on the reader.

On the flip side, I listen to podcasts at 2x and now the hosts' voices sound weird when I hear them at 1x elsewhere.

Same here. I go anywhere from 2.1 to 2.5 based on the podcast.

Mummy, why is everyone in real life drunk?

As I used to tell my boss: a man page read once thoroughly is worth more than ten-times skimming it quickly.

Any content that can be sped up [by 8x] without loss is not worth listening to in the first place.

"That's just, like, your opinion, man."

Obviously the people who are getting something out of it, otherwise they wouldn't do it?

"Don't yuck my yum."

> I can read fast, and I typically go through fluffy NYT bestseller nonfiction at 600 WPM. But when I do this I constantly have a sneaking suspicion that I'm just wasting my time. When I read a good book full of new ideas, I barely go at 150 WPM, but the time always feels well-spent.

I do the same, but with Hacker News comments :)

Hey, I never said my comment was a novel insight! I'm here to waste time too. ;-)

In my experience the best books routinely stop me dead in my tracks. I just started Invisible Man and every paragraph is littered with really deep themes, I can normally finish a 500 page book in a few days (a majority of my reading is done during my commute) but this will definitely be a slow burn.

I think it depends on the speaker. I watch most videos at 1.5x speed, and some slow speakers with gaps at 2x.

Fluff has value. Fresh cotton candy should be eaten rapidly. It’s fluff. It’s allowed :)

For me, when I'm reading something dense, my WPM fluctuates. It could be 300WPM in the easy areas, and down to 30WPM in the conceptually challenging areas.

I can watch videos like Linus Tech Tips up to 2x speed and get just as much out of it as otherwise.

Here’s a blind programmer using Visual Studio with a ridiculously fast TTS: https://youtu.be/94swlF55tVc

I'm not blind, but I've tested some of my apps for VoiceOver and it's just utterly unusable with a "reasonable" speed. You have to pretty much set it to your reading speed for it to be useful, and that happens to be significantly faster than most people are comfortable speaking.

Start slow, and build up your tolerance to it overtime. If you try to jump into the deep end, you'll just make yourself disappointed.

I'm reasonably good at listening to sped-up audio, personally, so this wasn't really an issue for me. I was just providing anecdotal reasoning of why TTS users may set their audio speed to something that might sound unreasonable: it takes forever to navigate the interface otherwise.

This makes me emotional.

Same. It really makes you realize what we take for granted as sighted developers. I can't imagine what the learning curve was like for him, as he lost his sight at age 7, long before many of these accessibility technologies existed. [1]

[1] https://news.microsoft.com/apac/features/saqib-shaikh-on-tec...

Yes, it was like this video (scroll ahead to 1 minute mark).

Thank you for sharing that, was super interesting.

Yes. Took me a while but I can comfortably understand 2x speed and now 1x podcasts seem weird like they're talking super slow. I would imagine it's something you just train even more out of necessity.

I typically use TTS near 2.5x (I turn it up when I'm alert and down when I'm tired.) It's definitely a learned skill; a few years ago I started at 1x and struggled even with that.

Every couple of months, take a moment to reflect on your comprehension. Is it currently easy for you to understand the audio? If yes, then crank it up a little bit until it's noticeably more difficult. Repeat this process periodically over a year or so and before you know it, it'll be set pretty damn quick.

I can watch YouTube content at 3x without issue. I did this without much intentional effort - I simply downloaded an extension which allows me to speed up the video in increments of 0.1x using a keyboard shortcut. Whenever it felt slow I would speed it up, and whenever it felt too fast I would slow it down. Without paying much attention to the actual numbers I had reached over 3x within a month or so.

I tried this right now with the trick below. 2x was no issue but 3x... that was a big step. It sounded like word salad. As if my brain was decoding the words out of order and was unable to assemble them into sentences.

It also depends on the density of the original content. Some things flow nicely at 1.5; others can be cranked up to 3x or even beyond if you turn on subtitles because they're just another YouTube video padded to hit the magic 10 minute mark, or whatever it is now, and you don't really need the audio.

OP did it over a month, not a single session.

Is there a trick to get it above 2x (which is the cap in the UI). It'd be nice when watching videos of congress.

I do it via `document.getElementsByTagName("video")[0].playbackRate = 3.0` in the dev console. You can up arrow to it similar to a normal terminal on the next video. The nice thing about this is you can put any float in e.g. 3.333.

There are also extensions of course.

You can also put this:

as a bookmark on your bookmarks bar.

The extension he probably used is Video Speed Controller; it works on any HTML5 video, and you're not limited by whatever UI they've wrapped around that built-in functionality.

Streaming and download/playback tools can offer greater flexibility. I typically use mpv, though other options are mps-youtube, or for download + playback, youtube-dl and xine, mplayer, mpv, mps, or VLC.

I find I can typically listen at 1.25 - 1.75 readily. Exceptionally poor content I'll bump to 2x. I don't generally go much above that unless I'm fast-forward scanning video for specific content.

I created a simple browser extension for this purpose, "YouTube Turbo Button". As mentioned in the other comment, it can be easily done through the console, but I found having a button for it more convenient.

Opening directly in `mpv` gives a wide range of speed control, has keyboard shortcuts for everything, and as an added bonus, skips ads by default.

I am blind, but I am not a primary speech synthesis user, I prefer tactile braille. However, I know a number of people which are using their speech synthesizers at rates similar to what you described above.

My theory/experience with this phenomenon is, that a speech synthesizer never makes any errors. When it pronounces a word, it will do so exactly the same way everytime the same word comes up. So the learning effect after a while is a bit higher then when you listen to a human. Humans will always have slight variation in how they pronounce the same word. So, as I understand it, you can "learn" to listen to your speech synthesizer on a fast rate more effectively then you would be able to listen to a fast human speaker.

And yes, I also listen to YouTube talks and audiboosk at about 1.5-2x rate. So I guess 80 bits per second is relatively easily doable for the receiver.

It depends on the speaker. Some speakers are particularly slow with lots of long pauses and others are faster.

I think anyone can get to 3x but it takes some time to adjust to faster and faster speeds. It also depends on what you are doing while listening. Distractions or listening while doing something else (driving for example) lowers my ability to comprehend. For example on the interstate without much traffic I'll listen to audiobooks at 3x, but in a city or a crowded highway I have to slow it down.

Overcast has a great feature where it cuts out pauses and otherwise leaves the speed intact. It easily speeds up podcasts by 30% and more

FYI if anybody is interested in doing this without an app, check out ffmpeg's silenceremove filter.

Most decent podcast players also support this.

If you close your eyes listening comprehension goes way up. I top out around 2x if I have to look but with my eyes closed I can get full comprehension at 3x+

If it's a technical talk or something I'll still pause often too reflect on what was said, but I can hear full sentences just fine at >3x with my eyes closed.

Well I'm sure the ability requires training, but I wonder if it is specific to screen readers.

Consider what you quote: > we’re likely limited by how quickly we can gather our thoughts

Now the amount of relevant info on a screen is typically small enough that a sighted person can zero in on it at a glance and perhaps just click a button without thinking.

I.e. the amount of info that deserves "gathering our thoughts" is typically very small. So if that is the bottle-neck, your colleague can keep cranking up the audio speed until low-level processing audio becomes the bottle-neck, which is a regime that sighted people never deal with even, not even the nerds who speed up their Joe Rogan podcasts.

It's easy to train yourself to do that though. Just find your favorite audiobook and listen to it daily. First listen to it at 1.5x, then adjust to 2x after a few days, then 2.5x after a few more days etc. You'd be surprised how fast your brain can actually process the information.

Personally when I did this I feel irritated when I speak because my sped-up audiobooks have conditioned me into thinking I should be speaking at that rate, but it's just not possible for my mouth and tongue to move that fast physically.

I recommend. .2 step increases as .5 is too much. I also recommend silence skipping if your player supports it.

>But, he says, instead of being limited by how quickly we can process information by listening, we’re likely limited by how quickly we can gather our thoughts. That’s because, he says, the average person can listen to audio recordings sped up to about 120%—and still have no problems with comprehension.

The deduction that is quoted does not follow: speeding up audio recordings with 120% results pressing both the auditory system as the language and thought systems (or any other potential bottleneck) to be sped up proportionally since it's a pipeline.

Similarily the posted article (I have yet to read the original one) states in the title that "human speech" has a universal transmission rate, but the research tested reading not speech, so this may or may not be true.

Perhaps the bottleneck is human speech, with the side effect that listening is never trained beyond the typical speech rate limit. (in this case the higher speed syllable languages would be easier to pronounce fast, and the lower speed ones harder to pronounce fast)

Perhaps the bottleneck was in the visual burden of reading, a language that encodes more bits per syllable implies more types of syllables, which irrespective of size or number of characters puts a classification demand on the visual system (classifying a symbol coming from a set of only 2 symbols will be easier, but will require more classification instances than classifying from a large set of characters but with fewer classification instances).

Perhaps the bottleneck was again in speech during reading by subconscious vocalizing of the text.

Perhaps the bottleneck was in the auditory "speech to syllable" classification.

Perhaps the bottleneck was in parsing text.

Perhaps the bottleneck was in "accessing thoughts" etc.

So it is rather hard to identify where the bottleneck is located without having a means of detecting where in the brain the "incoming queue is full" vs "incoming queue is waiting" during speaking, listening, reading. And which of these 3 causes this universal bottleneck (since I gave 2 examples of how an apparent bottleneck in reading could stem from not being trained beyond a possible universal bottleneck in speaking rate...)

That quote seems to imply that they have not measured the maximum speed to receive information but the speed at which we are comfortable outputing it.

There is no shortage of people training to receive a lot of information at once, and 39 bits per second seems to me on the lower end of what some video games require but in terms of constructed, linguistic output? They may be on to something there.

Fast chatters are not faster thinkers. I have yet to see people exchanging thought at a higher rate then usual.

> I wonder if his ability was trained through years of using fast screen readers, vs. a lower visual processing load leads to better audio processing, or some other explanation.

While I'm sure his visual cortex picked up some slack, I'm willing to bet it's mostly just through training. We just aren't trained for faster communication. I've known blind people and they are the same way with their readers.

For me I imagine the bottleneck would rather be how quickly I can translate my thoughts into speech. Oftentimes I will start out talking to myself to explain a topic, only to eventually digress into my "mental monologue" because I start to process thoughts faster than I can say them.

I'm fully sighted but I use espeak tts at around 1000wpd for fluff text like ecnomist articles and 300wpm for heavy going text like the text sections of math books.

I also watch most video at 2/3 times the speed since the skills seem transferable.

what flags do you use for espeak, just -s 1000 seems incomprehensible to me

-p 100 -s 1000 -v male7

I listen to youtube videos at 200%

What do i win???

Depends on the natural speed of the speaker, but I listen to most podcasts and YouTube commentary/narrative at 2x. Podcasts sometimes in the 3x range.

Sometimes it's worth slowing down to 1.5x to give myself a bit of time to process the ideas, though slowing below that sometimes hurts comprehension.

Side note: I find that YouTube in Chrome has the best pitch-preserving time stretching filter, and I've neglected all this time to figure out what exactly they use to accomplish that. I'd love to add that to mpv, if it's not already there.

Probably this is a healthy reminder of how the brain optimizes and uses sections of itself. Without the need for vision, those cranial areas can be better used for other things.

Apologies for being harsh, but this kind of thing is the phrenology of our time. I know it's utterly conventional to think this way about language in some circles that present themselves as doing legitimate science, but the view that you can calculate the amount of information in human speech, except in a super-technical sense that doesn't match any of the reporting on this study or the way people are interpreting it, has to be called out for the total nonsense that it is. It doesn't bear a moment's honest reflection.

And yes, I know information theory. It's language that these folks - many of them prominent and celebrated within their utterly normalized professions, just like in the days of phrenology - are fundamentally mistaken about. What quantity of information do you think there is in the word "trump," for instance? Is it the same over time, to bring up just one feature of how this funny thing called context informs human speech?

Wittgenstein's Philosophical Investigations is a good place to start if anyone's interested in understanding this issue.

They aren't talking about the semantic information of the word "trump". They explain the methodology for calculating information, and it's per syllable (based on the number of distinct syllables that are part of the language's phonetics). So, for English speakers, 'trump' has exactly 7 bits in it. That exact syllable may or may not exist in another language, but if so the same singly syllabic word "trump" would have a different number of bits to a speaker of that language. Maybe next time RTA?

In other words, they aren't factoring in compression.

>Maybe next time RTA?

I think it's you that has missed the point. Syllables have a very loose correlation to information. So great; we can stream out 39bits worth of syllables / second. In what way does that describe how information dense those syllables are? Context matters here.

I think the fact that context matters so much is why we don't try to quantify it. The word 'trump' can covey a lot of meaning or next to nothing, eg in a card game the word trump can covey a lot of information about the state of play and your reaction to it to your competitors. It doesn't take any longer to say and in the context of the game may take less time to think up as well.

The researchers are not making any claims wrt semantic information density.

s/exactly 7/a little more than 7/

You're saying phonology is the new phrenology?

Jokes aside, I agree that estimating the average absolute information content of a syllable seems pretty absurd.

However, if the primary goal here was to determine whether some languages convey more information per unit time than other languages, I think the authors did fine. To this end, they needn't define information per syllable in anything other than p.d.u. - procedurally defined units. If average Vietnamese speech has 2x the number of syllables/min as German, but it takes the same amount of time to recite War and Peace in both Vietnamese and German, it suggests that both languages convey the same high-level information 'per unit time', but not 'per syllable'.

And basically that's all they did... "We computed the ratio between the number of syllables [in the text passage] and the duration [it took to recite the passage]"

What do you see as the contradiction between Wittgenstein and information theory?

> And yes, I know information theory.

You clearly don't know linguistics though because the idea that a word conveys a constant quantity of information is hilarious.

Early on when Information Theory was emerging, there were attempts to measure the bandwidth of consciousness. They reckoned about 18 bits per second or less, which sounds very low.

Tor Norretranders book, The User Illusion, mentions some of the research:

W R Garner and Harold W Lake "The Amount of Information in Absolute Judgements" - Psychological Review 58 (1951) - they attempted to measure people's ability to distinguish stimuli (such as light and sound) in bits. Result: 2.2 to 3.2 bits per second.

W E Hick "On the Rate of Gain of Information" - Quarterly Journal of Experimental Psychology 4 (1952) - this experiment measured how much information a person could pass on if they acted as a link in a communication channel. That is, faced with a series of flashing lights, subjects had to press the right keys. Result: 5.5 bits per second.

Henry Quastler "Studies of Human Channel Capacity" - Information Theory, Proceedings of the Third London Symposium (1956). Measured how many bits of information are expressed by a pianist while pressing keys on a piano. Result: 25 bits per second.

J R Pierce "Symbols, Signals and Noise" (Harper 1961) - used experiments involving letters and symbols. Result: 44 bits per second.

Discussion of the research, Tor Norretranders book, and what the research may have missed here:


> instead of being limited by how quickly we can process information by listening, we’re likely limited by how quickly we can gather our thoughts. That’s because, he says, the average person can listen to audio recordings sped up to about 120%—and still have no problems with comprehension. “It really seems that the bottleneck is in putting the ideas together.”

Glad this paragraph was in the article, clears up their methodology. I wonder if it applies to writing too, or if skilled writers work faster.

Okay, but the experimental subjects didn't put any thoughts together but read out some text aloud...

The same text, at that. So the text has N bits of information and it was, according to the article, spoken at different speeds per language. So N bits at different speeds per language, exactly the opposite of their claim.

No, the text does not have the same amount of information in every language, because they are talking about the information in the audio, not the semantic value. Your brain categorizes each syllable it hears into one of N classifications, where N differs by language. So the max intelligible playback speed (in syllables per second), multiplied by the syllabic entropy (in bits per syllable), tends to land on 39bit/s.

Ah, I read too quickly and misunderstood their meaning of "bit rate."

> researchers took their final step—multiplying this rate by the bit rate to find out how much information moved per second

Thank you for your explanation, worth a bag of gold!

Really depends on language, if you're writing java, you'll be putting out a lot more than that due to how stupidly verbose it is.

> if you're writing java, you'll be putting out a lot more than that due to how stupidly verbose it is

Being "verbose" means that each letter you type communicates fewer bits of information. If the bottleneck is putting ideas together then you would expect someone writing in a more verbose language to type more letters per minute but still take a similar amount of time to communicate the idea.

In practice most Java programmers are using IDEs with good auto-completion, though, so aren't actually needing to type as many letters as you'd think.

What might be more relevant, however, is the effect of verbosity on comprehension. Syntax highlighting can help, and humans will likely naturally perform input compression (chunking), but this must be learned.

Like the sibling comment mentioned, it seems verbosity hinders reading comprehension rather than writing. Many IDEs understand this and hide some of the boilerplate.

This raises the question: if the IDE autocompletes the boilerplate for you, and also hides it, why is it needed in the first place?

In an autocomplete + simplify situation you usually type a little more than the simplified version before autocompleting and then allow the hiding to apply a lossy simplification. E.g. when the simplification is substituting a localization ID with the English text there might be multiple IDs mapped to identical texts and then the difference between the IDs is lost in the simplified representation. Differences which might be important, for example if the IDs are for different semantic concepts that map to separate texts in other languages. These simplifications are valuable because they are allowed to be "leaky", they can make the easy case even more easy by leaving the tough ones to the underlying raw, unsimplified truth.

Consider the verbosity of XML compared to s-expressions. <html>...</html> vs (html ...)

The latter can trivially be used to output the former. The conclusion is obvious; some of these formats are objectively more verbose that others while having equivalent expressive power.

Interestingly, depending on the context I find one or the other to be more readable. XML is great when you're got a lot of content, because it provides additional context of where you are in the closing tag, but it's not as great when you don't have a lot of data since the closing tag is just visual clutter.

So you're saying you're not a fan of RequestProcessorFactoryFactory.StatelessProcessorFactoryFactory?


I think you mean InternalFrameInternalFrameTitlePaneInternalFrameTitlePaneMaximizeButtonWindowNotFocusedState


That bit of text is 64 characters long. Do people who write in these languages just totally abandon line length restrictions? In Python, PEP8 dictates that lines of code be kept to 79 characters. Personally, I wrap at 72 characters in Python unless the entire statement can be placed on a single line of 79 characters or less.

I also think that in writing the bottleneck is in how fast and accurate your hand can move. I would agree that English is faster than Spanish because Spanish is more verbose.

But Spanish speakers also speak faster than English speakers. This isn't the first time that a study has shown that spoken languages seem to convey data at a reliable rate, and that the verbosity of a language seems to be inversely proportional to the speed at which it's generally spoken.

This is really cool. I am working in a related area and I think most of us have assumed that on average, the information rate is 'about the same' for the languages across the world. So it's exciting to see that their results confirm this assumption.

Two qualifying remarks.

1) The 'about the same' is important. Even in their data, there is still quite some variance. They found an average of 39bits, with a stdev of 5. That means that about 1/3 of the data falls outside of the range of 34-44bits.

2) Which brings me to the the uniform information density (UID) hypothesis. According to the UID, the language signal should be pretty smooth wrt how information is spread across it. For many years, the UID was thought to be pretty absolute: Even across a unit like a sentence, it was thought that information will spread pretty evenly. Now, there is an increasing amount of research that shows that esp. in spontaneous spoken language, there is a lot more variance within in the signal, with considerable peaks and troughs spread across longer sequences.

Why did everyone assume it would be the same on average? This seems weird to me.

Also, can you explain more about how the information density was calculated? Anything at the bit level seems crazy small to me. Words convey a lot of information. They cause your brain to create images, sounds, emotions, smells, etc. I guess we're calling language a compression of that? But even still, bits seems small.

> Why did everyone assume it would be the same on average? This seems weird to me.

(see edit below; but i leave this up; it might be interesting, also) you mean that even for smaller sequences, the UID holds, right? the assumption was that even for a single sentence, there are a lot of ways to reduce or increase information density so that you get a smoother signal. e.g.: "It is clear that we have to help them to move on.", you could contract it to "it's clear we gotta help them move on" and contract it even further in the actual speech signal ('help'em'). or you could stretch it: "it is clear to us that we definitely have to help them in some way to move on", or alike. the assumption was that such increases / decreases would even be done to 'iron out' the very local peaks and troughs, particularly in speech.

bits: yeah, that took me a while to get used to, as well. the authors used (conditional) entropy as a way to measure information density (which is a good measure in this instance imv). and bits is just per definition the unit that comes out of information theoretical entropy: https://en.wikipedia.org/wiki/Entropy_(information_theory) . btw: while technically possible, i don't think that the comparison in the summary article between 39 bits in language and a xy bit modem is a helpful comparison. bits in the context of entropy are all about occurence and expectation in a given context. bits of a modem/in CS, they represent a low level information content for which we do not check context and expectation.

edit: ah, i realise you are asking why most in our community assumed that this universal rate applied across languages, right?

i guess the intuition was that all of us humans, no matter what language we speak, use the speech signal to transmit and receive information and that all of us have the same cognitive abilities. so the rate at which we convey information should be about the same. sure, there are probably differences according to some factors (spoken vs written language, differences in knowledge between speakers, etc.). but when the only factor that differs is English vs Hausa, esp. in spontaneous spoken language, then the information rate should be about the same.

> esp. in spontaneous spoken language, then the information rate should be about the same.

This is entirely non-intuitive to me. I would think with language evolving that some would be faster than others. If language starts as conveying extremely simple thoughts then it should take longer to convey certain things. I would then assume that as the language develops it gets better at conveying ideas. I would think that thoughts could go much faster than how we process it with language. Like I have constant thoughts that are really fast and can be complex. There's no internal dialogue there. But when I think with an internal dialogue it is much slower.

I think there is a distinction between "flux of incoming information" and "net knowledge gained by human as a result of incoming information".

After a few cocktails, once or twice, I've wondered with friends whether some "fuzzy" information rate constant might be a reference by which our brain understands the passage of time. In other words: if there is a fundamental processing rate of x/time, then theoretically, wouldn't our brains subconsciously use that for all kinds of neat reasons?

And the rate wouldn't have to be the exact same value for each individual, so long as the brain can attune its specific value to other reference points to time in nature.

So here is my own experience. I was avid audio book fan for last 3 years and while ago some guy on reddit told me about how he listen books on Audible using high-speed option like 2.x. I never tried that before last summer since at higher speed speech become incomprehensible for me.

What this guy told me is that it's just take time to adjust to it. So I basically started to listen for books at slightly higher speed. Then I gradually increased it and in a few days I could handle 2.0x speed no problem while listening for really complex fantasy (Malazan Book of the Fallen [1]). After two weeks I could handle 2.5x without a problem.

In the beginning it was harder to comprehend at high speed while walking or crossing the street since I lost attention, but in a few months I could do anything while listening without missing any information or emotions of narrator.

To give an example of how far this can go. This spring I was listening for The Expanse audiobook [2] at 4.0x speed. With some effort I could go even faster for like 5.x in case of these particular books, but obviously can not keep up for long.

I still usually listen books at 2.0-3.0x depend on narrator and quality of audio and this skill dont go away even if I have extended time between books like a month or so.

[1] https://www.audible.com/pd/Reapers-Gale-Audiobook/B00M4LRBY6

[2] https://www.audible.co.uk/pd/Abaddons-Gate-Audiobook/B00T6NZ...

UPD: Edit. s/can keep up/can not keep up/

One thing I'd also like to develop / wish was integrated into audible and the like is silence trimming. Some speakers leave outsized pauses in their narration which can be significantly shortened effectively increasing speed with less distortion.

I have the opposite problem where I have trouble paying attention to an audiobook at 1x. I get bored in between words and my mind wanders making it very difficult to keep track of what is being said (as in I hear individual words but have trouble keeping sentences in memory when everything comes too slow)

I wish I had realized this in university and had been able to somehow record and playback lectures at 2x. I always got so little out of lectures because the information wasn't coming in fast enough for me to process correctly.

Overcast (a podcasting app) has great features to optimize the high speed listening experience. They have variable speed, a great silence trimmer, and a voice boost that makes speech clearer.

I just wish I could listen to arbitrary audio with Overcast. As it is I have a blog set up to feed Overcast audio that I give it, but it feels super clunky.

> One thing I'd also like to develop / wish was integrated into audible and the like is silence trimming.

I don't really use audible, but if you looking for good audio player on Android here is one that can do this:



Can you include the app name? I'm blocked from both links.

Voice Audiobook Player

Blind people use screenreader software sped up so fast as to be indecipherable to untrained ears. The screenreader can give them near instantaneous feedback about where on the screen they are and what's there when it's so sped up, and with a bit of practice perceiving the sped-up speech imposes no burden at all.

I can tell from experience that when I lay down in my bed with my eyes closed I could comprehend speech at much higher speed than I would while walking on the street. No surprise blind people can handle it better even though I have no clue how exactly it work in relations to the brain.

I was always curios to make actual research / paper on this kind of thing, but as non-scientist I simply have no time to do so. So I happy someone actually doing it.

Ever had to turn the radio/music in your car off in some tricky situations?

I don't drive, but I get what you mean.

Concentration is crucial here.

Side question: I wonder if anyone has actually finished the entire Malazan series. It takes some serious dedication. I would be curious of the story still makes sense to you by the end when listening at that speed.

> Side question: I wonder if anyone has actually finished the entire Malazan series. It takes some serious dedication.

I only finished Malazan Book of the Fallen, first two Tales books and all The Path to Ascendancy books. Also started Forge of Darkness, but was too preoccupied with my life to finish it.

Honestly Esselmont books are just weaker overall. The Path to Ascendancy was much better, but 3rd book is just too rushed.

> I would be curious of the story still makes sense to you by the end when listening at that speed.

Speed have no effect on story at all. Basically after you practice it for a bit you even get every emotion narrator trying to put into his speech.

As for the story in general it's make more and more sense closer you get the the end. It's masterfully crafted world with great theme of compassion and even though I finished it more than a year ago I still have flashback or two from time to time since I loved some of characters. Malazan is certainly one of my favorite book series.

Yet keep in mind there is abundance of information and events as well as unreliable narrators which can confuse your view of story lines.

Not OP, but I am currently on Reaper's Gale.

Malazan quickly became my favourite book series (and I am not even a fan of fantasy). It was hard initially. But it gets better.

However, I think that re-read is a must if you want to fully grasp the whole thing.

Why would you want to do that though? Isn't the experience of listening to it the point? If not, why listen to it at all instead of reading a detailed summary?

  > Why would you want to do that though?
Because I don't just listen for books for enjoyment of process itself. I love complex stories with hundred of characters and plot lines across many books. Reading something like Malazan or Wheel of Time is like a journey into another world for me and I deeply immersed into these world while exploring them. Yet amount of free time I have is limited so getting more information in short period of time is very convenient

I totally get it when some people just love to read books slowly while enjoying their coffee or looking at nature, but I'm into books for the stories and format of fast-paced audio is fine for me.

  > Isn't the experience of listening to it the point? If not, why listen to it at all instead of reading a detailed summary?
I feel like you imply that by listening on high speed I miss some part of experience. Yet other than voices being just slightly distorted (after some practice it's the same voices, but faster) I get exactly the same experience as any person who listen or read unabridged book.

On other side detailed summaries are not the same thing that author designed, but someone else rehearsal which is usually far from perfect.

I'm a bit confused, here. (I went and looked at the original paper.) They estimated information density for each of the subject languages as a whole, on average:

> In parallel, from independently available written corpora in these languages, we estimated each language’s information density (ID) as the syllable conditional entropy to take word-internal syllable-bigram dependencies into account.

But the experiment uses the same text translated into each language! Why introduce this extra variable (and source of error) of estimated language-wide information density, if you are controlling your experiment such that you have the exact same information encoded in each language? That is to say, why use an _estimated_ information density when you could measure it exactly for the texts that are being spoken? Or, conversely, why go to all the trouble of having the speakers read the same text translated into each language, if you aren't going to make use of that symmetry?

Information depends on probability. If something is very probable then it doesn’t have much information (because you already saw it coming). If something is improbable then it has a lot of information.

In the paper they want to know how much information is in a syllable in context. To do that they need to know the probability of each syllable given the previous syllable. To estimate that probability distribution, you need to look at a lot of text, much more than just the passages that the authors used to measure speech rate.

Good question!

I suppose that the experiment wants to capture the actual 'information density' of the language, and hence looks at the full language. Then, they want to avoid any modification in speech rate due to the semantics of the spoken text.

This does not make sense for a hypothesis where the actual bit-rate of speech tends towards 39 b/s. That is, when your text happens to convey more bits, you slow down.

However, for an alternative hypothesis, this design does make sense. The idea here is that a language naturally converges to a speech-rate that gives 39 b/s. The idea here is that the actual speech-rate is much more constant, and just drops until it becomes too fast. For that, I'd argue you don't want the mean bit-rate but something like the 90th percentile bit-rate. Because it seems to me that speech-rate that is 'too fast' more than 10% of the time would not really be natural.

The researchers obviously have to keep the scope narrow in order to get numbers at all.

That said, we should be aware that a tech nerd audience will find simple answers to complex non-tech questions appealing, and we should not over-estimate our understanding here just because we have a number.

There is a large amount of data transmitted through sub-communication and context, particularly during an in-person interaction, which is what people are wired for. Overall tone, body language, eye contact, and various social cues make up the bulk of data being transferred in many interactions. There's a reason why talking to some people feels exhausting and others invigorating, and it's not just the transcript.

We can avoid reading too much into the study by just remembering the error bars. It's not like 39 is a universal constant. It's more like 39 with a standard deviation of 6. That's a wide spread, but it's less wide than the spread you get from syllable rate alone, and that's all the study quantitatively tells us.

What are the things that make the difference between invigorating and exhausting?

Claude Shannon (of the information theory fame) did a similar research with his 1951 paper "Prediction and Entropy of Printed English". He used a particularly clever idea, leaving out words or letters from English text, he then measured how accurately people could predict the missing text. And then used those numbers for statistical information-theoretical analysis to estimate the information density at about 9-11 bits (IIRC) per letter.

Looked it up, 11.82 bits per word, 2.62 bits per letter.

Oh! Yes you're right. My mistake. That makes a lot more sense too, it would be weird if a letter was more bits than a byte, hm? :-)

I'm not sure what the review process is for the source, but the paper [1] is pretty interesting. Lots of cool findings/references in there like:

- There is a measurable difference in information density based on the sex of the speaker

- Syllables were chosen as the base unit of measurement because morphemes (words) are too big/linguistically varied and phonemes (sound equivalent of letters) are too small and likely to be dropped in regular speech. I'd like to see the same analysis using phonemes to see how it changes, especially between dialects.

[1] https://advances.sciencemag.org/content/5/9/eaaw2594

In my initial Army job I had to learn how to copy Morse code. I got up to a respectable 23 groups per minute (a group is 5 characters). If I remember right, beyond about 13-15 GPM, once the code stopped transmitting I would still write for another 20 seconds or so. It was not something I consciously did, it just backed up in my brain.

I doubt that there's a universal rate - a universal mean (within the context of a shared language) makes more sense.

Every conversation acts as its own handshaking algorithm from which context is derived, and contexts will vary greatly in terms of amount of language required to convey concepts.

Jargon rich conversations between experts have the potential to transfer information at a rate far greater than average.

So basically human speech processing is a noisy channel which has limited bandwidth. If the language we encode in has a higher rate of compression, we need to transmit with fewer errors, but if it has a low rate of compression (or even redundancy) we can support a high rate of transmission and correct for more errors.

Which is kind of neat. Thank you Claude Shannon!

>Scientists started with written texts from 17 languages, including English, Italian, Japanese, and Vietnamese. They calculated the information density of each language in bits—the same unit that describes how quickly your cellphone, laptop, or computer modem transmits information.

I feel like the whole "bits" calculation is a neat way to get into the media, but not actually related to "information density".

Edit: Been informed I'm deeply ignorant on Information Theory.

Bit is the unit usually used for describing entropy, information density, and channel capacity, regardless of whether computer systems are involved. See https://pdfs.semanticscholar.org/d554/d933fd78e19d51d68ac4ea... and, more generally, https://en.wikipedia.org/wiki/Channel_capacity and https://en.wikipedia.org/wiki/Information_theory.

The field of Information theory effectively began with Claude Shannon. The same formalisms he developed are used outside computer science--linguistics, physics, microbiology, etc.

Human speech uses the extremely dense lossy compression method called "A Lifetime of Experience and Biases"

It's only lossy when the decoder's dictionary doesn't have a similar volume of data points to properly reinflate the data stream. Things that didn't make sense to me as a teen make far too much sense now because my experience can properly contextualize the old data.

Or lossy when the decoder has a different volume of data points and extracts totally different information or even info that never existed in the first place. Kinda like those anime/retro game upscaling filters, except with way higher variance on what comes out between decoders, or the same decoder at different times. Gives me an idea for a new JPEG decoder with a floating ADC as input.

This has actually been a fairly popular idea within linguistics for a long time, and is absolutely relevant to information density. Problems like these need to be better understood for natural language understanding and translation, and 'bits' provide a good baseline measurement.

It's hard to translate to bits, but there is definitely different information densities between language. E.g. when you write a paragraph in Spanish, the total number of keystrokes you'll need is higher than writing the same thing in English.

Bits is a base unit of information theory.

Theoretically a summarisation algorithm (or neural network) should reduce sentences of similar content from different languages to similar sizes no?

The authors use syllable conditional entropy to estimate information density, and entropy is measured in bits (or shannons, if you prefer).

If your language is complex enough (and it's my understanding that Ithkuil is[0]) that the emitter takes more time to translate thoughts into it, the transmission rate is not going to raise regardless of information density (remember rate is over time).

I guess if we consider non-real time communication, but in that case (e.g. in English, which is limited by the medium's rate) the reception rate is the main factor, which is probably not too far off the transmission rate.

I'd say Ithkuil is designed for information density and my guess is its actual max rate is pretty similar to the submission's 39bps.

[0] IIRC not even its creator is a fluent speaker.

From the paper [0]:


>Language is universal, but it has few indisputably universal characteristics, with cross-linguistic variation being the norm. For example, languages differ greatly in the number of syllables they allow, resulting in large variation in the Shannon information per syllable. Nevertheless, all natural languages allow their speakers to efficiently encode and transmit information. We show here, using quantitative methods on a large cross-linguistic corpus of 17 languages, that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually: Languages are more similar in information rates than in Shannon information or speech rate. These findings highlight the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures. Thus, language is the product of a multiscale communicative niche construction process at the intersection of biology, environment, and culture.

[0] https://advances.sciencemag.org/content/5/9/eaaw2594

The Micro Machines commercial guy was transmitting at at least 900 baud.


John Moschitta held the title of World's Fastest Talker at one point: http://nymag.com/speed/2016/12/is-the-micro-machines-guy-sti...

I wonder how much face-to-face communication adds to the effective transmission rate where the physical gestures, body language, winks, and other non-verbal communication are taken into account.

Seems reasonable, from what is known about the capacity of speech, although extra cool if there is really some universal number we converge on. Kuyk and Kleijn found the upper bound of the speech channel to be 100 bits per second, in agreement with the lexical rate of about half that in "ON THE INFORMATION RATE OF SPEECH COMMUNICATION" (https://ieeexplore.ieee.org/document/7953233) using an interesting method based on mutual information.

Some highlights from the introduction that are relevant:

Broadly speaking, two approaches to measuring the information rate of speech exist: the linguistic approach, and the acoustic approach. The linguistic approach describes speech as a sequence of discrete perceptual units such as phonemes, words, or sentences. Taking the average talking speed as 12 phonemes per second [3], and using the English phoneme probabilities tabulated in [4], the lexical information rate is approximately 50 b/s. When the dependencies of the phonemes are accounted for the rate will be decreased further. The lexical information rate does not include information about talker identification, emotional state, and prosody. However, these variables vary relatively slowly in time and contribute little to the overall information rate. As an example, [5] estimated that the total amount of talker-specific information (e.g., age, accent, sex) was of the order of 30 bits

As an interesting side note, ham radio operators heavily use a digital mode called PSK31. It stands for "Phase Shift Keying, 31 Baud".

As I understand it, the 31 bit/s transmission rate was chosen because it is close to the entropy that operators can generate by typing on their keyboards. PSK31 does not transmit 8 bit bytes, but instead uses what they call a Varicode, a kind of Fibonacci code. More frequent characters are encoded using fewer bits, thus the encoded bit rate is an approximation of the entropy in the text stream.

I think it also really comes down to what videos or audio you listen to. In other words, the person delivering the audio and its quality.

I listen to most non-entertainment videos / podcasts at 1.5x and I would say about 80% of them are completely ok to comprehend from a "can I easily listen to this without struggling to figure out what they are saying at a language level". But as soon as I try 2x, that drops down to maybe 30-50% because the person speaking isn't speaking clear enough, they have an accent that overpowers any type of comprehension ability on my part or the audio quality is too poor and it introduces too many artifacts.

Sometimes I ask my friends to watch the same video at 2x to see if they can comprehend it and often times they can't (but sometimes they can). We're all in the same area.

I generally find a neutral accent and very clear annunciation helps the most. I've had a bunch of people say they've watched my videos / courses at 2x without issues because apparently I have no accent which is something I've heard from a number of people in different countries where English isn't their native language. I find it interesting because I've also heard a decent amount of people say I speak very fast at 1.0x speed, so I do believe accents and annunciation has at least some role in this.

Does anyone know of any software where you can feed it an English audio sample and it spits back the number of syllables per second? Seems like a pretty cool potential ML project.

Where are you from if people think you don't have an accent, if you don't mind me asking?

Long Island, New York.

I'm not trying to plug my channel but here's my latest public video from the other week: https://www.youtube.com/watch?v=Kq_khHWovl4

I do believe audio quality plays a -huge- role in this.

For comparison, here's the most recent Railscasts video from a few years ago by another screencast author: https://www.youtube.com/watch?v=urPi4qZJeOE

I can deal with him at 2x but it's mentally taxing because his audio has a metallic wispy sound at that speed and it makes his words sound blended together. I think he also talks slightly slower than me at 1.0x as well, so it's not just base talking speed. Does anyone else notice that metallic sound too?

Here's another sample of Joe Rogan and Bill Burr on a podcast: https://www.youtube.com/watch?v=cS1KWv0das8

Listening to them talk at 2x feels like a joy. They are talking a little slower since it's a casual conversation but the audio is crystal clear and both of them have very good word annunciation (not surprising since they are on stage talking for a living).

I agree, you have a very neutral American accent, not much like a New Yorker at all. Other things I noticed include that you are indeed pronouncing each word very clearly, but also that you're speaking a bit more slowly than I would consider a typical speech rate (though not necessarily slower than the average Youtube video). In this video, you're also speaking at a bit higher pitch than a typical male voice, which I think further aids clarity because "sharper" syllables come out very clearly instead of being muddled like they are in very deep voices.

That said, I couldn't comprehend you well at all at 2x speed (using the Youtube controls). This might have just been to distortion caused by the Youtube player on my computer, I'm not sure. At 1.75x you were still very clear, though I suspect at that rate I would find myself pausing the video now and then to think about what you were saying.

Thanks for the listen. All of my tests are always using Youtube's playback controls btw.

Were you able to listen to Joe Rogan's podcast at 2x? Skip somewhere in the middle and listen for 15 seconds maybe.

You are right in that I speak slower in that video. Most of my more recent Youtube videos are unscripted so I'm just thinking about things with zero preparation, where as I script my courses word for word (which leads to faster speaking generally) but I don't have any course videos with the same audio equipment to compare side by side.

I didn't read the paper but I wonder what they classify comprehension as. Personally I wouldn't listen to hardcore technical things at 2x because understanding the words isn't usually the goal of listening to it. It would be to fully absorb and understand what you're listening to so you can apply it on your own later. There's a big difference between a mechanical understanding of the words and really "getting" what you're listening to.

I typically reserve 2x for listening to tech talks where my goal is to get a high level overview of something quickly.

> Were you able to listen to Joe Rogan's podcast at 2x? Skip somewhere in the middle and listen for 15 seconds maybe.

Hmm, I skipped around to a few different spots. About half the time it was intelligible at 2x, but as soon as they started speaking faster it became a garble. Occasionally they would speak fast enough that I couldn't catch it even at 1.75x. So I'd say they have a lot more variability in their pacing than you do.

I scoffed at the idea that somebody from Long Island would have no accent, but after listening for myself I have to agree. Interesting.

Typing at 60 words/minute, 5 characters/word and 8 bits/character gives a gross bit rate of 40 bits/second.

Of course, the information rate is a lot less, since there are fewer than 8 bits of information per character in English. The paper says "from 4.8 bits per syllable for Basque to 8.0 bits per syllable for Vietnamese" and there are multiple characters per syllable. So the typing information bit rate is probably somewhere around 10 to 15 bits/second.

IIRC 1.3 bits per character is the value used as a rule-of-thumb for english text entropy, which would yield 6.5bps at 60WPM and an average stenographer can go at 4x that rate for approximately 26bps.

I took a peak at the paper and discovered the following: among the languages they studied, English was second only to French in the number of bits of information per second conveyed on average. (English also had significantly fewer syllables per second than French.)

Thus, if you're choosing a language to communicate in on the basis of how fast it is to get an idea across, English and French are likely your best choices! (Among languages in the survey.)

> Each participant read aloud 15 identical passages that had been translated into their mother tongue

I doubt there is an objective way to ensure that no information is lost or smuggled in when you translate a text into another language. For example, English has more than 100 words for 'walk', whereas Toki Pona (a constructed language known for its extreme simplicity) has only one. But does 'stroll' encode the same amount of information as 'tawa'? Depends on what you want to use that information for, I guess. If you only want to know where I went this morning, they are equally good. If you want to know that my act of going there was a recreational activity, possibly part of my morning routine, they are not.

It's my interpretation that by deciding the bits of each word, they included subtext/classification that would differentiate 'walk' and 'stroll'.

Scientists started with written texts from 17 languages, including English, Italian, Japanese, and Vietnamese. They calculated the information density of each language in bits—the same unit that describes how quickly your cellphone, laptop, or computer modem transmits information. They found that Japanese, which has only 643 syllables, had an information density of about 5 bits per syllable, whereas English, with its 6949 syllables, had a density of just over 7 bits per syllable. Vietnamese, with its complex system of six tones (each of which can further differentiate a syllable), topped the charts at 8 bits per syllable.

how can you encode 643 syllables using 5 bits? same for 6949 syllabes/7 bits?

If I understand this correctly, it isn't that they are uniquely encoding each syllable. It's that they are encoding the information in each syllable. Many syllables have very low information content and must be combined with other syllables to convey information. Many other syllables are redundant.

Being able to convey emotion with tone in speech likely adds a lot more "information".

The increased information density in tonal language appears to be offset by a reduced speaking rate [0].

[0] http://ohll.ish-lyon.cnrs.fr/fulltext/pellegrino/Pellegrino_...

I think jumelles was referring to something like "don't do that" uttered as fun, angry, etc.

So very true. Said in the right tone, "I don't care" can mean -many- things. Emotional intelligence is acquired (or not). 'Bit rate' means nada.

As well, the 'amount of information' conveyed depends on the environment, and the preparation of the speaker and the listener. Some speakers (not to mention any names) spew a lot of BS ( not information) to sort through.

I'd argue that, in a medical environment, the word 'sponge' conveys less information than the word 'ebola'.

For the prepared, the cascade of necessary reactions in the brain can take little or much time to process. Some authors/speakers (authoritative) can pack -a lot to think about- in a few words. Like 'Where is everybody?'

> After noting how long the speakers took to get through their readings, the researchers calculated an average speech rate per language, measured in syllables/second.

> No matter how fast or slow, how simple or complex, each language gravitated toward an average rate of 39.15 bits per second

So does this mean that we are "understanding" only those 39 bits of syllables per second, or more like we are using those 39 bits to index something like an internal address space?

And if the latter is the case, how big would that address space be?

It would also be cool to see this complemented with the data rate (bits/second) of emotion communicated per second and see if that increases the total effective rate of communication between people.

The article explicitly concludes that the speaker is the bottleneck, not the listener. You can understand sped-up Youtube videos just fine.

Yes. And they say that we can listen at a rate of up to 120% more, which means 46.8 bits/second of syllables.

So then, does it mean we "understand" 46.8 bits of information, or that we are using those bit to address some other, maybe bigger/more complex or detailed, memory space?

I would be interested if Rapid Serial Visual Presentation breaks this rule for any language. My guess is that it wouldn't, but I would be fascinated to find out. I would guess that languages with more general density of information per word would force the reader to slow their words per minute rate. Anecdotally, I can read with RSVP at around 620 words per minute in English and retain general comprehension. Sadly, I don't know another language well enough to compare that information with anything.


I recall a moment when, as an adult, I thought it might be fun to watch some cartoons that were my favorite as a child. I actually ended up giving up fairly quickly because the rate of the speech in many cartoons is apparently geared to be very slow. I remember firing up the first episode of the classic Thundercats show and being really struck at how much slower the characters spoke than my memory of the show.

On the flip side, I found noted radio show host Diane Rehm to be virtually unlistenable her rate of speech is sooooo incredibly slow. Her guests sound like they are all at 2x speed compared to her.

I'm pretty sure that this number changes as we age and our processing faculties gear up and then down.

Diane Rehm has spasmodic dysphonia, which affects the way she speaks. She's not just old.

True, but spasmodic dysphonia doesn't cause slow speech [1]. It's the source of other speech issues she's worked hard to over come.

Her pattern of speech is in general among the slowest I've ever heard -- sometimes approaching single digit words per minute. She's not always so slow, there's interstitials and other moments when her speaking is just kind of slow not unbearable.

Despite my personal feelings, I think her pace of speaking is part of what her appeal was. After being assaulted by other media all day, her show can also be a very relaxing listen and was nearly always a very intelligent conversation.

1 - https://www.youtube.com/watch?v=SqzfsKMaLqk

I didn't know the but rate, but I learned this studying in computational linguistics. Languages that seem "fast" tend to have more phonemes per semantic unit than "slow" languages. In general, languages with a lower lexicon of phonemes require more of them to make a "word", they get spoken faster. So the semantic units per period of time remains fairly constant across languages.

Mm.Hmm... Not sure the researchers have yet encountered the Deep South (US). I estimate approximately 17 bits of information in "Mm.hmm..." based on the 128K possible meanings. It takes about 5-10 seconds for proper gestation, when uttered expertly during the course of a lively conversation.

So more like approximately 1.7 to 3.4 bits/s.


I don't get the methodology. If you have a short-story, and translate it into 10 languages (and translate very well), it should have the same amount of information in it in each translation, no?

Therefore the transmission rate will simply be proportional to the time to read the story. This idea contradicts what their study found, no?

This has at-least tangential interest to me as one of my side projects is to create a custom "Number Station". Maybe out of a Raspberry Pi or similar, I'm still very much in the preliminaries. Since I'm in the US I am aware of the FCC.

See also this story from 8 years ago, which was about a previous study by the same group


I once saw research that said English speakers utter about 4 words per second. Shannon once empirically determined each English word carried about 6 bits of information. For what it's worth.

I would love to see the impact of acronyms on information density. It would also be interesting to see how many bits per second the average human maximum is.

YMMV, TLAs increase s/n but leverage ROI on prefamiliarisation.

Language is symbolic, all words are pointers. Whether you collapse complexity through an Apollonian use of religious icons or through initialisations and acronyms likely matters little.


Some people you might expect to have a very high bps, like auctioneers, are actually using a sort of parlor trick to create the illusion of rapid speech (a form of intention skillful stuttering which, when done right, is perceived as very fast speech.)

I expect some people are capable of some small multiple of this average, but probably not anything seriously dramatic.

(Of course there seems to be no lower bound, as in the case of involuntary stuttering.)

I feel like there's an important follow up study in how much humans can compress data with respect to those bits (cat -> large -> mane ~= lion).

Are there weights applied for regional disparities? For instance sped up for American New Englanders and slowed down for American South Easterners?

I wish I could provide references... I recall a tutor stating that human speech can be encoded at a very similar rate. This was in 1985 I think.

i wonder if there's a constant average reading speed (in terms of bits per second) across written languages. chinese is obviously more dense per symbol than english, but is each character processed more slowly as a result, preserving constant data rate across languages? also interesting to consider how much harder real time, verbal translation would be if data rates differed.

YouTube at 2x with captions can speed this to less than 78 seconds. "Less than " to account for advertising.

But we convey a lot in a lot higher bandwidth, because we use parallel channels each with low bitrate limitations.

Interesting, I did a similar analysis for music scores before and arrived at 30 bits per second.

So what the title says is: It may or may not have that rate. Moving on, nothing to see here.

I bet, elon musk is going use this data during his next talk about neuralink on human brain is limited to 39bit/sec during output of speech and BCI would surpass this limitation by output to smartphone speakers directly.

So speech is a very ineffective way to communicate?

the "text" portion? or already with intonation etc?

39.15/64 = phi

and 64 seconds is relevant here because?

It's 64 bits. The rate of information is measured in bits, and so 64 was the closest binary (2^6) denominator to 39.15 bits/sec.

Of course, Φ is irrational and the occurrence of golden ratio conjugate is almost certainly just a coincidence…

Curiously, this is also why memes are much better at conveying complicated political or sociological ideas faster and more efficiently than a few paragraphs on the topic.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact