Hacker News new | past | comments | ask | show | jobs | submit login
Researchers: It takes 1.5 MB of data to store language information (medicalxpress.com)
65 points by lelf on March 28, 2019 | hide | past | favorite | 59 comments



I haven't read the paper itself, but the article makes it sound like they simply counted the minimum number of bits required to represent the English language. It mentions no neuroscience-specific insights, so the "brain" part of the title is quite misleading.


Aside from that - is there really a way to express neural connections in the brain in terms of bits and bytes?


No. We still have a poor understanding of the exact way neurons code information and store information.

This page shows some of the current theories: https://en.wikipedia.org/wiki/Neural_coding#Hypothesized_cod...


The problem isn't whether there is "a" way; the problem is there's a multiplicity of ways and we don't know what the best one is. Part of the reason we don't have any idea what the best is is that we can't even construct any of them from a real brain, that is, the "brain scanning" technology the techno-rapture-style Singularity expects to develop does not even remotely exist right now. So we have nothing to experiment with or gain any experience with. The only ones we'd even remotely have access to right now would be some rather brute-force encoding of neurons and connectivity strengths and we pretty much know that such a thing must be incredibly redundant, but we have no ability right now to remove that redundancy. And we can't do it on a very large neural set.

It has been done on small models, though, at least to some significant extent: https://en.wikipedia.org/wiki/OpenWorm


Brain scanning techniques and simulation software have a long way to go, but there have been significant advances made:

A full fly connectome: https://www.nature.com/articles/s41684-018-0183-8

Architecture of the Mouse Brain Synaptome: https://www.cell.com/neuron/fulltext/S0896-6273(18)30581-6

So we are past just a simple worm simulation now.


Thank you for the better links. I just meant that as an example.


most definitely. ANN's do just that.

That said, they assume a computer. and a program to run it. Similarly, the brain probably has many complex lower structures that compress language storage, but the compression software is not trivial to store itself.


This describes how to summarize a language in a computer system.

I think the storage requirements for the semantics is in particular questionable. If you can refer to a particular concept somewhere in the brain you only need a pointer to it. However, single pointers might not be sufficient

The apparatus to learn a language might also not so easily distinguished from the structure to store the language or the structure to quickly recall the language. In that case taking about storage is like talking about compressing Plato's world of concepts.


>This describes how to summarize a language in a computer system.

It describes how to summarize a language from a information theory standpoint (which is something different).

There's a good reason to believe evolution would use something close to the most information theoretic efficient implementation (and even if not, it's a good lower bound)


> There's a good reason to believe evolution would use something close to the most information theoretic efficient implementation

Do you have a citation? I would be very interested to see it.

I hate to make analogies between computing and biological systems, but I immediately think of denormalization in databases, error correcting codes, and database replication as examples of optimizations that take you farther from information theoretic efficiency.

Information applies very straightforwardly to the study of a spike train of a synapse (source, channel, and receiver). I remain unconvinced that it systems like memory make sense as information channels.


Shannon observed this in his work towards information theory. Humans are able to predict next letters in a stream of natural language with a probability approaching optimal. The original paper is here:

https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf

The summary at wikipedia:

https://en.wikipedia.org/wiki/Entropy_(information_theory)#E...

Notably, PPM, the best compression algorithm (at the time), was not able to predict as well. So that's a quantitative demonstration of optimality in language performance that includes a faculty for understanding syntax and semantics and presumably is linked to phonology, pragmatics etc.

Chomsky discusses optimality in a 1998 talk here:

https://youtu.be/7Sw15-vSY8E, at about 1h 7m an onwards.

He says language evolved rather quickly for such a significant function (estimating it arose quickly ~200kya). He says though many things in evolution are pretty messy, language is close to optimal design. He doesn't provide references, but I think (given the time and academic associations; anyone know?) he may be referring to Optimality Theory:

https://en.wikipedia.org/wiki/Optimality_Theory

Separately, our knowledge of anatomy of the language faculty is that it is highly localized in the Wernicke and Broca areas of the brain. This also suggests a small modification of an existing brain structure.

Additionally, optimality in biological systems has been a productive research direction in a few areas: - Optimality Principles in Biology - https://link.springer.com/book/10.1007/978-1-4899-6419-9 - Near optimal energy use in the metabolism of cell division - https://arxiv.org/pdf/1209.1179.pdf - https://www.quantamagazine.org/the-math-that-tells-cells-wha...


I think we may be talking about different things. You're talking about the language itself. I was talking about the neurological structures that produce the behaviors that are language.

I poked through the optimality literature in biology pretty thoroughly a few years back. There is a great deal of interest, but except in a few cases where a simple piece of natural history is nicely described by an evolutionary game, the choice of criterion to optimize is generally arbitrary. Without an evolutionary mechanism and observations of forces driving that mechanism there is no reason beyond just-so stories to apply any particular criterion.


I see the distinction but I am following Chomsky, that they're aspects of the same thing; language is a faculty in a specific, recently evolved neurological structure. That's a bold theory, but it seems to have held up better than approaches where language is an ability learned through a general cognitive intelligence mechanism. Chomsky says in that lecture that it's separate (and compact and recent) and somehow coupled to the general intelligence and performative areas located elsewhere.

I think we're at a point with language and its neural mechanism where by analogy it looks like the phenomenon of flight being closely determined by the exact form and use of a wing. Almost any shape will not do, so there is a form closely (efficiently) following function. Of course there were preadaptations for both which provide very inefficient flight or communication abilities, but something flips and the adaptive utility of the function rapidly sculpts and elaborates the form in that direction.

About optimality, generally agree though I think following thermodynamic efficiency seems a deep inductive step as that allows measure of what is arbitrary or not, e.g. in the work of England that I cited.


> I am following Chomsky, that they're aspects of the same thing; language is a faculty in a specific, recently evolved neurological structure.

That is a bold hypothesis, certainly, but not one that you could use as a basis for argument at this point in time. I also deeply doubt it when viewed in the context of Wittgenstein's language games. A human and a sheep dog can very effectively engage in a language game, even subtle ones such as naming. Nor is it pure behavioral conditioning, as what makes various dog breeds better at particular things is selection for incidence of particular behaviors.

Then you go to a gorilla or a parrot, which can engage in quite abstract language games. Language games with parrots always make you aware of Wittgenstein's lion, though.

For Chomsky's hypothesis to be true, you would need to have a particular inflection point where language games become language, and for that inflection point to be reified as the evolution of a specific neural structure instead of exaptation of lots of aspects in the brain.

It feels too much like Watson and Crick's nonsense "one gene = one protein," which could have been discarded with a modest amount of thought before it was allowed to damage biology for decades.

> I think following thermodynamic efficiency seems a deep inductive step as that allows measure of what is arbitrary or not

Until you look at, say, bower birds and realize that conspicuous waste is a signalling mechanism for fitness. There you have an evolutionary game that rewards thermodynamic inefficiency.


> There's a good reason to believe evolution would use something close to the most information theoretic efficient implementation.

People used to think that about genomes, until we got to know in some detail what's in them.


Explain.


As far as the purpose of DNA is to tell the rest of the cell what proteins to synthesize, over 85% percent of our DNA is never translated to protein [1]. But leave alone our knowledge of DNA, biology rarely works in a manner that's best suited for functional efficiency. For examples, look at how the vagus nerve evolved, or look at vestigial organs. The question for the gene is rarely "is this the most efficient way of doing things?" Rather, it is "is this strategy good enough for me (and my descendants) to pass on my genes?"

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3205562/


That's a pretty cursory understanding of how DNA works. The "Junk DNA" hypothesis has long been invalidated.

The non-coding genome is a large part of what causes us to look different from a mouse or a fruit-fly. Gene regulation (what gene to transcribe, and when) is largely controlled by non-coding DNA: https://en.wikipedia.org/wiki/Regulation_of_gene_expression

There are energy costs associated with useless information being stored and transcribed in the genome... so it is pretty reasonable to believe that information theory does play some (however minor) role in how data is stored.


Neither I nor the paper I cited use the phrase "Junk DNA".

> The non-coding genome is a large part of what causes us to look different from a mouse or a fruit-fly.

Now that's just fantasy. There have been a lot of changes since our common ancestors with mice and fruit flies. We are all animals and obviously would share genes that perform analogous functions -- like common genes for controlling wing shapes and organ shapes. But would you like to point me to evidence in the human DNA that codes for fruit fly wings? And what role do you think having virus DNA in our genome plays in our development?

DNA encodes information. Information theory does play a role. But as mannykannot pointed out, the topic of the conversation is something else.


We are drifting away from the original issue here. 'However minor' is a long way from near-maximal efficiency, and in the context of the original claim, 'efficient' means compact encoding.

The fact that the DNA involved in gene regulation is mostly non-coding does not mean that most non-coding DNA has a regulatory purpose. If it did, changes in non-coding DNA would be more important, in aggregate, than changes in coding DNA are, with respect to evolution and disease, on account of its prevalence.


Junk DNA is a phrase biologists use for describing the long held (and recently disproven) idea that non-coding DNA ("dna not coding for proteins") is unimportant relative to coding regions (genes).

Most non-coding DNA does have a regulatory purpose. Things like copy number variations (where information is "duplicated") between promoters and enhancers can can change the rate at which particular genes get transcribed. This is my point: that viewing the genome as "information" in the simple sense of each ATCG nucleotide corresponding to bits is absurd. The genome is a geometric AND chemical entity and encodes information using both paradigms.

My point is not to say that you are wrong, but rather that you do not know. It is FAR from well established that the genome is inefficient at storing information.


I don't know, you don't know, but only one of us is using speculative arguments (and, in at least one earlier case, fantasy, as lake99 pointed out above) to support her opinion.

From an information-theoretic point of view (which is where we started) even if all non-coding DNA had some purpose, it would not be a sufficient basis for assuming that it is close to the most information-theoretically efficient implementation. On the other hand, the fact that whole genomes can be handily compressed, with straightforward Huffman encoding techniques, settles the infomation-theoretic efficiency question, regardless of whether your speculation about the usefulness of all non-coding DNA is correct.


I haven't had a chance to look at it yet but here are the links to the paper and associated Open Science Framework data if anyone is interested:

Paper: https://royalsocietypublishing.org/doi/full/10.1098/rsos.181...

OSF Page: https://osf.io/ga9th/


How would it be possible that 40,000 words translates to 400,000 bits?! Am I missing something here?


Compression I suppose. Like storing the word "compression" is easier if you already know the familiar "com-" prefix as well as the noun "pression". Which is itself easier to remember if you know the verb "to press".

I've been actively learning Portuguese and Russian lately, it's impressive how much faster I can pick up Portuguese vocabulary vs. Russian. And that's even for words that don't have an obvious cognate in languages I already know. The structure of the words, the various building blocks are just so much more familiar in Portuguese. A word like "atrever" (to dare) doesn't have any obvious cognate in languages that I know but it just "looks right" it a way that, say, "atverer" or "aterver" wouldn't. Those last two words sound distinctly un-Portuguese (I might say, un-Romance) to me. That makes it a lot easier to remember the spelling.

Eventually as I grow my Russian vocabulary I start making similar connections. Волноваться is pretty tricky to memorize on its own but it becomes easier when you know that ~ся is the reflective, ~ть is the common verb ending, волна means wave and ~овать is a very common building block for Russian verbs.


"Compression" applies very much to grammatical variants of words. It's why the only bizarre irregular verbs in a language tend to be the ones that get used all the time - be, do, go etc - because for anything more obscure the brain just forgets the special case and applies the general rule.

Steven Pinker's book Words and Rules is a great layman-oriented read if this sort of thing interests you.


It also makes sense why the most common verbs are irregular in most languages: to have us pick the direct word quickly, instead of the slower way of deriving it from a rule.

So, they're like "constants" vs calling a function to calculate a value.


Not sure I follow. If (big if) there were really a measureable advantage to having us "pick the direct word quickly, instead of the slower way of deriving it from a rule", it doesn't follow that irregularity makes that easier. I could memorize "goed" just as easily as I can memorize "went".


>I could memorize "goed" just as easily as I can memorize "went".

It wouldn't have anything distinctive for people to latch on to, so they would be constantly trying to derive it from the general rules for regular verbs.

E.g.

(a) all verbs regular -> instinctively go to (slower) rule derivation instead of memorization of all verbs, even for the most common ones.

(b) most frequent verbs being irregular -> instinctively retrieve them from the (faster) "lookup table" of memorizations, and bypass the rule based derivation for them.

I.e having the clear distinction of irregularity makes it faster to go directly to that kind of "constant" memory.

That said, this is not my theory, read it years ago in a cognitive/linguistic pop science article. This seems to say more or less the same thing:

https://en.wikipedia.org/wiki/Regular_and_irregular_verbs#Li...

In studies of first language acquisition (where the aim is to establish how the human brain processes its native language), one debate among 20th-century linguists revolved around whether small children learn all verb forms as separate pieces of vocabulary or whether they deduce forms by the application of rules. Since a child can hear a regular verb for the first time and immediately reuse it correctly in a different conjugated form which he or she has never heard, it is clear that the brain does work with rules; but irregular verbs must be processed differently.


I took it the other way around: it's easy to memorize "went" because you use it all the time. If on the other hand a much less common verb like "to satiate" had a very irregular conjugation then it would regularize pretty quickly because nobody but ultra-pedants would bother to remember the exception.

I think a decent real example of that is fiancé/fiancée, those are french borrowings and have, at least originally, kept the French grammatical gender inflection. However nowadays I often see people using either spelling in a gender-neutral way since most people don't bother to learn French grammar for this one word.


>I took it the other way around: it's easy to memorize "went" because you use it all the time.

That still wouldn't explain the why of having it like "went" vs "goed".

Sure, it's easy to memorize because we use it all the time, but why have it to memorize it in the first place, versus something like "goed".

So, this theory (I tried to convey above) said that it being irregular placed ensured we don't slow down try to derive it from regular rules, but instead have fast access to a memorized form.

Couldn't we just memorize "goed"? If it's just "frequency of use" that mattered, "went" and "goed" would work just as well.

But the extra idea is that "goed", being regular, would be too easy for us to confuse with thousand of other regular verbs, and not use our "fast recall" mechanism, regardless of that verb being needed all the time.

Not sure if correct - read it years ago. This seems to be related to that:

https://en.wikipedia.org/wiki/Regular_and_irregular_verbs#Li...


Your point is interesting but I think you're falling for the same type of fallacies people often have regarding evolution (if you keep going into water for a long time over many generations you'll eventually grow gills!).

Natural languages are not designed, they evolve. The irregular nature of the conjugation of "to go" might just be a remnant of some archaic form and nothing else, in the same way some argue that the plural of "octopus" is "octopi" or "octopodes. Does it serve any linguistic purpose? I don't see how, but it won't stop some people for saying it. Look at the use of the subjunctive in "if I were you", which is one of the only occurrences of the subjunctive outside of set-phrases in day-to-day modern English. Is it really necessary? If one was to say "if I was you" would it lose some additional nuance or meaning in practice?

I often see people trying to rationalize some language features (such as arbitrary genders of nouns in many languages) as error correction or some "optimization" but I'm generally unconvinced. Maybe "I went" is just the linguistic equivalent of a platypus, some weird byproduct of a very long evolution with no other intrinsic purpose in the grand scheme of things.


>Natural languages are not designed, they evolve.

But that's part of my point. I don't say irregular verbs were designed to be effective that way, but that they were evolved to be effective.

Hence the link with "language acquisition" being involved -- when regular and irregular verbs developed that wasn't a known theory some "language designer" could consciously follow. Just something innate that could develop because of a evolutionary advantage.

In fact, if someone merely designed, they'd probably go for all regular, rather than regular + irregular, as it's "cleaner".

>I often see people trying to rationalize some language features (such as arbitrary genders of nouns in many languages) as error correction or some "optimization" but I'm generally unconvinced.

Tons of language features are indeed optimizations for different things. Cold climates for example have languages with less vowels (keeping your mouth closed more).


English cognate to "atreve" is "attribute".

~ов is an iterative suffix - like ~le or ~er in gamble and chatter. It's useful to know, because you can rationalise why it is always dropped in the present tense - you can't be iterative at the moment, unless you're an Englishman)


>English cognate to "atreve" is "attribute".

I didn't know that, but you'll grant me that it's not a very useful cognate (either in spelling or in meaning).

>~ов is an iterative suffix - like ~le or ~er in gamble and chatter. It's useful to know, because you can rationalise why it is always dropped in the present tense - you can't be iterative at the moment, unless you're an Englishman)

Very interesting, thanks.


If you're serious about russian, feel free to ping me.


The number of "bits" something takes is not an absolute value. It is relative to the encoding scheme that is being used for the bits.

Converting a large word list to a small number of bits has been a computer science hobby for a long time. Here's a pretty good search result to start working through for more details: https://duckduckgo.com/?q=building+a+small+spell+checker+suf... It was especially important to write small and fast spell checkers in the 1980s and early 1990s, when you couldn't expect to have enough RAM sitting around to simply load up a naively-encoded list of words, and the act of spell checking a few thousand words could take noticeable time.

So in an encoding scheme chosen to represent English compactly, I'm not too surprised that you can get things down quite small.

However, the question is, what relevance does that encoding scheme have to the human brain? Having just scanned through the paper, the answer is "probably not that much", which the researchers are well aware of. They explicitly present this as a lower-bound, which is a reasonable thing to do. It is obvious that the brain does not simply store 1.5MB of data in the way a computer does, in many ways.

To be honest, this amounts to an exercise in recreational mathematics more than anything else. There's nothing wrong with that, and that's not a criticism. My point is that I'm not sure it's worth trying to read the paper as anything else.


A quick test using a non-lossy compressor with no understanding of phonemes or human language and grammatical context at all on a dictionary of 370000 English words resulted in 24 bits per word here. It wouldn't surprise me if our abilities to roughly contextualize language in terms of the language we already understand gives us a serious advantage here.

Now a few questions: Can you hear a word you've never written (in a language that you're familiar with) and intuitively spell it right the first time? Can you read a word that you don't immediately understand and figure out its meaning from the context in which it is used? Can you accurately complete half of a sentence?

That a lot of people can do these things suggests to me that we all sit on something superficially similar to an efficient lossy compressor in our brain.


I've seen somewhere that there is about 1 bit of entropy per letter of English text. The best packer (Hutter prize) compresses 100MB of wikipedia down to 15MB, including the packer itself, a ratio of 1:6.5, which isn't that far off.

Considering that, 40k words to 400kbits is not too surprising.


My only guess is that they are implicitly ignoring information we use all the time when talking about information in a computer. The number 5 stored in one place in a computer is the same as the number 5 stored elsewhere. In the human brain two words are already different because they are in different places. The researchers aren’t counting information needed to tell them apart. Admittedly, this is not the way I think either.


Superposition / quantum mechanical


I believe it if they provide a program that can speak English in under 1.5MB total size.


What does it mean to "speak English"? Does a program that plays a recording of the word "I" count as a program that speaks English?


I think you know the answer: think about how you find out whether a human speaks English or not.


Of the various estimates that go into this number, I am most skeptical of the semantics one. I doubt that one can draw a clear line between our semantic appreciation of language and the totality of our remembered experiences.


Makes me want to write a retrocyberpunk setting featuring languages loaded off floppy disks.


Do it!

Preferably, the floppy disks should be the 5-1/4 inch 1.2 MiB high-density, double-sided type, 3-1/2 inch disks sounds good too ;-)


Do it!


This seems very low. Just thinking of all of the different ways you can greet someone. The inflection and modifiers based on circumstance and audience alone seems like it would be higher than 1.5mb


So, all of it fits on a floppy? Nice.


Not quite. Considering the most common floppies stored a bit less than 1.44MB.


One feature of floppy disks is that the recording technology and the format is independent from the physical medium, the limitations are mostly imposed by the disk drive, the encoding and the file system, not the medium itself.

Well, this "feature" creates significant compatibility issues as all the vendors created their own proprietary formats, sometimes with patents. As a result, together with the economics of scale, often, the standard format is not the best design, a simpler design that serves as the lowest common denominator.

On the other hand, due to this feature, it was already possible to store 1,760 KB of data on a 3-1/2 inch HD floppy on a 1986 Amiga. And in 2000, it was possible to store 32 MiB of data on a standard 1.44 MB floppy disk, by using a SuperDisk LS-240 drive (although random write is sacrificed, the entire floppy must be rewritten if a change is needed, like CD-RWs).

I believe the advancements in magnetic recording technology in the past 20 years allows one to achieve even higher capacity on a standard floppy, and it can be as cheap as early floppy drives if mass produced, only that it doesn't make sense to do so.


I always wanted to know how far you could push the capacity of an old standard density floppy drive if one hacked on modern control electronics. Things like adding more tracks and variable sectors per track. Though I think you'd be limited by the magnetic density of the media and physical head size. Still, it would be fun to see just how much more data you could squeeze onto an old 1.44MB floppy.


The issue then becomes how do you properly protect the magnetic medium inside.


This is especially problematic for the last bunch of floppies produced in the late 90s. According to some anecdotes, due to the popularity of PCs at this time, vendors implemented various cost-saving measures to drive the price down, as a result, the magnetic medium inside had the lowest quality in the entire history of floppy disks and extremely unreliable... Meanwhile, many 5-1/4 inch disks from old mainframes are still working today.


1.5 million what? Fix your title.


reminds me of that "hacker steals millions of internet data" joke


Fixed!


Haha wrong




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: