Hacker News new | past | comments | ask | show | jobs | submit login
What's the minimum number of words you'd need to define all other words? (2012) (reddit.com)
343 points by devilcius 41 days ago | hide | past | web | favorite | 159 comments

The Oxford Advanced Learner’s Dictionary has a “Defining vocabulary” that they claim is used to write almost all definitions (I used the Fifth edition, where it is appendix 10). It’s about 8½ pages, with 5 columns of about 63 lines, so about 2,700 words.

It doesn’t list inflections, proper names, adjectives for colors such as yellowish and words used in an entry that derive from that entry (the dictionary mentions blearily and bleary-eyed being used in the definition of bleary)

They also say they occassionally had to use a word not in the list, but don’t say how often they had to. Those words _are_ defined in the dictionary, so it is possible that the reference graph does not have any cycles.

So, I guess 3,000 is a good first guess.

Considering the list seems to contain both "big" and "large", my guess is that there's quite a bit of overlap in the words used because they expect the 3000 are generally known and can be relied on. This means that if we were going to optimize for size, we could probably get to a much small number of words, and use those to define the others.

I didn't go searching, big was literally the first word on the list I read after going down a few pages, and I wondered about large, so I searched for it. I just looked a bit more, and there's "child", "childhood", and "grandchild", which while not the same problem, does illustrate that they are fairly liberal with their inclusions because they appear to want to use the minimum vocabulary to define something idiomatically, which is a slightly different question than what's the minimum required.

This problem actually seems to share a lot in common with database normalization.[1]

1: https://en.wikipedia.org/wiki/Database_normalization

I have been reading the comments and I think it needs to be stated the problem itself is misguided. This quote sums up the main issue quite well -

“Our language is an imperfect instrument created by ancient and ignorant men. It is an animistic language that invites us to talk about stability and constants, about similarities and normal and kinds, about magical transformations, quick cures, simple problems, and final solutions. Yet the world we try to symbolize with this language is a world of process, change, differences, dimensions, functions, relationships, growths, interactions, developing, learning, coping, complexity. And the mismatch of our ever-changing world and our relatively static language forms is a problem.” - Wendell Johnson

After having realized that a static lamguamge is a prolbem, find I oose-full that smarm nebibibibmd. Finibabde ilop impebnudee, {fna anf fophohot.} Thor (((irhs (pronim; mebidi) om flom.

Nice. From the description it sounds like it has something in common with https://en.wikipedia.org/wiki/Billy_Liar

A great book and fantastic movie.

Relax there, Humpty Dumpty. Or should we call a bondulance?

seriously are you okay ?

Are you okay?

yar eo?u kaoy

Yet you've used somebody's words to describe the problem.

In fact, it almost seems like there is no other way to describe such problems. They are conceptual, ephemeral, not wholly in the realm of things you can see or witness, but only really describe.

I'm not aware of anyone since Charles Sanders Peirce actually making a serious scientific effort to investigate this problem. His work is well worth the read for anyone who wants to see what semiotic looks like when one of the greatest logicians (I'm talking Frege tier) to live turned his mind to it.

I was waiting for someone to mention Peirce or Frege here! Kudos... Did you have any particular piece in mind?

On a New List of Categories[1] is a good entry point. I like How to Make Our Ideas Clear[2] as well, and it may be more germane to this topic. He was a prolific writer, and I've found everything of his I've read thought provoking.

Edit: Some Consequences of Four Incapacities[3] is another that deals with how we understand things.




Thank you!

Thank you. There is deep importance in understanding where we are from in order to understand where we are.

That's useful. There's Basic English, with about 1000 words. Using Basic English well is hard. During WWII, the BBC broadcast news to the British Empire countries in Basic English. George Orwell did some of the translations. He found translating to Basic English to be a political act. Ambiguity did not translate. He had to make political statements unambiguous.

That's where 1984's "Newspeak" came from. See "Orwell, the Lost Writings".

Here's a PDF of the Oxford 3000 [1]

[1] https://www.smartcom.vn/the_oxford_3000.pdf

Thanks for the link.

I have since found a better, more recent version which doesn't have the OCR problems of the list I linked to above. You can find both the new Oxford 3000 and 5000 lists in several forms by clicking on the download button here: [1]

[1] https://www.oxfordlearnersdictionaries.com/wordlists/oxford3...

Yeah, but there's always Popper's observation hovering in the background concerning definitions: 'all definitions involve the use words which themselves remain undefined'. Now if particular constituents of language (nouns, verbs, qualifiers) have empirical referents (EG, oak tree) then something other than words can be supplied to buttress and shape consensus for any formulated definitions, using words which themselves have empirical referents. But with conceptual referents (EG, democracy) definitions become subjective and lack clear capacity for unambiguous validation. So a definition of a concept which resonates with one individual based on their understanding of its verbiage may be dissonant for another based on that individual's understanding of the content of the definition.

You are confusing 'definition' with positivism. A definition does not have to be epistemologically apodictic to be a definition. It simply requires that we can understand its ordinary uses. Do you think nominalists can't define anything, and exist in the world in a state of perpetual confusion and dissonance?

You are confusing 'definition' with positivism

Interesting. But the subject is the nature of the definition. What is the OED definition of definition (circularity intended):

a precise statement of the nature, properties, scope, or essential qualities of a thing; an explanation of a concept, etc.; a statement or formal explanation of the meaning of a word or phrase

Well that's nice. The first component would be amenable to a sclerotic positivism (which denied subjective phenomena as inaccessible to measurement ergo epiphenomena to be ignored; this jettisoned by contemporary cognitivism and phenomenology ); the second addresses the conceptual without a hint of pragmatic methodology; and the salient element of third component is the word meaning which OED defines as:

that which is or is intended to be expressed or indicated by a sentence, word, dream, symbol, action, etc.

So the definition of definition by the ipse dixit English authority on definitions alternates between a call for precision and some rather vague references to intentionality. That was the intent of the above tidbit on the topic of definition. Namely some labels for subjects are amenable to degrees of precision in definition while others with only conceptual referents will have their proffered definitions disputed, diluted, or otherwise hedged and seemingly imprecise.

Steven Stitch in Fragmentation of Reason which is a personal overview of contemporary epistemology alludes to the inherent vagueness of consensual definitions and eventually settles into what he calls pragmatic epistemology

"Namely some labels for subjects are amenable to degrees of precision in definition while others with only conceptual referents will have their proffered definitions disputed, diluted, or otherwise hedged and seemingly imprecise."

It doesn't matter how contested a word is. You can nevertheless describe its main conventional uses. That would simply be an empirical observation.

Also, just a friendly suggestion: you are writing too much, and using too many long and unnecessary words. Simplicity is often better, both analytically, and to read.

It makes sense that it doesn't list joined words like bleary-eyed when its definition is obvious from the constituents bleary and eye, or inflected words because inflections like -ish and -ly each have the same meaning when modifying other words. But what about phrasal and prepositional verbs such as put off, put up, and put up with, where their meaning can't usually be deduced from the constituents such as put, off, up, and with ?

The definition of “put” has about two pages, with “put off” (five variants), “put up” (eight variants, including “put up with sb/st”) in those pages.

“Put out” also has a zillion variants in those two pages, but itself also is in the list of defining words.

Notably, 3000 is a good chunk of what's (afaik) considered to be the average everyday-use dictionary: somewhere from 10000 to 20000 words.

(Though again I'm unsure if the endless English phrasal verbs are counted as distinct in these estimates, not doing which would probably be cheating.)

Generally speaking, in language acquisition papers anyway, vocabulary size is done a "word families" rather than words. So "police" and "police station" are counted as a single "word" as long as you also have "station" in the list. Phrasal verbs ("look up to" vs "look in to" for example) are counted separately if I'm not mistake because while the root of the word is the same, it's not the same "word family".

I remember 3000 being the target vocabulary when I was studying French. I forget where I got that number though. Might have just been the number of flash cards in the CleverDeck app shrug.

«it is possible that the reference graph does not have any cycles.»

It's impossible. An English dictionary defined using English words has to have cycles.

That's assuming that every word used in every definition is also listed in the dictionary. While that's probably the case, it doesn't necessarily have to be true.

Surely a dictionary that uses a particular word but does not define that word is not a very good dictionary. It feels like one easy test of "completeness" would be to check if every word used has been defined.

The work of Ana Wierzbicka and Cliff Goddard studied 'Semantic Primes', 'the set of semantic concepts that are innately understood but cannot be expressed in simpler terms'.


The combination of a set of semantic primes and the rules of combining them forms a 'Natural Semantic Metalanguage' , which is the core from which all the words in a given language would be built up.


The current agreed-upon number of semantic primes is 65 (see list at wikipedia links above).

That means that any English word can be defined using a lexicon of about 65 concepts in the English natural semantic metalanguage.

I've been following this stuff for years, it's fascinating. I'm particularly interested in the recent practical applications like Minimal English and it's equivalent in other languages. For those that don't know, unlike other minimalist English subsets which usually focus on learnability or clarity, Minimal English focuses on maximum translatability.

I'm going to get silly now, but I can't help but think the semantic primes - if you can avoid thinking of them as words or even conscious experience - represent some core set of cognitive axioms, like the primitive elements for constructing mental models. As you go to simpler life forms the "word list" would get smaller. If there is any truth to that, I wonder what potential primitives we are missing that would allow us to think more complex thoughts and whether you could measure species intelligence by their "vocabulary" and working out what concepts can't be expressed when one of the primitives is missing. What would happen if you lost the concept of above'ness?

The other thing I find interesting and it might be no more than a coincidence, is how there is only the numbers one and two and then you have to use many or more. This in some way matches up with the ideas of the Parallel individuation system[1] whereby young children can only precisely recognize quantities up to 3, or 1 + 2 and an adult can only precisely recognize quantities up to 4, or 2 + 2. After that, the brain uses the Approximate number system[2]. So it's like there are only 2 slots to place a quantity.

[1] https://en.wikipedia.org/wiki/Parallel_individuation_system [2] https://en.wikipedia.org/wiki/Approximate_number_system

> some core set of cognitive axioms

This and the rest of the comment remind me of the Pirahã language, in which there are purportedly two numerals but researchers can't figure out what they are: https://en.wikipedia.org/wiki/Pirah%C3%A3_language#Numerals_...

> Frank et al. (2008) describes two experiments on four Pirahã speakers that were designed to test these two hypotheses. In one, ten spools of thread were placed on a table one at a time and the Pirahã were asked how many were there. All four speakers answered in accordance with the hypothesis that the language has words for 'one' and 'two' in this experiment, uniformly using hói for one spool, hoí for two spools, and a mixture of the second word and 'many' for more than two spools. The second experiment, however, started with ten spools of thread on the table, and spools were subtracted one at a time. In this experiment, one speaker used hói (the word previously supposed to mean 'one') when there were six spools left, and all four speakers used that word consistently when there were as many as three spools left.

Having read only your comment, I'll jump in and solve the puzzle.


    not enough

I taught myself to recognize five as a distinct quantity. Useful when counting up the "spare change jar".

I assume you see 3 objects on a table as a triangle. It's probably not equilateral, but any three objects on a table describe a triangle.

Make sure you can see 4 as a square, not 2+2. If you're stuck on seeing two pairs (or lines), try seeing 3+1 (a triangle and a point) instead. Then incorporate the point into the triangle...

Next, see pentagons. ... That's it.

I haven't tried to see "six"... Five was hard enough. :P

I don't think you would ever lose a concept like 'aboveness' - even if that word didn't exist in our language, we would have found away to express the same idea, perhaps in a less abstract way like 'closer to the sky'

What proportion of linguists with an interest in semantics regard "semantic primes" as a useful concept? The Wikipedia articles don't seem to have a "Criticism" section, which isn't a good sign.

It looks interesting, certainly, but rather arbitrary. There are several pairs of opposites, which in a minimal language could be handled with the concept of "opposite", and I have no idea how you'd express some fundamental concepts of human experience such as hunger, cold, pain or surprise, while "live, die" do not seem to me to be such fundamental concepts: they seem more like concepts that need to be defined, for example by a philosopher or medical specialist, rather than experienced directly.

I think this is the right approach to the problem. It's a question of meaning and bootstrapping a minimal language that's based heavily on metaphor (specifically, the conduit metaphor). The answer from this perspective, based on semantic metalanguage, is 800 words. Minimal english, but also minimal across all languages. It's a core language system that's translatable, because language is based on concepts, and those are consistent across natural languages (Chomsky - Universal Grammar).

--- https://en.wikipedia.org/wiki/Conduit_metaphor



I am a little suprised that toki pona ("language of good", https://en.m.wikipedia.org/wiki/Toki_Pona) is not mentioned. It is a language that consists of about 125 words, which aims to make you think about describing complicated subjects. To give an example: The concept "friend" could both be described as "good man" or "man good to me" depending on whether you think your friend is intrinsically good.

Admittedly, the original question is specifically about the English language, but toki pona is a nice experiment related to this.

> "[...] Who are you?"

> "A friend!" Shouted back the man. He ran toward Zaphod.

> "Oh yeah?" said Zaphod. "Anyone's friend in particular, or just generally well-disposed to people?"

Adams, Douglas. The Restaurant at the End of the Universe.

"sina jan seme?"

"jan pona" mije li toki wawa. ona li tawa tawa jan Zaphod.

"jan pona?" jan Zaphod li toki. "jan pona tawa jan wan anu ale?"

jan Douglas Adams. ma moku lon pini pi ma suli.

pona. taso mi pilin e ni: "restaurant" li "tomo moku" li "ma moku" ala.

sina pona. mi pakala. tenpo ni la mi ken ala ante e lipu mi :-(

An interesting related talk, touching on the minimality and expressiveness of both natural and computer languages, is Guy Steele's 1998 talk "Growing a Language":

Video: https://www.youtube.com/watch?v=_ahvzDzKdB0

PDF: https://www.cs.virginia.edu/~evans/cs655/readings/steele.pdf

Prior HN discussion: https://news.ycombinator.com/item?id=16847691, https://news.ycombinator.com/item?id=2359174, & others

That's the first thing I searched for when opening this thread to see if anyone else had posted the link yet. Just brilliant.

Thank you so much for linking this. It's such a fun talk, and I've been looking for it for years but nothing I tried in Google would ever bring it up.

I think the approach I would use is as follows:

0. Get a dictionary.

1. Form a directed graph, with an edge from each word to every word that uses that word in its definition.

2. Remove all words that have no outgoing edges.

3. If you removed some words, go to step 1. Otherwise, all words left in the dictionary are minimal.

EDIT: If anyone knows of a machine-readable dictionary, I'd love to actually do this.

This will not yield a minimal set; in a cycle, it is only necessary to remove at least one word. The problem is thus to delete the minimum number of vertices to remove all cycles. This is the NP-hard Feedback Vertex Set problem. Here's a paper that solves it for a dictionary (there is some more): https://arxiv.org/abs/0911.5703

I checked the comments to check if this paper was mentioned anywhere. Good recommendation.

Looks like you found our answer! Someone's already done the hard work.

This is not necessarily the answer. It's an upper-bound for the answer.

Looks like somebody made txt and json versions of the Oxford Unabridged English Dictionary here: https://github.com/adambom/dictionary. The json version should let build up the graph structures you're talking about pretty easily.

But you will come across a lot of words used in definitions that could easily be replaced with more common words. In some cases the change to the definition would be tiny, in others it might be more significant.

I'd like to see a DAG of WordNet, a database of English synonyms. Mapping single word synonyms solves the problem of common words in definitions.


Came here to say exactly the same! Great minds think alike!

It seems like a good start. Once you do that, you could start finding vertices with large amounts of incoming edges, attempt to redefine the word as a phrase composed of only words still in the graph, remove that vertex, and repeat.

That will get you much closer but it does ignore the ability to apply creativity to definitions to further reduce them. In the end, a machine-driven technique can give an approximate answer to this problem but it will never be the "perfect" answer.

You could try this using WordNet


and word2vec

Doesn't every word in a given definition have an "outgoing edge?"

Yes, but not all words defined in the dictionary are used in a definition.

So if "multitudinous" isn't used in a definition of another word, you remove it from the set. Maybe you then find out that "myriad" was only used in the definition of multitudinous, so you can take myriad out, and so on.

Definitions are not enough to fully capture the meaning of a word. In order to do that you need full language modelling and to ground words into other sensory modalities, plus the word in relation to actions taken in various situations when the word was used.

GPT-2 (of recent OpenAI fame) uses 1.5 billion parameters and, though capable of interesting results, is far from human level. It also uses just text so it's incomplete.


Another interesting metric is Bits Per Character - BPC. The state of the art is around 1.06 on English Wikipedia. This measures the average achievable compression on character sequences and doesn't include the size of the model, just the size of the compressed sequence.


That's true but it's almost inherent in what a dictionary is, i.e. to catalogue the canonical semantic meaning of words, not to provide a complete model of language and its contextual variables.

I used to work for Pearson Longman, and one of their USPs was that their defining vocabulary was significantly smaller than the main competitors, namely OUP and CUP. Longman's was just over 2000 (about 2100 IIRC), whereas OUP's was approx 3000.

Even then, one is rather constrained and definitions frequently cross-referenced other words to bootstrap the definition.

Words in the English language are not the same as computer code. I'm not sure you can fully define most words in terms of other words -- hence the variety. Dictionaries generally only provide rough sketches of the meaning of a word. Even synonyms can have slightly different subtexts, connotations, and histories. Hell, individual words have wildly different meanings depending on context.

You could call this the Wittgensteinian critique of the question.

Besides Basic English, I've run into a neat French dictionary for children, https://www.amazon.com/Mon-premier-dictionnaire-Roger-Pillet...

It sticks to a basic vocabulary, has an entry for every word it uses, and goes heavy on examples and pictures in preference to formal definitions. (And it's monolingual even though written mainly for learners in North America.)

I don't have it to check, but estimating from memory: around 2000 to 4000 words. I found it useful while bootstrapping up from Duolingo.

> has an entry for every word it uses

That is actually a really interesting challenge: to have a completely self-contained dictionary. Especially in 1963, before modern automation, the proofreading required must have been a Herculean task.

Perhaps this could be some kind of measure for answering this question in and of itself: what is the smallest useful self-contained natural language dictionary that one can write?

EDIT: Oh, fginionio came up with an intuitive approach to do this automatically below: https://news.ycombinator.com/item?id=19332041

If it goes heavy on examples and pictures, then it can probably give a more relaxed definition for words, knowing the context will be picked up from the pics and examples. Do you find that true?

Yes, it was like that. The philosophy was to support learning that tries to come closer to real-life immersion than typical school foreign-language classes did. (From my memory of the preface, the only part in English. Of course nothing back in the 1960s could really approach moving to France -- maybe nowadays you could using the internet.)

I am looking for the same but in German from either English or French.

Me too! But maybe I was unclear: the only part not in French was the introduction, so it doesn't really matter what language you're coming from.

It depends on what is meant by "define". If we are allowed to use existing words in a language, L, to create a new language, L', then use expressions in L' to define each word in L, a single word w, originally in L, suffices.

The idea is to first index each word v in the lexicon of L (including w), starting at 1 and ending at n, whatever is the number of distinct words in the language. Alternatively, you can index _meanings_. Then (should be obvious where I'm going with this by this point) you map a sequence S_k of repetitions of w of length k in [1,n] to each k'th word, v_k, in L. So now L' is the language of n sequences S_1,...,S_n of w each of which maps to a word (or meaning) in L. And you have "defined" L in terms of a single word, the word w.

But that's probably not at all what the reddit poster had in mind.

However, it should be noted that natural language is such that there's really no reason that we have many words- it's just convenient and helps us create new utterances without having to create long sequences of one word, as above. The important ability in human language is that we can combine words to create new utterances, forever- which we can do with one word just as well as with a few thousand.

Finally, I suspect that if there was a minimal set of (more than one!) words sufficient to define all other words (meanings) in a language, all natural languages would converge to about that number of words- which I really don't think is the case.

> probably not at all what the reddit poster had in mind

I'm pretty confident the goal is to choose a smallest subset of English so that, if you know this subset of English and are given a dictionary written in it, you can learn the entire vocabulary of full English.

That means you're not allowed to create any new words, so you can't create the magic uber-word w.

> if there was a minimal set of (more than one!) words sufficient to define all other words (meanings) in a language, all natural languages would converge to about that number of words- which I really don't think is the case.

This amounts to saying there is little to no redundancy in language. I'm not convinced. For example, once you've got "one" and "plus", the words "two", "three", "four", etc. are just convenience. Another example might be opposites: if you have "down", you don't absolutely have to have "up". But the thing is, people really like convenient ways of saying things. In fact, the economics probably drive you toward doing this. It makes for shorter sentences think of it like data compression: if a concept occurs often, you want a dedicated word for it so you can just say that word instead of saying the definition.

Of course there are redundancy a lot comes from historical facts where due to conquests and other form of migrations one language has become influenced by several. As such you could say that English is not one but rather consists of 3 or four different languages.

So for English it would be rather easy to find this by looking up synonyms originating from France, Germany and even Scandinavia and of course latin.

>> That means you're not allowed to create any new words, so you can't create the magic uber-word w.

Oh- w can be an English word. And the reddit post didn't say anything about not inventing a new language, with only English words (it would be a new language since it would have completely different grammar and semantics).

But I think you're right that what I propose above is totally cheating :)

I'd argue that even if you take the same letters as an existing word, adding completely unrelated definitions makes a new word.

>> words sufficient to define all other words.

Often the reason we have a word for a concept is precisely because no other combination of words would do. I'd suggest that the article's attempt to "define" one word with another is an oversimplification. It is not enough to sufficiently define a word, to convey its most common understanding. To declare a word superfluous, replaceable, one must define it absolutely. For many words I'd expect such a definition to fill an entire volume, not a short sentence.

The author also forgets that words have layers beyond literal meanings. Their tone, their length, and even their spelling can convey different meanings depending on context.

Assuming they exist, words that require an entire volume to functionally define must be rare and specialized. Plus, the question still stands: what’s the smallest number of unique words needed to write all those volumes?

You don’t have to have felt schadenfreude for someone to explain to you what it is.

I looked at this question a while back, and wrote this: https://kybernetikos.com/2007/12/03/atoms-of-english/ (blog is only up some of the time sadly, I'll fix it eventually).

I took Websters dictionary from the project Gutenberg site. I started with 95712 words. After the initial throwing away of words that weren’t in any definitions, I was down to 4489 words. After expanding them, and throwing away words that weren’t in the expanded definitions, I was down to 3601 words. Setting recursive definitions as atoms and continuing got me down to 2565 words.

I once found (plausibly from another HN commenter) a text based adventure where (almost?) all the words used were replaced with alternative English-sounding nonsense words, but have never rediscovered the link.

I feel this would be of interest to the thread, if anyone knows what I'm talking about or knows how to successfully Google for such a thing.

The Gostak

Finally, here you are. At the delcot of tondam, where doshes deave. But the doshery lutt is crenned with glauds.

Glauds! How rorm it would be to pell back to the bewl and distunk them, distunk the whole delcot, let the drokes uncren them.

But you are the gostak. The gostak distims the doshes. And no glaud will vorl them from you.

It has been on my to-play list for some time but I haven't got around to it yet.


And let us not forget about Lighan Ses Lion, a transcript of a fictitious game in a made-up language that just happens to overlap with English.


Thanks so much, this is it! The play online link was what I'd been linked to.


Reminds me of Randall Munroe's Thing Explainer:

"In Thing Explainer: Complicated Stuff in Simple Words, things are explained in the style of Up Goer Five, using only drawings and a vocabulary of the 1,000 (or "ten hundred") most common words."


A similar idea from the 19th century: Lucy Aikin (writing as "Mary Godolphin") wrote some children's novels using words of one syllable (except for proper names). There are only so many English words of just one syllable (Wiktionary lists 8626 words [1]), but you can get very far! The writing has a rather robotic cadence, though.

You can read some of her books online, such as "Robinson Crusoe In Words Of One Syllable" [2].

[1] https://en.wiktionary.org/wiki/Category:English_1-syllable_w...

[2] https://manybooks.net/book/189660/read#epubcfi(/6/2[item4]!/...)

A problem with that book is that while it uses the 1000 most common words, it doesn't use the 1000 most common meanings. Many of the 1000 words are used with somewhat uncommon meanings.

Love the book. Super fun to read, even for an adult.

It is a great book, and one of the best tools to teach kids in a fun way I’ve come across. He has a new book coming out later this year too which describes absurdly overengineered ways to solve simple problems. I’m preordered. :)

That's fantastic. Didn't know that he has a new book, for the lazy ones, called "How To: Absurd Scientific Advice for Common Real-World Problems" [0]. Also pre-ordered!

[0] https://www.amazon.com/How-Absurd-Scientific-Real-World-Prob...

I find it reassuring that customers that bought this book also bought Tramontina 80114/535DS Professional Aluminum Nonstick Restaurant Fry Pan, 10"

I've actually been wondering about this a lot myself recently, though I have been thinking of it in terms of "axiomatic English" i.e. the set of words and grammar/syntax rules from which all other meanings expressible in English can be represented, and cannot be explained themselves except through tautology? It's a really, really interesting question, and answering it would explain a lot about how we actually think.

Just one: "nor"


Hope you are one of the 10000 lucky ones whose mind is blown for the first time.

Or another one: "1"


Logic won't help you without a metatheory that links it back to the real world.

this reminds me of https://youtu.be/_ahvzDzKdB0 awesome talk!

The words needed to define a universal turing machine (and a program to simulate a human brain, but that doesn't require additional words).

We could extend it to cover words not conceivable by humans, and any universe, by using a program to simulate those, but (1) I assume the question implicitly assumes human words, though (2) it wouldn't require more words anyway.

Wow, I had no idea there was such a thing as simple.wikipedia.com! It apparently tries to follow 'Basic English'[0] that's comprised of only 850 words. The difference between the simple version[1] of artificial neural networks is a lot more approachable than the normal version[2]!

0: https://simple.wikipedia.org/wiki/Basic_English

1: https://simple.wikipedia.org/wiki/Artificial_neural_network

2: https://en.wikipedia.org/wiki/Artificial_neural_network

0 obviously. Babies start with no definitions of words, but here we all are.

The baby learns the words via example, not by definitions.

I think implicit to the question is that you're just using words for defining.

How does this make any sense?

You could have 100 synonyms with the same "definition" but 100 different shades of meaning, implied degree of strength, or connotations.

You don't necessarily simplify anything by making people add additional words get across those subtleties.

Of course of some are useless equivalents, but many aren't.

Oh, you absolutely wouldn't simplify anything by doing this - ideas that used to be encompassed by a single word would have paragraph-long descriptions.

It's just a thought experiment about how much you could optimize one dimension (number of words) if you didn't care at all about optimization anywhere else in language.

>some are useless equivalents//

Not all synonyms amount to useless equivalents.

Occasionally it's useful to use a different word simply because one can; sometimes the facilitous utility of alternate mots juste serves its own purpose.

> Of course of some are useless equivalents, but many aren't.

No such thing as a synonym. On the face of it yes many words share meanings. But a mutt is not the same as a dog despite what thesaurus.com says

Those shades of meaning can be described with other words.

What is the minimum number of words needed to define everything else?

Maybe some of the dictionary entries would have to be short stories or novels or poems.

The answer is 2: zero and one. What you need is to describe second-order logic. Just define "every" quantifier to be 0 0 0, NAND [1] - 0 0 1, and all other words as other sequences of 0s and 1s, for clarity, that look like 1 *. There might need to be some trick to ensure unambiguity of splitting a "sentence" into "words", but that should be trivial.

1: https://en.wikipedia.org/wiki/Sheffer_stroke

Taken to the logical extreme, the question is: "how many intrinsic symbols do we need to convey any meaning when presented to a fully logical being", to which, in my opininon, the answer is 1 (or 2, really, since 1 is only "possible").

You might not have words for it, but a fully logical being can decipher any bitstream given enough interactivity.

So start from 1 and 0, form basis of mathematics and symbols, then start with physics from all the way bottom.

0) Initialize set X to contain every word.

1) Y = set of words in every definition of the words in set X

2) X = Y - X (all words in Y that are not in X)

3) Repeat from 1 if the set of words in X has changed

Does that reduce all words down to the actual minimal set of words required to define other ones? Since you can build upwards from the resulting set X to get the original set of words.

Also, this reminds me of the knapsack problem a little bit (for example what is the minimum set of coins required to be able to make $X).

I've seen a dictionary that defined 120 words using those same 120 words (morphemes really), though some of the definitions were a bit ... weak. Toki Pona also has about 120 words but it's a very different set of words: Toki Pona's vocabulary is concrete and everyday, while the dictionary's was very abstract. So probably it's just a cute coincidence that both numbers were about 120.

Not according to these guys:


mentioned by mjgeddes in this very thread

It depends on the context assumed about the audience that is supposed to understand the definitions.

Do they have the experiences relevant to the word being defined? If not, what experiences do they have in common with the person providing the definition?

How intelligent are they? Can they understand complex concepts through logic, through examples or both?

How much do they know about English (besides the few words assumed known)?

Isn't the answer "two" ? "One" and "None" (or on and off) ?

Of course I see the obvious bootstrapping problem where you relate the encoding starting with just those two words but ... somehow I think that's easier to overcome than it seems ... as in, I think it must be possible.

If Helen Keller can write a book, surely I can relate digital encoding to a toddler over the course of a year or three, right ?

I think it depends on if there is a distinction between digits/letters and words, otherwise "26" would be a good starting answer too (since each letter is it's own word).

You're thinking of letters, not words.

I can actually answer this question. Back in the day I was going through Oxford Dictionary and it mentioned that all the meaning use words from a like of about 3,000 words. The list, IIRC, was also at the back of the dictionary. And it also mentioned that on rare occasions they have to use words outside of those 3,000.

Source: My memory of something I read at British Council Library 17 years ago.

Others agree with your memory, eg https://news.ycombinator.com/item?id=19332648.

So, once the human race had discovered roughly this number of words (give or take a few for whatever language existed at the time, and minus the useless words demanded by grammar) then humans had a Turing Complete language? That must have been a crucial point for the evolution of human culture.

Much more crucial than the Turing completeness of the grammer was certainly the ability to write language down, which conserved knowledge and language for multiple generations as long as the decoding skills were passed along.

What effect this really had can be observed with the introduction of mechanical printing presses which reduced spatial and temporal distances of information flow significantly.

The internet might yet be another of those things..

I bet this heavily depends on what you consider an accurate definition.

There are lots of ways it's not at all the same, but it's at least sort of interesting to compare this question to the number of dimensions needed for effective word embeddings.

Reminds me of Toki Pona — with about 120 words it seems to work.

As an avid toki pona user, I've often contrasted it with NSM and noticed things that are really tough to express in toki pona (very likely intentionally).

One thing is that toki pona has no built-in comparatives at all. A usual thing is to say something like

mi sona e ijo mute ala. jan pi pali sama li sona e ijo mute.

'I know not many things. My colleague knows many things.'

ona li suli taso mi suli mute.

'She is big, but I am very big.'

jan ni li jo e mani mute. taso jan ante li jo e mani mute mute.

'This person has a lot of money. But the other person has lots and lots of money.'

Another thing is that there's no built-in way to make a relative clause at all.

mi sona e toki. mama meli mi li sona e toki sama.

'I know a language. My mother knows the same language.' (As opposed to 'My mother knows a/the language that I know'!)

mi sona e toki. mama meli mi li sona ala e toki ni.

'I know a language. My mother does not know this language.' (As opposed to 'I know a language that my mother doesn't (know)'!)

ona li pali e ijo. mi sona e jan ante. jan ni li pali kin e ijo ni.

'She does something. I know another person. This person also does this thing.' (As opposed to 'I know another person who does what she does'.)

moku mute li kama tan soweli. mi moku ala e moku ni.

'Many foods come from animals. I don't eat these foods.' (As opposed to 'I don't eat foods that come from animals'.)

It's also extremely tricky to construct specific tenses and specific logical conditions. The particle "la" can mean "when", "because", "also", or "if", and is only supposed to be used once per sentence. This is especially challenging when trying to contrast things that have happened with hypothetical conditions. For example

jan olin ona mije li moli la mi mute li pilin ike.

I intend this to mean 'we feel bad because his romantic partner died' but we can't really disambiguate, for example, 'we will feel bad when his romantic partner dies' or 'if his romantic partner dies, we will feel bad'.

You can qualify things with "tenpo pini/ni/kama la" ('in past/this/future time'), but you're not supposed to use more than one "la" in the same sentence, so it's discouraged to write things like

?tenpo pini la mi moku e ni la insa mi li pilin ike.

'Because, in the past, I ate this, my belly feels bad.'

You can try to break these up into multiple sentences.

tenpo pini la mi moku e moku jaki. mi pali e ni la mi kama pilin ike.

'In the past, I ate gross food. Since I did this, I started feeling bad.'

This gets really challenging if you have to refer to several different things of the same sort, which perhaps have conditional relations to one another that apply at different times or in different circumstances. For example, if you wanted to say "when my mother arrived, the plane that she was on was very warm because it had a broken air conditioning unit which the crew didn't know how to fix", you might end up making a long series of sentences that tell a story.

tenpo pini la mama mi li kama kepeken ilo tawa kon. ona li kama la kon lon ilo li seli mute. ni li kama tan ni: ilo lete li pakala. jan pali li sona ala pona e ilo lete.

In the past, my mother came using an air travel tool. When she/it arrived, the air in the tool was very hot. This happened because of this: the cooling tool broke. Workers did not know how to improve the cooling tool.

But some kinds of conditions don't necessarily lend themselves well to this form, like if I wanted to say "if she had known that this would happen, she wouldn't have taken this airplane", or quantifiers like "every Singaporean who goes to school in Singapore learns English and whatever the government defines as his or her family's language" or "everyone who was inside the building when the earthquake happened got injured by some object"...

I don't feel confident about my ability to describe the truth conditions of the latter two examples in toki pona in a way that's faithful to the English original.

It's also unclear to what extent we're allowed to stack "e ni:" and "tan ni:" in order to embed indirect discourse and chained reasons.

?ona li pilin pona tan ni: toki pona li pona tawa ona tan ni: ona li toki lili li jo ala e nimi mute.

'She was happy because of this: she liked toki pona because of this: it's a small language and doesn't have many words.'

Edit: also, NSM explications assume that you're deliberately defining new vocabulary in order to expand your language, which isn't really customary in toki pona. Even if we figure out how to express a concept or situation in toki pona, we don't then acquire a single word that we can use for that concept or situation in the future.

What's the minimum number of words you'd need to define the word "left", as in "left hand"?

opposite right?

Left is 'not right', right is 'not left'. So left is simply 'not not left' ... ez!

a, b, c, d, e, f, g, j, k, l, m, n, p, r, s, v, x, y, (, ), and, concatenate.

Some hints:

- "backwards j"

- "a circle"

- "a cross"

- "n, but rotated ninety degrees"

- "mirror of p"

- "vv, except no gap"

- "pixel-wise union n and l"

- "mirror of s, and make the lines straight"

Semantics are impossible anyways, I challenge you to define the word "dog".

Challenge: Do better, make sure you don't have circular dependencies.

This feels intuitively like it's closely associated with some measure of the Komolgorov complexity of a passage.

Can go from 1 word: Entity, to every word.

The tradeoff being density of information, understandability to the readers, and conciseness.

There are things which "are" which are not entities: objects.

Two words. "1" and "0"

10 words. "1" and "0"

fixed it for you.

01000100 01101001 01100100 00100000 01111001 01101111 01110101 00100000 01101101 01100101 01100001 01101110 00111010 00001010 00001010

00110001 00110000 00100000 01110111 01101111 01110010 01100100 01110011 00101110 00100000 00100010 00110001 00100010 00100000 01100001 01101110 01100100 00100000 00100010 00110000 00100010 00001010 00001010 01000110 01101001 01111000 01100101 01100100 00100000 01101001 01110100 00100000 01100110 01101111 01110010 00100000 01111001 01101111 01110101 00101110


Also spaces??

1 and i if you are quantum computing

And I and I, if you're qualia computing

Finally, something thought-provoking! Everybody, ready your Internets, this gentleman deserves an answer!

"a" and "i" since its binary you could define all others.

Finally, something thought-provoking! Everybody, ready your Internets, this gentleman deserves an answer!

Reminds me of Toki Pona

I'm guessing but I can't really explain why, my gut feel is 42.

Randall Munroe of XKCD experimented with this in his book Thing Explainer:



unary codes still need two symbols because you need a terminator/separator.

binary codes can be prefix-free thus self terminating.

and kernel is one of them

I'd say most nouns need to be seen.

To understand duck you must see a duck (Eat a duck, pet a duck, smell a duck, hear a duck)

Perhaps you could cheat and uses pixels and coordinates to use English to draw photos and videos to explain ducks.

It depends on how well you want to "define" something. Wikipedia describing a duck:

Duck is the common name for a large number of species in the waterfowl family Anatidae which also includes swans and geese. Ducks are divided among several subfamilies in the family Anatidae; they do not represent a monophyletic group (the group of all descendants of a single common ancestral species) but a form taxon, since swans and geese are not considered ducks. Ducks are mostly aquatic birds, mostly smaller than the swans and geese, and may be found in both fresh water and sea water. Ducks are sometimes confused with several types of unrelated water birds with similar forms, such as loons or divers, grebes, gallinules, and coots.

But you could also describe a duck in two simple words: "water bird". Apparently that's a real term: https://en.wikipedia.org/wiki/Water_bird

That's not really a good definition, because it is too expansive. Penguins are water birds but not ducks.

(Preparing for downvotes)

See Genesis 2:19-20 (and its placement/context). God shows Adam forms to be named.

Don't know why you would be downvoted for this.

Every ancient-enough religion all starts out, effectively, the same way, in their own "religious dialect": the universe was created, life on Earth was created, and then Mankind invented language. This has made a lot of people very angry and been widely regarded as a bad move.

Things were simply not named before we invented language and named things.

Doesn't Goedel's incompleteness theorems imply that it is impossible to define all words using words, unless you have some axiomatic words that are not defined within the system?

Generally speaking, the number of words is discrete, not infinite, so I don't believe that Godel's theorems would apply here.

I'm not saying that new words aren't created, just that in a practical sense, unless you're creating new words to mess with someone trying to do this, it doesn't apply.

You have my most enthusiastic contrafribularities.

I don't think it applies here. With words you just end up with circular definitions.

With math a circular definition is unacceptable, and that when the theorem comes into play.

I would think circular definitions would be just as unacceptable with words.

They're not because we don't acquire language by definition of words (alone; sometimes at all; reading dictionaries comes along way down the road of language acquisition).

If I understand correctly, axiomatic words are exactly what's being counted here.

Ah, I misread the question. Thanks.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact