
The most frequent 777 characters give 90% coverage of Kanji in the wild - sova
http://japanesecomplete.com/777
======
diego
The thing is, 90% coverage is not that great. What happens is that you
understand common words that make up for a lot of structure, but when an
uncommon word appears it's probably important to the sentence. For example,
"son, if you go to the plumbf tomorrow morning don't forget to pick up some
zlonks."

98% is closer to what you need in order to read a text and have an idea of
what's going on. See this article:

[https://www.sinosplice.com/life/archives/2016/08/25/what-80-...](https://www.sinosplice.com/life/archives/2016/08/25/what-80-comprehension-
feels-like)

~~~
WnZ39p0Dgydaz1
Being fluent in Japanese as a second language, I agree with this. It's Zipf's
law and sounds great, but 90% isn't as useful as it sounds. You'll mostly
recognize a single common Kanji of compound words consisting of 2-3
characters, or common structural words. It's a far cry from being able to
understand content you find in the wild.

Also, "understanding" a Kanji is an ill-defined term. Most Kanji have multiple
meanings based on context, and many different readings. So for each Kanji
you're not learning a single character, but possibly a lot more. Especially
the Kanji that correspond to more abstract concepts you cannot learn by
themselves. They don't have a concrete meaning like "eat" or "drink". You
essentially have to memorize all of their word compounds, a single character
does not help.

EDIT: I also started out learning Japanese by memorizing the top 2000 Kanji
using Spaced Repetition. While it definitely helped, it wasn't nearly as
useful as some of these marketing-driven sites want you to believe. Kanji
meaning are too complex to be captured like that. Even if you "know" all Kanji
in a word you'll likely not understand the word's meaning unless it's
something simple and concrete. I think you are much better of memorizing and
studying word compounds. Over time you will automatically "pattern match" the
Kanji you see often to their abstract concepts.

Example from 1min of browsing a JP text: Take "可能" which is a very common word
that usually means "possible". Knowing the two Kanji (tolerant and ability) it
not going to help you. It could mean dozens of other things based on those
simplified Kanji meanings. This is not an exception, the majority of words are
like this. On the other hand, let me give you a bunch of words containing "可":
許可、可決、可能性、不可欠 (permission, approval, possibility, essential) and you start to
pattern-match that "可" corresponds to something like "positive possibility",
but it's hard to translate.

~~~
WnZ39p0Dgydaz1
Afterthought: Kawaii (cute) is actually 可愛い, also containing "可", which
literally may mean something like "a thing that can be easily loved", or
simpler, cute. But you wouldn't be able to guess that if you just know the
Kanji.

~~~
mytailorisrich
I don't know for Japanese as meaning sometimes shifts from Chinese, but in
Chinese the standard definition of 可 is "can, may, be able to".

You obviously learn it by itself but as Chinese words are mostly a combination
of 2 characters, you immediately also have to learn e.g. 可以 (can, may, be able
to), 可能 (maybe), 可爱 (cute) etc.

So someone who's learning characters in order to get 90% coverage (or
whatever) would not simply learn characters but learn actual words. Learning
characters in isolation would not be that helpful, indeed.

When you don't know a word (i.e. a combination) but you know the individual
characters it is much easier to learn the new word either by guessing or
checking.

In context, the meaning of 可爱 would be fairly straightforward to guess, for
example. Even in English 'lovable' is a synonym of 'cute'.

~~~
fenomas
> the meaning of 可爱 would be fairly straightforward to guess

To be honest this whole thread about 可愛い is more or less bonkers, because it's
an ateji. The word's meaning doesn't derive from the characters, the
characters got arbitrarily attached to an existing word because they were
similar in sound and meaning.

As such, the whole thing is about as meaningful as talking about how easy it
is to guess that 珈琲 means "coffee"...

~~~
mytailorisrich
I must say I don't know much about ateji in Japanese.

In this case, though it does seem that the characters where chosen at least
partly because of their actual meaning.

It seems that it is both an ateji and a jukujikun [1] because the word does
not come from the characters but the characters do have the correct meaning.

[1][https://en.wiktionary.org/wiki/%E5%8F%AF%E6%84%9B%E3%81%84#J...](https://en.wiktionary.org/wiki/%E5%8F%AF%E6%84%9B%E3%81%84#Japanese)

~~~
fenomas
> the characters where chosen at least partly because of their actual meaning

Sure, didn't I say they were in my post?

The point was, in a discussion of how well X predicts Y, it's not very useful
to examine a test case where Y came first and X was chosen post-hoc to match
it.

~~~
mytailorisrich
> _Sure, didn 't I say they were in my post?_

No, quite the opposite actually ;)

~~~
fenomas
> the characters got arbitrarily attached to an existing word _because they
> were similar in sound and meaning_

~~~
mytailorisrich
It does not seem arbitrary in this case because the meaning does match.

I do take your point that using that word in the discussion above, which is
about Japanese was not the best example. On the other hand, it is a good
example in chinese.

~~~
fenomas
(A) In Japanese the meanings don't match that closely. The word didn't
originally mean cute, but rather pathetic or pitiable, and evolved over time.
More info: [http://gogen-allguide.com/ka/kawaii.html](http://gogen-
allguide.com/ka/kawaii.html)

(B) By arbitrary here I mean that there is no linguistic connection.
"Arbitrarily chosen because they are similar" => "chosen for no reason other
than their similarity".

------
vsnf
An interesting assertion, but the article is both vacuous and confusing. It
doesn’t link to the underlying study, combines what one tends to think of as
kanji (single characters) with more complex juku-kanji (multiple characters)
with straight up whole dictionary words. Some of those words are presented
with English definitions to the side, but most aren’t. Some of the kanji are
presented with pronunciations, but most aren’t. There’s no support for the
thesis statement.

I have no particular reason to disbelieve the headline, it sounds reasonable
enough, but this page in specific is doing nothing to persuade me of its main
point.

~~~
sova
A Japanese Logographic Frequency List (2000) Chikamatsu et. al

[https://researchmap.jp/YOKOYAMA_Shoichi/%E8%B3%87%E6%96%99%E...](https://researchmap.jp/YOKOYAMA_Shoichi/%E8%B3%87%E6%96%99%E5%85%AC%E9%96%8B/?action=multidatabase_action_main_filedownload&download_flag=1&upload_id=6391&metadata_id=18550)

~~~
wodenokoto
Great find, digging up what appears to be the base of TFA’s list.

Some critique of using the linked list: That is based on a newspaper corpus
from 93, which explains why day/sun is at the top, which they won’t be for
list created from Wikipedia or television subtitles, both sources that are
arguably closer to “in the wild”

------
kazinator
I know around 10,000 Japanese words. As far as kanji goes, I've studied some
1200 of them intensively; through vocab I know many more as parts of words. I
quite often have to reach for a dictionary when reading.

If you know 777 kanji in some way (like associating them with meanings,
through your native language) and you haven't crammed on any vocabulary, you
absolutely will not be able to read a thing.

In fact, even if you continue that way and memorize over 2000 kanji, and
recognize every single one in a given document, you still won't be able to
read anything without vocab.

The broadened knowledge will help support vocabulary building, though.

~~~
matt-attack
> I quite often have to reach for a dictionary when reading

So when reading kanji, how do you look up a word (picture) you don't know?
Since there isn't a minimal set of characters, the notion of "alphabetical
order" seems impossible. Weird that I've never thought about this until now,
but I'm honestly baffled.

~~~
laurieg
Let's say I see the word: 銀行 and want to look up the first character. I notice
that the left hand side is 金 so I go to a dictionary, turn to the kanji
radical section[1] find 金 and then find the original character by number of
strokes[2]. This character has 14 strokes. If you have any experience with
kanji then counting strokes is pretty trivial.

Also note that on the site linked to common kanji have a red background. Many
of the characters are obscure so radical + stroke count narrows done the
choices to very few kanji.

[1]
[https://kanji.jitenon.jp/cat/bushu.html](https://kanji.jitenon.jp/cat/bushu.html)

[2]
[https://kanji.jitenon.jp/cat/bushu08002.html](https://kanji.jitenon.jp/cat/bushu08002.html)

------
kazinator
> _Take solace in the fact that Japanese Complete has arranged the kanji and
> the verbs based on a frequency analysis of the Japanese corpus._

Jim Breen's KANJIDIC has frequency information.

[http://www.edrdg.org/wiki/index.php/KANJIDIC_Project](http://www.edrdg.org/wiki/index.php/KANJIDIC_Project)

> _The 2,501 most-used characters have a ranking which expresses the relative
> frequency of occurrence of a character in modern Japanese. The data is based
> on an analysis of word frequencies in the Mainichi Shimbun over 4 years by
> Alexandre Girardi. Note: (a) these frequencies are biased towards words and
> kanji used in newspaper articles, and (b) the relative frequencies for the
> last few hundred kanji so graded is quite imprecise._

------
olsgaarddk
A few years ago I downloaded several hundreds of megabytes of Japanese
subtitles, split into 3 categories: live action/drama, anime and foreign
film/tv

I’ve listed them in a google sheets together with a few other corpora

[https://docs.google.com/spreadsheets/d/1yb5dq4ahdwc_g0aQTL3Y...](https://docs.google.com/spreadsheets/d/1yb5dq4ahdwc_g0aQTL3YM6i2mKiZ2m-AvhpwygbZD4A)

Choose the jimaku tab for subtitles to see how big the variation between
corpus can be.

According to other comments here, it appears that OP list is based on a
newspaper corpus from 1993.

~~~
echelon
The source links appear to no longer work. Do you know where we can download
Japanese subtitles?

I would love to attempt to segment a bunch of Japanese subtitles into words
and then do frequency analysis. My interest is in increasing my listening
ability, so I want to put the most frequently spoken words into SRS/Anki, and
perhaps even break it down by anime.

Alternatively, has anyone already done this?

~~~
olsgaarddk
That was my initial goal, but I had a lot of trouble with vanilla MeCab not
understanding a lot of the text. But this was before neologd, so i think it
would work better now.

I don’t have the source code on me, but I scraped it from a website that
publishes subtitles. The scraping was easy, the cleaning not, and I believe
this spreadsheet is generated from my first attempt at cleaning.

A lot of sources in Japanese nlp and linguistics have a bad habit of changing
url often, so it bitrots easily. Sorry.

------
xxxpupugo
This means very little though.

Take me as English learner for example. I would say I was only able to
understand everyday English without too much of a hassle, after I acquired
like around 10k words, which as I just checked had a coverage about 98%+.

Noted, it is still NOT enough, actually far from enough. Right now I believe I
master around 15k to 20k words, by various estimates, and navigating English
on the internet is like a charm, very little context switch in between with my
native language.

Still, reading literature is huge undertake for me. I would still need to
pardon myself about every once a page that if I stumped upon certain unknown
words/phrases and can't move on before fully understands it, my pleasure of
reading would be ruined. Such comprise frustrates me still, to this day. On
the other hand, I will never have a second thoughts reading most cryptic novel
in my own language, understanding might still be a challenge, but unlikely due
to my insufficient vocabulary.

~~~
wingerlang
How do you check your coverage? Or even how many words you know?

~~~
bspammer
Just googling around, I found this test: [https://www.arealme.com/vocabulary-
size-test/en/](https://www.arealme.com/vocabulary-size-test/en/)

No idea about its accuracy, but as a native English speaker it's telling me my
vocabulary size is 30k words, which sounds roughly correct.

~~~
thwave
I've tried it now and got 22k, which seems not so bad for a foreigner ("Top
6.53% Your vocabulary is at the level of professional white-collars in the
US!"), but I feel like I cheated: most of the more fancy English words are
just misspelled Latin, and having even a modest Latin vocabulary (I'd don't
think I know more than 4k Latin words) makes their meanings pretty obvious.

------
redthrow
If Japanese people just used kana (like Korean people use hangul today), kids
in Japan don't have to spend countless hours learning this complex (and often
irrational) writing system.

[http://kanamozi.org/hikari959-04.html](http://kanamozi.org/hikari959-04.html)

> If you really want to be native-level Japanese, kanji are essential

There are visually impaired people who have difficulty learning kanji but
speak Japanese fluently. Language is not just for people who can read and
write, let alone reading and writing complex characters, or spell things
"correctly".

Richard Feynman:

 _If the professors of English will complain to me that the students who come
to the universities, after all those years of study, still cannot spell
"friend," I say to them that something's the matter with the way you spell
friend_

~~~
asutekku
Have you read japanese written only with hiragana? The language has so many
homonyms it is hard to distinguish them from a text as it is a heavily context
based language. I’d argue abolishing kanji and using only hiragana/katakana or
roman alphabet makes learning the language harder after the very beginning
when your vocab starts to increase.

~~~
redthrow
I'd recommend playing the excellent SNES game Mother 2 (マザー２）, a huge hit in
Japan and nobody complained about the text being difficult to read.

Or you can watch one of the let's play videos with narration here:

[https://m.youtube.com/watch?v=F_UrqsO2JQ0&list=PLC4EWNG6GsuY...](https://m.youtube.com/watch?v=F_UrqsO2JQ0&list=PLC4EWNG6GsuY5SExvHdnN9l9nP_ghz5vo&index=2&t=0s)

~~~
innocenat
But games don't show you pages and pages of Hiragana. And it's old games --
it's accepted technical limitation.

Have _you_ read pages and pages in Hiragana? Or even paragraphs? It's not fun,
and it's take a lot of concentration.

~~~
redthrow
It's no harder than having conversations in spoken Japanese where you don't
see any text, kanji or otherwise.

It would look awkward initially, but that's just because people are just used
to the status quo which is kanji-kana mix.

~~~
innocenat
Spoken Japanese has intonation and rhythm which make identifying word
boundaries and homonyms easier.

Reading hiragana (especially without space as word boundaries) is totally
different. Reading long hiragana by speaking aloud help, but there is still a
problem of ha/wa and he/e.

~~~
redthrow
Using space is a given if you write exclusively in kana. See Kanamoji-kai:

[http://kanamozi.org/](http://kanamozi.org/)

Also, the very reason there are so many homonyms in written text in the first
place is _because_ of the kanji (over)usage - that is, because people think
there are visual cues they are less careful about choosing words that are also
understood easily by people listening to the words.

When people speak, at least if they are a competent speaker, people tend to
avoid the overuse of homonyms (mostly kango).

------
jaredklewis
A brief tangent, but something I've noticed is that the meanings of words like
"fluent" and even "native" are so ambiguous and poorly defined, that it is
almost impossible to have a meaningful conversation about language learning
unless you avoid them completely.

The marketing materials for language learning resources tend to make full use
of this vagueness, like this one does. I wish these resources instead did more
to enlighten their prospects as to what one can actually expect to achieve and
in what kind of time frames.

Languages are endlessly deep. "Native" is not even close to the top. Even
amongst "native" speakers, skill with and understanding of language is
enormously varied. Compare the wedding toast of a skilled public speaker with
that of an average one. Compare a literature scholar's understanding of a
classic novel with that of an ordinary high school graduate. It's night and
day.

IME, 777 kanji wouldn't get you very far in a newspaper and certainly not a
novel. It would likely be enough to understand 90% of ordinary emails and text
messages.

So many great resources to learn Japanese with these days; this vocabulary
list is not one of them.

------
franciscop
This is known in the education space, and tests like the JLPT N5-N1 are based
on this! Also when learning English, the learning material and exams are based
on this. The order of frequency is not strictly followed though, if you have
to learn "Monday" now and "Tuesday" in a couple of days, it makes sense to
bundle them all by concepts at once and learn "getsukasuimokukindonichi". So
in Japanese learning, on a day-to-day it might seem like you learn random
difficult-easy words or characters, but overall you only have to _memorize_
the top N characters/words for the next test and you'd be alright.

I made a free website to memorize Kanji that works offline:
[https://core.cards/](https://core.cards/). Initially I did maintain a list of
the top 100, top 500, and top 1000 (approx) if I recall correctly, extracted
from Wikipedia lists, to learn Japanese Kanji. But now I've switched it to
just follow the JLPT because they were almost the same.

------
echelon
Does anyone have frequency lists of vocabulary broken down by type (verb,
noun, etc.)? I've seen word lists on Wiktionary, and I'm attempting to cross-
reference jisho and other sources.

How many words do you need to comprehend for daily competency? Would the
10,000 suffice?

How many words do you need to be able to watch anime aimed at children (eg
Bono Bono) or teenagers (Boku no Hero Academia)?

~~~
sova
Great questions, let's figure out the answers! I think 6,000-7,000 words (just
a wild guess based on experience) would cover a lot of daily conversation,
plus specializing in whatever domain you're in a little.

------
bgee
Disclaimer: native Chinese speaker, knows some Japanese, English sufferer

Putting aside the argument of whether removing all Hanzi from Japanese text
would actually be more efficient or not, the question to me is: why stop at
Hanzi? Why not romanizating all the Japanese literature? Surely almost all the
reasoning in favor of getting rid of Hanzi can also apply here?

edit: grammar

~~~
angelsl
That's essentially what Korean did. They replaced everything with their own
morphophonemic orthography.

But even then they still use Hanja to disambiguate sometimes.

~~~
bgee
I was not even talking about Hangul, the point I made was on using a Latin
alphabet. If we are revamping the whole writing system, why reinvent the wheel
if the main goal is "efficiency"?

edit: obviously Kana has already been created

~~~
redthrow
There have been notable Japanese people who argued that the Japanese writing
system should adopt Latin alphabets since ~150 years ago.

[https://ja.m.wikipedia.org/wiki/%E3%83%AD%E3%83%BC%E3%83%9E%...](https://ja.m.wikipedia.org/wiki/%E3%83%AD%E3%83%BC%E3%83%9E%E5%AD%97%E8%AB%96)

I'm personally fine with both romaji and kana (learning kana is not a big
cognitive burden anyways).

I'm even fine with kanji (or Latin or anything) if people learn it
voluntarily. What's not ok is kids being forced to learn them in school.

------
LastZactionHero
As someone who spent a lot of time learning kanji alone, there's not much you
can do with kanji alone. It's a helpful step in learning actual words,
learning strokes, finding patterns, but I'd be skeptical of the utility of
this list.

If you're going to rote memorize something, I'd probably start with the
radicals.

------
visarga
What about kanji combinations? I bet you got to learn many thousands (made of
the same 777 kanjis).

------
wrp
Vocabulary frequency lists have long been very popular with publishers, so I
assume as well with language learners. The assumption behind these is that
with knowledge of, say, 90% of the vocabulary you encounter, you will
comprehend 90% of the writings you encounter. It doesn't work out at all that
way, though, because it is generally the low frequency vocabulary that carries
the key information in a text and on which the interpretation of everything
else hinges.

~~~
jacobolus
In the US, there are beginning reader books (e.g. the “I Can Read!” and “Ready
to Read” series) which intentionally use somewhat limited vocabulary. These
are somewhere between a picture book and a chapter book: they have usually 1.5
pages of text and 0.5 pages of picture in each 2-page spread; the text is set
in a large font but there are at least a few sentences per page; usually the
books are 30–50 pages long, with 3–5 “chapters”.

I don’t know how useful these are for independent reading by 5–6-year-olds,
but anecdotally they are great material for reading to 2-year-olds, better
than most picture books. (Note: some of the recent readers are garbage
marketing gimmicks with movie tie-ins, ranging from boring to
incomprehensible; skip those.)

~~~
bdowling
Notable examples are _The Cat in the Hat_ , which uses 236 words, and _Green
Eggs and Ham_ , which uses only 50 words. (Wikipedia)

------
echelon
Before this falls off the front page, I figured I would ask the following:

1) Does anyone have a resource for Japanese subtitles (in Japanese/kanji, not
English)?

2) Does anyone have good frequency { word => frequency } lists? Especially if
they are topical, eg. school-related, anime-related, industry-related.

3) What are the best programs for segmenting Japanese text into words
reliably?

4) Does anyone have a vocabulary set for any given manga, anime, or film that
you could study before watching?

5) In addition to Anki and Wanikani, what are good SRS apps or programs?

6) Does anyone use Skype (or similar) to practice with native Japanese
speakers? How is it? How did you find people to practice with?

7) What is the inflection point (in terms of raw # of vocabulary) to being
able to understand Japanese anime or drama? What JLPT level does this
correspond to?

8) How many new words do you acquire per day of study? How long have you been
studying? Have you taken any of the JLPT tests?

~~~
htns
5) zkanji: [https://github.com/z1dev/zkanji](https://github.com/z1dev/zkanji)!
It's got a dictionary from which you can directly add words into its study
decks when looking them up, and it has handwriting recognition plus let's you
easily find similar looking kanji (with shared components etc), which is great
when tesseract ocr fails, or when the text is so blurry/compressed you can't
really even see it clearly (the online Japanese war history archives really
love to compress their scans).

~~~
echelon
Thanks for the head's up! :)

Not long after this HN thread, this thread popped up on Reddit. It answered a
lot of my questions ([1], [2], [4], and [7]), and I found it immensely useful:

[https://www.reddit.com/r/LearnJapanese/comments/crlsqj/googl...](https://www.reddit.com/r/LearnJapanese/comments/crlsqj/googlesheet_anime_frequency_list/)

------
kerorin
私はネイティブの日本語話者です。確かに777個の漢字を知ると90%をカバーできるとは思いますが、それがイコール日本語の90%をマスターしたとは言えないはずです。実際、この文章は極めて平易ですが、外国人にはそれなりの難易度になっていて読むのに非常に苦労するんじゃないかなと思います。
それから、リストのすべてをレビューしましたが、一部「経（ふ）」や「格別空」は言葉としておかしいです。また「恬然」や「整復」は明らかに頻出語句ではなく、ソースに偏りが見られます。

I'm a native Japanese speaker. I agree that 777 kanjis are contained in common
sentences at the rate of 90%, but it doesn't mean you can complete 90% of
Japanese. Actually, these sentences are very easy to read for Japanese people,
although it's hard for foreigners because of the difference of vocabulary. I
also reviewed the list. I believe「経（ふ）」and「格別空」are odd as a word. In addition,
「恬然」and「整復」 are not frequently appeared so I think there was a bias to choose
sources.

~~~
kerorin
You can estimate your Japanese vocabulary size on this site.
[https://www.arealme.com/japanese-vocabulary-size-
test/ja/](https://www.arealme.com/japanese-vocabulary-size-test/ja/)

Here are examples of native Japanese speakers. Usually we can achieve 20,000
easily. My result was 36,000 words.

[http://burusoku-vip.com/archives/1798300.html](http://burusoku-
vip.com/archives/1798300.html)
[https://b.hatena.ne.jp/entry/s/www.arealme.com/japanese-
voca...](https://b.hatena.ne.jp/entry/s/www.arealme.com/japanese-vocabulary-
size-test/ja/)

------
timwaagh
So what is the typical reason for people to learn Japanese? Is it a good
destination for something more important than visiting a few shrines or buying
rare manga? I'm currently learning vietnamese, another east Asian language.
It's very difficult. I'm very good at learning languages, generally. But not
this one. The only reason I'm doing this at all is i'm here anyways after my
plans failed two days into the trip and I can't communicate so I'm alone and
shit out of luck. I don't think I will know a reasonable level even after the
three weeks are over. Waste of a trip but I still get some decent pictures to
post online.

~~~
vharuck
I like learning foreign languages because it's interesting to see how they
outline ideas and idioms are a great way to learn about the culture. So it's
an enjoyable hobby that builds a skill (maybe of little value, but I can
deceive myself).

I chose Japanese because I watch and read a lot of stuff from Japan. It's also
very different from English, which is fun.

------
euske
The frustrating issue is that they don't state in _what situation_. I'm
certain that most people will just do fine with only 100 kanji characters for
daily life, or no kanji at all if they're just a tourist. But if you're
applying for a full-time Japanese speaking job in a Japanese company, knowing
only 777 characters is a joke (which is about the 5th-grader level). Just like
a programming language, the level of fluency required depends on a task.

p.s. just confirmed that the list is not enough for filing a tax in Japan.
They don't have words like 所得 (income), 控除 (deduction) or 医療費 (medical
expense).

------
fortran77
Not really sure what the major point is. Sure, there's a core set of "units"
(words / Kanji characters) that are useful when trying to learn a language. I
didn't even attempt to speak English conversationally until I had about 200
verbs and 500 nouns and maybe 100 adjectives memorized. This lets you have
useful conversations and basic understanding when reading, but it's still a
struggle to get that last 10%.

------
rootsudo
Website is about 777 kanji that give 90% coverage of Japanese.

No Kanji at all, english words and some romaji like "tsu" which, can also be
kana.

I was disappointed by the article.

~~~
emilfihlman
Disable automatic translation.

------
Shorel
Seems similar to what Fluent Forever (book, application and website) claims.

Memorize the 625 most used words to jump-start your language learning and then
you can move to grammar and other stuff.

[https://blog.fluent-forever.com/base-vocabulary-list/](https://blog.fluent-
forever.com/base-vocabulary-list/)

------
lttlrck
For non-experts it is helpful to know there are around 50000 kanji characters
so 777 is ~1.6%

[https://japanese.stackexchange.com/questions/11735/how-
many-...](https://japanese.stackexchange.com/questions/11735/how-many-kanji-
characters-are-there)

------
chasontherobot
I've read through the linked paper and I can't understand where they get the
assertion of 777 characters give 90% coverage. The original paper isn't even
about that topic, but rather comparing and contrasting a corpus created in
1994 with a corpus created in 1962 and 1976.

------
dvduval
Yes, I have been studying Chinese for about five years now and I can recognize
about 2700 characters. Looking at this Japanese character set, I'm guessing I
know 80% or so. I always act like charts like this.

~~~
jackklika
I've noticed the same where I can understand the basics of a Japanese program
if there's subtitles. But there are some oddities like 食 being used in a verb
(as in 食べる) or 行 as to go. But you still have good context and can at least
understand the topic in most cases.

~~~
wilsonthewhale
knowing Cantonese helps here, as both the verbs you mention are still common
verbs in modern Cantonese.

------
slashcom
Zipf’s law

~~~
causality0
Beat me to it. To expound upon this, Japanese is not unique in basically
adhering to Zipf's law. In many organization data sets, including the
vocabulary of most languages, the most commonly used word is twice as common
as the second-most, and the second-most word is twice as common as the third-
most, and so forth.

~~~
fenomas
> second-most word is twice as common as the third-most, and so forth.

Normally Zipf's law refers to the frequency being inversely proportional to
rank - i.e. the 3rd most common element would be 1/3 as frequent as the first,
not 1/4th.

------
canjobear
A simple information-theoretic argument suggests that most of the information
is in the remaining article 10%.

~~~
tfha
Not enough information. For example, in English some of the least common
characters are also generally not important for conveying information.

Your argument only holds if Kanji was designed to be optimal in compressing
information, which is of course not how Kanji came to be.

That doesn't mean you are wrong, but your argument is invalid.

~~~
6gvONxR4sf7o
Aren't kanji closer to english words than english letters?

------
WalterBright
It's similar to learning about 3000 words in a foreign language will make you
passably fluent.

------
RadioHacker
Wouldn't 777 words in any language give 90% coverage of all the words in a
typical newspaper?

~~~
grzm
One difference here is that Japanese characters can be combined in various
ways to create different words, so the 777 characters can be used to create
many more than 777 words. Compare with the Simple English Wikipedia which
strives to use only the 1000 most commonly used English words. I think you'll
find that that experience quite a bit different from reading a typical English
language newspaper.

[https://simple.wikipedia.org/wiki/Wikipedia:Simple_English_W...](https://simple.wikipedia.org/wiki/Wikipedia:Simple_English_Wikipedia)

Another example of "using only the ten hundred words people use most often" is
Randall Munroe's "Up Goer Five":

[https://xkcd.com/1133/](https://xkcd.com/1133/)

It may be helpful to think of "characters" as representing some middle ground
between words and alphabetic letters, a little like word stems.

------
viburnum
I thought the number of standard kanji was only about 1900 to begin with.

~~~
fiblye
2136.

But the list isn’t at all comprehensive. There are a considerable number of
kanji in regular use that aren’t on the list, and when you include
place/person names, it grows massively.

It’s possible to memorize all the standard kanji, but crack open a history
book and you won’t recognize half of the words.

~~~
viburnum
Haha, it was 1,945 when I was studying Japanese, times change. Kind of
surprised they added characters rather than further shrink the list.

------
blondie9x
This is kind of common knowledge. JLPT 4 then 3 then 2 then 1. Done

------
ekianjo
Lol, 777 kanji is just too short to even be comfortable reading emails at work
in Japanese. So what good is 90 percent if the only thing it allows you is to
do shopping?

~~~
quicklime
777 of the most popular kanji from general Japanese might not be useful for
work emails, but if you find the top 777 from a corpus of your own emails, it
might be.

It's not hard - I've done this before using a corpus made up of work
documents. There's a part-of-speech analysts tool called mecab that gives the
word stems, and makes it easy to find word boundaries (since Japanese doesn't
use spaces).

The output went into Anki and it didn't take long before I was reading emails
and documents at work fairly easily.

~~~
sova
genius

------
microcolonel
The long tail is full of crucial nouns.

------
foota
In other news only 24 english characters give nearly 100% of all the
characters in use in English

~~~
cyborgx7
But knowing 24 characters doesn't make you understand 100% of the words.
Actually, it gives you 0%.

~~~
phlyingpenguin
It turns out there are more than 777 words commonly used. Kanji is not
different in this respect.

------
gfodor
now do mandarin please :)

------
andrewkondelin
hgv

------
knolax
This is just a confirmation of Zipf's Law[0], which applies to all languages.

[https://en.m.wikipedia.org/wiki/Zipf%27s_law](https://en.m.wikipedia.org/wiki/Zipf%27s_law)

------
H8crilA
Yet another (re)discovery of Pareto's distribution. Net worth, stock returns,
popularity of words in languages, casualties in wars or natural disasters,
size of cities, popularity of artwork pieces like songs or computer games, ...

[https://en.wikipedia.org/wiki/Pareto_distribution#Applicatio...](https://en.wikipedia.org/wiki/Pareto_distribution#Applications)

~~~
sova
This is closer to 90:25 rather than the typical Pareto of 80:20

~~~
H8crilA
80:20 is a pop-culture take on the distribution. It does a good job of
visualising it to someone who doesn't know maths.

It's really any distribution with the CDF of the form x^(-a)

~~~
sova
The wikipedia article says it's commonly formulated as 80:20, so that's where
I'm getting my info. You're saying it covers every nice Pareto ratio, which is
very different. Because 90:25 is much better than 80:20 but they are part of
the same phenomenon. Well, you can call everything a Pareto phenomenon then
and what's the point if everything fits in this universalish category? How can
I explain the value of 90:25 without invoking Pareto and having it constantly
diluted to 80:20?

