The thing is, 90% coverage is not that great. What happens is that you understand common words that make up for a lot of structure, but when an uncommon word appears it's probably important to the sentence. For example, "son, if you go to the plumbf tomorrow morning don't forget to pick up some zlonks."
98% is closer to what you need in order to read a text and have an idea of what's going on. See this article:
As someone who barely scraped by the Kanji/readings of my N2 but have to do a chunk of my work in Japanese, I gotta disagree.
Sure, reading isolated sentences with only 90% coverage is really hard sometimes. But usually you're reading whole blocks of text, so you have context. Also, maybe you don't know some kanji but it has similar radicals to other, so you can make educated guesses (only sometimes of course).
I mean it's just like English! There are radicals/etymological clues, there are context clues, there's so much that you can work off of to figure things out. And exercising "figuring things out with partial information" is a valuable skill that can compensate for a lot of missing kanji knowledge.
An aside: I've heard that when learning a language, you should be able to understand roughly 80% of what you're reading/listening to, because then you'll have enough of a base to not feel overwhelmed, while still having new stuff to absorb.
When I was learning English, the most memorable/epic lesson of advanced English I ever took was trying to understand what we were told was an "advanced" poem; the Jabberwock!
We were a class of 6-8 students, in pairs making our guesses of what the words that we didn't "understand" meant. We all got pretty advanced ideas and all got to similar conclusions. The teacher would tell us afterwards how those were made up words, and we were all in awe at how much we still understood and agreed upon the meaning of the words.
I don't disagree with your point in general but the irony is that your first sentence is an excellent demonstration of the parent comments point.
I have no clue what a "N2" is and I'm not entirely sure what you meant by "Kanji/readings". There are only six other sentences of context and they didn't really help to understand the first one.
And still you're fluent in English (I assume) - it's not an issue with the English language but your understanding of the context. Japanese (and Chinese) work a bit differently than latin languages. Even if you know 100% of the kanji (very few Japanese people do), it doesn't mean you know 100% of the words - and vice versa. Since the characters are idiomatic, it also makes it easier to guess the meaning of a word or character you don't know than if they would have been written out phonetically.
Kanji: The Chinese characters in question
N2: The second highest level of the Japanese Language Proficiency Test (JLPT). N1 is by many seen as a very high bar and many Japanese employers demand it if you'll be using Japanese as your main working language. N2 is generally attained after 1-2 years of Japanese studies.
> Since the characters are idiomatic, it also makes it easier to guess the meaning of a word or character you don't know than if they would have been written out phonetically.
Ideographic, as in trying to portray ideas graphically, not idiomatic, as in using expressions like a native speaker
> "N2 is generally attained after 1-2 years of Japanese studies"
From my own experience, this would only be true in the most favorable of the situations. 2 years studying fulltime while living in Japan sounds about right. 1 year studying as a hobby few hours a week, no way.
Do kanji handwriting-recognition models even “know” (i.e. are trained on) all in-use kanji, or do they just not bother with some of the rarest ones because their presence in the dataset would decrease the likelihood of correctly matching more-common characters, which it’s overwhelmingly-likely are what you’re writing?
Presumably it's the same as words in English, where you'd never make a word suggestion model that just blindly considers all possible words. You swipe jow or gow and it will change it to "how" despite the other two existing in arguably correct English in some very obscure case.
I think even if you reach the parent comment's 98% point, "N2" will not appear there. But 90% English is probably good enough to be able to reason "N2 must be some kind of level".
Exactly. You need to be in the zone of proximate development or learning isn’t very efficient. I remember as a teen I had a fixed mindset and would get 100% on every calculus test. The problem became that I liked getting a perfect so much that ended up actively avoiding harder material and just sticking with what I was taught in school. I was afraid of posing interesting problems to my self out of fear that I wouldn’t be able to answer them.
This is a dangerous and corrosive mindset that did a lot of harm to me and other people I know. It took me maybe until 25 to actually start to accept my lack of knowledge and incompetence and start to do something with it. I sometimes wonder what my life would be like if I had the emotional maturity to challenge myself at a young age instead of pursuing destructive perfectionism.
I thought his first sentence made his point. I didn't know what an N2 was either, but from the context I thought it could be some kind of test he barely passed.
Maybe whether one can make a reasonable guess depends on more than the context - perhaps background, experience, and point of view. I know I've missed points that are obvious to others because of our differing perspectives.
> the irony is that your first sentence is an excellent demonstration of the parent comments point
And allow me to claim that you've actually proved your parent's point! I have no idea what an "N2" is either but from the context was able to accurately guess that it's some kind of proficiency test.
I agree with rtpg on this one - and disagree with the Diego.
I also read/type Japanese (my writing out of lack of continuing to write on a daily basis has gone to the dogs). With four years in Japanese universities and 20 years living in Japan I had the opportunity to experience what it is to get reasonably proficient in a second language.
The synaptic connections you can make by understanding 'most' of a compound kanji (multiple characters joined together) along with parts of the kanji makeup for a specific kanji you do not know puts you in an incredible position to figure out what the word is likely to mean.
Regardless, once you Japanese gets to a point, you begin to realise it is less about the meaning of words and more about the context and usage of language. You find yourself understanding how to use some words you have not studied without having said to yourself 'so what is this word equivalent in English' - and at times there is not a word equivalent in English. That is when you realise you are thinking in Japanese not just translating.
Synaptic connections, context and knowing what the relevant responses to how you want to respond are the key to fluency in my opinion. This is also the case for writing emails etc.
This also explains why people who do homestays where they get to watch people interact with each other progress faster than someone who goes to live with their significant other half. If you are involved in half of the interactions you get less opportunity to mimic the common responses.
Sorry for digressing a little on this one. Once you get that 777 kanji mark you would be well on your way to being able to contextually understand much more for the kanji/sentences you do not fully comprehend - which in turn leads to more kanji being 'gut-felt' understood if not fully formally learnt.
An additional benefit once you get some proficiency is that you can hear a word you do not know - but you can guess the half of the kanji being used in the compound word (and confirm by asking) and that gives you some context of what the word means. Very much like using latin/greek roots in English to breakdown words.
Time to go back to my coding before the day gets away from me. I just needed to respond to this one because I thought Diego's opinion on language learning was too far off the mark from my experience with it.
>> An aside: I've heard that when learning a language, you should be able to understand roughly 80% of what you're reading/listening to, because then you'll have enough of a base to not feel overwhelmed, while still having new stuff to absorb.
Funny as this is how I learned (and have similarly described) to read language X and it felt ___almost___ like zero resistance. I learned first just to identify each character by name, then somewhat how they were pronounced, then vowels, then how there were pronounced, then special rules for X, then special rules for Y.
All of the learning material clump them together from start, a few consonants, a few vowels, some of their properties, and rules on how to combine them. All at once, this made me ignore writing for years until I tried it my way which I think worked well (although it took some time, still).
Being fluent in Japanese as a second language, I agree with this. It's Zipf's law and sounds great, but 90% isn't as useful as it sounds. You'll mostly recognize a single common Kanji of compound words consisting of 2-3 characters, or common structural words. It's a far cry from being able to understand content you find in the wild.
Also, "understanding" a Kanji is an ill-defined term. Most Kanji have multiple meanings based on context, and many different readings. So for each Kanji you're not learning a single character, but possibly a lot more. Especially the Kanji that correspond to more abstract concepts you cannot learn by themselves. They don't have a concrete meaning like "eat" or "drink". You essentially have to memorize all of their word compounds, a single character does not help.
EDIT: I also started out learning Japanese by memorizing the top 2000 Kanji using Spaced Repetition. While it definitely helped, it wasn't nearly as useful as some of these marketing-driven sites want you to believe. Kanji meaning are too complex to be captured like that. Even if you "know" all Kanji in a word you'll likely not understand the word's meaning unless it's something simple and concrete. I think you are much better of memorizing and studying word compounds. Over time you will automatically "pattern match" the Kanji you see often to their abstract concepts.
Example from 1min of browsing a JP text: Take "可能" which is a very common word that usually means "possible". Knowing the two Kanji (tolerant and ability) it not going to help you. It could mean dozens of other things based on those simplified Kanji meanings. This is not an exception, the majority of words are like this. On the other hand, let me give you a bunch of words containing "可": 許可、可決、可能性、不可欠 (permission, approval, possibility, essential) and you start to pattern-match that "可" corresponds to something like "positive possibility", but it's hard to translate.
Let me add that just as it's not enough to focus on learning Kanji, even learning compound words made out of Kanji is not enough. There are many Japanese words that are never or only rarely written with Kanji, and depending on the text there can be many sentences with no Kanji at all. I recognize 4000+ Kanji by knowing Mandarin, so I can somewhat read e.g. software manuals (many Kanji, some English loanwords) but children's books are too advanced for me.
The "777 Kanji for 90% coverage" figure is probably more relevant for illiterate native speakers. Based on a corpus of 167,281 Japanese sentences I had segmented and lemmatized some time ago, you'd need 2,685 words for 90% (up to "故郷", "birthplace", usually written "ふるさと"), 6,564 for 95% (up to "すらり", "slender, smooth"), 14,098 words for 98% (up to "鼻先", "tip of the nose") and 20,657 words for 99% (up to "恐慌", "panic"). Obviously the exact numbers depend a lot on the diversity of the corpus, so don't mistake them for fixed targets to aim for.
I know there are a lot of ideas that there isn't a "right way" to use SRS, and I'm sure that someone would argue that what I am about to say is a no-true scotsman or whatever, but the simple fact of the matter is that you were using SRS wrong. In this instance 'wrong' means 'inefficient', 'suboptimal', and 'in a severely defective fashion'.
You don't just load single words up into it and study them on their own, you load words, compounds, sentences, sentence fragments, radicals, etc. etc. and everything else you can, and the SRS system helps you study and remember it. I can understand SRS not being as useful to you if you were only using it for single characters.
Just to clarify, I did use SRS for all my learning materials, including words and sentences. But I also had decks for single characters (you can find a lot of these online). Over time I found these pure character decks to be the least helpful or nearly useless, so I dropped them. Using SRS for other things like grammar and words/compounds is great.
Afterthought: Kawaii (cute) is actually 可愛い, also containing "可", which literally may mean something like "a thing that can be easily loved", or simpler, cute. But you wouldn't be able to guess that if you just know the Kanji.
I don't know for Japanese as meaning sometimes shifts from Chinese, but in Chinese the standard definition of 可 is "can, may, be able to".
You obviously learn it by itself but as Chinese words are mostly a combination of 2 characters, you immediately also have to learn e.g. 可以 (can, may, be able to), 可能 (maybe), 可爱 (cute) etc.
So someone who's learning characters in order to get 90% coverage (or whatever) would not simply learn characters but learn actual words. Learning characters in isolation would not be that helpful, indeed.
When you don't know a word (i.e. a combination) but you know the individual characters it is much easier to learn the new word either by guessing or checking.
In context, the meaning of 可爱 would be fairly straightforward to guess, for example. Even in English 'lovable' is a synonym of 'cute'.
> the meaning of 可爱 would be fairly straightforward to guess
To be honest this whole thread about 可愛い is more or less bonkers, because it's an ateji. The word's meaning doesn't derive from the characters, the characters got arbitrarily attached to an existing word because they were similar in sound and meaning.
As such, the whole thing is about as meaningful as talking about how easy it is to guess that 珈琲 means "coffee"...
I must say I don't know much about ateji in Japanese.
In this case, though it does seem that the characters where chosen at least partly because of their actual meaning.
It seems that it is both an ateji and a jukujikun [1] because the word does not come from the characters but the characters do have the correct meaning.
The characters of 可愛い do not have the “correct” meaning. Just like in Chinese, 可 means “can”, or “possible”, and 可愛い is a weird exception. (It’s also nearly always written in hiragana.)
Nearly every word involving the kanji 可 in Japanese has something to do with permission: impossible (不可能), possible (可能), permission (許可), approval (可決), etc.
That said, the kanji-centric view of the world is ineffective, and one primarily adopted by beginner learners who have not spent any significant amount of time studying words in Japanese. It is always better to just learn words.
> the characters where chosen at least partly because of their actual meaning
Sure, didn't I say they were in my post?
The point was, in a discussion of how well X predicts Y, it's not very useful to examine a test case where Y came first and X was chosen post-hoc to match it.
It does not seem arbitrary in this case because the meaning does match.
I do take your point that using that word in the discussion above, which is about Japanese was not the best example. On the other hand, it is a good example in chinese.
(A) In Japanese the meanings don't match that closely. The word didn't originally mean cute, but rather pathetic or pitiable, and evolved over time. More info: http://gogen-allguide.com/ka/kawaii.html
(B) By arbitrary here I mean that there is no linguistic connection. "Arbitrarily chosen because they are similar" => "chosen for no reason other than their similarity".
Actually, no. 可 is like 'the be able to' prefix in Japanese/Chinese, so 可愛 is can be seen as a compound word because of that, literally means 'is able to be loved'
In Chinese, it can be further compounded into many common phrases:
可悲(pathetic, 可+'sad')
可气(annoying/irritating, 可+'anger')
可怜(pitiful, 可+'pity')
可恨(resentful, 可+'hate')
What would be a more suitable example would be 可(ke)乐(le) in my opinion, in which case it means Coke in Chinese, it is a transliteration.
But it could also be 'a thing that can be easily loved'. Rr simply, 'pleasing'. For example, 可口 means tasty. So, 可爱 could be understood in both ways actually.
Perhaps not the best example. Knowing that 可 is usually read 'ka' in compound words, and that 愛 is read as 'ai', you get to 'ka-ai-i', at which point the meaning of 愛 (love) will likely push you to かわいい (kawaii).
Of course, but you are assuming that you already know the meaning of "kawaii". My point is that if you don't know the word, you can't infer the meaning from the Kanji, except in very simple cases. That's why memorizing Kanji alone really isn't as helpful. Knowing 90% of Kanji meaning and reading doesn't help you much in understanding compounds/words.
> but you are assuming that you already know the meaning of "kawaii".
Have you ever been around non-native students of Japanese? This is the one word that is pretty much guaranteed to be known by all with an interest in anything remotely Japanese.
Your point is valid, but kawaii is perhaps not the right example.
Most non-native students of Japanese will almost always encounter that word in hiragana, katakana, or even Romaji--encountering that as a kanji is actually remarkably rare.
The fact that the kanji for kawaii is an ateji makes it one of those odd ducks.
> Most non-native students of Japanese will almost always encounter that word in hiragana, katakana, or even Romaji--encountering that as a kanji is actually remarkably rare.
Interestingly, it's very common for students of Chinese. The word 可爱 (kě'ài, "cute") is dirt-common, but it isn't native to Chinese -- it originates as a loan from Japanese.
This isn't clear to the Chinese themselves, who use 卡哇伊 (kǎwāyī) if they want to refer to the Japanese word.
There's another modern word for cute, 萌 méng, which is also a loan from Japanese, though I think popular awareness of it as a weird loanword is higher, since its literal meaning ("sprout") is so far removed from the concept "cute".
The counterpoint is that if you are fluent in spoken Japanese you can get by with a fairly minimal number of kanji (and complete understanding of the kana as well, obviously).
Which is the trend with native Japanese people too.
> Even if you "know" all Kanji in a word you'll likely not understand the word's meaning unless it's something simple and concrete.
I don't think anyone passingly familiar with Japanese thinks otherwise. That is, I think you're arguing against a position here that nobody actually holds. The argument for memorizing kanji is that it makes it easier to learn compound words, not that you'll just know them without learning them.
I agree that people with some Japanese knowledge likely already know this. But out there you will find a lot of marketing material/posts targeting absolute beginners promoting some kind of "Top X% Kanji lists", as if they are a huge shortcut and secret to quickly learning Japanese. So I'm just saying they aren't.
If you talk to people who don't know much about Japanese they often believe that memorizing the characters is the difficult part and doing so will help you understand a large fraction of written text. I think it's quite a common misconception.
I feel like a lot of this perception comes from a tendency for people to equate Kanji characters with words. I always explain to others that a character is kind of like a root, e.g. Sub-optimus-al = suboptimal. It would be crazy to say one can learn English by memorizing just a few hundred Latin roots and suffixes, and in fact such a claim is so irrelevant nobody even keeps track of the statistics.
This is why I created an Anki deck that contains both kanji and common words that use those kanji [1]. It even includes statistics on how much each reading is used, so you can estimate if it's worth learning this reading or treat it as an exception.
I agree, and that's why I find wanikani so useful. After 7-8 years trying to learning kanji and always failing after 3-400, I have been using wanikani for ~ 6 months, and the system of building vocabulary on top of kanji on top of radicals really did it for me. I learnt 600 kanji in 6 months, ~ 2000 words of vocabulary, and I am quite confident I will reach 1000+ EOY.
Even at the beginning, learning the simple kanji, having the vocabulary I knew/heard living in Japan associated to the kanji made it very useful from the start.
For anybody learning kanji with anki, etc. and failing, I recommend giving wanikani a try. I am just a happy user. Disclaimer: I live in Japan, work in a mixed English/Japanese environment and speak simple Japanese w/ my wife every day
There’s some level of proficiency (whatever the percentage is) where each sentence is understood save for one word, and that word because of a single character that you can then look up in a dictionary. To me, that’s a useful level to achieve, because it means that you don’t have to constraint-solve sentences by guessing at the meaning of one character to decode the meanings of others, but instead can just hold onto the complete sentence minus one “hole” in your mind while you go look that “hole” up. Much less mentally-taxing!
> 90% coverage is not that great ... when an uncommon word appears
TFA muddles this somewhat, but the research referred to has to do with coverage of kanji characters in a corpus text, not words. Someone who can read 90% of the kanji in a Japanese text might understand significantly more or less than 90% of the words, depending on the text.
Also notable: the research behind this was based on newspaper text corpora, which one could expect to have a lot of kanji and kanji compounds that are uncommon in casual usage.
This is why picture books are so important for improving listening comprehension for young children. It gives them something to focus on so that hearing words they don’t understand or new grammatical constructions is less distracting, it provides visual cues related to some of those words and helps guide understanding of the story, and it gives the children and whoever is reading to them something to point at in discussion. It also makes the stories more appealing and makes repeat readings more rewarding.
I would guess (based on not-too-rigorous anecdotal experience) that the optimal level of word familiarity for most rapidly improving listening comprehension when read-to 1:1 is closer to 90% than 98% (but with the caveat that books should be re-read several times). The big difference is that having someone to talk to makes it very quick and easy to ask what words mean and discuss other aspects of the story.
When read to for an average of 1+ hour per day, young children improve at listening comprehension very quickly.
I’ve been looking into the research on second language acquisition a bit, and the primary mechanism for learning a language appears to be hearing/reading understandable sentences for content, rather than structure. When learning, some sentences in the text will make sense to you and others will be out of your reach.
I suspect that the role of pictures (or any other context) is to help correct your mental state of the narrative world so that the sentences you don’t understand don’t present an impassable barrier, preventing you from getting to the later sentences that will help you learn.
I noticed that it takes very little to understand what a conversation is about, but understanding the specifics is a different story entirely.
For example if you just understand the word "car" and the name of a place, it most likely is related to traffic. If that's coming from someone arriving late you can bet he is finding excuses ;)
That's a reason why you shouldn't bad mouth foreigners if they can hear you based on the premise that they don't understand the language. They may not be able to converse with you but they can still get the general idea.
It is easy to lookup or ask the meaning of the 1-3 words you don't know in a sentence. It is best to always have some percentage of the content not be what you know, otherwise you will plateau.
If you are speaking/writing, much easier to form a sentence and search for the one word you don't know than search how to generate an entire sentence structure.
This. I peaked at about 1000 kanji in active memory. It was still not enough to read newspapers easily. Joyo kanji ("regular use Chinese characters") have about 2100 characters, which any high school grad should be able to read. Anything much below that level and you are guessing/looking up a lot of content.
You are better off trying to cover specific domains. I was able to read technical articles related to hardware and software topics pretty easily.
I would agree with that. I know about 1000 German words, and that covers the bulk of words people use when speaking German, but not the words that matter.
You say that like it's a bad thing. I wish I could decipher that much of a random sentence in Japanese. At least then I can ask about the parts I don't know. And for most people learning Japanese, they'll probably already know the word in hiragana or katakana, knowing kanji is just icing on the cake.
>The thing is, 90% coverage is not that great. What happens is that you understand common words that make up for a lot of structure, but when an uncommon word appears it's probably important to the sentence.
If you have 90% you can more often than not guess the uncommon word...
No, I'm taking about the premise of the article: if you know 90% of the words most widely used (that is, the above 777 characters for example). Not 90% of the words in general.
If you know the most likely words, you can more often than not guess the unknown words in a phrase from the context.
An interesting assertion, but the article is both vacuous and confusing. It doesn’t link to the underlying study, combines what one tends to think of as kanji (single characters) with more complex juku-kanji (multiple characters) with straight up whole dictionary words. Some of those words are presented with English definitions to the side, but most aren’t. Some of the kanji are presented with pronunciations, but most aren’t. There’s no support for the thesis statement.
I have no particular reason to disbelieve the headline, it sounds reasonable enough, but this page in specific is doing nothing to persuade me of its main point.
Great find, digging up what appears to be the base of TFA’s list.
Some critique of using the linked list: That is based on a newspaper corpus from 93, which explains why day/sun is at the top, which they won’t be for list created from Wikipedia or television subtitles, both sources that are arguably closer to “in the wild”
It looks like the article author used the kanji from the research as a starting point, but then randomly added in words endings or compound characters, perhaps as their own study aid.
Seems like someone just pulled entries from their study list and turned it into an article. Loads of incredibly basic words aren’t there, but some random uncommon things are. And absolute loads of duplicates and variations of the same thing.
I know around 10,000 Japanese words. As far as kanji goes, I've studied some 1200 of them intensively; through vocab I know many more as parts of words. I quite often have to reach for a dictionary when reading.
If you know 777 kanji in some way (like associating them with meanings, through your native language) and you haven't crammed on any vocabulary, you absolutely will not be able to read a thing.
In fact, even if you continue that way and memorize over 2000 kanji, and recognize every single one in a given document, you still won't be able to read anything without vocab.
The broadened knowledge will help support vocabulary building, though.
> I quite often have to reach for a dictionary when reading
So when reading kanji, how do you look up a word (picture) you don't know? Since there isn't a minimal set of characters, the notion of "alphabetical order" seems impossible. Weird that I've never thought about this until now, but I'm honestly baffled.
Let's say I see the word: 銀行 and want to look up the first character. I notice that the left hand side is 金 so I go to a dictionary, turn to the kanji radical section[1] find 金 and then find the original character by number of strokes[2]. This character has 14 strokes. If you have any experience with kanji then counting strokes is pretty trivial.
Also note that on the site linked to common kanji have a red background. Many of the characters are obscure so radical + stroke count narrows done the choices to very few kanji.
These days, if I'm reading print, I use the kanji appendix of the dictionary I'm using (big red Kokugo Jiten, by Kodan-sha, 1993). The kanji are grouped by stroke count, then within stroke count by radical. Usually kanji lookup can be avoided.
Firstly, unusual words tend to have furigana, which makes it trivial. Next, if the unknown kanji isn't the first one in the word, it's possible to do a prefix-based dictionary lookup. In many cases, also, I've been able to guess a reading by common structure. E.g. both 赤(red) and 跡(traces, remains) have a "seki" reading due to a common element, and 根(root) and 痕 (traces, remains) have a "kon" reading, also due to a common element. You might be able to guess at 痕跡 (konseki) by thinking of 根赤.
How we can find 跡 in the Kokuko Jiten's appendix is by counting the strokes first: 13. Then in the 13 stroke section, of the appendix, we find the subsequence of 13-stroke kanji that have the 足 seven stroke radical. The radicals are sorted by stroke count also, so we can find this subsequence fairly quick.
I can recognize quite a number of words that have at least one kanji which doesn't occur in any other word that I know. I've never studied the kanji in isolation, but I can recognize it in that word.
If you're native or an advanced learner, you probably have a good guess at the pronunciation too, so you can just type it phonetically and have your phone/computer convert for you.
The "alphabetical" order is based on the sound of the word. If you are trying to look up a kanji you don't know in a traditional dictionary, you generally look it up by the radicals (parts) of the kanji or the number of strokes of the character, but most people these days use an electronic dictionary where you draw the kanji to look it up.
> The 2,501 most-used characters have a ranking which expresses the relative frequency of occurrence of a character in modern Japanese. The data is based on an analysis of word frequencies in the Mainichi Shimbun over 4 years by Alexandre Girardi. Note: (a) these frequencies are biased towards words and kanji used in newspaper articles, and (b) the relative frequencies for the last few hundred kanji so graded is quite imprecise.
A few years ago I downloaded several hundreds of megabytes of Japanese subtitles, split into 3 categories: live action/drama, anime and foreign film/tv
I’ve listed them in a google sheets together with a few other corpora
The source links appear to no longer work. Do you know where we can download Japanese subtitles?
I would love to attempt to segment a bunch of Japanese subtitles into words and then do frequency analysis. My interest is in increasing my listening ability, so I want to put the most frequently spoken words into SRS/Anki, and perhaps even break it down by anime.
That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.
I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.
A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.
Take me as English learner for example. I would say I was only able to understand everyday English without too much of a hassle, after I acquired like around 10k words, which as I just checked had a coverage about 98%+.
Noted, it is still NOT enough, actually far from enough. Right now I believe I master around 15k to 20k words, by various estimates, and navigating English on the internet is like a charm, very little context switch in between with my native language.
Still, reading literature is huge undertake for me. I would still need to pardon myself about every once a page that if I stumped upon certain unknown words/phrases and can't move on before fully understands it, my pleasure of reading would be ruined. Such comprise frustrates me still, to this day. On the other hand, I will never have a second thoughts reading most cryptic novel in my own language, understanding might still be a challenge, but unlikely due to my insufficient vocabulary.
In English there's a lot of words you only see in books, you wouldn't pick them up by listening. So the only way to learn to read literature is to read literature, and tolerate a certain % of confusion. I learned most of those words through repeated in-context exposure, and eventually one will get stuck in my head later in the day and I have to look it up.
The downside of this is, when someone asks while they are reading, "hey what does [word] mean?", I often can't tell the precise definition, because I never looked it up, but I can usually read the sentence and understand it with context.
It's much worse than that because kanji are not even words. The same kanji can appear in multiple words with absolutely nothing in common neither in meaning nor pronunciation. So the use of learning kanji is very limited vs learning whole words for example.
Heck, English is my most fluent language (though Chinese is my mother tongue) but I always have trouble understanding comments here on HN. I can understand every word and don't need to open the dictionary, but then I can't follow the logic / the point being debated / the thought being conveyed.
I've tried it now and got 22k, which seems not so bad for a foreigner ("Top 6.53% Your vocabulary is at the level of professional white-collars in the US!"), but I feel like I cheated: most of the more fancy English words are just misspelled Latin, and having even a modest Latin vocabulary (I'd don't think I know more than 4k Latin words) makes their meanings pretty obvious.
There are certain tests floating around, like they will try 50ish quiz to estimate your vocabulary size.
Those are not set-in-stone science, so it is just an estimate, but I take about 10+ over those years, and the the number from those estimates seems consistent.
Coverage, there is some paper to track, I just googled it.
If Japanese people just used kana (like Korean people use hangul today), kids in Japan don't have to spend countless hours learning this complex (and often irrational) writing system.
> If you really want to be native-level Japanese, kanji are essential
There are visually impaired people who have difficulty learning kanji but speak Japanese fluently. Language is not just for people who can read and write, let alone reading and writing complex characters, or spell things "correctly".
Richard Feynman:
If the professors of English will complain to me that the students who come to the universities, after all those years of study, still cannot spell "friend," I say to them that something's the matter with the way you spell friend
The good news is that this is a problem that's solving itself. Kanji, especially complex kanji, are dying out in favour of katakana. An anecdotal example, as little as 20 years ago, all the fish markets had signs for each fish, in kanji. Now all of those signs are in katakana. Many young people can't read the kanji for specific fish anymore. They can probably manage tuna, but hake? Pollock?
Most written communication is typed. Which involves typing each word phonetically (and then picking the correct kanji from a list). This is orders of magnitude easier than writing a kanji by hand. You can get by without knowing stroke order or without knowing exactly how to write it. As long as you're even vaguely familiar with a kanji, you'll be able to pick the right one from the list.
Give this another generation or so, and I expect the coverage of those "777" will increase to ~100%, and the number of kana-only words will increase dramatically. National pride and linguists be damned, convenience always wins out in the end.
I think it's good that more private businesses are using kana-based writing (including more widespread furigana usage), but I don't consider the problem being fixed until the mandatory teaching of kanji in compulsory education stops, and maybe I just don't know but there's no sign of that happening.
Give it time. Compulsory Latin education was very common for a very long time in large parts of Europe. In parts because the catholic church had a lot of influence. Now people barely even remember it. Many schools now offer Latin as an elective subject, and those who are into that sort of thing can study it to their hears contents, without wasting everyone else's time [1].
Something similar will probably happen to Kanji. They'll become more and more useless. And, eventually, education will adapt. Give it another generation or so.
[1] I realize that whether or not studying Latin (or Kanji) is a waste of time is a contentious point. If it's your cup of tea, go for it, but don't force it on others. Force them to learn maths instead.
Have you read japanese written only with hiragana? The language has so many homonyms it is hard to distinguish them from a text as it is a heavily context based language. I’d argue abolishing kanji and using only hiragana/katakana or roman alphabet makes learning the language harder after the very beginning when your vocab starts to increase.
"I want to agree with you (having spent many hours learning kanji and feeling like after I've learned the kanji, they're valuable to my comprehension), but having learned Korean I can see how you can eventually get along without them and in the end I feel like it's equivalent to Latin and English. Knowing the Latin makes things easier, but I didn't have trouble with English before I knew the roots of words. My understanding now is just richer nowadays."
We always get this reply in this (endlessly repeated) conversation. However, as the person you are replying to says, blind and vision impaired people are perfectly proficient readers using Japanese Braille [1].
And have you ever sought out what opinion Japanese Braille readers have of the system? Whether they find it "less than optimal"?
I can't find the references right now but my memory from reading up on it at one time is that users were very happy with the system, that there is a fair amount of material transcribed into Japanese Braille, that it's quite easy to learn and that children become proficient readers faster than child learners of the regular writing system.
Spoken Japanese has intonation and rhythm which make identifying word boundaries and homonyms easier.
Reading hiragana (especially without space as word boundaries) is totally different. Reading long hiragana by speaking aloud help, but there is still a problem of ha/wa and he/e.
Also, the very reason there are so many homonyms in written text in the first place is because of the kanji (over)usage - that is, because people think there are visual cues they are less careful about choosing words that are also understood easily by people listening to the words.
When people speak, at least if they are a competent speaker, people tend to avoid the overuse of homonyms (mostly kango).
I want to agree with you (having spent many hours learning kanji and feeling like after I've learned the kanji, they're valuable to my comprehension), but having learned Korean I can see how you can eventually get along without them and in the end I feel like it's equivalent to Latin and English. Knowing the Latin makes things easier, but I didn't have trouble with English before I knew the roots of words. My understanding now is just richer nowadays.
The English language is also quite irregular both in spelling and grammer[sic]. Maybe we should start with it given that children around the world "spend countless hours learning this complex (and often irrational) writing system".
Disclaimer: native Chinese speaker, non-fluent Japanese speaker, English sufferer
I'm a native Japanese speaker but as far as the writing system goes, I fully agree with the linguist Geoffrey Pullum's following statement about Chinese characters / kanji:
In consequence, this horror-show of a writing system, with its crippling memorization burden for students and malign impediment to progress in science and industry, is the focus of so much intellectual investment and cultural pride that getting rid of it is out of the question. Intolerable though it is, it will continue to be tolerated – leaving English, with a spelling system that positively stinks, smelling almost like a rose.
Don't get me wrong, I'm generally in favor of made-in-China products, but those characters to me were the worst made-in-China product I wasted so much time on (and unlike video games, I wasn't even having fun with them.)
Difference is your written English can be riddled with errors and people will still get it. As long as you have some knowledge of spoken English you can get your point across fine. The same is not true for character languages like Mandarin or Japanese.
Not sure what you mean, both missing strokes or having another character that looks or sounds similar should not impair the meaning of a sentence to most Hanzi user.
I can't compare it to Mandarin or Chinese, but as a non-native English speaker, I'd say it depends. Mixing up their/they're, lose/loose, fair/fare, cue/queue, etc, is understandable for me just fine. Having to know that "parody" sometimes means "parity" is when things start getting surreal.
HN is full of the dilettante type who consider themselves an authority on any field based on a few days of wiki-hopping.
I notice this every time I see comments on HN about a topic I actually understand in depth. Quite a few comments are just outright incorrect, most are misguided, but all are confident.
Yes, but a lot of people who are aware of their lack of understanding (or reduced understanding) actually choose to stay silent, or offer opinions with a lot of riders.
On the other hand, the most confident pronouncements usually come from people unaware of their lack of understanding.
Vocabulary frequency lists have long been very popular with publishers, so I assume as well with language learners. The assumption behind these is that with knowledge of, say, 90% of the vocabulary you encounter, you will comprehend 90% of the writings you encounter. It doesn't work out at all that way, though, because it is generally the low frequency vocabulary that carries the key information in a text and on which the interpretation of everything else hinges.
In the US, there are beginning reader books (e.g. the “I Can Read!” and “Ready to Read” series) which intentionally use somewhat limited vocabulary. These are somewhere between a picture book and a chapter book: they have usually 1.5 pages of text and 0.5 pages of picture in each 2-page spread; the text is set in a large font but there are at least a few sentences per page; usually the books are 30–50 pages long, with 3–5 “chapters”.
I don’t know how useful these are for independent reading by 5–6-year-olds, but anecdotally they are great material for reading to 2-year-olds, better than most picture books. (Note: some of the recent readers are garbage marketing gimmicks with movie tie-ins, ranging from boring to incomprehensible; skip those.)
When you want to learn an abstract word, you have to learn it in the context of a sentence in the foreign language. In order to do that you have to know the meanings of the other words in the sentence.
This works passively as well, if you see a sign and you understand 90% of what's written on it but there's one word you don't understand, you will much more easily remember that word later because you can place it in context.
A brief tangent, but something I've noticed is that the meanings of words like "fluent" and even "native" are so ambiguous and poorly defined, that it is almost impossible to have a meaningful conversation about language learning unless you avoid them completely.
The marketing materials for language learning resources tend to make full use of this vagueness, like this one does. I wish these resources instead did more to enlighten their prospects as to what one can actually expect to achieve and in what kind of time frames.
Languages are endlessly deep. "Native" is not even close to the top. Even amongst "native" speakers, skill with and understanding of language is enormously varied. Compare the wedding toast of a skilled public speaker with that of an average one. Compare a literature scholar's understanding of a classic novel with that of an ordinary high school graduate. It's night and day.
IME, 777 kanji wouldn't get you very far in a newspaper and certainly not a novel. It would likely be enough to understand 90% of ordinary emails and text messages.
So many great resources to learn Japanese with these days; this vocabulary list is not one of them.
This is known in the education space, and tests like the JLPT N5-N1 are based on this! Also when learning English, the learning material and exams are based on this. The order of frequency is not strictly followed though, if you have to learn "Monday" now and "Tuesday" in a couple of days, it makes sense to bundle them all by concepts at once and learn "getsukasuimokukindonichi". So in Japanese learning, on a day-to-day it might seem like you learn random difficult-easy words or characters, but overall you only have to _memorize_ the top N characters/words for the next test and you'd be alright.
I made a free website to memorize Kanji that works offline: https://core.cards/. Initially I did maintain a list of the top 100, top 500, and top 1000 (approx) if I recall correctly, extracted from Wikipedia lists, to learn Japanese Kanji. But now I've switched it to just follow the JLPT because they were almost the same.
Does anyone have frequency lists of vocabulary broken down by type (verb, noun, etc.)? I've seen word lists on Wiktionary, and I'm attempting to cross-reference jisho and other sources.
How many words do you need to comprehend for daily competency? Would the 10,000 suffice?
How many words do you need to be able to watch anime aimed at children (eg Bono Bono) or teenagers (Boku no Hero Academia)?
Great questions, let's figure out the answers! I think 6,000-7,000 words (just a wild guess based on experience) would cover a lot of daily conversation, plus specializing in whatever domain you're in a little.
Disclaimer: native Chinese speaker, knows some Japanese, English sufferer
Putting aside the argument of whether removing all Hanzi from Japanese text would actually be more efficient or not, the question to me is: why stop at Hanzi? Why not romanizating all the Japanese literature? Surely almost all the reasoning in favor of getting rid of Hanzi can also apply here?
> Why not romanizating all the Japanese literature? Surely almost all the reasoning in favor of getting rid of Hanzi can also apply here?
The difference is that removing kanji from Japanese profoundly changes it, in that tons and tons of words that were previously distinct become indistinguishable. Hence people arguing about getting rid of kanji are arguing about the benefits of the language being easier to learn, vs. the drawbacks of having huge masses of distinct words that are all written the same way.
Whereas, using kana vs. romanization isn't really an important distinction. Nobody argues which is better because they're basically equivalent; anyone who knows one could learn the other in a matter of days or weeks.
Kana are more directly phonemic than any romanization method, and not difficult for any foreigner to learn. If we want a single universal alphabet then we should be looking to something more like IPA - or at least something that would involve a radically simplified orthography for english. But really the details of which alphabet one chooses are much less important than the decision to use an alphabet/syllabary at all rather than an ideographic system.
I was not even talking about Hangul, the point I made was on using a Latin alphabet. If we are revamping the whole writing system, why reinvent the wheel if the main goal is "efficiency"?
As someone who spent a lot of time learning kanji alone, there's not much you can do with kanji alone. It's a helpful step in learning actual words, learning strokes, finding patterns, but I'd be skeptical of the utility of this list.
If you're going to rote memorize something, I'd probably start with the radicals.
Before this falls off the front page, I figured I would ask the following:
1) Does anyone have a resource for Japanese subtitles (in Japanese/kanji, not English)?
2) Does anyone have good frequency { word => frequency } lists? Especially if they are topical, eg. school-related, anime-related, industry-related.
3) What are the best programs for segmenting Japanese text into words reliably?
4) Does anyone have a vocabulary set for any given manga, anime, or film that you could study before watching?
5) In addition to Anki and Wanikani, what are good SRS apps or programs?
6) Does anyone use Skype (or similar) to practice with native Japanese speakers? How is it? How did you find people to practice with?
7) What is the inflection point (in terms of raw # of vocabulary) to being able to understand Japanese anime or drama? What JLPT level does this correspond to?
8) How many new words do you acquire per day of study? How long have you been studying? Have you taken any of the JLPT tests?
5) zkanji: https://github.com/z1dev/zkanji! It's got a dictionary from which you can directly add words into its study decks when looking them up, and it has handwriting recognition plus let's you easily find similar looking kanji (with shared components etc), which is great when tesseract ocr fails, or when the text is so blurry/compressed you can't really even see it clearly (the online Japanese war history archives really love to compress their scans).
Not long after this HN thread, this thread popped up on Reddit. It answered a lot of my questions ([1], [2], [4], and [7]), and I found it immensely useful:
I'm a native Japanese speaker. I agree that 777 kanjis are contained in common sentences at the rate of 90%, but it doesn't mean you can complete 90% of Japanese. Actually, these sentences are very easy to read for Japanese people, although it's hard for foreigners because of the difference of vocabulary.
I also reviewed the list. I believe「経(ふ)」and「格別空」are odd as a word. In addition, 「恬然」and「整復」 are not frequently appeared so I think there was a bias to choose sources.
So what is the typical reason for people to learn Japanese? Is it a good destination for something more important than visiting a few shrines or buying rare manga? I'm currently learning vietnamese, another east Asian language. It's very difficult. I'm very good at learning languages, generally. But not this one. The only reason I'm doing this at all is i'm here anyways after my plans failed two days into the trip and I can't communicate so I'm alone and shit out of luck. I don't think I will know a reasonable level even after the three weeks are over. Waste of a trip but I still get some decent pictures to post online.
I like learning foreign languages because it's interesting to see how they outline ideas and idioms are a great way to learn about the culture. So it's an enjoyable hobby that builds a skill (maybe of little value, but I can deceive myself).
I chose Japanese because I watch and read a lot of stuff from Japan. It's also very different from English, which is fun.
The frustrating issue is that they don't state in what situation. I'm certain that most people will just do fine with only 100 kanji characters for daily life, or no kanji at all if they're just a tourist. But if you're applying for a full-time Japanese speaking job in a Japanese company, knowing only 777 characters is a joke (which is about the 5th-grader level). Just like a programming language, the level of fluency required depends on a task.
p.s. just confirmed that the list is not enough for filing a tax in Japan. They don't have words like 所得 (income), 控除 (deduction) or 医療費 (medical expense).
Not really sure what the major point is. Sure, there's a core set of "units" (words / Kanji characters) that are useful when trying to learn a language. I didn't even attempt to speak English conversationally until I had about 200 verbs and 500 nouns and maybe 100 adjectives memorized. This lets you have useful conversations and basic understanding when reading, but it's still a struggle to get that last 10%.
I've read through the linked paper and I can't understand where they get the assertion of 777 characters give 90% coverage. The original paper isn't even about that topic, but rather comparing and contrasting a corpus created in 1994 with a corpus created in 1962 and 1976.
Yes, I have been studying Chinese for about five years now and I can recognize about 2700 characters. Looking at this Japanese character set, I'm guessing I know 80% or so. I always act like charts like this.
I've noticed the same where I can understand the basics of a Japanese program if there's subtitles. But there are some oddities like 食 being used in a verb (as in 食べる) or 行 as to go. But you still have good context and can at least understand the topic in most cases.
Beat me to it. To expound upon this, Japanese is not unique in basically adhering to Zipf's law. In many organization data sets, including the vocabulary of most languages, the most commonly used word is twice as common as the second-most, and the second-most word is twice as common as the third-most, and so forth.
> second-most word is twice as common as the third-most, and so forth.
Normally Zipf's law refers to the frequency being inversely proportional to rank - i.e. the 3rd most common element would be 1/3 as frequent as the first, not 1/4th.
One difference here is that Japanese characters can be combined in various ways to create different words, so the 777 characters can be used to create many more than 777 words. Compare with the Simple English Wikipedia which strives to use only the 1000 most commonly used English words. I think you'll find that that experience quite a bit different from reading a typical English language newspaper.
But the list isn’t at all comprehensive. There are a considerable number of kanji in regular use that aren’t on the list, and when you include place/person names, it grows massively.
It’s possible to memorize all the standard kanji, but crack open a history book and you won’t recognize half of the words.
Lol, 777 kanji is just too short to even be comfortable reading emails at work in Japanese. So what good is 90 percent if the only thing it allows you is to do shopping?
777 of the most popular kanji from general Japanese might not be useful for work emails, but if you find the top 777 from a corpus of your own emails, it might be.
It's not hard - I've done this before using a corpus made up of work documents. There's a part-of-speech analysts tool called mecab that gives the word stems, and makes it easy to find word boundaries (since Japanese doesn't use spaces).
The output went into Anki and it didn't take long before I was reading emails and documents at work fairly easily.
It gives you 90% coverage in general. That's not 100% for any given domain. You're likely always going to need to know something beyond that common core for any particular domain. However, that 10% is going to be specific to the domain, and not have much overlap with other domains. The 10% that will be helpful for you at work is likely going to be different from the 10% the tI need for work. And you can get a long way on knowing 90% of something, and have the skills to be able to ask questions about the remaining areas of ambiguity.
Yet another (re)discovery of Pareto's distribution. Net worth, stock returns, popularity of words in languages, casualties in wars or natural disasters, size of cities, popularity of artwork pieces like songs or computer games, ...
The wikipedia article says it's commonly formulated as 80:20, so that's where I'm getting my info. You're saying it covers every nice Pareto ratio, which is very different. Because 90:25 is much better than 80:20 but they are part of the same phenomenon. Well, you can call everything a Pareto phenomenon then and what's the point if everything fits in this universalish category? How can I explain the value of 90:25 without invoking Pareto and having it constantly diluted to 80:20?
98% is closer to what you need in order to read a text and have an idea of what's going on. See this article:
https://www.sinosplice.com/life/archives/2016/08/25/what-80-...