Hacker News new | comments | show | ask | jobs | submit login
“Prestudy”: Learning Chinese Through Reading (kerrickstaley.com)
213 points by KerrickStaley 3 months ago | hide | past | web | favorite | 107 comments

I am reading novels in Japanese. My method is similar but without the flashcards.

I read a page at a time, looking up and writing all unknown vocabulary in a notebook. There could be dozens of new words on the page. I then reread the same page. Usually I can read the entire page through quickly the second time, because it's fresh in my memory. However, when the same words come up again later I may have forgotten the reading or meaning, so it gets another entry in the notebook (even if I recognize it). After noting the same word a few times, I may start to remember it.

The advantage of reading books, in my opinion, is that they tend to use the same vocabulary over and over. You'll eventually remember the most frequently used words naturally. The other advantage is that vocabulary appears in context. This is better for both remembering and for understanding. This is why I don't use flashcards, and just let the book be my study tool.

It's much easier to use this method if the book is selected carefully. It should be relatively easy, but still interesting. Something that might be assigned reading in middle school or high school, I feel, are often good examples.

Having done this many times before (it was my main study method), I'll say that at least for me putting the new vocab in SRS software improved the efficiency of reading dramatically. I changed my process to scanning the text and finding the first 20 pieces of vocab I didn't know. I'd put that in my SRS software (of my own invention because at that time Anki sucked, believe it or not :-)) I'd then memorise those 20 pieces of vocab and then read the text.

Every day, I'd do that, and also do my SRS reviews. Every week or so, I'd reread the whole week's text. If it was a particularly difficult book, then I might do the same thing after a month (ie. read the whole month's work).

Almost always, by the time I got to half way through the book, I didn't need the SRS software anymore. Authors tend to use the same vocab and grammar over and over again. So I could read several books written by that author with almost no difficulty (although it depends on the level).

Anyway, what I found was that this was the fastest way to bootstrap me to free reading. If I didn't memorise vocab as I went, then I'd need to look up stuff for the entire book (and beyond). But if I did memorise vocab, then I could easily get to a point where I'm just reading casually without any desire to look stuff up. Studies (which I don't have pointers to at the moment, unfortunately) have shown that you need 95% comprehension (on average) to learn new vocabulary and grammar from context. So the key is to get yourself there (in the context of what you are reading) as quickly as possible. It makes a big difference.

Thank you for sharing your methodology.

A pointer to the study that you're probably thinking of is research done by Paul Nation and Robert Waring (vocabulary researchers). They cite their own 1985 study and a 1989 study with the following quote: "With a vocabulary size of 2,000 words, a learner knows 80% of the words in a text which means that 1 word in every 5 (approximately 2 words in every line) are unknown. Research by Liu Na and Nation (1985) has shown that this ratio of unknown to known words is not sufficient to allow reasonably successful guessing of the meaning of the unknown words. At least 95% coverage is needed for that. Research by Laufer (1989) suggests that 95% coverage is sufficient to allow reasonable comprehension of a text. A larger vocabulary size is clearly better." [1]

1. http://www.fltr.ucl.ac.be/fltr/germ/etan/bibs/vocab/cup.html

Interesting. I haven't actually seen that one. I was actually thinking about a paper from McGill university which I think it was discussing whether or not having annotations which translate certain words (the name of which escapes me at the moment) when free reading is helpful. If you can think of the name for those annotations (the ones that you can often find in English graded readers), then you can probably find the paper (search for that, "free reading" and "McGill").

However, I have no doubt that the paper I was reading, references this one. Great find. Thanks!

> annotations which translate certain words (the name of which escapes me at the moment)

Perhaps you're thinking of glosses? I'm not familiar with graded readers, but I am rather familiar with mediæval manuscripts, wherein it's common to see copies of Latin works with little annotations near certain (sometimes all) words. When the gloss has been added above (or below) the word(s) they annotate, they're said to be interlinear, but marginal glosses aren't unheard of.

The glosses were sometimes written by the same scribe that made the copy, but often they appear to have been added later, perhaps by the owner of the book—sometimes in a comically small hand so as to fit in narrow spaces :)

Yes. That's it. Thanks :-) Unfortunately I still can't find the paper. Oh well, the one provided above is quite good. Probably if one searched for papers that cite that one, it would uncover a lot of interesting new work.

I wonder how this applies to Chinese/Japanese? I can sometimes guess the meaning of words I don't know because I recognize the kanji. Sometimes even if I don't recognize the kanji, but recognize the radicals, I can still guess. I'm not advanced enough for this to happen often, but it does happen.

In Japanese, it is a useful strategy. Knowing the common meanings of all the jouyou kanji is very useful. My main problem when reading that way is that I often don't know all the readings. I still need to look up the word to find out how to pronounce it.

I tried using software too, but it always became a distraction for me. Probably because I'm in the software field and would always want to end up building better software >_<.

Thus, I found myself to be more successful with a notebook.

There is another reason though. In addition to the word I write the Japanese definition of the word as well (from a children's dictionary). I mostly avoid using/memorizing English translations. This is just my personal preference, and certainly not the speediest or most effective way depending on your goals.

I have a friend who does exactly this. His Japanese is better than mine :-) Perhaps I should do it too (it would really help my writing too...)

Reading full novels is the original spaced repetition tool. Words get reused regularly in context.

Yes. Unfortunately, they tend to use tons of words with a lower frequency than once per a couple hundred pages.

So it's much better for over-learning on known words, and less good at acquiring tons of new ones.

well said :)

Do you have any suggestions for a first novel in Japanese? I am already reading slice-of-life manga, but the pictures and furigana are a crutch, and the vocab is pretty much limited to everyday life.

I would look into graded readers.

Related : Do 20 pages of a book give you 90% of its words?

Link: https://blog.vocapouch.com/do-20-pages-of-a-book-gives-you-9...

HN discussion: https://news.ycombinator.com/item?id=14673229

I used to spend a lot of time trying to find good books but a funny thing happened. The more time I spent choosing a book, the less likely I was to read or finish reading it. I gave up and started choosing randomly.

At a convenience store or book store I find the display of "new/popular" books, pick one up, and read a few paragraphs to check the level of difficulty and for cursory interest. If it seems about right—I could generally understand even without recognizing all the words—buy and try.

Recently I read Harry Potter #1 and カエルの楽園, which turned out to be a politically motivated parable about Japanese rearmament. They were both at a great level for me. Currently I am reading ハツカネズミと人間 (Of Mice and Men) which was given to me by a friend. Next up is a book on eating one meal a day.

A couple light / young adult fiction books I've read cover to cover:




After using what was essentially this technique to learn to read Japanese I'm not entirely convinced it is actually all that useful compared to just reading. I found myself spending so much time in anki that sometimes I wouldn't do real reading at all... And for what? So I would have a chance at being sort of familiar with a character when I came across it? So called "flashcard blindness" is also a real thing. Context is so important when learning, sometimes it was like I had to learn two words, the word I 'learned' in anki and the word in the wild. Even though they were obstinsably the same word somehow they couldn't connect.

If this technique did help at all I think the effect was mainly psychological. Looking at a page with a bunch of words you don't know is stressful and demoralizing. Even being able to think that you have seen a character somewhere without knowing what it means can be comforting. It sounds silly but any amount of stress can severely inhibit language acquisition. You might be better off meditating for 15 minutes... Or taking drugs

The only thing that seemed to be 100% effective was readimg content that I was completely engaged in. It didn't matter if said content was close to my skill level or way beyond my skill level. As long as I am sufficiently engaged it doesn't matter how many times I have to reference the dictionary or comb google for grammar explanations.

I had never heard the term "flashcard blindness." But, wow -- it fits my experience.

I think any time I mistakenly use SRS expecting that it represents the learning process rather than it facilitates learning, I'm prone to this syndrome. More explicitly, the learning process involves active use and problem solving, creating a web of connections between concepts. SRS merely changes conceptual recall times. For a CS analogy, it moves the memorized concept to main memory instead of residing on a slow disk or over the network. Cumulative slow seeks frustrates learning, so this is useful. But, having something in fast memory isn't useful if you don't have an index to it -- the absence of which (contextually) is flashcard blindness.

> "flashcard blindness" [...] sometimes it was like I had to learn two words, the word I 'learned' in anki and the word in the wild. Even though they were obstinsably the same word somehow they couldn't connect.

I've also felt this problem a lot too, but haven't found a good alternative for self study. Anki/flash card needs to be a support, that uses the minority of your study time, and links closely to the other parts. This is easier when following a class, than when doing self-study.

It is so easy to build elaborate flashcard strategies and get lost in implementing them.

Also, I added example sentences to the flashcards (early version of the tool didn't have them) and it seemed to help with this problem. Still, all valid points and I appreciate your input :)

I've definitely experienced what you've mentioned. Learning a word is a two-step process, first you memorize its definition, then you see it in context and its meaning sinks in. But I think both parts are useful. If I just read without flashcards I forget words too quickly, and the burden of looking up words again and again slows me down a lot.

I'd say you need a mixture of methods, and the ratios change with progress.

Flash card-based methods are very fast, especially in the beginning. Reading goes further, but is too slow for the first maybe 1000-2000 words (depending on the language).

One summer I was running a month long simulation for my PhD and decided to learn Chinese using an undergrad text I found. It had about 17 chapters with 50 new characters per chapter. Normally this would be a two semester course. Using spaced repetition, some books that helped to show the pictographic evolution, and color coding the tones, I was able to go through a chapter a day—granted it took about 14 hours a day. After a few weeks I could read write and speak about 1300 words/850 characters. Reading was the easiest, then, writing, then speaking. Listening was the hardest and one I always struggled with. I don’t have a great memory, but if you have the time, tools, and will it’s very manageable. Usually we lack the time for such an indulgence.

Which text was this?

Sounds like this is Integrated Chinese. A very popular undergraduate text in the US for teaching Chinese language skills.

Can’t recall. It was Yale’s undergrad text circa 2007. Looked old but was very good.

Just a word of warning to all HN readers who plan to follow the author's method. It is a quite painful way to learn a language, especially Chinese. I have learned a language (not Chinese) in this way, but I doubt that it is more efficient than following a decent language course (which are, unfortunately, very rare for Chinese. That could explain why the author did what he did).

First of all: Don't start reading a real book from a too low level. It's absolutely fantastic that the author has the energy and perseverance to attack a real book, but it can quickly end in frustration for most people. The author says he passed HSK4. Let's assume that his Chinese was actually a little bit better than HSK4 which would correspond to a vocabulary of around 1800 words. This sounds a lot but it's actually...nothing. I can only assume that his real level was significantly higher than HSK4, otherwise he would have seen many more than 5-10 new words per page (20 or 30 per page would be more realistic for the first 50 pages in my opinion).

Second, in my experience, even if you know all words of a Chinese text, you are still very far from understanding it. The grammar is a bitch. What will probably happen is that you will stare at what you assume is a sentence and wonder whether you accidentally got some old edition where the characters are printed from right to left or vertically. It's a pity that the author has solely focused on vocabulary acquisition in his article. That could give readers who only know Indo-European languages the impression that Chinese is somehow like Spanish (or even Farsi) where the sentence structure is similar enough to English and where just knowing the words can be already enough to enjoy a book.

(Edit: language)

I agree with your sentiment. I do prefer studying from books, but really try to just take a multifaceted approach: speak everyday, watch movies/tv, read native materials/bed/etc, also use chinese language study books.

I was curious though, when you said Chinese grammar was tough, what specifically did you think was tough? I always sell Chinese grammar as being super simple. It’s the easiest grammar of any language I’ve ever studied so far. I understand how depending on what language you come from other languages can be easier or harder of course. (E.g. Koreans can learn japanese easier than Americans, etc.)

> I understand how depending on what language you come from other languages can be easier or harder of course

Definitely. For me (native speaker of an Indo-European language), learning Chinese is like learning a functional programming language when you only know C. Give me a subject, a verb, and some objects and I can build you a more or less natural sounding sentence in five different European languages (two of them I have never used in an oral conversation). When I tried the same in Chinese, I would first have to consult my grammar book to see what specific pattern to use, decide whether it would be more natural to have a topic in front of the sentence, check the previous sentence to see whether I need the subject at all or whether I can just attach the verb to the previous sentence, and then finally add a couple of "le" here and there :) and still have a result that is wrong because "that's not how you say it".

> decide whether it would be more natural to have a topic in front of the sentence

A grammar book can't tell you this, because it's a style choice. All the grammar book will tell you is how to do it. For what it's worth, topic fronting is a standard feature of English too.

> and then finally add a couple of "le" here and there

Yeah... I feel your pain there.

That makes sense. I probably overlooked the difference between grammatically correct and sounding natural. Chinese (and Japanese) have such a different cultural background that I also find the "natural way" to speak is quite different than English for example. The natural way to speak takes many many years. I've only ever seriously studied East Asian languages so have been able to build on top of each other to some degree. I just assumed this would be an issue in all languages, though now I could see how learning Spanish for example, is likely far less "eccentric" for a native English speaker than something like Chinese/Japanese.

Thanks for your response!

Chinese really needs rote memorization. There is no way around it.

I've often heard successful learners would learn by rote memorization first, up to a very high level of characters, and only then start picking up grammar and the rest from reading or conversations.

I've definitely experienced reading a sentence, knowing all or almost all the words in it, and still not being able to assemble the meaning of the sentence due to an unfamiliar grammar structure. Many times.

I think the best way to tackle this is to just try and try again: read the sentence a second time, read the next sentence and see if you can understand the previous sentence with the added context, or just be content with an incomplete understanding; maybe that grammar usage will be easier to understand the next time you see it. Another good way is to read textbooks. Textbooks are great for grammar (although spaced repetition tends to work better for vocabulary).

Interesting piece of information about textbooks. I haven't tried that.

Concerning your tool, how well does the word segmentation of jieba work in your experience? I tried a few, but at the end I always returned to http://mandarinspot.com/annotate

(Edit: because I can add whitespaces to segment words manually if I see that the output doesn't make sense)

Not only that, but Chinese reading has very little to do with speaking. The vocabulary and even grammar are way too different unless you are reading a kid’s book. If your goal is speaking and listening, you have tk practice that separately.

And Chinese vocabulary is rather small and extremely systematic, so those 1000 or so characters form a solid base for understanding. Far from looking up every other character or so. Only a few % of encountered characters are not among the first 2000.

I thought of trying to read the three body problem along the audiobook, however the one I found online doesn't seem to contain all the chapters.

> And Chinese vocabulary is rather small and extremely systematic, so those 1000 or so characters form a solid base for understanding.

Learning Chinese based on characters is going to be very frustrating. Most words have two characters and knowing their meaning only helps slightly with guessing the meaning of the word.

In the SUBTLEX-CH frequency data, 94% of characters are among the top 1000, but only 82% of words. So it's quite common to see a sentence where you know all the characters, but don't understand it because you don't know how they group into words.

> Most words have two characters and knowing their meaning only helps slightly with guessing the meaning of the word.

It's really infuriating how many two-character words contain an extremely frequent character (like 要 yào) and you have still no idea what the word means.

As someone that knows a bit of Russian and French Chinese seems like a language almost invented to be hard. Let me get this right; Chinese:

1. Has different character sets, including traditional and simplified.

2. Has tens of thousands of different characters.

3. Has both conjugates and homophones.

4. Makes new words by combining characters.

5. Has completely different ways of speaking; none of which are easy.

6. Generally assumes that the other communicator can use the context to denote the tense.

Like. Guys. Common.

If you were combining characters to make words why bother with tens of thousands of characters in the first place?

1. "Simplified" is technically a script variant of traditional developed in the last half century or so, only in the PRC - but they are ultimately still the same system. Those who can read traditional chinese script can read simplified with ease.

2. Only ~2000 needed for normal vocab. Compare this with 10000+ for english.

3. Mandarin in particular has a lot of homophones but it's practically not a problem since terms are comprised of one or more sound and thus easily disambiguate.

4. This is a good thing, vocab is much lower (see point 2)

5. Same with any language?

6. All humans use context

1. Most of the simplifications are regular.

2. You only need to know several thousands.

3. Chinese uses no cojugation (I guess you meant something else). All languages has homophones

4. That's good. That's why you only need several thousand words.

5. I'm not sure what you mean.

6. It simply doesn't convey tense most of the time.

Those tens of thousands is what you get if you count the most obscure characters you can find, that nobody really uses.

Also no spaces. This hurts so bad.

I find it pretty rare not to be able to group words to be honest. I do agree with your first points however.

similarly, learning by individual character (thousands) is much harder than learning by radical (dozens - ~250 total)

sidenote: does three body problem not translate well to English? coz I read an English translation and found the writing quite simplistic. the concepts used were great and I've never read anything like it in hard scifi but the writing wasn't as good as i'd expected for such a lauded book.

I think it's pretty common complaint to say Chinese doesn't translate simply to English. Like Chinese poems are said to contain wordplay based on alternate pronunciations or homophonic information which simply can't be translated.

I didn't read it in English, but I read it's bad. Basically if it's translated from Chinese, you can almost bet it's bad.

This is neat!

Also useful for this kind of analysis: https://www.chinesetextanalyser.com/

I'm doing something similar but instead of reading native books I'm still sticking to readers with increasing difficulty so that I don't have to look too many words up while reading. Since the readers also include word lists I can learn almost all of the words beforehand.

My reading list (traditional characters) is here: https://www.chinese-forums.com/forums/topic/44336-graded-rea...

Thanks for the links! Really interesting to see other projects working in this direction. I'll try picking up some of those readers too :) I'm trying to learn both simplified and traditional (but my simplified is pulling ahead right now because of all the work I'm putting in on The Three Body Problem).

people need to go massive input either in country or online

having actually learnt to the point of watching news tv shows and reading chinese forums i can say that going the classroom or exam route is a waste of time and money

learn the top 50 to 70 chars then start watching dual subtitled programs like cctv4

do it actively meaning pausing rewinding attempting to read subtitles before the speaker etc but never touch a dictionary in flow only much later if you couldnt figure from context repeatedly ie greater than 6 7 context attempts fail

1.5 years and you will be much ahead of anyone else

I echo this approach.

If it is not for the hollywood movies/western RPGs, I won't find my inner calling to conquer English at all. I picked up writing by simply want to have a reasonable discussion with people on the Reddit (well, that was years ago, not sure it is possible nowadays).

To some degree, I see there is similarity between how human/algorithm approaches language. The trick is always data-driven: expose yourself/model more to the true distribution, which in this case the society where the language is used natively, the more the better.

It's clearly worked for you because you are using what I would consider very idiomatic phrases. Good job, this is obviously not easy to do.

Hi, that cctv4 looks like a great resource; thanks! Can you recommend any others? Are there archives of cctv4?

This is the ultimate site for learning colloquial, idiomatic Chinese IMO (it has other languages too):


It has a huge database of modern Chinese TV shows (ranging from ancient dramas to modern CSI-esque crime).

But the greatest part about it is that it has a "learn mode" that provides instant translations when you hover over characters you don't know, and pauses the show automatically [1].


Highly recommended.

youtube has cctv4 shows going back years and also livestreams


i learnt chinese pretty much just watching Homeland Dreamland 远方的家 and Across China 走遍中国 from about 2016 to mid 2017 after which i was able to watch purely chinese shows without english subtitles

these are travel and tech shows so you learn about people places and lot more than just the language

also keeps it fresh and never boring so you never lose motivation for long

100s of hours of content all dual subtitled

I wrote https://pingtype.github.io to add spaces between words, pinyin, the literal translation for each word, and a parallel English translation when available.

It also has support for simplified/traditional conversion, bopomofo, Taiwanese and Cantonese, and typing from pictures.

The most useful texts are bilingual. For me, I was reading the Bible, song lyrics, comics, and subtitles. I want to upload those, but I've been threatened with copyright issues.

I think prestudying vocab before doing more extensive reading is a great technique and I haven't seen it mentioned much.

I've got a rough prototype of a tool for finding unknown words in text with european languages, but you've gotta mark them as known rather than integrating with anki: https://words.sh/

A bit of a late answer, but your idea is pretty cool, definitely something I could see myself using if it were integrated in an ebook reader or something similar. For German (and a few other languages, actually), a very good website to fetch sentences and translations from is Linguee. I might implement something similar on top of epubjs-reader.

I don't believe it's a very good idea to start reading "god-mode" texts in a foreign language until you can get to something resembling a smooth reading flow. A couple of new words per page is ok, a couple of new words per sentence isn't. One of the keys is to NOT look up every unknown word. You'll figure out which ones you need and which ones you don't. Next important secret: Don't write down the words. Instead, reread the pages a while later, and see if you can remember it from the context. If not, look it up again.

Anyway, until that stage flashcard based methods are faster and I find it easier to stick with that.

I recommend "Decipher Chinese" for beginning to advanced Chinese learners. Also Duolingo has a nice conversational-based course, though I recommend using something else to get a grasp of the first maybe 400 characters. You need to know how characters work. You really need to write them with your hands, either with pen and paper or on a tablet/phone. My theory is that this is necessary for an efficient "neural encoding" of Chinese characters. They are designed to be written in a certain way, as if you had a wet brush in your right hand.

I estimate I'm currently at HSK3, but I have not read any Chinese books or novels except childrens books. I would imagine that an adult Chinese novel probably contains a lot of idioms (成语)and these have to be recognized 4 characters at a time, the meaning won't come from just knowing the meaning of the individual characters. Think of an intermediate English reader encountering the phrase "piece of cake", or "cakewalk", or a phrase like "bite the bullet"

Part of my difficulty learning Chinese is I taught myself, primarily by studying vocabulary, and learning characters, not practicing conversation with other people and I think this is a bad way to start. It's much better to put in far far more speaking and listening practice than reading/writing. I didn't do so, and now I can read Chinese text and communicate in chat, but have difficulty listening. And that's a shame, because I think once your listening skills gets reasonable good, you can start watching Chinese TV and Movies, and bootstrap a lot faster -- and in an entertaining way that seems less like mentally taxing work.

Reading novel is always difficult. Having lived in an English-speaking country for many years, and communicated fluently with locals, even in certain cases passing their radar without being identified as foreigner, it is still pretty hard to read a reasonably popular English novel. The words they use, the rhetorics are all so different from everyday spoken language.

In case of Chinese novel, depending on author's style, if you are not reading those period novels, the idioms will not be a huge problem. Although I do see that since the flexibility of the Chinese language, author tend to create new combinations of characters into their own vocabulary to keep the their language fresh, which might be low-effort to the local readers but might present great challenges to language learners.

>It also only supports texts with simplified characters. I’ll eventually add support for traditional characters. The silver lining is that when you add a flashcard for a simplified character, you’ll also get a flashcard for the traditional character. It’ll be suspended by default so you’ll have to unsuspend if you want to study it.

This goal is difficult the way you put it. In Chinese, besides reduction in strokes, a number of traditional characters (of different or similar meanings but mostly pronounced the same) are converted into one simplified character. In other words, traditional character set has a n-to-1 relationship with simplified character set. For most characters this n=1, but you would also see n=2,3... quite often.

For this reason, you would often see articles awkwardly converted from simplified characters into traditional characters when it's done automatically. The other direction--traditional character to simplified character conversion--has no such problem.

I'm unsure how this works, because this technique wont teach you what the English equiv is when 2 (or more) words are next to each other are.

For example, COW + MEAT = BEEF.


牛 = Cow

肉 = Meat

This is a very simple example but there are more complicated words such as Divorce.

In Chinese this is 離婚 which is 離 (from or without) and 婚 (marriage) but it's hard to learn that this is actually Divorce in English.

Edit: Wife is Taiwanese and I'm trying to learn Chinese.

Both your examples seem to be a problem for someone learning english (i.e. there's a specific word for cow meat), but not a problem for someone learning Chinese. Perhaps I'm missing your point, but if I come across a word that translates to "cow meat", I also know that as "beef" in english, but it's perfectly understandable to me as "cow meat".

But going the other way, for someone learning english though, this is a problem. They may very well understand and know "cow meat", but have no idea about the word "beef" if they have not encountered it before.

Sure I know cow meat is beef. But what I find hard is knowing when 2/3/4 characters together mean a specific English word. So when trying to read something I find it hard to figure out when I should and shouldn’t combine characters. I’m struggling a bit to learn. Picking up speaking better for me.


Just like 对不起人 translates to “Canadians”


this was good :-)

It isn't picking out just the Hanzi - it is picking out words! If you look at the second picture used it detects the top 3500 words from a .YAML file (and expects you to add words that are not detected in the top 3500 words).

For example, 離婚 is in the file.

    - trad: 離婚
      simp: 离婚
      pinyin: lí hūn
      - to divorce
      - divorced from (one's spouse)
      - trad: 我想她會和他離婚。
        simp: 我想她会和他离婚。
        pinyin: wǒ xiǎng tā huì hé tā líhūn 。
        eng: I think she will divorce him.
牛肉 is not in the file but is such a simple example. 牛奶 made the list though! So I guess milk is more common than beef.

Yeah, the list isn't perfect (for example, 羊肉 is listed even though 牛肉 isn't). 牛肉 isn't common enough to make the list, but for some reason the Chinese government decided 羊肉 should be one of the first 300 words you learn (it's in HSK 2), so it ended up high on my list because my list partially pulls from HSK, in addition to another list called SUBTLEX-CH.

In this case, I think actually neither should be on the list, and you should instead learn 牛,羊,and 肉 separately and infer the meaning of 牛肉 and 羊肉 from those. I'll change it.

How do you learn to infer the meaning of the 2 put together? Is there a method or is it something you just learn over time?

I don't study Chinese, but I do study Japanese which shares the issue of combination words being multiple kanji (not a big surprise since kanji come from hanzi after all...)

It's something you learn over time for the ones that are more obvious at least. Not all combination words are straight forward, but many basic words are straightforward. "羊肉" is literally "lamb meat" and if you're told 牛 means "cow" you can probably guess "牛肉" means "cow meat". And if 豚 means "pig", what do you think "豚肉" means? And 魚 means "fish" so "魚肉" means? They aren't always so perfectly straight forward, but I bet you can guess what this means too: 消防士 "extinguish (a fire), defend, gentlemen."

This type of meaning inference doesn't always work, though it will for the vast majority of what many would consider "basic vocabulary".

I built my own dictionary with 114,326 words for Pingtype. I based it on the word spacing from the romanised Bible and Apple's option-right-arrow method, and I've made over 5000 edits.

In my opinion, reading traditional characters with spaces is much easier than simplified characters without spaces.

This is actually what makes learning Chinese much easier. Very often you can guess the correct word even when you don't know it. (Such as pingguojiu for cider, a real life example)

Looks like it segments the dictionary and then looks it up. It could still be a problem if it oversegments, but if you look at the screenshots, most "words" are two characters.

The segmentation is based on a library called jieba. It seems to work pretty well from my experience. https://github.com/fxsjy/jieba

I do notice that some generated words are bogus (compounds of other words or redundant in some way). I have a file that lists redundant words, and when I notice these words I add them to the file so the tool won't generate them again. It also lists the words that they are duplicates of, so that those words can be upranked. This is the file if you're curious: https://github.com/kerrickstaley/Chinese-Vocab-List/blob/mas...

Sure, there are words that are relatively opaque based on their components, but 離婚 isn't one of those words. Look at the 6 definitions given for 離 in the ABC dictionary:

1. (verb) leave; part from; be away from -- 我的夫人離我而去 My wife left me.

2. (verb) separate

3. (verb) defy; go against

4. (preposition) distant / apart from -- 北京離上海有多远? How far is Beijing from Shanghai?

5. (noun) name of one of the 8 Trigrams

6. (noun) name of one of the 64 Hexagrams

The sense of 離 is separation; it makes perfect compositional sense in 離婚. A divorce involves two people starting together and going apart.

From the description, it sounds like it works with words, not characters, so 牛肉 would be one multi-character word. The screenshot even shows an example: 燃烧

That said, I wonder if it also creates cards for the individual characters when they make sense. I know that for me, it's often easier to learn a composite word if I know the individual characters, and it helps you figure out other likely words (eg: cow+ meat = beef, so pig+meat = pork).

Some single-character words are included in the vocabulary list if they're often used as stand-alone words. Others that are rarely used alone or whose meaning can be inferred from "compound" words (e.g. 边, which occurs in 旁边, 右边, 左边,etc) are omitted. It's all based on my best judgement (which is probably misguided sometimes). A lot of manual tweaking went into generating the vocab list. All this deduplication is encoded in this file if you're curious: https://github.com/kerrickstaley/Chinese-Vocab-List/blob/mas...

Also, in order to generate a flashcard for such a word, it would have to appear in a standalone context, e.g. 我没吃过牛肉 would not generate flashcards for 牛 or 肉, but 牛不吃肉would.

Not to be pedantic, but I think 離 in this case is abbreviated for 離開 (to leave). So divorce means to leave a marriage...

NER might help here.

For any one learning Mandarin, the best advice I have is: don't start with reading, start with speaking.

The Michel Thomas method has worked pretty well, much better than Pimsleur, for me:


And the Yale "Speak Mandarin" series with romanized Chinese is also really helpful.


Learning Mandarin, overall, carries a much higher cognitive load than, say, Spanish, because the sounds don't map to 26 letters. They map to 1000s of characters that must be memorized. This makes reading a much less useful way for a beginner to approach a language.

I use Pleco dictionary for reading Chinese books (read 20% of 三体 so far). There is a built-in copy-paste reader in the app. You copy-paste any book (it's just text) you want and it's possible to look up any word and add it to flashcards, pronounce it, etc. Minimum effort and preparation.

This is what I do too. In the iOS version, you can directly open epubs. It's great

Interesting, I'll check out Pleco's tool for this.

Nice tool. I'm interested in Chinese websites primarily so here is some code I wrote to dump text from a Chinese website.

    import re
    import requests
    from bs4 import BeautifulSoup
    soup= BeautifulSoup(r.text, 'lxml')
    RE = re.compile(u'[^⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE)
    # txt = soup.find_all(string=RE)
    txt = re.sub(RE, '', soup.get_text())

Sounds like a good technique.

Get a book that is written in conversional language and learn the vocabulary of a page before reading it. This probably neither gets you speaking and writing in that language, but being able to read it is a good start.

Very interesting article. I'm trying to learn Mandarin with a focus on speaking and conversating (using Memrise 2x per day for 15min). Would this help as well? Reading seems to me like a much more difficult project to tackle.

I've always been a visual learner and listening and speaking never really did it for me. Reading, though harder because of the character system, has helped cement vocabulary and grammar in a way that speaking and listening didn't. My speaking and listening are getting better by osmosis too.

Author, since you're commenting: you have the pinyin for San Ti wrong in the first line of your first major subsection. San is first tone, but it's marked in your post as 4th. (you have Ti correctly in 4th)

I’ve been trying this technique via two apps:

* duchinese

* wordswing

I tried also:

* special books: but it’s impractical as they usually translate not enough words at my level

* webpages with perapera plugin: font is too small, translation is unaware of context and the texts I found were not easy to read

That is _awesome_. I've been wanting to do exactly this for Thai Wikipedia for quite a while, but you have some serious issues with identifying words in Thai.

When I want to segment words in some language, I usually check what Apache Lucene does. In this case, the Thai tokenizer [1] simply uses java.text.BreakIterator [2] and hopes that Thai is supported.

[1] https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=...

[2] https://docs.oracle.com/javase/10/docs/api/java/text/BreakIt...

Fantastic tip, thanks

Kerrick what are your thoughts on the Pleco flashcards instead of Anki for iPhone Chinese flashcards?

The big advantage of flashcards in Pleco is the integration with the whole app. If you use the Pleco Reader, you can add new cards directly from the Reader. You can also create cards from dictionary entries and during studying check other dictionaries for each card, play the audio etc.. Some people prefer Anki, but personally I would recommend the Pleco Flashcards for learning Chinese.

Software developers should be more efficient with symbols. This approach may actually work better.

Does anyone have experience with methods like Michel Thomas or Rosetta stone learning Mandarin?

I wonder if the author's tool can be generalized to be used for other languages.

Very nice!


The ROI of learning speaking and listening is very high. I’ve been learning Chinese for 10 years and have gotten much more out of it than my 6 years learning German.

Why is that? Living in England, there weren’t many German-speaking people around so I didn’t get much out of it. Even when I went to Germany, they all spoke English better than I spoke German so we mostly ended up conversing in English.

Chinese on the other hand... there are lots of Chinese speakers in most places, and they’re mostly very happy, relieved and even excited to speak Chinese with you. If you ever go to China, speaking Chinese opens you up to a whole host of new experiences. Few people confidently speak English there so you can get a lot of practice very quickly.

I do agree that ROI on reading/writing is low, and it’s mostly due to the insane difficulty of memorizing characters. My suggestion: learn how to read and type simplified Chinese. Don’t bother with writing (which interestingly is much harder than typing on a pinyin keyboard), or traditional Chinese (there are tools which convert back and forth between the two of you need

>Don’t bother with writing (which interestingly is much harder than typing on a pinyin keyboard), traditional Chinese (there are tools which convert back and forth between the two of you need

I personally find traditional Chinese characters easier to learn to read than simplified characters. Simplified characters are definitely much easier to write but that's not very useful in modern times. Plus learning traditional Chinese characters helps with Japanese.

The reason that traditional Chinese characters are easier for me to learn is that it keeps components that hints toward the meaning that are left out of the simplified version.

For example, hear is 聽 (traditional) and 听 (simplified). The traditional character has 耳 which means ear, the new simplified character has 口 (mouth) and 斤 (axe). That's not necessarily the best example since some of the other components of the traditional character are a bit weird in a character related to hearing but I do feel that it's easier to derive the meaning from the traditional character.

Likewise, I prefer 愛 (love) to have the heart component 心.

There are other reasons than simple return on investment for learning a language.

That's totally true, people can learn it out of pure love.

But realisticlly, Mandarin is very hard to learn, we learn it here in China everyday until the last day of highschool, the everyday classes of Chinese language is no less than any other major course, that's serious time investment even for native speakers, yet still many college students are basicly asses with it.

I'd say the information within it's language world is low quality, large parts of it were translated poorly from English, no where near the English one since it's the world language, so you basicly were reduced to using it for everyday conversation, if you realy want, I suggest you don't invest more than that.

And I oppose teaching it in foreign middle schools aside for Chinese American, it's good to dabble in for some basics though.

learned chinese to have access to one of the biggest content sources in this world and thus prevent boredom

this roi is pretty good

also have huge respect for the chinese ultra realist world view though i dont agree with it

a story like the three body problem is basically impossible from anywhere else

chinese is also devastatingly succinct on a whole another level and much appreciated

Why? Becuase it won't be good enough to work in China?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact