I've been exchanging emails with Richard Cook (the Unihan maintainer) about getting some "rare" Taiwanese characters added to Unicode. I say "rare" because they're in the Bible, which I think should be covered as a basic text (it's the most-read book in the world!)
My research into word spacing, font issues, and more is covered on the blog at https://pingtype.github.io (click the Docs or Blog header). Practical suggestions for better web design are also welcome.
Regarding fonts, this is my specific rant that made me move from Heiti to Pingfang. Unfortunately forcing users to download 13 MB of Pingfang font was too slow for mobile, so I decided to disable it for the web version of Pingtype.
Edit: These are the IDS codes of the missing characters.
Photo evidence from a paper Bible:
⿱髟煮 chhang.jpg Job 39:19, Job 4:15
⿸疒粒not𤷟 liap.jpg 1Sa 5:6, 1Sa 5:9, 1Sa 5:12 ... (17 found) - also see WikiSource.
⿱⿳亠口冖足 37106亮足 lo-.jpg Deu 1:28, Deu 2:10, Deu 9:2
⿰牜周 tiau.jpg 1Ch 17:7, 1Sa 24:3, 2Ch 14:15 (25 found, although 2Ch 14:15 uses 牧 in the paper version)
UNIHAN is Unicode's Han Unification effort. It handles the variant issue - but actually also goes a step further, citing information from paper books, including stroke information, definitions, and even pronunciations .
I've compiled a overview of UNIHAN at https://unihan-etl.git-pull.com/en/latest/unihan.html.
unihan-etl is a project I've created that allows extracting the contents' of UNIHAN's database: https://unihan-etl.git-pull.com. It can be used as a Python library, or a self-serve export of the database to a tabular or structured format.
In addition, there is something I've worked on to make this data also available in SQLAlchemy / DB form: https://unihan-db.git-pull.com
And also using it as a basis for a spiritual successor to cjklib : https://cihai.git-pull.com
This Bible is not written in Mandarin, but rather Hokkien/Min (I suppose it's more correct to say "Southern Min," but I'm used to saying Hokkien and you'll see Taiwanese Hokkien as a term appear). And this is actually a kind of interesting statement about the written language. If you've ever talked about languages/dialects in the Chinese language family you might have heard something like "They might sound different but they all share the same underlying written language and are hence dialects" or the opposite "They're entirely separate languages akin to the Romance languages with their own spoken and written languages that might occasionally be similar/have cognates across different languages."
The situation is a bit muddier than either of those statements might suggest. As this example demonstrates, it's certainly NOT true that if you just wrote down any language/dialect in the Chinese language family it's intelligible to a Mandarin speaker. Chinese Unicode caters towards Mandarin and as you can see there's certain characters here that essentially never show up in Mandarin (and hence a Mandarin speaker might not recognize them). There are also characters that a Mandarin speaker might recognize but not understand its meaning in this case (e.g. some of the pronouns for example the third-person pronoun 伊 which is normally 他/她/它 or 其 in standard Mandarin depending on how formal the setting is). And certainly any Mandarin speaker can attest to the fact that there are certain languages/dialects that are not verbally mutually intelligible.
However, there often exists a written "embedding" of Mandarin into these dialects/languages that doesn't really have an analog among e.g. the Romance languages. For example, Cantonese pop music is often written in Mandarin, but sung in Cantonese, which means that each Mandarin character has a proper Cantonese pronunciation. It comes off as "Mandanrinized" and it's not the way people normally speak, but there's a standard, well-defined way of doing this, and it's intelligible (well at least in certain settings; if you speak like that normally you're going to have a problem). There isn't really a similar way of doing this for French and Spanish. You can't just take a Spanish work, pronounce it using a French pronunciation, and then voila have a French work. It's not even entirely clear for all words what "pronouncing it with a French pronunciation" means.
Nonetheless it's not the same thing as having a shared written language. Standard written Cantonese is going to be tricky for a Mandarin reader to read, just as this Hokkien Bible is going to be tricky for a Mandarin reader to read. And even this "embedding" I'm talking about isn't perfect since there are some Chinese characters that show up commonly in Mandarin but rarely in, say, Hokkien and vice versa. Having a shared set of characters only gets you so far when they can have somewhat divergent meanings and very divergent pronunciations. And it's not even the case that all dialects/languages have standardized written forms.
Regardless, despite the fact that there are many named languages/dialects in the Chinese language family, the mutual intelligibility situation isn't quite as fragmented as it's occasionally made out to be. There's a handful of large families that share a pretty high degree of mutual intelligibility among dialects/languages within them. A few examples are the Mandarin family of dialects (of which Standard Mandarin is a member), which probably has the broadest geographical distribution, running mainly in a diagonal from the Southwest of China (Sichuan) up to the Northeast (HeilongJiang), the Cantonese family of dialects, the Wu family of dialects, and the Min family of dialects.
Returning to this Bible, the funny thing here is that the preface to this Bible is written in Mandarin, but the Bible itself is written in Hokkien (which hints at some of the weird interconnectedness among Chinese dialects/languages). In turn the Chinese characters here are a transliteration of a previous Romanization of an Amoy (Xiamenese Min) translation of the Bible.
This post ended up running away from me a bit, so just to recap the original Unicode answer: Chinese Unicode caters towards characters that show up in written Mandarin and this isn't written Mandarin.
I guess this is similar to how every written english word has a well defined spanish pronounciation as if it were written spanish. You wouldn't always get something intelligeble from an english listener's point of view, but by the rules of spanish pronounciation, many english words have similar pronounciation when read as spanish. I don't think I've ever heard a song do something like you described between these 2 languages, though, and I'm not sure how it could be made to sound good either.
The Spanish-English example you give really only is a result of Spanish's regular correspondence between spelling and pronunciation (as far as my limited Spanish knowledge allows me to understand), that allows you to unambiguously pronounce most combinations of Latin letters. For example, there's a "Spanish pronunciation" of the Quecha written Latin script, but it's basically an unintelligible mess of sounds that violates a lot of Spanish sound rules. If you use some conventions to group letters together and settle on a single pronunciation of each of those groupings, you can do the same in English with any Latin letter combinations as well.
Generally speaking, pronouncing written English as Spanish results in something that neither Spanish speakers nor English speakers understand nor accept as part of their language. While there is a subset of vocabulary (most notably -ion words) that shares the same or similar spelling between the two languages, there isn't really a cohesive subset of the written language (that is imagine writing entire sentences and paragraphs) that can be shared.
Like you said, that's not really doable for spanish-english. There's words like "similar" which are written and pronounced the same in both languages, but there's not enough to write sentences.
FWIW it's not just songs; standard Mandarin is usually the prestige dialect/language so anything that's perceived as being more "prestige" than an everyday conversation (e.g. things that are meant to be formal or official or whatever) tends to become Mandarinized among the different dialects/languages. Cantonese is a special example because of Hong Kong's own interesting social environment.
For most other dialects/languages in the PRC, speakers just switch directly to Mandarin where they might have used a "Mandarinized" part of the dialect/language. Hong Kong is one of the few places where Mandarin comprehension is still not near-universal and so instead of just having Mandarin pop songs, you end up with Mandarinized Cantonese pop songs.
在呼召我之處 The Place Of Calling (國語+台語) - 約書亞樂團 Joshuaband
On the topic of cultural crossover, I can also recommend inner Mongolian folk metal (traditional throat singing mixed with electric guitars is brilliant - 九大圣器 Nine Treasures, 杭盖乐队 Hanggai). For Taiwanese, check out 滅火器 Fire EX., 一步 One Step, 閃靈 ChthoniC).
I don't think that's quite correct. It is true that many older Cantonese pop songs were not written in Cantonese, but they were not written in Mandarin either; instead they were in what's called Written Vernacular Chinese (白話文), which is based on a subset of Mandarin.
As an example of the top of my head, 咱們 is a very common word in spoken Mandarin, but it is never used in formal documents (written vernacular), and a Cantonese only speaker would have trouble pronouncing it.
Maybe I've lost touch with the current Canto-pop scene, but I seem to recall quite a lot of modern-ish Canto-pop (late 2000s) being written in Mandarin.
白话文 is sort of ill-defined. It really just means "vernacular written language" as opposed to Classical Chinese (文言文). There is a Yue vernacular (粤语白话文), Wu vernacular (吴语白话文), etc. There is also a Mandarin vernacular (which is what 白话文 usually refers to by itself in Mandarin) that you can further specify as 官话白话文 (although that term is getting quite specialized and not every native speaker is necessarily going to have heard of that), but that is exactly what it sounds like, the semi-formal written Mandarin vernacular. This is exactly the same situation as standard written English. While this is a restriction of "English" in some sense (does this include regional-specific terms such as "yous?"), it still is more or less equivalent to what most people think of when they think of "English."
咱们 is common in Northern dialects of Mandarin, but used to be a pretty strong marker for a Northern Chinese speaker. That's changed somewhat over time due to broadcast media and the primacy of some Northern Chinese accents, but you're still not going to hear a ton of older Southern Chinese speakers use 咱们 when speaking Mandarin (and when I mean south I mean essentially anyone from Jiangsu/Zhejiang on down) even if they speak pitch-perfect Standard Mandarin. They almost exclusively use 我们 instead (which is also valid Standard Mandarin, everywhere you would use 咱们 you could also use 我们 although not vice versa).
I think this is the phenomenon you're picking up on, that a lot of Northern Chinese Mandarin terms don't show up in Southern Chinese Mandarin.
More broadly, I think what you're pointing out is that "Mandarin" is really itself a fuzzy category that starts falling apart when you poke a its seams, but really that's just the nature of language. "English" starts falling apart when you poke too hard at its particular words as well.
The central point about the embedding of Mandarin though is that it reads completely fluently to a Mandarin speaker. It's not a pidgin, it's not a situation where the Mandarin speaker can understand it in bits and pieces. It reads like anything a Mandarin speaker would have written.
For modern Cantopop, the recent putsch to match more closely to spoken Cantonese is due to an increasing dislike of Mandarin among the young in Hong Kong.
Ooo... that's interesting. I knew that there's been a rise in regional dialect pride in general and that Hong Kong in general has never really been too fond of the Mandarin-speaking Chinese tourist masses, but hadn't realized it'd spread to modern Canto-pop. Might have to take a gander at some of the latest songs.
Maybe Bokmål is similar in this regard? The Wikipedia page says it's "Norwegianized variety of Danish": https://en.wikipedia.org/wiki/Bokmål
The whole story of how the various Scandinavian languages developed is actually quite interesting and has a lot of similarities to the situation among the different languages/dialects of Chinese. In particular similar questions of mutually intelligibility come up.
However, as I understand it, Bokmål in this particular case is not quite the same. It historically came from Danish, but now is a written standard for Norwegian that competes with Nynorsk (which is in itself a fascinating phenomenon; not many other languages have competing written standards). It's not really a current embedding of Danish into Norwegian so much as it is a written standard for Norwegian that historically developed from Danish.
Have you considered sending Ken Lunde (@ken_lunde) a tweet? He's quite the expert in CJKV and he usually responds promptly to a tweet.
The example regarding wang4 seems like something that should be handled in terms of a Simplified/Traditional variant switch. On the other hand, the unusual characters in the Taiwanese Bible seem to be characters without meaning outside of Taiwanese (or more generally, Min), but that have specific semantic meaning in Taiwanese, and should be encoded in Unicode.
What is interesting is that Fujianese are the first to go abroad, so minnanhua can be more common than mandarin in SE Asia and even Europe.
Unlike Mandarin, most dialects have not undergone the same nationalist push to write their spoken language, so literature in the spoken language is few and far between. However, Christian proselytizers are often very determined to translate the Bible into local spoken languages, and are often one of the first significant works in any spoken language/dialect's literature. Hence I am not surprised that it is the Bible that top-commenter found digitally-disadvantaged Chinese characters not encountered in literature outside of Taiwanese Hokkien. You may be right in a sense that these same characters may be encountered in other Min dialects, but they would still be pronounced differently.
I don't think that's the case here though. Plenty of Taiwanese people I've spoken to are happy to (and often do) call it Hokkien (or Taiwanese Hokkien). Taiwanese as an adjective is usually ambiguous (Taiwanese Mandarin, Hokkien, or aboriginal Taiwanese languages are all possibilities). Taiwanese as a noun is usually less ambiguous (although the confusion in this thread makes me rethink that).
Hokkien (Fujianese) is to Southern Min what Cantonese is to Yue and to some extent Shanghainese is to Wu. They are place-name terms that nominally refer to a specific subset of a dialect family that are now used in informal speech to refer to the entire dialect family. But for example Macaonese (澳门话) is still a place-name term you'll hear even though it's within the Yue family and therefore you could (and people do) call it Cantonese. Taiwanese is a similar case (place-name term) here.
I don't think that's the case here though. There's nothing specifically different from the Hokkien spoken in Taiwan. It's a dialect of Chinese and certainly cannot be called linguistically different. It would be like saying someone spoke Texan instead of English with a Texas accent.
The written language is traditional Chinese, nothing else. This is a common misunderstanding
In the same way that you'll hear 重庆话 (Chongqingnese) and 成都话 (Chengdunese) both used to refer to region-specific dialects of Sichuanese (which in turn is really a member of the greater Mandarin (官话) dialect family) and 上海话 (Shanghainese) and 苏州话 (Suzhounese) used to refer to different region-specific dialects of Wu, Taiwanese is being used in a similar fashion here. It is indeed a dialect of Hokkien that is mutually intelligible with and very similar to other varieties of Hokkien. The differences are, as you imply, on par with the differences you might find between a Texan accent and a New York accent. But there's nothing special about Taiwanese here. This is just the way all region-specific dialects of Chinese are named.
The written language is kind of sort of traditional Chinese (depending on how liberally we're defining "traditional Chinese" here), but it's a stretch. The grammar is not the same as modern Mandarin (nor is it the same as Classical Chinese). Moreover not all the characters used are attested to in official Chinese sources (for example, the first character that the original comment way way up circled in his Flickr image is not found in any ancient Chinese dictionaries I know of (Guangyun/广韵, Shuowenjiezi/说文解字, Qieyun/切韵, Kangxi Dictionary/康熙字典) nor is it found in the modern Xinhua Dictionary 新华字典. And I don't have a copy handy but I wouldn't be surprised if I couldn't find it in the evocatively named Cihai/辞海 (Word Ocean). This is presumably why they are not in Unihan. A much larger set of characters are those that very very few Mandarin speakers recognize. And another (independent) large set of characters do not and have never had the meaning they have in Hokkien in other varieties of Chinese.
That's not to say there's not a lot of overlap with written Mandarin. A Mandarin speaker could probably muddle their way through this Bible if they tried, but then again, a French speaker could muddle their way through Haitian Creole. And even an English speaker could maybe get through a French newspaper given enough cognates. This is a far far cry from the "embedded Mandarin" situation I talked about in another comment though.
Maybe it's not quite something like https://en.wikipedia.org/wiki/Zhuang_logogram which is Chinese-looking, but definitely not what people would normally call Chinese characters, but for a few of these characters, it's getting pretty damn close.
EDIT: I do appreciate that "Taiwanese" as in Taiwanese characters is ambiguous. I initially also read that as a "characters used in works written for a Taiwanese audience in Taiwanese Mandarin."
These have of course nothing to do with the Chinese character set issue raise in above comments.
Jared Diamond's Guns, Germs and Steel is obviously the go-to book in this area for introduction of why some people built empires and others did not.
It's unfortunate that awareness of the many Chinese dialects/languages is pretty poor in the Anglosphere, and parent comment could be clearer with what he meant by Taiwanese. It was obvious to me that it was Hokkien that he was referring to, but that is only by virtue of me having familiarity in the matter.
Even many of the “regular” characters have been simplified. Consider 吃 and 喫—they both mean to eat, but the one with fewer strokes became really the only modern choice to use (however, Japanese still uses the old variant). Another common one is a simplification of the first symbol for Taiwan (臺灣). 台 is often in used of place of 臺.
Edit: fixing up my Japanese
Any Chinese person who tells you the truth about what your tattoo says is being very kind to you, but most Chinese won't say anything bad since they have no reason to embarrass the person.
One time when I was still in college my family took a trip out to Mexico. I forgot what store we went into but the cashier asked my dad to write down the cashier's name (sorry forgot that too) into Chinese. My dad spent a decent amount of time to think of the proper characters, wrote it down and we were on our way. I still think about that incident a lot, like what if my dad was a jerk and wrote something stupid for this guy to get tattooed (he wouldn't). but even then he's essentially trusting my dad is not messing up his name in Chinese (it definitely wasn't something like Mark).
But the chinese tattoos, and the english words, aren't meant for those native speakers to see. The tattoos are meant for other english speakers to see, and the words are meant for other japanese speakers to hear. They're not meant to be understood, so much as just being visually/audibly cool.
It looks good, and it sounds good, and its in an environment where almost no one will understand it, so it really doesn't matter if the content is correct. The idea is sufficient.
Now of course, if you get a chinese tattoo mispelled and move to china, you'll look like a bumbling idiot. Its the same as being a native speaker and mispelling it.
The context/environment matters, when deciding how important that mistake is.
To be fair, I've seen a Taiwanese person with the English "spice girl" as a tattoo, which is a reverse-poor-translation, because it sounds like a singular member of a defunct pop band. Makes more sense in Chinese.
Can you clarify what it means? I'm guessing something like "sassy" or "hot" but it's really unclear without knowing the cultural connotations.
Tattoo characters being "ugly as hell" aren't unique to Eastern/Asian script on Westerners. There is plenty of ugly Western/Latin script on Westerners too.
are you a mod? why don't you let the mods mod.
That's a key concept not only for font design, but also for learners of Chinese. For certain characters like 醫 you have to scale down or elongate the radicals to be balanced within a unified whole. Add the importance of stroke order and simplified vs. traditional characters, and learning basic writing skills (let alone calligraphy) gets really tricky.
Note that when the PRC standardized its stroke order, it ended up with stroke orders that differed from calligraphic tradition: most notably 右 and 必. In addition, different calligraphic styles and regional variations sometimes have different character variants and different stroke orders.
In short, whereas there are a few mainstream ways to write words in Chinese, just like how there are a few mainstream ways to handwrite disconnected Latin letters, there are many variations that one is expected to be able to recognize, just as how one is expected to recognize a wide variety of Latin letterforms (copperplate, Palmer, blackletter, closed vs. open a/g, etc.)
* Pinyin can't be used for other dialects, meaning someone in Guangzhou won't understand written pinyin.
* There are only something like 400 sounds in Mandarin, which means there are a lot of homonyms which makes pinyin not suitable in certain contexts.
* Switching a highly literate society with more than a billion people to a different writing system would be a massive undertaking.
The Vietnamese made such a switch from Chinese characters to a Romanization system based on Portuguese, but at a time when literacy in Chinese characters was relatively low and a colonial power (France) dominated the country and its bureaucracy.
Of course you address some of the pain points like the fact that China is way more literate than Korea was when Hangul was introduced.
You write a "new" Chinese character and then there is: a) no way to represent it on a computer unless you draw it b) no way of knowing how it's pronounced
Latin, Cyrillic, Arabic, Hebrew (ok, they have some common roots), Korean are much more maintainable and "portable".
No, Chinese won't be the new English. You get to write and conversate in English in a short time frame (1 yr). Not Chinese. And certainly the learning curve gets steeper the further you go.
Speaking of not knowing how to pronounce a Chinese character, that is not entirely true. Many of the characters you see in Chinese have a radical in the construct that indicates a sound. Unfortunately tone is not indicated.
Chinese won't be the new English, I agree. But emoji, which function exactly the same way Chinese characters do (image representing a meaning) are already an English augment.
Sort of an additional aside, there is a way to type a Chinese character by radical assemblage. So even if you don't know how to pronounce it by pinyin, you can still retype it into an e-dictionary. Similar method is used to look it up in a traditional dictionary. I never bothered to learn as it requires a touch typist skill to be fast enough to abandon the pinyin entry.
A former colleague remarked that spelling bees are interesting only because English is such a horrible language. A German spelling bee, in comparison, might go on for weeks!
Perhaps we should've gone through with one of these reforms: https://en.wikipedia.org/wiki/English-language_spelling_refo...
Five separate words - with five different ways to pronounce "ough"! Spelling in English is a complete disaster. I pity those people who have to learn it as a second language.
True, and my point apply to those as well. However, even English pronunciation is not that bad and grammar is much easier than German or Latin or most Romance languages.
> Many of the characters you see in Chinese have a radical in the construct that indicates a sound. Unfortunately tone is not indicated.
Many, but not all (I think Japanese uses some of those hints as well). But if you have "character from some ancient text" that doesn't have that radical you're SOL.
> there is a way to type a Chinese character by radical assemblage. So even if you don't know how to pronounce it by pinyin, you can still retype it into an e-dictionary
Good to know
The Japanese pronunciation hints are way less reliable than the Chinese. First, Japan inherited Chinese sounds and characters in drips and drabs. Because Chinese has so many dialects, and a dialect may pronounce a certain character a vastly different way, the onyomi today may be a presentation of old Fujian pronunciation or old Cantonese, not the Mandarin. Second, because the characters were introduced to different parts of Japan at different times, those characters were used for different purposes in the different regions in Japan. Today this means a single character can be used in different words with different meanings, with more than one onyomi. Nevermind what happens in kanji compounds.
If a Chinese character was suddenly introduced with a new radical, that would be akin to suddenly finding a new letter in the alphabet. Highly unlikely. When is the last time a letter was added or removed from the standard English alphabet? A quick google says the last letter added to the English language was J, added 1524. Currently there are 214 radicals in Chinese. A new character would almost certainly be a combination of those existing radicals. Furthermore, the character could still be systematically written in computer language with brush stroke entry. That character would simply have to be added at a unicode level for image rending, same as having to define Alien Emoji as U+1F47D so that it does not render as a ?
To be a bit pedantic, https://en.wikipedia.org/wiki/Long_s you can still find newspaper articles in Google's archive that use back in the mid 1800s.
As a nonnative speaker, it sometimes feels like you just have to know the words individually in order to pronounce them correctly: https://en.wikisource.org/wiki/Ruize-rijmen/De_Chaos.
Can you say squirrel? My trick with this word is to use different words. Start with the word "whirl". Repeat this word a couple of times to get your mouth around it. Then say the "sch" from the word "school". "Sch"... + ..."Whirl", "Sch".. + .."Whirl", "Sch". + ."Whirl", "Sch" + "Whirl", "Sch" "Whirl", "Sch-Whirl", ""SchWhirl". Looks horrible typed but it works on the tongue. Just need to get that sound glide exiting "sch" into "whirl". If you end the "sch" with your lips pursed in anticipation for the "oo" sounds in school, it sets you up for the "wh".
>(Received Pronunciation, General Australian, UK) IPA(key): /ˈskwɪɹl̩/, /ˈskwɪɹəl/
>(Canada, US) IPA(key): /ˈskwɝl/, /ˈskwɝl̩/, /ˈskwɝəl/
I haven't really thought about this word before, I think I use the UK pronunciation /ˈskwɪɹəl/, and this word didn't come across as hard to pronounce.
aɪ heɪt tu seɪ ðɪs, bʌt aɪ doʊnt si ðə pɔɪnt ɪn meɪnˈteɪnɪŋ ˈkɑmpləˌkeɪtəd oʊld ˈraɪtɪŋ ˈsɪstəmz. (aɪ min, ʌv kɔrs aɪ si ðə hɪˈstɔrɪkəl ænd ˈkʌlʧərəl ˈvælju, bʌt aɪ doʊnt si waɪ ʃʊd ˈpipəl kip ˈjuzɪŋ ɪt)
ju raɪt eɪ "nu" ʧaɪˈniz ˈkɛrɪktər ænd ðɛn ðɛr ɪz: eɪ) noʊ weɪ tu ˌrɛprəˈzɛnt ɪt ɑn ə kəmˈpjutər ənˈlɛs ju drɔ ɪt bi) noʊ weɪ ʌv ˈnoʊɪŋ haʊ ɪts prəˈnaʊnst
ˈlætən, səˈrɪlɪk, ˈærəbɪk, ˈhibru (ˈoʊˈkeɪ, ðeɪ hæv sʌm ˈkɑmən ruts), kɔˈriən ɑr mʌʧ mɔr meɪnˈteɪnəbl ænd "ˈpɔrtəbəl".
noʊ, ʧaɪˈniz woʊnt bi ðə nu ˈɪŋglɪʃ. ju gɛt tu raɪt ænd kənˈvɜrs ɪn ˈɪŋglɪʃ ɪn ə ʃɔrt taɪm freɪm (1 jɪr). nɑt ʧaɪˈniz. ænd ˈsɜrtənli ðə ˈlɜrnɪŋ kɜrv gɛts ˈstipər ðə ˈfɜrðər ju goʊ.
I made a Pingtype English for language exchange the other way, and that uses IPA instead of Pinyin. I often find that users prefer ㄅㄆㄇㄈ though (which I derive from IPA).
Not at all! English has really strong historical roots for its spelling, but these could be eliminated even without IPA. Spelling could be standardized instead of staying historical. ("Speling kud bee stenderdized insted uv staying historikel".) Even within ASCII.
I was pointing out that the parent poster fails to realize that English itself has a "complicated old writing system" with "historical and cultural value". Just as having different characters allows for homophones to be recognized easily, English does the same thing with many words that sound the same (sea/see/C, or to/two/too, etc.) There is a lot of history or culture in English spelling, for example all the silent -e endings.
Chinese is not the only language that uses these characters. Japanese and Korean do too. For these groups, the learning curve is much less; that something does not come easy to English speakers is not evidence of its inherent deficiency.
I would suggest someone figure out a sane way to capture spoken Chinese in a modern (i.e. easy to digitise) writing system but I doubt it would gain any traction because of the cultural implications of the script. Most Westerners see their writing system as a simple fact of life, the Chinese seem to see it as a sacred traditional craft in the same vein as forging steel.
Well, it would be as difficult as understanding a spoken phrase that uses such homophones. (NLP people would probably hate that though :) )
Japanese yes. Korean historically but not really these days. If you try browsing Korean Wikipedia or Korean top newspaper sites, it's Hangul and some Latin characters and no Hanja in sight.
This merely means that you were (very likely) born in a society speaking English or some closely related west European language. You know, the "normal, easy" languages.
The difficulty of Chinese is more in the "long tail" of rare characters. If you memorize 5 characters a day (it gets easier once you know all the basic components), after a year you'll be able to read 90% of most texts and get by without much difficulty, but the remaining 10% require double or triple the amount of memorization and after that there'll still be rare characters that even native speakers don't recognize immediately.
This fairy tale is promulgated because of the fact that, when you look at the character frequencies, over 95% of the characters in any newspaper are easily among the first 2,000 most common ones. But what such accounts don't tell you is that there will still be plenty of unfamiliar words made up of those familiar characters. (To illustrate this problem, note that in English, knowing the words "up" and "tight" doesn't mean you know the word "uptight".) Plus, as anyone who has studied any language knows, you can often be familiar with every single word in a text and still not be able to grasp the meaning. Reading comprehension is not simply a matter of knowing a lot of words; one has to get a feeling for how those words combine with other words in a multitude of different contexts. In addition, there is the obvious fact that even though you may know 95% of the characters in a given text, the remaining 5% are often the very characters that are crucial for understanding the main point of the text. A non-native speaker of English reading an article with the headline "JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS" is not going to get very far if they don't know the words "jacuzzi" or "phlebitis".
Moser has been discussed to death in the past, his article is pretty exaggerated to say the least.
I think it's worth keeping in mind that for most of history, Chinese referred to a writing system that was shared between many different languages, each with different pronunciations, coming from different places (what is now modern China, Korea, Vietnam, Japan), so the writing system was a script designed around semantic meaning, not sound. This also lets it get used in Kanji where words may have multiple readings (let's not even get into how crazy that is).
Now that mandarin is de-facto "Chinese", the writing system no longer serves its purpose, and in fact a lot of the character simplifications made in the 20th century make radical substitutions based on sound, further complicating things.
Reading a Chinese newspaper knowing 2000 characters is absolutely possible. You can either just rely on good old guessing from context since, you know, the headline is usually followed by an article, or you just look that word up. It's really not like you will stumble at ever other character since like that guy even admits 95% are already covered by the 2000 you know. Imagine reading the sports section for the first time. You might find a dozen characters or words you don't know, but after a week or two you'll eventually know enough Chinese terms regarding soccer. Same for any specific topic in any foreign language. As for the character combos, first of all while learning it's not like you memorize each character separately, you're usually given some examples of such combinations using characters you already learned, phrases to go along with it etc. For many combinations its even pretty easy to guess what they mean.
No, not really. Though Romance languages are related to English via PIE. So, it might have been easier than if my mother tongue was Finnish or Turkish or a non PIE Asian language.
English is a poor comparison, because it - having consumed vocabulary from many other latin alphabet languages and preserved their spelling, among other historical accidents - does not have a consistent pronunciation rule: you have to know the word to know the pronunciation or otherwise only guess at it, which is similar performance of an ideographic language with pronunciation hints like Chinese.
Latin languages with consistent pronunciation are better examples.
Hangul though, is a superb example: truly designed system, syllabic and compositional. (other readers: look it up, it's really cool how it works).
It's unfortunate you were downvoted over the details here. The big picture - many languages have a large process of rote memorization as a barrier to literacy and such a process is entirely optional - is a great observation.
Although, I must admit, I continue to find the Chinese writing system to be quite beautiful. It's like I'm looking at a strange advanced alien script every time I see it. So, I hope that it never dies. This writing style is a treasure. It is a gift to humanity, to human history, and to human culture. We need to preserve it any way that we can.
I hope that one day, an advanced brain-computer-interface technology will allow us to just download a language package into our brains, and we can get instant Chinese comprehension in just 60 seconds. You can get zapped, and go: "I know Chinese." Or rather: "Wo dong zhongwen."
 Here is some background on written Korean.
 Here is an interesting comic to teach Korean in 15 minutes.
How it went was a classic tale of standardization without implementation. The first round of simplifications incorporated simplifications that were already in common use in informal settings, in calligraphy, and in ancient forms. This led it to be readily adopted. (fun fact; "traditional Chinese characters" are in some cases newer than simplified characters, where the character in question acquired a popular different meaning due to its use for phonetically similar words, and the old sense started to be used with a modified character to distinguish it from the borrowed meaning!)
The next round of simplification was too radical for the people, and so it was eventually rolled back.
Why would you write a new character? How would others know what it means? Unless they figure it out from the radicals? (Which sounds like lossy as hell.)
I then wrote a script that used Harfbuzz to extract every glyph of 75,000 characters as PNG files. That took about 4 months to run on a spare computer, writing 500 GB to an external disk.
I now want to sort out the blank glyphs, but it's really slow even on USB 3. Instead, I bought a 2 TB upgrade for my MacBook Pro, passed down the 512GB to my MacBook Air, and now I'm copying the files from the external disk to the SSD. There's about 90 million files to copy, estimated time remaining 4 days. When I remove the blank images, I plan to use the data to make my own Chinese OCR using Tensorflow.
The TTF files alone are 11.82 GB. If you can recommend a suitable file host, I could re-upload them for you.
Honestly most of what I've done since then has been data collection (song lyrics, movie subtitles, etc) instead of developing new features.
My favourite feature now is to read the song lyrics in church, find 4 characters I know, search my database of Christian song lyrics, load that into Pingtype, and sing along with the pinyin and understand the meaning. It's all automated, but I can't upload it because I've received copyright threats about redistributing the song lyrics. I'm not a limited-liability company (this is a side project) so I'd be personally liable for the consequences of putting it online.
I've done much more research to find new data sources. For example, 9gag helped me stumble upon a translated comic (Mixflavor & HowardInterprets). I transcribed all the comics, and I'm using it with my language exchange tutor every week. I decided I wanted to find more comics that are popular with my friends.
So I extracted my Facebook friends' liked Pages. (Yes, that sounds like Cambridge Analytica, but I did it myself using an AppleScript to scrape and some bash scripts to parse). I found 223,783 pages, in 865 Facebook categories. I reduced the Facebook categories to 30 of my own categories (Art, Music, Cooking, Driving, Pets, Shopping, Religion, etc). Then I found the top pages for each of those. So I know the most popular musicians in Taiwan. That's going to become a blog post and Show HN soon, when the paranoia about Facebook calms down.
We're always looking for good submissions that didn't get much interest, so if anyone knows of others, please email links to email@example.com and we'll take a look.
What's your blog?
1. Get a list of IDs (I must be friends with them).
I manually maintain Lists of friends I met in each country. I went to my Taiwan list, scrolled down, and copied the source into TextWrangler. A few regex find-replace later, I had a list of all my Taiwanese friends' IDs.
2. Find-replace to make URLs.
My ID is 705630362, so the URL of my Likes page is: https://m.facebook.com/timeline/app_collection/?collection_t...
3. Scroll down and copy out. This is GUI-intensive, so run it on a spare computer.
tell application "System Events" to tell process "Safari" to key code 119
Repeat those two while download_source does not contain "<div class=\"_51lb\">". If download_source contains "The page you requested cannot be displayed" then exit repeat.
4. Write it to a file (use cat, not Apple's recommended code, in order to preserve Unicode).
5. Convert to text.
In my case, the HTML files took up 479 MB for 1576 friends. I wrote another script to convert them to text.
Split the HTML based on the "<a class=\"darkTouch _51b5\" href=\"" delimiter.
Now it's time to do research. What are the most common likes? Just combine all the files using cat, and use a bash script to find the most common lines:
cat "input.txt" | sort | uniq -c | sort -n -r > "output.txt"
I plan to write more about this soon, and I'll probably put it on the Pingtype blog. But I might put it on Medium, because people seems to like that these days. Maybe both. There's also my personal website, but I'm worried that people might complain about privacy, so maybe I should distance myself from it. I'm not afraid to write the comment here because we're all hackers.
Using it to create a new font would probably still require lots of manual labor to create training data and then check the output (you don't want to mess up the rare character appearing in someone's name...), but being able to easily interpolate should come in handy for exploration of the design space.
Sort of like MIME types, which were standardized a decade later.
Resources were limited to, I think, 4MB. 4MB was 32x the RAM capacity, and 10x the [floppy] disk capacity, of the original Macintosh.
One of the OSTypes was 'FONT'. Font data was simply a resource stuck in some file — the system file, or, as a kind of pre-web “web font”, an application.
When we added support for CJK fonts, around 1990, we had to also add OS/file system support for resource sizes > 4MB. (I think the limit was increased to 16MB.)
Resources were a clever invention that facilitated the development of GUI-heavy apps on what by today's standards are ridiculously resource-constrained computers. (The Apple Watch series 3 has more than 60,000 times the RAM of that first Macintosh — although only a third the screen resolution. :-)
Resources also enabled a limited kind of “view source”, that helped a generation of programmers learn their way around Mac application structure. You couldn't view the actual code source, but you could browse the GUI resources of any application you could get your hands on. (This is similar to do the modern web, where the use of webpack, Babel, uglification, and the use of compile-to-js languages, means the actual source code to a complex web site is not accessible, but the assets are.)
As MacOS 10.0, which built on the Unix- (Mach-)based NextOS, resources (multiple data within a file; one OS file per UI file) were replaced by Bundles (many OS files — in a directory — per UI “file”). Bundles are a much better solution for a world with a heterogeneity of operating systems (macOS, Windows, Linux and other Un*xen), where files and tools need to port between multiple file systems. Although bundles come with their own portability problems.