Hacker News new | comments | show | ask | jobs | submit login
The long, tortuous and fascinating process of creating a Chinese font (qz.com)
276 points by tosh 6 days ago | hide | past | web | favorite | 152 comments

Are there any Unicode experts here?

I've been exchanging emails with Richard Cook (the Unihan maintainer) about getting some "rare" Taiwanese characters added to Unicode. I say "rare" because they're in the Bible, which I think should be covered as a basic text (it's the most-read book in the world!)

My research into word spacing, font issues, and more is covered on the blog at https://pingtype.github.io (click the Docs or Blog header). Practical suggestions for better web design are also welcome.

Regarding fonts, this is my specific rant that made me move from Heiti to Pingfang. Unfortunately forcing users to download 13 MB of Pingfang font was too slow for mobile, so I decided to disable it for the web version of Pingtype.


Edit: These are the IDS codes of the missing characters. Photo evidence from a paper Bible:


  ⿱髟煮 chhang.jpg Job 39:19, Job 4:15

 ⿸疒粒not𤷟 liap.jpg 1Sa 5:6, 1Sa 5:9, 1Sa 5:12 ... (17 found) - also see WikiSource.

 ⿱⿳亠口冖足 37106亮足 lo-.jpg Deu 1:28, Deu 2:10, Deu 9:2

  ⿰牜周 tiau.jpg 1Ch 17:7, 1Sa 24:3, 2Ch 14:15 (25 found, although 2Ch 14:15 uses 牧 in the paper version)

As a native speaker (well, reader), I think I agree with Unicode's opinion that these are stylistic differences only. To me it's like you are trying to argue that the single-storey lowercase "a"[1] is wrong and every font must be in the double-storey style.

[1] https://en.wikipedia.org/wiki/A#Typographic_variants

Since we're on the subject, I'd like to take the time to mention UNIHAN, and also an effort I've created to make historical / regional variants of CJK characters more easily accessible.

UNIHAN is Unicode's Han Unification effort. It handles the variant issue - but actually also goes a step further, citing information from paper books, including stroke information, definitions, and even pronunciations [1].

I've compiled a overview of UNIHAN at https://unihan-etl.git-pull.com/en/latest/unihan.html.

unihan-etl is a project I've created that allows extracting the contents' of UNIHAN's database: https://unihan-etl.git-pull.com. It can be used as a Python library, or a self-serve export of the database to a tabular or structured format.

In addition, there is something I've worked on to make this data also available in SQLAlchemy / DB form: https://unihan-db.git-pull.com

And also using it as a basis for a spiritual successor to cjklib [2]: https://cihai.git-pull.com

[1] https://www.unicode.org/reports/tr38/

[2] https://github.com/cburgmer/cjklib

I'm curious, how popular is the bible in Taiwan? And why does it use Taiwanese characters not commonly used elsewhere?

I don't know about the former, but I can answer the latter (and the presumably implied question of why doesn't Unicode cover what should be a fairly common use case?).

This Bible is not written in Mandarin, but rather Hokkien/Min (I suppose it's more correct to say "Southern Min," but I'm used to saying Hokkien and you'll see Taiwanese Hokkien as a term appear). And this is actually a kind of interesting statement about the written language. If you've ever talked about languages/dialects in the Chinese language family you might have heard something like "They might sound different but they all share the same underlying written language and are hence dialects" or the opposite "They're entirely separate languages akin to the Romance languages with their own spoken and written languages that might occasionally be similar/have cognates across different languages."

The situation is a bit muddier than either of those statements might suggest. As this example demonstrates, it's certainly NOT true that if you just wrote down any language/dialect in the Chinese language family it's intelligible to a Mandarin speaker. Chinese Unicode caters towards Mandarin and as you can see there's certain characters here that essentially never show up in Mandarin (and hence a Mandarin speaker might not recognize them). There are also characters that a Mandarin speaker might recognize but not understand its meaning in this case (e.g. some of the pronouns for example the third-person pronoun 伊 which is normally 他/她/它 or 其 in standard Mandarin depending on how formal the setting is). And certainly any Mandarin speaker can attest to the fact that there are certain languages/dialects that are not verbally mutually intelligible.

However, there often exists a written "embedding" of Mandarin into these dialects/languages that doesn't really have an analog among e.g. the Romance languages. For example, Cantonese pop music is often written in Mandarin, but sung in Cantonese, which means that each Mandarin character has a proper Cantonese pronunciation. It comes off as "Mandanrinized" and it's not the way people normally speak, but there's a standard, well-defined way of doing this, and it's intelligible (well at least in certain settings; if you speak like that normally you're going to have a problem). There isn't really a similar way of doing this for French and Spanish. You can't just take a Spanish work, pronounce it using a French pronunciation, and then voila have a French work. It's not even entirely clear for all words what "pronouncing it with a French pronunciation" means.

Nonetheless it's not the same thing as having a shared written language. Standard written Cantonese is going to be tricky for a Mandarin reader to read, just as this Hokkien Bible is going to be tricky for a Mandarin reader to read. And even this "embedding" I'm talking about isn't perfect since there are some Chinese characters that show up commonly in Mandarin but rarely in, say, Hokkien and vice versa. Having a shared set of characters only gets you so far when they can have somewhat divergent meanings and very divergent pronunciations. And it's not even the case that all dialects/languages have standardized written forms.

Regardless, despite the fact that there are many named languages/dialects in the Chinese language family, the mutual intelligibility situation isn't quite as fragmented as it's occasionally made out to be. There's a handful of large families that share a pretty high degree of mutual intelligibility among dialects/languages within them. A few examples are the Mandarin family of dialects (of which Standard Mandarin is a member), which probably has the broadest geographical distribution, running mainly in a diagonal from the Southwest of China (Sichuan) up to the Northeast (HeilongJiang), the Cantonese family of dialects, the Wu family of dialects, and the Min family of dialects.

Returning to this Bible, the funny thing here is that the preface to this Bible is written in Mandarin, but the Bible itself is written in Hokkien (which hints at some of the weird interconnectedness among Chinese dialects/languages). In turn the Chinese characters here are a transliteration of a previous Romanization of an Amoy (Xiamenese Min) translation of the Bible.

This post ended up running away from me a bit, so just to recap the original Unicode answer: Chinese Unicode caters towards characters that show up in written Mandarin and this isn't written Mandarin.

I tend to think of it as classical versus Vulgar Latin. I can write Cantonese phrases word for word (with a few Cantonese only characters) and mandarin only speakers won’t be able to understand it, or I can rephrase it into a “classical” way so that any literate Chinese person from any dialect can read it.

Just to clarify for future readers, "classical" here presumably does NOT mean Classical Chinese (文言文).

> There isn't really a similar way of doing this for French and Spanish. You can't just take a Spanish work, pronounce it using a French pronunciation, and then voila have a French work. It's not even entirely clear for all words what "pronouncing it with a French pronunciation" means.

I guess this is similar to how every written english word has a well defined spanish pronounciation as if it were written spanish. You wouldn't always get something intelligeble from an english listener's point of view, but by the rules of spanish pronounciation, many english words have similar pronounciation when read as spanish. I don't think I've ever heard a song do something like you described between these 2 languages, though, and I'm not sure how it could be made to sound good either.

The rather interesting features about this particular "embedding" is that it results in a written language that is intelligible for both Mandarin and Cantonese speakers and that it also happens quite frequently in certain settings. In fact, it even comes off as "correct." That is it's not as if you're reading broken Mandarin or broken Cantonese and are able to piece together the meaning, as a speaker of a parent language of a pidgin language might feel when reading the pidgin. It's a full-fledged language that reads entirely naturally to a Mandarin speaker and just feels "Mandarinized" but otherwise "correct" to a Cantonese speaker.

The Spanish-English example you give really only is a result of Spanish's regular correspondence between spelling and pronunciation (as far as my limited Spanish knowledge allows me to understand), that allows you to unambiguously pronounce most combinations of Latin letters. For example, there's a "Spanish pronunciation" of the Quecha written Latin script, but it's basically an unintelligible mess of sounds that violates a lot of Spanish sound rules. If you use some conventions to group letters together and settle on a single pronunciation of each of those groupings, you can do the same in English with any Latin letter combinations as well.

Generally speaking, pronouncing written English as Spanish results in something that neither Spanish speakers nor English speakers understand nor accept as part of their language. While there is a subset of vocabulary (most notably -ion words) that shares the same or similar spelling between the two languages, there isn't really a cohesive subset of the written language (that is imagine writing entire sentences and paragraphs) that can be shared.

Ah. I understand now. Man, that's cool for a song to do that!

Like you said, that's not really doable for spanish-english. There's words like "similar" which are written and pronounced the same in both languages, but there's not enough to write sentences.

How different languages interact is always cool :).

FWIW it's not just songs; standard Mandarin is usually the prestige dialect/language so anything that's perceived as being more "prestige" than an everyday conversation (e.g. things that are meant to be formal or official or whatever) tends to become Mandarinized among the different dialects/languages. Cantonese is a special example because of Hong Kong's own interesting social environment.

For most other dialects/languages in the PRC, speakers just switch directly to Mandarin where they might have used a "Mandarinized" part of the dialect/language. Hong Kong is one of the few places where Mandarin comprehension is still not near-universal and so instead of just having Mandarin pop songs, you end up with Mandarinized Cantonese pop songs.

In case you're interested in songs, this is a popular worship song that switches between Chinese and Taiwanese/Hokkien in different verses:

在呼召我之處 The Place Of Calling (國語+台語) - 約書亞樂團 Joshuaband https://www.youtube.com/watch?v=31q8w6JpndA

On the topic of cultural crossover, I can also recommend inner Mongolian folk metal (traditional throat singing mixed with electric guitars is brilliant - 九大圣器 Nine Treasures, 杭盖乐队 Hanggai). For Taiwanese, check out 滅火器 Fire EX., 一步 One Step, 閃靈 ChthoniC).

There are some early Bible translators who took their source texts and translated them word-for-word into their target language, preserving even the word order of the source where it differed from the target.

> For example, Cantonese pop music is often written in Mandarin, but sung in Cantonese, which means that each Mandarin character has a proper Cantonese pronunciation.

I don't think that's quite correct. It is true that many older Cantonese pop songs were not written in Cantonese, but they were not written in Mandarin either; instead they were in what's called Written Vernacular Chinese (白話文)[1], which is based on a subset of Mandarin.

As an example of the top of my head, 咱們 is a very common word in spoken Mandarin, but it is never used in formal documents (written vernacular), and a Cantonese only speaker would have trouble pronouncing it.

[1] https://en.m.wikipedia.org/wiki/Written_vernacular_Chinese

Oh man... there's a long way to go down this rabbit hole...

Maybe I've lost touch with the current Canto-pop scene, but I seem to recall quite a lot of modern-ish Canto-pop (late 2000s) being written in Mandarin.

白话文 is sort of ill-defined. It really just means "vernacular written language" as opposed to Classical Chinese (文言文). There is a Yue vernacular (粤语白话文), Wu vernacular (吴语白话文), etc. There is also a Mandarin vernacular (which is what 白话文 usually refers to by itself in Mandarin) that you can further specify as 官话白话文 (although that term is getting quite specialized and not every native speaker is necessarily going to have heard of that), but that is exactly what it sounds like, the semi-formal written Mandarin vernacular. This is exactly the same situation as standard written English. While this is a restriction of "English" in some sense (does this include regional-specific terms such as "yous?"), it still is more or less equivalent to what most people think of when they think of "English."

咱们 is common in Northern dialects of Mandarin, but used to be a pretty strong marker for a Northern Chinese speaker. That's changed somewhat over time due to broadcast media and the primacy of some Northern Chinese accents, but you're still not going to hear a ton of older Southern Chinese speakers use 咱们 when speaking Mandarin (and when I mean south I mean essentially anyone from Jiangsu/Zhejiang on down) even if they speak pitch-perfect Standard Mandarin. They almost exclusively use 我们 instead (which is also valid Standard Mandarin, everywhere you would use 咱们 you could also use 我们 although not vice versa).

I think this is the phenomenon you're picking up on, that a lot of Northern Chinese Mandarin terms don't show up in Southern Chinese Mandarin.

More broadly, I think what you're pointing out is that "Mandarin" is really itself a fuzzy category that starts falling apart when you poke a its seams, but really that's just the nature of language. "English" starts falling apart when you poke too hard at its particular words as well.

The central point about the embedding of Mandarin though is that it reads completely fluently to a Mandarin speaker. It's not a pidgin, it's not a situation where the Mandarin speaker can understand it in bits and pieces. It reads like anything a Mandarin speaker would have written.

I totally agree with you that all these terms are ill-defined and falls apart once you "poke at its seams" as you put it. I think my point was that 1) the formal written form across various regions have slowly converged into roughly the same form over the past century or so; and 2) this "formal written form" might be based on Mandarin, and reads like everyday modern, spoken Mandarin, but is not exactly 100% Mandarin.

For modern Cantopop, the recent putsch to match more closely to spoken Cantonese is due to an increasing dislike of Mandarin among the young in Hong Kong.

Ah, but therein lies the rub. What is "100% Mandarin?" :). Is it broadcast news Chinese? Which essentially is just the written vernacular standard just read out loud? Is it the Mandarin that a Northern shop-keeper uses to talk to (Chinese) out-of-towners? Is it the Mandarin that he uses among his friends? Is it the Mandarin that kids learn in public school curriculum? Once these questions start to be relevant, I'd say that it's all just "Mandarin" for the purposes of this discussion.

Ooo... that's interesting. I knew that there's been a rise in regional dialect pride in general and that Hong Kong in general has never really been too fond of the Mandarin-speaking Chinese tourist masses, but hadn't realized it'd spread to modern Canto-pop. Might have to take a gander at some of the latest songs.

Indeed I should have been more careful and said "100% Beijingese" or "100% Mandarin as spoken around Beijing" instead. :)

> There isn't really a similar way of doing this for French and Spanish. You can't just take a Spanish work, pronounce it using a French pronunciation, and then voila have a French work. It's not even entirely clear for all words what "pronouncing it with a French pronunciation" means.

Maybe Bokmål is similar in this regard? The Wikipedia page says it's "Norwegianized variety of Danish": https://en.wikipedia.org/wiki/Bokmål

Caveat: My knowledge of Norwegian is approximately the level you would achieve with 5 hours of searching on the internet after being exposed to the Ylvisåker brothers and an armchair amateur's knowledge of Old English.

The whole story of how the various Scandinavian languages developed is actually quite interesting and has a lot of similarities to the situation among the different languages/dialects of Chinese. In particular similar questions of mutually intelligibility come up.

However, as I understand it, Bokmål in this particular case is not quite the same. It historically came from Danish, but now is a written standard for Norwegian that competes with Nynorsk (which is in itself a fascinating phenomenon; not many other languages have competing written standards). It's not really a current embedding of Danish into Norwegian so much as it is a written standard for Norwegian that historically developed from Danish.

About 5% of Taiwanese are Christian, but culturally Christianity has played a bigger role than you might expect based on that number.

If the Taiping Rebellion had been successful, China might have become a (very unorthodox, not accepted by any Western church) Christian theocracy. https://en.wikipedia.org/wiki/Taiping_Rebellion

The short description of Taiping's policies [1] makes them sound like a Christian Taliban, or the society of Margaret Atwood's "A Handmaid's Tale".

[1] https://en.wikipedia.org/wiki/Taiping_Rebellion#Taiping_Heav...

One quarter as many deaths as World War 2, all of it in southern China.

When I first learned of it, years ago, I was stunned at two things - the scale of loss of life, and how little known it was ...

Something to think about is that in many languages and dialects, the Bible is frequently one of the premier pieces of vernacular literature written in the register that it is spoken. In the case of already-literate societies, until modern nationalism, typically what they write is a classical version of what they speak, which they regard as not prestigious enough to put to pen.

> Are there any Unicode experts here?

Have you considered sending Ken Lunde (@ken_lunde) a tweet? He's quite the expert in CJKV and he usually responds promptly to a tweet.

The example regarding wang4 seems like something that should be handled in terms of a Simplified/Traditional variant switch. On the other hand, the unusual characters in the Taiwanese Bible seem to be characters without meaning outside of Taiwanese (or more generally, Min), but that have specific semantic meaning in Taiwanese, and should be encoded in Unicode.

There is no Taiwanese. Just Chinese.

By Taiwanese, parent comment was referring to Taiwanese Hokkien, which is mutually unintelligible with Standard Mandarin. Even in writing, Taiwanese written at its spoken register is at best read with great difficulty by Mandarin speakers.

Hokkien is used on the mainland also, especially in southern Fujian, of course. Not that it matters in any case, I’m sure Taiwanese hokkien has drifted over the last century.

What is interesting is that Fujianese are the first to go abroad, so minnanhua can be more common than mandarin in SE Asia and even Europe.

Get your facts straight. Hokkien, or Minnan Proper (閩南語/閩南話), is a Southern Min dialect group of Chinese. Ask anyone in Fujian Province.

I don't think we are in contradiction. Taiwanese Hokkien is a variety of Min, which is a dialect that shares ancient roots with the Mandarin dialect. My maternal grandmother speaks Teochew, an offshoot of the more common standard Hokkien in the area where I live.

Unlike Mandarin, most dialects have not undergone the same nationalist push to write their spoken language, so literature in the spoken language is few and far between. However, Christian proselytizers are often very determined to translate the Bible into local spoken languages, and are often one of the first significant works in any spoken language/dialect's literature. Hence I am not surprised that it is the Bible that top-commenter found digitally-disadvantaged Chinese characters not encountered in literature outside of Taiwanese Hokkien. You may be right in a sense that these same characters may be encountered in other Min dialects, but they would still be pronounced differently.

there are plenty of regional and area-specific dialects of traditional chinese that would require different glyphs

You're factually right. But this is HN and everytime we discuss something of Chinese culture, the comment will always be "welll actualllly" or something about Taiwan that's completely irrelevant just because people here hate China. Oh well, it's how it goes

No offense and not to sound political here, but there isn't a such language as "Taiwanese". People in Taiwan write in Traditional Chinese characters and use Mandarine (official spoken language and the majority) and Southern Min (or Hokkien proper)[1] as their spoken language. Granted they might have invented a few characters here or there and had variations in some pronunciations in words, but if you ask a linguist expert in Sino-Tibetan languages, they will probably tell you the same thing.

[1] https://en.wikipedia.org/wiki/Southern_Min

Eh... "Taiwanese" is usually pretty well-understood to be Taiwanese Hokkien/Min, both in English and Mandarin Chinese (台语/台語/台湾话/臺灣話).

At least as a noun. When used as an adjective ambiguity definitely exists as to what language you're referring to.

Hokkien is a type of Chinese dialect. The only reason people rush to call it "Taiwanese" is because of politics, not cultural.

There's certainly a ton of political baggage around the Chinese language and layered on top of that there's a crazy amount of political baggage around China and Taiwan (e.g. is it called 正体字 or 繁体字?).

I don't think that's the case here though. Plenty of Taiwanese people I've spoken to are happy to (and often do) call it Hokkien (or Taiwanese Hokkien). Taiwanese as an adjective is usually ambiguous (Taiwanese Mandarin, Hokkien, or aboriginal Taiwanese languages are all possibilities). Taiwanese as a noun is usually less ambiguous (although the confusion in this thread makes me rethink that).

Hokkien (Fujianese) is to Southern Min what Cantonese is to Yue and to some extent Shanghainese is to Wu. They are place-name terms that nominally refer to a specific subset of a dialect family that are now used in informal speech to refer to the entire dialect family. But for example Macaonese (澳门话) is still a place-name term you'll hear even though it's within the Yue family and therefore you could (and people do) call it Cantonese. Taiwanese is a similar case (place-name term) here.

There's certainly a fair amount of people from Taiwan who call it Hokkien as a matter of accuracy

I don't think that's the case here though. There's nothing specifically different from the Hokkien spoken in Taiwan. It's a dialect of Chinese and certainly cannot be called linguistically different. It would be like saying someone spoke Texan instead of English with a Texas accent.

The written language is traditional Chinese, nothing else. This is a common misunderstanding

Chinese has a long tradition of using place-names to refer to regional specific dialects that in English we might just call, as you put it, "English with an X accent."

In the same way that you'll hear 重庆话 (Chongqingnese) and 成都话 (Chengdunese) both used to refer to region-specific dialects of Sichuanese (which in turn is really a member of the greater Mandarin (官话) dialect family) and 上海话 (Shanghainese) and 苏州话 (Suzhounese) used to refer to different region-specific dialects of Wu, Taiwanese is being used in a similar fashion here. It is indeed a dialect of Hokkien that is mutually intelligible with and very similar to other varieties of Hokkien. The differences are, as you imply, on par with the differences you might find between a Texan accent and a New York accent. But there's nothing special about Taiwanese here. This is just the way all region-specific dialects of Chinese are named.

The written language is kind of sort of traditional Chinese (depending on how liberally we're defining "traditional Chinese" here), but it's a stretch. The grammar is not the same as modern Mandarin (nor is it the same as Classical Chinese). Moreover not all the characters used are attested to in official Chinese sources (for example, the first character that the original comment way way up circled in his Flickr image is not found in any ancient Chinese dictionaries I know of (Guangyun/广韵, Shuowenjiezi/说文解字, Qieyun/切韵, Kangxi Dictionary/康熙字典) nor is it found in the modern Xinhua Dictionary 新华字典. And I don't have a copy handy but I wouldn't be surprised if I couldn't find it in the evocatively named Cihai/辞海 (Word Ocean). This is presumably why they are not in Unihan. A much larger set of characters are those that very very few Mandarin speakers recognize. And another (independent) large set of characters do not and have never had the meaning they have in Hokkien in other varieties of Chinese.

That's not to say there's not a lot of overlap with written Mandarin. A Mandarin speaker could probably muddle their way through this Bible if they tried, but then again, a French speaker could muddle their way through Haitian Creole. And even an English speaker could maybe get through a French newspaper given enough cognates. This is a far far cry from the "embedded Mandarin" situation I talked about in another comment though.

Maybe it's not quite something like https://en.wikipedia.org/wiki/Zhuang_logogram which is Chinese-looking, but definitely not what people would normally call Chinese characters, but for a few of these characters, it's getting pretty damn close.

Thanks very much for sharing some of the detailed background here. Stuff like this is what keeps HN a big magical.

Not really, many Taiwan residents conduct business in Standard Mandarin, but speak in Taiwanese Hokkien at home. It's called Taiwanese because it has since drifted from mainland Hokkien, and sociolinguists consider Taiwanese a prestige dialect of Hokkien due to its cultural output and Japanisms.

He didn't claim the language was Taiwanese, only the characters. That seems reasonable for Chinese characters that were used predominantly in Taiwan, no?

This is a somewhat confusing way of putting it. This Bible isn't written in Mandarin (PRC or Taiwanese dialects thereof) and that's why it's using these non-Unihan characters. A Taiwanese Mandarin version wouldn't be using these characters. This Bible is written in (Taiwanese) Hokkien and that's why the GP is having issues with finding these characters in Unihan.

EDIT: I do appreciate that "Taiwanese" as in Taiwanese characters is ambiguous. I initially also read that as a "characters used in works written for a Taiwanese audience in Taiwanese Mandarin."

The characters are all Chinese. Same characters can be found in written Cantonese dialect of Chinese.

GP never even implied there was a separate Taiwanese language, only that the characters were Taiwanese.

And in fact there are a number of Taiwanese languages, or Formosan languages, as Han Chinese settlement of the island is a fairly recent thing. The Taiwanese aborigines are Austronesian people - and in fact many historical linguists consider Taiwan to be the original homeland of the Austronesian language family.

These have of course nothing to do with the Chinese character set issue raise in above comments.


It's really cool how advanced the Austronesians were in maritime navigation! I like to imagine that if history went down a slightly different path, we would be talking about Austronesian imperialism!

They were indeed impressive navigators - I think the most amazing feat is that they were the ones who first settled Madagascar, and just look at the distance from Southeast Asia - but I think they lacked most of the technological and societal prerequisites for building actual empires. Particularly, writing.

Jared Diamond's Guns, Germs and Steel is obviously the go-to book in this area for introduction of why some people built empires and others did not.

There's no such thing as Taiwanese characters. The island uses Traditional Chinese characters

colloquially, most english speakers would recognize "Taiwanese" and "Traditional Chinese/Traditional Mandarin" as interchangeable

Actually, that's not what parent comment meant. By Taiwanese he meant the Taiwanese Hokkien dialect, which is what is spoken by many Taiwanese at home that is mutually unintelligible from Mandarin. It has words that do not correspond neatly to Mandarin or classical Chinese, and so historically, writers have invented characters specifically to write Min languages.

It's unfortunate that awareness of the many Chinese dialects/languages is pretty poor in the Anglosphere, and parent comment could be clearer with what he meant by Taiwanese. It was obvious to me that it was Hokkien that he was referring to, but that is only by virtue of me having familiarity in the matter.

Minor typos in the article: In using the word "Horse" to show Chinese character evolution, the "Regular" is marked from 220 AD to 907 AD. As a matter of fact, that kind of characters were almost the "standard" in Chinese before Chinese government simplified many words around 1950. Even now, the Republic of China (a.k.a. Taiwan) still recognizes the "Regular" characters as the standard. Among Chinese people in the world, it it also known as the "Traditional" characters.

It’s funny because even in Chinese there’s widespread disharmony with respect to “complicated” (繁體字/繁体字) or “regular” (正體字/正体字) script, as opposed to “simplified” script (簡體字/简体字). (Left-hand side phrase is in traditional/complicated script, while right-hand side is in simplified, for comparison).

Even many of the “regular” characters have been simplified. Consider 吃 and 喫—they both mean to eat, but the one with fewer strokes became really the only modern choice to use (however, Japanese still uses the old variant). Another common one is a simplification of the first symbol for Taiwan (臺灣). 台 is often in used of place of 臺.

Off the top of my head, the only place I can recall the 喫 character appearing is in the word 喫茶店 (coffee / tea shop). To eat Japanese would use the 食べる or 召し上がる. After referring to a dictionary, there is a word 喫する,but it's not common (as in I don't recall ever hearing or learning this word) and means more generally consume by mouth as in drink / eat / smoke. Yes, the Chinese 喫 means eat, but no the meaning is not exactly the same and not used with the frequency of the word eat 食べる or 召し上がる.

Conversely, 食 also means "eat" in Chinese, but is now used almost exclusively in nouns like 食物 (food) or 食堂 (dining hall). The Chinese character inventory is simply too large to keep all possible uses, especially across different languages.

食 is still being used everyday in Cantonese as a verb. Granted there are some who classifies Cantonese as a different language from Mandarin, as there are many differences between the two such as this example.


Right—I was only writing about the character form. 食べる is used to mean to eat, but as far as raw characters are concerned, Japanese doesn’t use 吃 but rather 喫, the non-simplified form.

Apparently there is a use for 吃 in Japanese, 吃 is a sound used to represent a tile being discarded in a game, in compounds such as 吃驚, to be surprised. I think this compound is also used in Mandarin with the same meaning. Quick check of my Chinese dictionary indicates yes, they are a shared character compound / meaning.

Edit: fixing up my Japanese

Japanese does use it, but it's rare: 吃(ども)る to stutter

Tangentially related question: do Chinese website use higher font size (especially for traditional script) ? I find those character hardly readable on hackernews at default font size.

Yes for the most part, just check out weibo.com baidu.com xinhuanet.com for example, but its still possible to read at hn size, just not that comfortable.

"longform" chars, like 臺 are typically used in formal writing, like official government docs [marriage lics, passports, etc]. There are also the counterfeit resistant "banker's" number chars used in official docs too [https://en.wikipedia.org/wiki/Chinese_numerals]

Where can I read more about this topic?

I think "regular" was used as an example of a script style.

Funny enough whenever I see a tattoo on a westerner's body, not only is it usually wrong in the grammar/spelling sense. But it is ugly as hell. Would you let a 5 year old tattoo the word, "Strength" onto your body? That's akin to what I see when I see the typography/style of the tattoo. "Sir, not only does it not say Superman, the characters are backwards and missing strokes"

Any Chinese person who tells you the truth about what your tattoo says is being very kind to you, but most Chinese won't say anything bad since they have no reason to embarrass the person.

One time when I was still in college my family took a trip out to Mexico. I forgot what store we went into but the cashier asked my dad to write down the cashier's name (sorry forgot that too) into Chinese. My dad spent a decent amount of time to think of the proper characters, wrote it down and we were on our way. I still think about that incident a lot, like what if my dad was a jerk and wrote something stupid for this guy to get tattooed (he wouldn't). but even then he's essentially trusting my dad is not messing up his name in Chinese (it definitely wasn't something like Mark).

You'll see the same thing with japan randomly using english words (like one word of a song), though afaik they usually use them correctly (but very awkwardly).

But the chinese tattoos, and the english words, aren't meant for those native speakers to see. The tattoos are meant for other english speakers to see, and the words are meant for other japanese speakers to hear. They're not meant to be understood, so much as just being visually/audibly cool.

It looks good, and it sounds good, and its in an environment where almost no one will understand it, so it really doesn't matter if the content is correct. The idea is sufficient.

Now of course, if you get a chinese tattoo mispelled and move to china, you'll look like a bumbling idiot. Its the same as being a native speaker and mispelling it.

The context/environment matters, when deciding how important that mistake is.

Reminds me of when I lived in China with my friend, anytime we left the city for some tourism in the boondocks, he'd wear this white T-shirt with nothing on it except big block characters on the front - "外國人". (In simplified though, which apparently I don't have on my phone). Just means "foreigner." Chinese people got the biggest hoot out of it.

外国人 in simplified.

To be fair, I've seen a Taiwanese person with the English "spice girl" as a tattoo, which is a reverse-poor-translation, because it sounds like a singular member of a defunct pop band. Makes more sense in Chinese.

> Makes more sense in Chinese

Can you clarify what it means? I'm guessing something like "sassy" or "hot" but it's really unclear without knowing the cultural connotations.

you got it right, it means "hottie". (still awkward to tatto on body though..)

>Funny enough whenever I see a tattoo on a westerner's body, not only is it usually wrong in the grammar/spelling sense. But it is ugly as hell.

Tattoo characters being "ugly as hell" aren't unique to Eastern/Asian script on Westerners. There is plenty of ugly Western/Latin script on Westerners too.

To be honest the type of person who would get random chinese writing are usually putting less priorty on getting the exact details perfect rather than making a mysterious statement. They are the least likely group to have Asian friends.


I found his comments to be interesting, and related in a train of thought way to the submission. I’m sorry it didn’t add additional interesting content for you, but it did for me, and I don’t find these sorts of comments to be a derail.

is my post a complaint or just an anecdote? the actual article talks about the style and design of a chinese character, which is what my anecdote is about. westerners ignore that aspect of character writing, which is incredibly important.

are you a mod? why don't you let the mods mod.

But in Chinese, “every character has to be adjusted,” says Su of Justfont. “Each one is its own image, with its own design needs.”

That's a key concept not only for font design, but also for learners of Chinese. For certain characters like 醫 you have to scale down or elongate the radicals to be balanced within a unified whole. Add the importance of stroke order and simplified vs. traditional characters, and learning basic writing skills (let alone calligraphy) gets really tricky.

Noob question here... why is stroke order important? Surely if the final result is the same, why should the order in which the strokes were formed matter?

Write an English T. If you do the horizontal stroke first it's really easy to ensure the vertical stroke starts exactly on the horizontal stroke. The other way round is harder. This matters in Chinese because there are lots of similar characters and sometimes the only difference is whether lines meet or cross. Even when there are other differences, getting it right helps the reader.

(subjective experience) When writing fast the result will be easier to read if you wrote with the correct order. Writing a character on your phone with the correct stroke order will increase the chances of finding the character you want.

Stroke order is less important than you might think, and than what most learning materials make it out to be. But an idea on the general order is crucial to writing fast, and also, quick writing that does not follow either (1) the standardized word order or (2) a traditional word order can look weird to native eyes.

Note that when the PRC standardized its stroke order, it ended up with stroke orders that differed from calligraphic tradition: most notably 右 and 必. In addition, different calligraphic styles and regional variations sometimes have different character variants and different stroke orders.

In short, whereas there are a few mainstream ways to write words in Chinese, just like how there are a few mainstream ways to handwrite disconnected Latin letters, there are many variations that one is expected to be able to recognize, just as how one is expected to recognize a wide variety of Latin letterforms (copperplate, Palmer, blackletter, closed vs. open a/g, etc.)

Sorry if that is a very ignorant question but why not move to a system closer to the latin alphabet with only a handful of signs ?

Quick answer: China does have a romanization system called pinyin for the Mandarin dialect which is quite accurate as long as you know the tones. Problems:

* Pinyin can't be used for other dialects, meaning someone in Guangzhou won't understand written pinyin.

* There are only something like 400 sounds in Mandarin, which means there are a lot of homonyms which makes pinyin not suitable in certain contexts.

* Switching a highly literate society with more than a billion people to a different writing system would be a massive undertaking.

The Vietnamese made such a switch from Chinese characters to a Romanization system based on Portuguese, but at a time when literacy in Chinese characters was relatively low and a colonial power (France) dominated the country and its bureaucracy.

I am not speaking about romanization, although it would surely be nice if all the cultures in the world used the same writing system. I am referring to what korea did with hangul.

Of course you address some of the pain points like the fact that China is way more literate than Korea was when Hangul was introduced.

Chinese is not a single language but spans a large subset of the Sino-Tibetan language family. They tend to mostly differ in their phonologies and not much in syntax and vocabulary, so written logograms could enable communication otherwise impossible. The script was also adopted by, for example, Koreans so they could do the non-verbal communication in written Chinese. You probably cannot expect this kind of uniformity across the whole Indo-European languages.

I don't know Chinese but I guess the text will become much more difficult to read. At least in Japanese there are many characters that have the same reading but different meaning. So with roman alphabet all those characters will be written the same way.

You mean apart from thousand years of history/records and billions of users?

I mean like Hangul

Hangul was possible because it explicitly targeted a single unified language (Korean was politically unified since circa 900 and the language was stuck).

And Hangul was also developed long time ago. And Korean(s) has way less users than Chinese.

Why should the number of users matter? The work factor per person is the same no matter how many people there are.

IMO, its because Han characters are a fundamental part of the national identity, and China has a fierce national pride.

I was once involved with a software project, actually the DOS version of Lotus 1-2-3 2.4J, which bundled some Japanese fonts that were licensed from a Taiwanese font maker. The QA manager told one of the staff to print out every character and check them. I thought it was crazy but the junior guy came back a few weeks later with a list of mistakes that he had found. They were reported to the maker and a new updated version was received. This was at the end of the era when software was distributed on physical media (CDs in this case) and providing updates was a costly business.

Taiwan and Japan have different standard characters have some stylistic differences that few people are aware of! https://en.wikipedia.org/wiki/Han_unification gives a nice table of some variations, if your OS's fonts support them!

I hate to say this, but I don't see the point in maintaining complicated old writing systems. (I mean, of course I see the historical and cultural value, but I don't see why should people keep using it)

You write a "new" Chinese character and then there is: a) no way to represent it on a computer unless you draw it b) no way of knowing how it's pronounced

Latin, Cyrillic, Arabic, Hebrew (ok, they have some common roots), Korean are much more maintainable and "portable".

No, Chinese won't be the new English. You get to write and conversate in English in a short time frame (1 yr). Not Chinese. And certainly the learning curve gets steeper the further you go.

That argument can extend to any complicated system developed over hundreds and thousands of years. Off the top of my head, English Common Law is an example as well as English spelling and grammar. English is a horrible language with massive numbers of grammatical exceptions. I thank the many gods that I was born in an English speaking country.

Speaking of not knowing how to pronounce a Chinese character, that is not entirely true. Many of the characters you see in Chinese have a radical in the construct that indicates a sound. Unfortunately tone is not indicated.

Chinese won't be the new English, I agree. But emoji, which function exactly the same way Chinese characters do (image representing a meaning) are already an English augment.

Sort of an additional aside, there is a way to type a Chinese character by radical assemblage. So even if you don't know how to pronounce it by pinyin, you can still retype it into an e-dictionary. Similar method is used to look it up in a traditional dictionary. I never bothered to learn as it requires a touch typist skill to be fast enough to abandon the pinyin entry.

English is a horrible language with massive numbers of grammatical exceptions.

A former colleague remarked that spelling bees are interesting only because English is such a horrible language. A German spelling bee, in comparison, might go on for weeks!

Fun fact, english spelling kind of made sense before a bunch of scholars wanted to put the Latin back into the language, and now spelling is divorced from writing. As an example, why don't these words rhyme?

* Enough

* Through

* Though

Perhaps we should've gone through with one of these reforms: https://en.wikipedia.org/wiki/English-language_spelling_refo...

And there's also:

* Bough

* Thorough

Five separate words - with five different ways to pronounce "ough"! Spelling in English is a complete disaster. I pity those people who have to learn it as a second language.

> That argument can extend to any complicated system developed over hundreds and thousands of years. Off the top of my head, English Common Law is an example as well as English spelling and grammar.

True, and my point apply to those as well. However, even English pronunciation is not that bad and grammar is much easier than German or Latin or most Romance languages.

> Many of the characters you see in Chinese have a radical in the construct that indicates a sound. Unfortunately tone is not indicated.

Many, but not all (I think Japanese uses some of those hints as well). But if you have "character from some ancient text" that doesn't have that radical you're SOL.

> there is a way to type a Chinese character by radical assemblage. So even if you don't know how to pronounce it by pinyin, you can still retype it into an e-dictionary

Good to know

> Many, but not all (I think Japanese uses some of those hints as well). But if you have "character from some ancient text" that doesn't have that radical you're SOL.

The Japanese pronunciation hints are way less reliable than the Chinese. First, Japan inherited Chinese sounds and characters in drips and drabs. Because Chinese has so many dialects, and a dialect may pronounce a certain character a vastly different way, the onyomi today may be a presentation of old Fujian pronunciation or old Cantonese, not the Mandarin. Second, because the characters were introduced to different parts of Japan at different times, those characters were used for different purposes in the different regions in Japan. Today this means a single character can be used in different words with different meanings, with more than one onyomi. Nevermind what happens in kanji compounds.

If a Chinese character was suddenly introduced with a new radical, that would be akin to suddenly finding a new letter in the alphabet. Highly unlikely. When is the last time a letter was added or removed from the standard English alphabet? A quick google says the last letter added to the English language was J, added 1524. Currently there are 214 radicals in Chinese. A new character would almost certainly be a combination of those existing radicals. Furthermore, the character could still be systematically written in computer language with brush stroke entry. That character would simply have to be added at a unicode level for image rending, same as having to define Alien Emoji as U+1F47D so that it does not render as a ?

> When is the last time a letter was added or removed from the standard English alphabet?

To be a bit pedantic, https://en.wikipedia.org/wiki/Long_s you can still find newspaper articles in Google's archive that use back in the mid 1800s.

True, I only looked up "added". Thanks for that detail. I know there are a couple of other letters that used to be common-ish but are no longer.

>even English pronunciation is not that bad

As a nonnative speaker, it sometimes feels like you just have to know the words individually in order to pronounce them correctly: https://en.wikisource.org/wiki/Ruize-rijmen/De_Chaos.

Amen to that stupid poem. I'm a USA born white girl, and that poem broke me. I tried to read it aloud and got part way through before my tongue / brain made me give up. It's evil.

Can you say squirrel? My trick with this word is to use different words. Start with the word "whirl". Repeat this word a couple of times to get your mouth around it. Then say the "sch" from the word "school". "Sch"... + ..."Whirl", "Sch".. + .."Whirl", "Sch". + ."Whirl", "Sch" + "Whirl", "Sch" "Whirl", "Sch-Whirl", ""SchWhirl". Looks horrible typed but it works on the tongue. Just need to get that sound glide exiting "sch" into "whirl". If you end the "sch" with your lips pursed in anticipation for the "oo" sounds in school, it sets you up for the "wh".

From Wiktionary[1]:

>(Received Pronunciation, General Australian, UK) IPA(key): /ˈskwɪɹl̩/, /ˈskwɪɹəl/

>(Canada, US) IPA(key): /ˈskwɝl/, /ˈskwɝl̩/, /ˈskwɝəl/

I haven't really thought about this word before, I think I use the UK pronunciation /ˈskwɪɹəl/, and this word didn't come across as hard to pronounce.

[1]: https://en.wiktionary.org/w/index.php?title=squirrel&oldid=4...

I would say it "skwir-rul," but that might be a northeastern regional pronunciation?

So almost a combination of school's "sch" plus "rural".

I think you mean:

aɪ heɪt tu seɪ ðɪs, bʌt aɪ doʊnt si ðə pɔɪnt ɪn meɪnˈteɪnɪŋ ˈkɑmpləˌkeɪtəd oʊld ˈraɪtɪŋ ˈsɪstəmz. (aɪ min, ʌv kɔrs aɪ si ðə hɪˈstɔrɪkəl ænd ˈkʌlʧərəl ˈvælju, bʌt aɪ doʊnt si waɪ ʃʊd ˈpipəl kip ˈjuzɪŋ ɪt) ju raɪt eɪ "nu" ʧaɪˈniz ˈkɛrɪktər ænd ðɛn ðɛr ɪz: eɪ) noʊ weɪ tu ˌrɛprəˈzɛnt ɪt ɑn ə kəmˈpjutər ənˈlɛs ju drɔ ɪt bi) noʊ weɪ ʌv ˈnoʊɪŋ haʊ ɪts prəˈnaʊnst

ˈlætən, səˈrɪlɪk, ˈærəbɪk, ˈhibru (ˈoʊˈkeɪ, ðeɪ hæv sʌm ˈkɑmən ruts), kɔˈriən ɑr mʌʧ mɔr meɪnˈteɪnəbl ænd "ˈpɔrtəbəl".

noʊ, ʧaɪˈniz woʊnt bi ðə nu ˈɪŋglɪʃ. ju gɛt tu raɪt ænd kənˈvɜrs ɪn ˈɪŋglɪʃ ɪn ə ʃɔrt taɪm freɪm (1 jɪr). nɑt ʧaɪˈniz. ænd ˈsɜrtənli ðə ˈlɜrnɪŋ kɜrv gɛts ˈstipər ðə ˈfɜrðər ju goʊ.

Haha! The IPA is great because it has support for many languages, and is a recommended tool for students learning English in Taiwan. If computer keyboards used IPA instead of ASCII, I think we'd all be using it by now. But typing is pushing everything towards ASCII (not even Latin because of the lack of diacritics on US/UK/ZH-TW keyboards).

I made a Pingtype English for language exchange the other way, and that uses IPA instead of Pinyin. I often find that users prefer ㄅㄆㄇㄈ though (which I derive from IPA).


>If computer keyboards used IPA instead of ASCII, I think we'd all be using it by now

Not at all! English has really strong historical roots for its spelling, but these could be eliminated even without IPA. Spelling could be standardized instead of staying historical. ("Speling kud bee stenderdized insted uv staying historikel".) Even within ASCII.

I was pointing out that the parent poster fails to realize that English itself has a "complicated old writing system" with "historical and cultural value". Just as having different characters allows for homophones to be recognized easily, English does the same thing with many words that sound the same (sea/see/C, or to/two/too, etc.) There is a lot of history or culture in English spelling, for example all the silent -e endings.

I recommend familiarizing yourself with the languages that these Chinese characters encode to better answer that question. Homophony is extremely common in Mandarin; eliminating characters would make reading very very difficult.

Chinese is not the only language that uses these characters. Japanese and Korean do too. For these groups, the learning curve is much less; that something does not come easy to English speakers is not evidence of its inherent deficiency.

It's true that homophones are much more common in Mandarin than say, English. It's not, IMHO, a very compelling argument against moving the common language to a more phonetic system like Korean did with Hangul. Something like 施氏食獅史 is already incomprehensible to a native speaker (who isn't familiar with the text) when read aloud.

But spoken Chinese is quite different from written Chinese, which allows for more economy. Often a single character will be used in writing where a two-character word would be used in speech. And people's names usually can't be determined by the pronunciation; they are defined by the actual characters. If Chinese moved to a strictly phonetic writing system, a lot of culture would have to adapt: conventions around signage, poetry, proverbs, names (this is a big one!) of people and organizations, formal writing, wordplay, etc.

So what you're saying is Chinese speakers generally use two languages/registers: one for spoken language and another for written. Which is another way of saying the writing system is not actually a good match for the spoken language in the first place and mostly exists for ceremonial reasons.

I would suggest someone figure out a sane way to capture spoken Chinese in a modern (i.e. easy to digitise) writing system but I doubt it would gain any traction because of the cultural implications of the script. Most Westerners see their writing system as a simple fact of life, the Chinese seem to see it as a sacred traditional craft in the same vein as forging steel.

> Homophony is extremely common in Mandarin; eliminating characters would make reading very very difficult.

Well, it would be as difficult as understanding a spoken phrase that uses such homophones. (NLP people would probably hate that though :) )

Yes, it is difficult to understand a spoken phrase. That's why speakers rely on shared context: physical place, relationship between speaker and listener, previous interaction/s between speaker and listener.... none of which is present in a book or newspaper article that's being plucked out of the context-ether.

> Japanese and Korean do too.

Japanese yes. Korean historically but not really these days. If you try browsing Korean Wikipedia or Korean top newspaper sites, it's Hangul and some Latin characters and no Hanja in sight.

> No, Chinese won't be the new English. You get to write and conversate in English in a short time frame (1 yr).

This merely means that you were (very likely) born in a society speaking English or some closely related west European language. You know, the "normal, easy" languages.

To add to that, the level of ability you can achieve in 1 year learning English is also doable in Chinese. That's how long I had learned the language before I went for an exchange semester in Shanghai, and from leaving the airport onwards I spent a week speaking Chinese with everyone (taxi driver, checking in at the hotel, finding an apartment, registering my residence with the police ...). Things got funny when I didn't understand at first that my landlord wanted me to pay a deposit, but when someone wants your money, they can usually find a way to communicate, so in the end it worked out fine.

The difficulty of Chinese is more in the "long tail" of rare characters. If you memorize 5 characters a day (it gets easier once you know all the basic components), after a year you'll be able to read 90% of most texts and get by without much difficulty, but the remaining 10% require double or triple the amount of memorization and after that there'll still be rare characters that even native speakers don't recognize immediately.

Where does the writer, who claims to be a graduate student in Chinese studies, of the excerpt[0] below go wrong?

This fairy tale is promulgated because of the fact that, when you look at the character frequencies, over 95% of the characters in any newspaper are easily among the first 2,000 most common ones. But what such accounts don't tell you is that there will still be plenty of unfamiliar words made up of those familiar characters. (To illustrate this problem, note that in English, knowing the words "up" and "tight" doesn't mean you know the word "uptight".) Plus, as anyone who has studied any language knows, you can often be familiar with every single word in a text and still not be able to grasp the meaning. Reading comprehension is not simply a matter of knowing a lot of words; one has to get a feeling for how those words combine with other words in a multitude of different contexts. In addition, there is the obvious fact that even though you may know 95% of the characters in a given text, the remaining 5% are often the very characters that are crucial for understanding the main point of the text. A non-native speaker of English reading an article with the headline "JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS" is not going to get very far if they don't know the words "jacuzzi" or "phlebitis".

[0]: http://www.pinyin.info/readings/texts/moser.html

Yup, this is a demonstration of Zipf's Law, which affects all human languages: https://simple.wikipedia.org/wiki/Zipf%27s_law

Moser has been discussed to death in the past, his article is pretty exaggerated to say the least.

I think it's worth keeping in mind that for most of history, Chinese referred to a writing system that was shared between many different languages, each with different pronunciations, coming from different places (what is now modern China, Korea, Vietnam, Japan), so the writing system was a script designed around semantic meaning, not sound. This also lets it get used in Kanji where words may have multiple readings (let's not even get into how crazy that is).

Now that mandarin is de-facto "Chinese", the writing system no longer serves its purpose, and in fact a lot of the character simplifications made in the 20th century make radical substitutions based on sound, further complicating things.

Seems like he's trying to make a point just for the sake of it. For a start yeah as a German I don't understand the example headline he's giving there even though I started learning English some 20 years ago. So what does that tell us? Apparently English is even harder than Chinese.. Or maybe that you'll eventually stumble upon an unknown word every now and then and can quickly look it up.

Reading a Chinese newspaper knowing 2000 characters is absolutely possible. You can either just rely on good old guessing from context since, you know, the headline is usually followed by an article, or you just look that word up. It's really not like you will stumble at ever other character since like that guy even admits 95% are already covered by the 2000 you know. Imagine reading the sports section for the first time. You might find a dozen characters or words you don't know, but after a week or two you'll eventually know enough Chinese terms regarding soccer. Same for any specific topic in any foreign language. As for the character combos, first of all while learning it's not like you memorize each character separately, you're usually given some examples of such combinations using characters you already learned, phrases to go along with it etc. For many combinations its even pretty easy to guess what they mean.

> born in a society speaking English or some closely related west European language

No, not really. Though Romance languages are related to English via PIE. So, it might have been easier than if my mother tongue was Finnish or Turkish or a non PIE Asian language.

When reading, we consume whole words at a time, even in syllabic / phonetic alphabet languages.

English is a poor comparison, because it - having consumed vocabulary from many other latin alphabet languages and preserved their spelling, among other historical accidents - does not have a consistent pronunciation rule: you have to know the word to know the pronunciation or otherwise only guess at it, which is similar performance of an ideographic language with pronunciation hints like Chinese.

Latin languages with consistent pronunciation are better examples.

Hangul though, is a superb example: truly designed system, syllabic and compositional. (other readers: look it up, it's really cool how it works).

It's unfortunate you were downvoted over the details here. The big picture - many languages have a large process of rote memorization as a barrier to literacy and such a process is entirely optional - is a great observation.

I always found the Korean Hangul to be quite fascinating.

Although, I must admit, I continue to find the Chinese writing system to be quite beautiful. It's like I'm looking at a strange advanced alien script every time I see it. So, I hope that it never dies. This writing style is a treasure. It is a gift to humanity, to human history, and to human culture. We need to preserve it any way that we can.

I hope that one day, an advanced brain-computer-interface technology will allow us to just download a language package into our brains, and we can get instant Chinese comprehension in just 60 seconds. You can get zapped, and go: "I know Chinese." Or rather: "Wo dong zhongwen."

[1] Here is some background on written Korean.


[2] Here is an interesting comic to teach Korean in 15 minutes.


At the turn of the 20th century, many Chinese literati shared the same sentiment. This was when Western ideas were in vogue, when China was getting humiliated by Western power, and when Communist ideas had currency. In fact, at one point the CCP had plans to eventually transition Chinese characters to phonetic symbols.

How it went was a classic tale of standardization without implementation. The first round of simplifications incorporated simplifications that were already in common use in informal settings, in calligraphy, and in ancient forms. This led it to be readily adopted. (fun fact; "traditional Chinese characters" are in some cases newer than simplified characters, where the character in question acquired a popular different meaning due to its use for phonetically similar words, and the old sense started to be used with a modified character to distinguish it from the borrowed meaning!)

The next round of simplification was too radical for the people, and so it was eventually rolled back.

> You write a "new" Chinese character

Why would you write a new character? How would others know what it means? Unless they figure it out from the radicals? (Which sounds like lossy as hell.)

"New" in the sense it's something not present in the dictionaries/fonts (see the other comment about the Bible characters). I'm not suggesting you invent a new character.

Since we're in this topic: I'm curious, is there any "Google Fonts" for Chinese fonts? That is, a high quality free font repository.

I downloaded 1304 fonts from here: https://chinesefontdesign.com

I then wrote a script that used Harfbuzz to extract every glyph of 75,000 characters as PNG files. That took about 4 months to run on a spare computer, writing 500 GB to an external disk.

I now want to sort out the blank glyphs, but it's really slow even on USB 3. Instead, I bought a 2 TB upgrade for my MacBook Pro, passed down the 512GB to my MacBook Air, and now I'm copying the files from the external disk to the SSD. There's about 90 million files to copy, estimated time remaining 4 days. When I remove the blank images, I plan to use the data to make my own Chinese OCR using Tensorflow.

The TTF files alone are 11.82 GB. If you can recommend a suitable file host, I could re-upload them for you.

I would love to see a Show HN about your amazing project sometime.

I submitted it last year, without gaining much interest.


Honestly most of what I've done since then has been data collection (song lyrics, movie subtitles, etc) instead of developing new features.

My favourite feature now is to read the song lyrics in church, find 4 characters I know, search my database of Christian song lyrics, load that into Pingtype, and sing along with the pinyin and understand the meaning. It's all automated, but I can't upload it because I've received copyright threats about redistributing the song lyrics. I'm not a limited-liability company (this is a side project) so I'd be personally liable for the consequences of putting it online.

I've done much more research to find new data sources. For example, 9gag helped me stumble upon a translated comic (Mixflavor & HowardInterprets). I transcribed all the comics, and I'm using it with my language exchange tutor every week. I decided I wanted to find more comics that are popular with my friends.

So I extracted my Facebook friends' liked Pages. (Yes, that sounds like Cambridge Analytica, but I did it myself using an AppleScript to scrape and some bash scripts to parse). I found 223,783 pages, in 865 Facebook categories. I reduced the Facebook categories to 30 of my own categories (Art, Music, Cooking, Driving, Pets, Shopping, Religion, etc). Then I found the top pages for each of those. So I know the most popular musicians in Taiwan. That's going to become a blog post and Show HN soon, when the paranoia about Facebook calms down.

I've sent you an email about reposting it.

We're always looking for good submissions that didn't get much interest, so if anyone knows of others, please email links to hn@ycombinator.com and we'll take a look.

If I may make a suggestion, the part that's of interest to me is what went into the making. I also wouldn't have taken special note of the old submission, because it doesn't tell me anything about how it works or how it came to be, and I have no actual need to translate to or from Chinese. It seems you've been very thoughtful and done a lot of work, which I think would be of interest to people, should you feel the inclination to describe it. All the best :)

Sounds fascinating! I guess you might not have gotten up votes because between the title and the front page of the link, as an English speaker I didn't know what I was looking at or why it was interesting.

How did you scrape using applescript?

What's your blog?

The scripts are pretty messy, but the process went like this. It was necessary to use AppleScript because the Graph API doesn't give access to friend's Likes (because of privacy issues e.g. Cambridge Analytica). But AppleScript has access to everything through the GUI. (if anyone from Facebook reads this, please don't ban me - I'm just doing this to find out what my friends here like, so I can learn Chinese. I'm not selling this data!).

1. Get a list of IDs (I must be friends with them).

I manually maintain Lists of friends I met in each country. I went to my Taiwan list, scrolled down, and copied the source into TextWrangler. A few regex find-replace later, I had a list of all my Taiwanese friends' IDs.

2. Find-replace to make URLs.

My ID is 705630362, so the URL of my Likes page is: https://m.facebook.com/timeline/app_collection/?collection_t...

3. Scroll down and copy out. This is GUI-intensive, so run it on a spare computer.

tell application "System Events" to tell process "Safari" to key code 119

tell application "Safari" to tell front document to set download_source to do Javascript "document.documentElement.innerHTML;"

Repeat those two while download_source does not contain "<div class=\"_51lb\">". If download_source contains "The page you requested cannot be displayed" then exit repeat.

4. Write it to a file (use cat, not Apple's recommended code, in order to preserve Unicode).

5. Convert to text.

In my case, the HTML files took up 479 MB for 1576 friends. I wrote another script to convert them to text.

Split the HTML based on the "<a class=\"darkTouch _51b5\" href=\"" delimiter.

6. Post-processing!

Now it's time to do research. What are the most common likes? Just combine all the files using cat, and use a bash script to find the most common lines:

cat "input.txt" | sort | uniq -c | sort -n -r > "output.txt"

I plan to write more about this soon, and I'll probably put it on the Pingtype blog. But I might put it on Medium, because people seems to like that these days. Maybe both. There's also my personal website, but I'm worried that people might complain about privacy, so maybe I should distance myself from it. I'm not afraid to write the comment here because we're all hackers.

This is what I use in my computer. https://www.google.com/get/noto/help/cjk/

This is a job that is ripe for automation from deep learning.

Already happened but I don't think the results are good enough to use: https://github.com/kaonashi-tyc/zi2zi

(2015), please.

I wonder if this is something that machine learning could help with? You could train an aesthetic model to make suggestions and tweak as necessary.

Someone has semi-successfully applied GANs to the problem https://github.com/kaonashi-tyc/zi2zi

Using it to create a new font would probably still require lots of manual labor to create training data and then check the output (you don't want to mess up the rare character appearing in someone's name...), but being able to easily interpolate should come in handy for exploration of the design space.

When the Macintosh was introduced in 1984, files had a data fork and a resource fork[1]. The data fork was normal file data. The resource fork was an map of (OSType, int16) -> data, where OSType[2] was a four-character resource type identifier such as 'MENU' to specify a menu, 'PICT' for picture, etc.

Sort of like MIME types, which were standardized a decade later[3].

Resources were limited to, I think, 4MB. 4MB was 32x the RAM capacity, and 10x the [floppy] disk capacity, of the original Macintosh.

One of the OSTypes was 'FONT'. Font data was simply a resource stuck in some file — the system file, or, as a kind of pre-web “web font”, an application.

When we added support for CJK fonts, around 1990, we had to also add OS/file system support for resource sizes > 4MB. (I think the limit was increased to 16MB.)

Resources were a clever invention that facilitated the development of GUI-heavy apps on what by today's standards are ridiculously resource-constrained computers. (The Apple Watch series 3 has more than 60,000 times the RAM of that first Macintosh — although only a third the screen resolution. :-)

Resources also enabled a limited kind of “view source”, that helped a generation of programmers learn their way around Mac application structure. You couldn't view the actual code source, but you could browse the GUI resources of any application you could get your hands on. (This is similar to do the modern web, where the use of webpack, Babel, uglification, and the use of compile-to-js languages, means the actual source code to a complex web site is not accessible, but the assets are.)

As MacOS 10.0, which built on the Unix- (Mach-)based NextOS, resources (multiple data within a file; one OS file per UI file) were replaced by Bundles[4] (many OS files — in a directory — per UI “file”). Bundles are a much better solution for a world with a heterogeneity of operating systems (macOS, Windows, Linux and other Un*xen), where files and tools need to port between multiple file systems. Although bundles come with their own portability problems[5].

[1] https://en.wikipedia.org/wiki/Resource_fork

[2] https://en.wikipedia.org/wiki/OSType

[3] https://tools.ietf.org/html/rfc2045

[4] https://en.wikipedia.org/wiki/Bundle_(macOS)

[5] https://productforums.google.com/forum/#!topic/drive/25XGSFt...

Quartz is publishing such interesting content.

Thanks for this post -- it was an education!

Turtle graphics all the way down.

Can't they just use a shorter set of characters (ie the latin alphabet or the IPA) to write down the pronunciation?

You seem to have forgotten that homophones exist.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact