
The long, tortuous and fascinating process of creating a Chinese font - tosh
https://qz.com/522079/the-long-incredibly-tortuous-and-fascinating-process-of-creating-a-chinese-font/
======
peterburkimsher
Are there any Unicode experts here?

I've been exchanging emails with Richard Cook (the Unihan maintainer) about
getting some "rare" Taiwanese characters added to Unicode. I say "rare"
because they're in the Bible, which I think should be covered as a basic text
(it's the most-read book in the world!)

My research into word spacing, font issues, and more is covered on the blog at
[https://pingtype.github.io](https://pingtype.github.io) (click the Docs or
Blog header). Practical suggestions for better web design are also welcome.

Regarding fonts, this is my specific rant that made me move from Heiti to
Pingfang. Unfortunately forcing users to download 13 MB of Pingfang font was
too slow for mobile, so I decided to disable it for the web version of
Pingtype.

[https://pingtype.github.io/docs/glyphErrors.html](https://pingtype.github.io/docs/glyphErrors.html)

Edit: These are the IDS codes of the missing characters. Photo evidence from a
paper Bible:

[https://www.flickr.com/photos/150180606@N08/sets/72157693398...](https://www.flickr.com/photos/150180606@N08/sets/72157693398863955)

⿱髟煮 chhang.jpg Job 39:19, Job 4:15

 ⿸疒粒not𤷟 liap.jpg 1Sa 5:6, 1Sa 5:9, 1Sa 5:12 ... (17 found) - also see
WikiSource.

 ⿱⿳亠口冖足 37106亮足 lo-.jpg Deu 1:28, Deu 2:10, Deu 9:2

⿰牜周 tiau.jpg 1Ch 17:7, 1Sa 24:3, 2Ch 14:15 (25 found, although 2Ch 14:15 uses
牧 in the paper version)

~~~
devy
No offense and not to sound political here, but there isn't a such language as
"Taiwanese". People in Taiwan write in Traditional Chinese characters and use
Mandarine (official spoken language and the majority) and Southern Min (or
Hokkien proper)[1] as their spoken language. Granted they might have invented
a few characters here or there and had variations in some pronunciations in
words, but if you ask a linguist expert in Sino-Tibetan languages, they will
probably tell you the same thing.

[1]
[https://en.wikipedia.org/wiki/Southern_Min](https://en.wikipedia.org/wiki/Southern_Min)

~~~
dwohnitmok
Eh... "Taiwanese" is usually pretty well-understood to be Taiwanese
Hokkien/Min, both in English and Mandarin Chinese (台语/台語/台湾话/臺灣話).

~~~
volgo
Hokkien is a type of Chinese dialect. The only reason people rush to call it
"Taiwanese" is because of politics, not cultural.

~~~
dwohnitmok
There's certainly a ton of political baggage around the Chinese language and
layered on top of that there's a crazy amount of political baggage around
China and Taiwan (e.g. is it called 正体字 or 繁体字?).

I don't think that's the case here though. Plenty of Taiwanese people I've
spoken to are happy to (and often do) call it Hokkien (or Taiwanese Hokkien).
Taiwanese as an adjective is usually ambiguous (Taiwanese Mandarin, Hokkien,
or aboriginal Taiwanese languages are all possibilities). Taiwanese as a noun
is usually less ambiguous (although the confusion in this thread makes me
rethink that).

Hokkien (Fujianese) is to Southern Min what Cantonese is to Yue and to some
extent Shanghainese is to Wu. They are place-name terms that nominally refer
to a specific subset of a dialect family that are now used in informal speech
to refer to the entire dialect family. But for example Macaonese (澳门话) is
still a place-name term you'll hear even though it's within the Yue family and
therefore you could (and people do) call it Cantonese. Taiwanese is a similar
case (place-name term) here.

~~~
volgo
There's certainly a fair amount of people from Taiwan who call it Hokkien as a
matter of accuracy

I don't think that's the case here though. There's nothing specifically
different from the Hokkien spoken in Taiwan. It's a dialect of Chinese and
certainly cannot be called linguistically different. It would be like saying
someone spoke Texan instead of English with a Texas accent.

The written language is traditional Chinese, nothing else. This is a common
misunderstanding

~~~
dwohnitmok
Chinese has a long tradition of using place-names to refer to regional
specific dialects that in English we might just call, as you put it, "English
with an X accent."

In the same way that you'll hear 重庆话 (Chongqingnese) and 成都话 (Chengdunese)
both used to refer to region-specific dialects of Sichuanese (which in turn is
really a member of the greater Mandarin (官话) dialect family) and 上海话
(Shanghainese) and 苏州话 (Suzhounese) used to refer to different region-specific
dialects of Wu, Taiwanese is being used in a similar fashion here. It is
indeed a dialect of Hokkien that is mutually intelligible with and very
similar to other varieties of Hokkien. The differences are, as you imply, on
par with the differences you might find between a Texan accent and a New York
accent. But there's nothing special about Taiwanese here. This is just the way
all region-specific dialects of Chinese are named.

The written language is kind of sort of traditional Chinese (depending on how
liberally we're defining "traditional Chinese" here), but it's a stretch. The
grammar is not the same as modern Mandarin (nor is it the same as Classical
Chinese). Moreover not all the characters used are attested to in official
Chinese sources (for example, the first character that the original comment
way way up circled in his Flickr image is not found in any ancient Chinese
dictionaries I know of (Guangyun/广韵, Shuowenjiezi/说文解字, Qieyun/切韵, Kangxi
Dictionary/康熙字典) nor is it found in the modern Xinhua Dictionary 新华字典. And I
don't have a copy handy but I wouldn't be surprised if I couldn't find it in
the evocatively named Cihai/辞海 (Word Ocean). This is presumably why they are
not in Unihan. A much larger set of characters are those that very very few
Mandarin speakers recognize. And another (independent) large set of characters
do not and have never had the meaning they have in Hokkien in other varieties
of Chinese.

That's not to say there's not a lot of overlap with written Mandarin. A
Mandarin speaker could probably muddle their way through this Bible if they
tried, but then again, a French speaker could muddle their way through Haitian
Creole. And even an English speaker could maybe get through a French newspaper
given enough cognates. This is a far far cry from the "embedded Mandarin"
situation I talked about in another comment though.

Maybe it's not quite something like
[https://en.wikipedia.org/wiki/Zhuang_logogram](https://en.wikipedia.org/wiki/Zhuang_logogram)
which is Chinese-looking, but definitely not what people would normally call
Chinese characters, but for a few of these characters, it's getting pretty
damn close.

------
skypather
Minor typos in the article: In using the word "Horse" to show Chinese
character evolution, the "Regular" is marked from 220 AD to 907 AD. As a
matter of fact, that kind of characters were almost the "standard" in Chinese
before Chinese government simplified many words around 1950. Even now, the
Republic of China (a.k.a. Taiwan) still recognizes the "Regular" characters as
the standard. Among Chinese people in the world, it it also known as the
"Traditional" characters.

~~~
jasonjei
It’s funny because even in Chinese there’s widespread disharmony with respect
to “complicated” (繁體字/繁体字) or “regular” (正體字/正体字) script, as opposed to
“simplified” script (簡體字/简体字). (Left-hand side phrase is in
traditional/complicated script, while right-hand side is in simplified, for
comparison).

Even many of the “regular” characters have been simplified. Consider 吃 and
喫—they both mean _to eat_ , but the one with fewer strokes became really the
only modern choice to use (however, Japanese still uses the old variant).
Another common one is a simplification of the first symbol for Taiwan (臺灣). 台
is often in used of place of 臺.

~~~
rabboRubble
Off the top of my head, the only place I can recall the 喫 character appearing
is in the word 喫茶店 (coffee / tea shop). To eat Japanese would use the 食べる or
召し上がる. After referring to a dictionary, there is a word 喫する，but it's not
common (as in I don't recall ever hearing or learning this word) and means
more generally consume by mouth as in drink / eat / smoke. Yes, the Chinese 喫
means eat, but no the meaning is not exactly the same and not used with the
frequency of the word eat 食べる or 召し上がる.

~~~
yorwba
Conversely, 食 also means "eat" in Chinese, but is now used almost exclusively
in nouns like 食物 (food) or 食堂 (dining hall). The Chinese character inventory
is simply too large to keep all possible uses, especially across different
languages.

~~~
spacehunt
食 is still being used everyday in Cantonese as a verb. Granted there are some
who classifies Cantonese as a different language from Mandarin, as there are
many differences between the two such as this example.

------
swang
Funny enough whenever I see a tattoo on a westerner's body, not only is it
usually wrong in the grammar/spelling sense. But it is ugly as hell. Would you
let a 5 year old tattoo the word, "Strength" onto your body? That's akin to
what I see when I see the typography/style of the tattoo. "Sir, not only does
it not say Superman, the characters are backwards and missing strokes"

Any Chinese person who tells you the truth about what your tattoo says is
being very kind to you, but most Chinese won't say anything bad since they
have no reason to embarrass the person.

One time when I was still in college my family took a trip out to Mexico. I
forgot what store we went into but the cashier asked my dad to write down the
cashier's name (sorry forgot that too) into Chinese. My dad spent a decent
amount of time to think of the proper characters, wrote it down and we were on
our way. I still think about that incident a lot, like what if my dad was a
jerk and wrote something stupid for this guy to get tattooed (he wouldn't).
but even then he's essentially trusting my dad is not messing up his name in
Chinese (it definitely wasn't something like Mark).

~~~
setr
You'll see the same thing with japan randomly using english words (like one
word of a song), though afaik they usually use them correctly (but very
awkwardly).

But the chinese tattoos, and the english words, aren't meant for those native
speakers to see. The tattoos are meant for other english speakers to see, and
the words are meant for other japanese speakers to hear. They're not meant to
be understood, so much as just being visually/audibly cool.

It looks good, and it sounds good, and its in an environment where almost no
one will understand it, so it really doesn't matter if the content is correct.
The idea is sufficient.

Now of course, if you get a chinese tattoo mispelled and move to china, you'll
look like a bumbling idiot. Its the same as being a native speaker and
mispelling it.

The context/environment matters, when deciding how important that mistake is.

~~~
komali2
Reminds me of when I lived in China with my friend, anytime we left the city
for some tourism in the boondocks, he'd wear this white T-shirt with nothing
on it except big block characters on the front - "外國人". (In simplified though,
which apparently I don't have on my phone). Just means "foreigner." Chinese
people got the biggest hoot out of it.

~~~
hawkice
外国人 in simplified.

To be fair, I've seen a Taiwanese person with the English "spice girl" as a
tattoo, which is a reverse-poor-translation, because it sounds like a singular
member of a defunct pop band. Makes more sense in Chinese.

~~~
pluma
> Makes more sense in Chinese

Can you clarify what it means? I'm guessing something like "sassy" or "hot"
but it's really unclear without knowing the cultural connotations.

~~~
baconizer
you got it right, it means "hottie". (still awkward to tatto on body though..)

------
ilamont
_But in Chinese, “every character has to be adjusted,” says Su of Justfont.
“Each one is its own image, with its own design needs.”_

That's a key concept not only for font design, but also for learners of
Chinese. For certain characters like 醫 you have to scale down or elongate the
radicals to be balanced within a unified whole. Add the importance of stroke
order and simplified vs. traditional characters, and learning basic writing
skills (let alone calligraphy) gets really tricky.

~~~
on_and_off
Sorry if that is a very ignorant question but why not move to a system closer
to the latin alphabet with only a handful of signs ?

~~~
ilamont
Quick answer: China does have a romanization system called pinyin for the
Mandarin dialect which is quite accurate as long as you know the tones.
Problems:

* Pinyin can't be used for other dialects, meaning someone in Guangzhou won't understand written pinyin.

* There are only something like 400 sounds in Mandarin, which means there are a lot of homonyms which makes pinyin not suitable in certain contexts.

* Switching a highly literate society with more than a billion people to a different writing system would be a massive undertaking.

The Vietnamese made such a switch from Chinese characters to a Romanization
system based on Portuguese, but at a time when literacy in Chinese characters
was relatively low and a colonial power (France) dominated the country and its
bureaucracy.

~~~
on_and_off
I am not speaking about romanization, although it would surely be nice if all
the cultures in the world used the same writing system. I am referring to what
korea did with hangul.

Of course you address some of the pain points like the fact that China is way
more literate than Korea was when Hangul was introduced.

------
stuartcw
I was once involved with a software project, actually the DOS version of Lotus
1-2-3 2.4J, which bundled some Japanese fonts that were licensed from a
Taiwanese font maker. The QA manager told one of the staff to print out every
character and check them. I thought it was crazy but the junior guy came back
a few weeks later with a list of mistakes that he had found. They were
reported to the maker and a new updated version was received. This was at the
end of the era when software was distributed on physical media (CDs in this
case) and providing updates was a costly business.

~~~
jhanschoo
Taiwan and Japan have different standard characters have some stylistic
differences that few people are aware of!
[https://en.wikipedia.org/wiki/Han_unification](https://en.wikipedia.org/wiki/Han_unification)
gives a nice table of some variations, if your OS's fonts support them!

------
raverbashing
I hate to say this, but I don't see the point in maintaining complicated old
writing systems. (I mean, of course I see the historical and cultural value,
but I don't see why should people keep using it)

You write a "new" Chinese character and then there is: a) no way to represent
it on a computer unless you draw it b) no way of knowing how it's pronounced

Latin, Cyrillic, Arabic, Hebrew (ok, they have some common roots), Korean are
much more maintainable and "portable".

No, Chinese won't be the new English. You get to write and conversate in
English in a short time frame (1 yr). Not Chinese. And certainly the learning
curve gets steeper the further you go.

~~~
intopieces
I recommend familiarizing yourself with the languages that these Chinese
characters encode to better answer that question. Homophony is extremely
common in Mandarin; eliminating characters would make reading very very
difficult.

Chinese is not the only language that uses these characters. Japanese and
Korean do too. For these groups, the learning curve is much less; that
something does not come easy to English speakers is not evidence of its
inherent deficiency.

~~~
khuey
It's true that homophones are much more common in Mandarin than say, English.
It's not, IMHO, a very compelling argument against moving the common language
to a more phonetic system like Korean did with Hangul. Something like 施氏食獅史 is
already incomprehensible to a native speaker (who isn't familiar with the
text) when read aloud.

~~~
monfrere
But spoken Chinese is quite different from written Chinese, which allows for
more economy. Often a single character will be used in writing where a two-
character word would be used in speech. And people's names usually can't be
determined by the pronunciation; they are defined by the actual characters. If
Chinese moved to a strictly phonetic writing system, a lot of culture would
have to adapt: conventions around signage, poetry, proverbs, names (this is a
big one!) of people and organizations, formal writing, wordplay, etc.

~~~
pluma
So what you're saying is Chinese speakers generally use two
languages/registers: one for spoken language and another for written. Which is
another way of saying the writing system is not actually a good match for the
spoken language in the first place and mostly exists for ceremonial reasons.

I would suggest someone figure out a sane way to capture spoken Chinese in a
modern (i.e. easy to digitise) writing system but I doubt it would gain any
traction because of the cultural implications of the script. Most Westerners
see their writing system as a simple fact of life, the Chinese seem to see it
as a sacred traditional craft in the same vein as forging steel.

------
wiradikusuma
Since we're in this topic: I'm curious, is there any "Google Fonts" for
Chinese fonts? That is, a high quality free font repository.

~~~
peterburkimsher
I downloaded 1304 fonts from here:
[https://chinesefontdesign.com](https://chinesefontdesign.com)

I then wrote a script that used Harfbuzz to extract every glyph of 75,000
characters as PNG files. That took about 4 months to run on a spare computer,
writing 500 GB to an external disk.

I now want to sort out the blank glyphs, but it's really slow even on USB 3.
Instead, I bought a 2 TB upgrade for my MacBook Pro, passed down the 512GB to
my MacBook Air, and now I'm copying the files from the external disk to the
SSD. There's about 90 million files to copy, estimated time remaining 4 days.
When I remove the blank images, I plan to use the data to make my own Chinese
OCR using Tensorflow.

The TTF files alone are 11.82 GB. If you can recommend a suitable file host, I
could re-upload them for you.

~~~
sls
I would love to see a Show HN about your amazing project sometime.

~~~
peterburkimsher
I submitted it last year, without gaining much interest.

[https://news.ycombinator.com/item?id=14907618](https://news.ycombinator.com/item?id=14907618)

Honestly most of what I've done since then has been data collection (song
lyrics, movie subtitles, etc) instead of developing new features.

My favourite feature now is to read the song lyrics in church, find 4
characters I know, search my database of Christian song lyrics, load that into
Pingtype, and sing along with the pinyin and understand the meaning. It's all
automated, but I can't upload it because I've received copyright threats about
redistributing the song lyrics. I'm not a limited-liability company (this is a
side project) so I'd be personally liable for the consequences of putting it
online.

I've done much more research to find new data sources. For example, 9gag
helped me stumble upon a translated comic (Mixflavor & HowardInterprets). I
transcribed all the comics, and I'm using it with my language exchange tutor
every week. I decided I wanted to find more comics that are popular with my
friends.

So I extracted my Facebook friends' liked Pages. (Yes, that sounds like
Cambridge Analytica, but I did it myself using an AppleScript to scrape and
some bash scripts to parse). I found 223,783 pages, in 865 Facebook
categories. I reduced the Facebook categories to 30 of my own categories (Art,
Music, Cooking, Driving, Pets, Shopping, Religion, etc). Then I found the top
pages for each of those. So I know the most popular musicians in Taiwan.
That's going to become a blog post and Show HN soon, when the paranoia about
Facebook calms down.

~~~
komali2
How did you scrape using applescript?

What's your blog?

~~~
peterburkimsher
The scripts are pretty messy, but the process went like this. It was necessary
to use AppleScript because the Graph API doesn't give access to friend's Likes
(because of privacy issues e.g. Cambridge Analytica). But AppleScript has
access to everything through the GUI. (if anyone from Facebook reads this,
please don't ban me - I'm just doing this to find out what my friends here
like, so I can learn Chinese. I'm not selling this data!).

1\. Get a list of IDs (I must be friends with them).

I manually maintain Lists of friends I met in each country. I went to my
Taiwan list, scrolled down, and copied the source into TextWrangler. A few
regex find-replace later, I had a list of all my Taiwanese friends' IDs.

2\. Find-replace to make URLs.

My ID is 705630362, so the URL of my Likes page is:
[https://m.facebook.com/timeline/app_collection/?collection_t...](https://m.facebook.com/timeline/app_collection/?collection_token=705630362%3A2409997254%3A96)

3\. Scroll down and copy out. This is GUI-intensive, so run it on a spare
computer.

tell application "System Events" to tell process "Safari" to key code 119

tell application "Safari" to tell front document to set download_source to do
Javascript "document.documentElement.innerHTML;"

Repeat those two while download_source does not contain "<div
class=\"_51lb\">". If download_source contains "The page you requested cannot
be displayed" then exit repeat.

4\. Write it to a file (use cat, not Apple's recommended code, in order to
preserve Unicode).

5\. Convert to text.

In my case, the HTML files took up 479 MB for 1576 friends. I wrote another
script to convert them to text.

Split the HTML based on the "<a class=\"darkTouch _51b5\" href=\"" delimiter.

6\. Post-processing!

Now it's time to do research. What are the most common likes? Just combine all
the files using cat, and use a bash script to find the most common lines:

cat "input.txt" | sort | uniq -c | sort -n -r > "output.txt"

I plan to write more about this soon, and I'll probably put it on the Pingtype
blog. But I might put it on Medium, because people seems to like that these
days. Maybe both. There's also my personal website, but I'm worried that
people might complain about privacy, so maybe I should distance myself from
it. I'm not afraid to write the comment here because we're all hackers.

------
choonway
This is a job that is ripe for automation from deep learning.

~~~
zawerf
Already happened but I don't think the results are good enough to use:
[https://github.com/kaonashi-tyc/zi2zi](https://github.com/kaonashi-tyc/zi2zi)

------
intopieces
(2015), please.

------
yayitswei
I wonder if this is something that machine learning could help with? You could
train an aesthetic model to make suggestions and tweak as necessary.

~~~
yorwba
Someone has semi-successfully applied GANs to the problem
[https://github.com/kaonashi-tyc/zi2zi](https://github.com/kaonashi-tyc/zi2zi)

Using it to create a new font would probably still require lots of manual
labor to create training data and then check the output (you don't want to
mess up the rare character appearing in someone's name...), but being able to
easily interpolate should come in handy for exploration of the design space.

------
osteele
When the Macintosh was introduced in 1984, files had a data fork and a
resource fork[1]. The data fork was normal file data. The resource fork was an
map of (OSType, int16) -> data, where OSType[2] was a four-character resource
type identifier such as 'MENU' to specify a menu, 'PICT' for picture, etc.

Sort of like MIME types, which were standardized a decade later[3].

Resources were limited to, I think, 4MB. 4MB was 32x the RAM capacity, and 10x
the [floppy] disk capacity, of the original Macintosh.

One of the OSTypes was 'FONT'. Font data was simply a resource stuck in some
file — the system file, or, as a kind of pre-web “web font”, an application.

When we added support for CJK fonts, around 1990, we had to also add OS/file
system support for resource sizes > 4MB. (I think the limit was increased to
16MB.)

Resources were a clever invention that facilitated the development of GUI-
heavy apps on what by today's standards are ridiculously resource-constrained
computers. (The Apple Watch series 3 has more than 60,000 times the RAM of
that first Macintosh — although only a third the screen resolution. :-)

Resources also enabled a limited kind of “view source”, that helped a
generation of programmers learn their way around Mac application structure.
You couldn't view the actual code source, but you could browse the GUI
resources of any application you could get your hands on. (This is similar to
do the modern web, where the use of webpack, Babel, uglification, and the use
of compile-to-js languages, means the actual source code to a complex web site
is not accessible, but the assets are.)

As MacOS 10.0, which built on the Unix- (Mach-)based NextOS, resources
(multiple data within a file; one OS file per UI file) were replaced by
Bundles[4] (many OS files — in a directory — per UI “file”). Bundles are a
much better solution for a world with a heterogeneity of operating systems
(macOS, Windows, Linux and other Un*xen), where files and tools need to port
between multiple file systems. Although bundles come with their own
portability problems[5].

[1]
[https://en.wikipedia.org/wiki/Resource_fork](https://en.wikipedia.org/wiki/Resource_fork)

[2]
[https://en.wikipedia.org/wiki/OSType](https://en.wikipedia.org/wiki/OSType)

[3] [https://tools.ietf.org/html/rfc2045](https://tools.ietf.org/html/rfc2045)

[4]
[https://en.wikipedia.org/wiki/Bundle_(macOS)](https://en.wikipedia.org/wiki/Bundle_\(macOS\))

[5]
[https://productforums.google.com/forum/#!topic/drive/25XGSFt...](https://productforums.google.com/forum/#!topic/drive/25XGSFtnGLk)

------
k5hp
Quartz is publishing such interesting content.

------
pcrh
Thanks for this post -- it was an education!

------
bayesian_horse
Turtle graphics all the way down.

------
John_KZ
Can't they just use a shorter set of characters (ie the latin alphabet or the
IPA) to write down the pronunciation?

~~~
crooked-v
You seem to have forgotten that homophones exist.

