Hacker News new | past | comments | ask | show | jobs | submit login
Cantonese Font with Pronunciation (visual-fonts.com)
387 points by skogstokig on May 8, 2023 | hide | past | favorite | 201 comments



As many have pointed here Mandarin, Thai, Cantonese and Vietnam are tonal languages and the meaning of words are depending on how you speak the syllables inside the words. Mandarin has four, Thai has five, Cantonese has six and Vitnamese has six tones. Overall about 20% or 1.5 billion of the world's population converse daily in tonal languages.

It will be very helpful if someone come up with automatic tonal detection systems for language learners to automatically check the correctness of their pronunciations as they speak in real-time. This can be accomplished by using time-frequency analysis that detect its accuracy similar to these language pronunciation apps like ELSA speak for English [1][2].

[1]Time–frequency representation:

https://en.wikipedia.org/wiki/Time%E2%80%93frequency_represe...

[2] ELSA Speak:

https://elsaspeak.com/en/


> It will be very helpful if someone come up with automatic tonal detection systems for language learners to automatically check the correctness of their pronunciations […]

The number embedded in the Jyutping pronunciation indicates the tone.

However, the automatic tonal detection is not as straightforward as it might seem due the tone sandhi (the tone change of the word/syllable depending on surrounding morphemes), and not all tonal languages have the tone sandhi. Cantonese, Teochew and Hokkien languages can have a pretty complex tone sandhi, though.

For Cantonese and Jyutping specifically, the tone sandhi is marked as, e.g., «faan4*2» which means that the standard pronunciation is «faan4» (the low falling tone) can change into «faan2» (the high rising tone) depending on the following morpheme. The tone sandhi does not have fixed rules and depends on each morpheme. Depending on the tonal language, the preceding morpheme can initiate a tone change, or it can be the following one. Certain tonal languages can have a complex tone sandhi that affects the tonal change not just in a single morpheme but it the rest of the morphemes that comprise a full word.

We would really be talking about a full contextual, semantics aware automated translator that applies the approriate tone sandhi on the contextual basis to give a Western reader the correct tone rendition in the transliteration.*


Note that while Cantonese has the phenomenon of changed tones (變音 - https://en.wikipedia.org/wiki/Changed_tone), it is not actually considered tone sandhi (https://en.wikipedia.org/wiki/Tone_sandhi#What_tone_sandhi_i...).

Tone sandhi is phonologically motivated, i.e., the tone changes arise from the pronunciation of the surrounding words, and are thus largely predictable. Cantonese changed tones, however, are generally lexically motivated, i.e., the changed tone is part of the realization of the word itself. Cantonese changed tones are thus more akin to Mandarin's erhua phenomenon (https://en.wikipedia.org/wiki/Erhua), which is also seemingly as random with regard to the words to which it applies.

> For Cantonese and Jyutping specifically, the tone sandhi is marked as, e.g., «faan4*2»

Note that the Jyutping standard actually doesn't specify how tone changes are marked. I believe the * convention originated at https://www.cantonese.sheik.co.uk/ to facilitate Cantonese learning. Wiktionary has adopted a similar convention but using a hyphen (-) instead (https://en.wiktionary.org/wiki/Wiktionary:About_Chinese/Cant...).


It is true that the tone change in Cantonese is largely predictable due to it having the lexical nature. Since the parent raised a question about a generic solution to the tone marking problem, I thought I would use the tone sandhi as an example of a challenge that compounds the matter.

On the other hand, languages such as Hmong exhibit the complex tone sandhi. Tone sandhi is not as pronounced in Cantonese as it is in some other languages.

> Note that the Jyutping standard actually doesn't specify how tone changes are marked.

No, it does not, and whether an asterisk or a dash is used, it is largely immaterial. What is important, though, is that not every Cantonese dictionary marks off a lexical tone change which might misguide the learner or the seasonal reader.


I got curious about something with tonal languages: how are song melodies written for them. For some, the melody matched the tones of the word. Then there's Mandarin. Mandarin just follows the melody, and you can figure out the word by context. As an English speaker, this makes ~cents~ sense. Homophones aren't a big deal. If Mandarin doesn't need tones in lyrics, why does it need them normally?


Interestingly, Cantonese songs tend to preserve tone better than songs sung in Mandarin. The paper "Tone and Melody in Cantonese" by Marjorie K.M. Chan [1] mentions the following:

> For Chinese, modern songs in Mandarin and Cantonese exhibit very different behaviour with respect to the extent to which the melodies affect the lexical tones. In modern Mandarin songs, the melodies dominate, so that the original tones on the lyrics seem to be completely ignored. In Cantonese songs, however, the melodies typically take the lexical tones into consideration and attempt to preserve their pitch contours and relative pitch heights.

[1] https://journals.linguisticsociety.org/proceedings/index.php...


Tangential fact - while, as you described, most popular songs in Cantonese have tones matching the melody (usually the melody is written first, then lyrics are filled in), many songs from Christian churches don't follow this practice. (I don't know why, maybe a lack of lyricists for translation from English/Latin to Cantonese during earlier years?)

So, in Hong Kong, when somebody writes a song/lyric that doesn't quite have matching tones, we ask "which church are you from?" to make fun of it.


> many songs from Christian churches don't follow this practice. (I don't know why, maybe a lack of lyricists for translation from English/Latin to Cantonese during earlier years?)

If you want the Cantonese lyrics to be translations of English lyrics, it severely limit the words you can choose, to the point it is impossible to fully match the melody.


It needs them because there are too few unique syllables in Mandarin. I'm sure a linguist can provide the proper terminology, but there are only around 400 unique sounds in Mandarin, ignoring tones. Even adding five tones still only increases this to ~1500 (not all are used). Compare this to English, where estimates are in the 10-15k range.

There are therefore an enormous number of homophones in Mandarin, which makes it very challenging to comprehend without context. I've often had native speaking friends eavesdrop on a conversation, only to tell me that they're not sure what is being discussed.

It also means that the language cannot be usefully written phonetically, and thus unique characters are required.

some discussion here:

https://chinese.stackexchange.com/questions/40574/why-does-m...

https://chinese.stackexchange.com/questions/39695/does-chine...

https://chinese.stackexchange.com/questions/14596/how-many-s...


Having few unique syllables doesn't mean tones are required, since syllables can be combined. Most Mandarin words are disyllabic or longer, and 400×400 = 160k is enough combinations for a quite large vocabulary.

Unique characters being required to distinguish homophones in modern written Mandarin is mostly a circular effect due to the characters already being available, so people use them in ways that would be ambiguous when read aloud (as intentional puns or simply to be more concise.)

If there had been no preexisting writing system and written Mandarin was a simple transcription of spoken Mandarin, introducing characters would be about as helpful as indicating the Indo-European roots of words in English writing, which is to say that some people might get a feeling of epiphany after realizing the connection between seemingly disparate words, but it would hardly be practical for everyday use.


Evidence that Chinese can be perfectly understandable written without the use of characters can be seen in the Dungan language (https://en.wikipedia.org/wiki/Dungan_language), which can be considered a dialect of Mandarin Chinese, but is written in the Cyrillic alphabet.

> Unique characters being required to distinguish homophones in modern written Mandarin is mostly a circular effect due to the characters already being available, so people use them in ways that would be ambiguous when read aloud (as intentional puns or simply to be more concise.)

Indeed, because of the way Dungan is written, it ended up evolving differently with respect to how new vocabulary is derived, often borrowing words phonetically from Russian instead of constructing them from Chinese morphemes that might otherwise be considered ambiguous when used individually.


>Most Mandarin words are disyllabic or longer, and 400×400 = 160k is enough combinations for a quite large vocabulary.

While true, I'd bet that some combinations dominate because they sound better/are easier to pronounce.

Also just because you can technically differentiate 160k sound pairs doesn't mean you can do it in a noisy environment.

Japanese and Korean have a similarly limited number of syllables and have very long words compared to English. I'm guessing because they don't have tones.

If you look at communication theory you don't only need distinct sounds, you also need error correction. Which requires extra bits of redundant information.

Tones just make it possible to carry extra bits.

Longer strings of syllables like in Japanese and Korean do the same.

More complex syllables, like in English, too.

It's just multiple different ways of carrying enough bits in speech to work in a noisy environment.

Another analogy could be password strength. You can have a very long numeric password (Japanese & Korean), A password with a mix of a-zA-Z0-9 of medium length (English). A password with weird special characters but shorter (Chinese), and they all end up having the same entropy (given that the password rules are known to the attacker).


There are language with even smaller sets of unique sounds that do not have tones like Hawaiian: https://en.wikipedia.org/wiki/Hawaiian_phonology

There are many common homophones in English that are distinguished by context. Conversation tends to have a lot of context. As a Mandarin speaker, I've never really experienced this context problem. You can make up some artificial examples in English and Chinese but they don't really reflect average communication. Like "The bat and the bow are on the table". It is important to know that a good percentage of words in Mandarin are multi-syllabic (not just one character).

Mandarin can be written phonetically perfectly fine. Currently the most popular systems are Hanyu Pinyin (used in China, Singapore and Malaysia) and Zhuyin (used in Taiwan). Kids learn these systems in school before they learn characters. Chinese characters have a strong historical and cultural value, that's why they are still around.


People simply have a LOT of romanticized bullshit views built around Chinese characters, or the relative difficulty of different ways of writing because they're fluent in the language and have spent thousands of hours immersed in a sinograph-based writing system. Of course a different writing system is difficult to read even if it's ultimately much easier to learn, you have no practice! It's like writing English in Latin script vs. writing it in runes, both work fine, but we're practiced on recognizing words in Latin script. ᛖᛚᛞᛖᚱ ᚠᚢᚦᚨᚱᚲ, ᚾᛟᛏ ᛋᛟ ᛗᚢᚲᚺ.

Vietnamese is written in alphabet without issue. The Dungan people of Kyrgyzstan and Kazakhstan even write their Mandarin-descended language with the Cyrillic script without any tone markings at all - the tones are supplied in a dictionary, but that's it. It works.

Most of the homophones etc. stuff come from people having decided that sinographs are good and then coming up with justifications for keeping them, not really an actual analysis whether Sinitic languages or Japanese would work without. This is a Chinese dictionary: https://imgur.com/a/rdxVh9i

> Mandarin can be written phonetically perfectly fine.

To reinforce this to the readers: https://www.pinyin.info/readings/pinyin_riji_duanwen.html

The author is a native Mandarin speaker who specifically requested that her work not be rendered in sinographs. It should be standard Pinyin orthography except that the author writes 'de' as 'd'.


> Of course a different writing system is difficult to read even if it's ultimately much easier to learn, you have no practice!

Yes, people often confuse the "way I do it", "the way it's always been done" or the "official way" as the only way it can be done.


You could also write English as an abjad with no vowels but not sane person would consider it. You can aslo splel einlsgh lkie tihs and msot people colud raed it flriay esilay.[4] The fact that your type demand Chinese writing to not only be phonetic but also not have tones is pretty telling that your motivation for using phonetic writing has pretty much nothing to do with "it's easier" or "it's phonetically regular" but just from some sort of disdain for the Chinese language in general. These sorts of phonetic reforms also require writing in a style that is essentially newspeak on steroids, such as your second source, which uses no vocabulary above maybe a 2nd grade level, and yet still I couldn't figure out what some of the words were supposed to be.

Here's another quote from the source you use:

> "There is no doubt that romanized Classical Chinese would be gibberish"

Invariably these proponents of phonetic writing for Chinese are non-native speakers[1] from the west who seem to have an intense hatred for any aspect of the Chinese language that they consider "Classic Chinese" derived[3]. This of course extends to any sentence that goes beyond "where's the bathroom" and "hello my name is bob" except not even the second example because Chinese names are what these people would consider "classical derived". So you propose a system that would not be able to transcribe __names__. Go to Korean wikipedia and click on a disambiguation page[0]. Or go ask them to show you their ID card[2]. These are a people whose entire national identity is based around not using Chinese writing. A lifetime of both native chinese speakers and non-chinese alike not being able to pronounce my NAME right when rendered in Pinyin is apparently not evidence enough that it's an inadequate system.

> This is a Chinese dictionary: https://imgur.com/a/rdxVh9i

You also leave out that double digit percentages of the Dungan language comes from Arabic and Persian, Russian, Turkic etc. Not even their names are Chinese. What little Chinese is left is a fraction of the amount of Chinese morphemes a normal Chinese speaker knows. Even in your example the entry for "da" has 10 semantically, phonetically, and etymologically different entries. The PRC also tried to enforce phonetic writing on the Yi and Zhuang languages, which had their own scripts that work on the same principles as Chinese. The result was low literacy rates and a population that predominantly still used the old writing system.

I could very well turn your argument against you. Why doesn't English spell pique, peak, peek the same? Pours, pores, poors? Why did a phonetic writing system slowly evolve into what is essentially a logographic script. Why were you able to read the above example relatively easily, but sdrow eht esrever I fi ylkciuq sa ylraen ton? It's almost as if mature readers of all scripts focus primarily on morpheme clusters when reading, and whatever gains you have from supposedly phonetically regular spelling are offset by that, assuming no pronunciation differences of course. By the time you force everyone to either memorize the "proper" pronunciations or simply force them to only use your privileged dialect your orthography will already be out of date. You can reform again, but by then your lexicon will be so etymologically and semantically starved[6] that you'll probably have to construct all your technical terms from some dead language with a stable orthography anyways.

> an actual analysis whether Sinitic languages

It's called general Chinese. The only phonetic system that works for most dialects, and whose spelling requires the same amount of memorization as writing with logographic characters. Of course if your kind had your way, by the time you could force it on every Chinese speaker it would be out of date and not even regular anymore. Of course these discussions usually don't even touch on the concept of morpheme regularity.

Of course all this text is useless because you probably don't speak Chinese well enough to evaluate any primary source, and the motivation for all this is less rational and more a personal vendetta you non-native speakers hold against Chinese being "too hard to learn"[5]. What's funny is it's the same sentiment you expats have for Vietnamese and Korean, Arabic or even Dutch. Even if we lobotomize our language for your sake you'll simply demand we all adopt English anyways.

[0] https://ko.wikipedia.org/wiki/%EC%88%98%EB%8F%84_(%EB%8F%99%...

[1] or some sort of deranged newspeak proponent, usually diaspora

[2] https://learn.microsoft.com/en- us/answers/questions/815368/acceptable-types-of-identification-%28az-900-test%29?orderby=newest

[3] Usually the argument against 施氏食獅史, somehow a several sentence long story every native chinese reader would understand being rendered as gibbereished shi shi shi shi shi shi, or maybe shi Shi shi shi shi if you're generous, is a totally reasonable reform in your eyes.

[4] https://www.ddginc-usa.com/can-you-read-this.htm

[5] Not limited to language apparently, no cultural differences can be tolerated by you globalists types. Even chopsticks compel your type to proclaim > "Really? A fork and a spoon is far more superior. It shocks me that chopsticks are still used and that people like using them" https://news.ycombinator.com/item?id=35877051

[6] > Romanticized bullshit views built around Chinese characters.

Leads to Oxymoronic statements where Refusing To "Romanize" is because of "Romanticism". How absurdity like this is supposed to be easy for non-native learners and native children to grasp is beyond me.


I'm surprised you've never experienced this. Even names often require an explanation, because the pronunciation is insufficient to convey which words (i.e. characters) are used.

>Mandarin can be written phonetically perfectly fine

It can, and I use hanyu pinyin daily, but my point is that given the small space of possible sounds, it often has a great deal of ambiguity, and is mentally taxing to read. Have you ever tried reading an essay or book in pinyin? With syllabic spacing? There will be many places where it is simply not possible to know for certain what a particular word is. And then there are text books, scientific books.

Chinese characters do indeed have strong historical and cultural value, but that is not why they are still around. They are still around because they are essential to the written language.


> Even names often require an explanation

You can still write the name in Hanyu Pinyin or Zhuyin perfectly fine. It is just that we like character names and that most characters are valid to be used in names so there is a lot more flexibility in what can be a name versus other cultures where there is a less flexible set of names. You can still do something similar in English where you say your name is "rainbow" but you spell it "rhaynbeau", people aren't going to be able to guess that.

> given the small space of possible sounds

Again, see languages like Hawaiian and Vietnamese. They also have small sets of sounds and do fine with romanization.

> Have you ever tried reading an essay or book in pinyin? With syllabic spacing?

Yup, it is just that most people are used to reading Chinese characters and not in romanized Mandarin. There may be other advantages to Chinese characters like quicker recognition and occupying a smaller space, and I am not trying to advocate for eradication of Chinese characters, but I want to stress that is perfectly possible to read and write Mandarin phonetically and characters are not essential.

Also I read and write Taiwanese (Hokkien) in romanized form. Feels like a waste of time to worry about characters, but many people do and end up not writing Taiwanese or using mixed script.


Every forum post I've seen mentioning 白話字 and 台羅 mentions how hard it is to read and how few Hokkien speakers can even read it. The few proponents for it seem to be holding on for religious reasons (Presbyterians).

>You can still do something similar in English where you say your name is "rainbow" but you spell it "rhaynbeau",

This is an insulting borderline racist comparison and ties to the same old western trope of treating our names like random sounds. "rhaynbeau" Isn't a word and doesn't carry any meaning.


> Every forum post I've seen mentioning 白話字 and 台羅 mentions how hard it is to read and how few Hokkien speakers can even read it. The few proponents for it seem to be holding on for religious reasons (Presbyterians).

I am not sure what you mean by holding on for religious reasons? IThere are lot of reasons to write Taiwanese. Anyway, I don't know anything about these forums or have Presbyterian affiliation, but in my real life I use it quite often with friends and family. The reason few people can read it is because few people have learned it. For the majority of Taiwanese speakers it is only a spoken language. Written Taiwanese does not play a large role in public education in Taiwan.

> "rhaynbeau" Isn't a word and doesn't carry any meaning.

It's an imperfect example for non-Chinese speakers to illustrate that it can be hard to guess the character of another person's name but people still understand the sounds when hearing it. A lot of thought goes into choosing the characters for a Chinese name. Other cultures have names that are not related to meaning or are separated very far form the original meaning (the words are for names). Others allow variations on previous names or borrowing from other langauges so likewise those names might be challenging to know the spelling.


> They are still around because they are essential to the written language.

This argument used to be made in Korea, yet the country seems to have transitioned to alphabetic writing without issue. A lot of the tax of reading phonetic scripts of Chinese or Japanese is that fluent speakers are simply not at all used to it, even if they can read it.

For example:

ᚦᛁᛋ ᚨᚱᚷᚢᛗᛖᚾᛏ ᚢᛋᛖᛞ ᛏᛟ ᛒᛖ ᛗᚨᛞᛖ ᛁᚾ ᚲᛟᚱᛖᚨ, ᛃᛖᛏ ᚦᛖ ᚲᛟᚢᚾᛏᚱᛃ ᛋᛖᛖᛗᛋ ᛏᛟ ᚺᚨᚹᛖ ᛏᚱᚨᚾᛋᛁᛏᛁᛟᚾᛖᛞ ᛏᛟ ᚨᛚᛈᚺᚨᛒᛖᛏᛁᚲ ᚹᚱᛁᛏᛁᛝ ᚹᛁᚦᛟᚢᛏ ᛁᛋᛋᚢᛖ. ᚨ ᛚᛟᛏ ᛟᚠ ᚦᛖ ᛏᚨᚲᛋ ᛟᚠ ᚱᛖᚨᛞᛁᛝ ᛈᚺᛟᚾᛖᛏᛁᚲ ᛋᚲᚱᛁᛈᛏᛋ ᛟᚠ ᚲᚺᛁᚾᛖᛋᛖ ᛟᚱ ᛃᚨᛈᚨᚾᛖᛋᛖ ᛁᛋ ᚦᚨᛏ ᚠᛚᚢᛖᚾᛏ ᛋᛈᛖᚨᚲᛖᚱᛋ ᚨᚱᛖ ᛋᛁᛗᛈᛚᛃ ᚾᛟᛏ ᚨᛏ ᚨᛚᛚ ᚢᛋᛖᛞ ᛏᛟ ᛁᛏ, ᛖᚹᛖᚾ ᛁᚠ ᚦᛖᛃ ᚲᚨᚾ ᚱᛖᚨᛞ ᛁᛏ.

Same normal English, but Elder Futhark as the script. If you grew up reading that you'd read without issue. Now? It's a pain.


[flagged]


You've broken the site guidelines badly with this flamewar post, even stooping to personal attack. That's totally not ok.

I'm not saying that your points on the underlying topics are wrong–for all I know you're 100% right—but you can't abuse HN like this, no matter how right you are or feel you are. As you've broken HN's rules many times in the past and ignored our repeated requests to stop, I've banned the account.

https://news.ycombinator.com/item?id=33904225 (Dec 2022)

https://news.ycombinator.com/item?id=27830573 (July 2021)

https://news.ycombinator.com/item?id=22713311 (March 2020)

https://news.ycombinator.com/item?id=22591936 (March 2020)

https://news.ycombinator.com/item?id=20712243 (Aug 2019)

https://news.ycombinator.com/item?id=20191623 (June 2019)

It's a pity, because you're clearly knowledgeable on some of these topics and I hate to ban a knowledgeable user. But we don't have a choice when people break the rules like this and don't respond to warnings. If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.


> but there are only around 400 unique sounds in Mandarin, ignoring tones. Even adding five tones still only increases this to ~1500 (not all are used). Compare this to English, where estimates are in the 10-15k range.

Sure, but English needs them because there are only around 26 letters; compare this to Mandarin, where estimates are in the 400 range.


Why would a small number of letters make English need more syllables?

If anything, letters being overloaded limits the number of syllables we can express.


I was mostly just joking/pointing out that the comparison's a bit.. not to say 'apples and oranges', but a bit arbitrary, it seemed to me could just as well be comparing letter count - or even that that makes more sense as a comparison for (disregarding tone) characters, but still arbitrary, the languages just work differently.


Just because you can figure it out by context in songs (rarely upon the first listen, mind you), that doesn’t mean the added cognitive load isn’t excessively burdensome in everyday speech.


Realizing no one's going to change a language with 900 million speakers, do you think it's because there's a lot of ambiguity, or is it because it's a cognitive load people aren't used to? Mandarin is a newer language than Cantonese, and it has fewer tones. Languages tend towards laziness, so I wonder if it settled on the right number, of if it's an ongoing trend.

Edit: About languages losing features, English used to be declined like German or Latin. Only pronouns are declined in modern English, and we don't usually teach it as "pronouns are declined."


> Mandarin is a newer language than Cantonese

Both languages descended from a common ancestor, so you can't necessarily say that one is newer than the other. However, it is the case that Cantonese preserves several features that Mandarin has lost, in particular the complete inventory of final consonants and all of the tone categories of Middle Chinese, which makes it seem better suited for reciting 1000+ year old Tang dynasty poetry where rhyming and tones were especially important.

On the other hand, Cantonese has lost other features that Mandarin has preserved (such as medial vowels and the three-way distinction of initial sibilant consonants), but these features aren't as critical with respect to reciting Tang poetry. For this reason, Cantonese may seem "older" than Mandarin, even though in reality, it's simply that they each have preserved different features and the features that Cantonese preserved happened to make it better for reciting old poetry.

> Languages tend towards laziness, so I wonder if it settled on the right number, of if it's an ongoing trend.

All languages change and will continue to change over time, and while laziness may drive changes in some features of a language, often times other parts of the language become more complex to compensate. This process is called grammaticalization, and is thought to occur in cycles: http://websites.umich.edu/~jlawler/TheGrammaticalizationCycl...


Just like human are newer than, say, some monkey because it comes later, even we evolve from same ancestor, is a matter of fact. The recitation of tang poetry and more complicated speak a lot of this. Mandarin is later.


Stop embarassing yourself in public.


I suspect what's going on here is that in music, it doesn't matter if you understand it right the first time.

How many songs do you misunderstand the lyrics for on the first few listens, in your native language? For me, in English, I either can't tell exactly what they're saying for some proportion of lyrics, or just totally mishear them _quite_ often (especially depending on the genre).

Music doesn't require every word to be perfectly understandable. Communication does, ideally.


> I suspect what's going on here is that in music, it doesn't matter if you understand it right the first time.

So there's this

https://www.youtube.com/watch?v=pdz5kCaCRFM

and more interestingly

https://www.youtube.com/watch?v=-VsmF9m_Nt8


Languages often lose features but they also gain features. Complexity of language is hard to compare, but we can still find many examples.

Modern English has less complex verbal morphology and noun declension (as you mentioned, only in pronouns). But the set of vowels in Modern English is more complex than that of Old English. Also the vocabulary of Modern English has two main sources: Germanic words (native) and French/Latin/Greek words where a single idea can be expressed in either vocabulary source with different nuances. Old English was mostly comprised of Germanic words with some words borrowed from Latin.

Another interesting thing to note is that languages without tones can gain tones (tonogenesis) and in Old Chinese tones played a much smaller role than the modern descendants. This is often the result of syllables/sound systems becoming less complex and losing contrast so the tone of the word becomes contrastive to maintain a distinction between words.


From my understanding Mandarin has a lot of two-syllable words and in many of the words the second syllable doesn't add much, if any, additional meaning.

Contrast that with Cantonese, which I believe still uses a single syllable for most words. (Someone please correct me if I'm wrong)

So it makes sense with less tones, because you have more syllables to disambiguate.


> which I believe still uses a single syllable for most words

Not sure about "most" (depends on the sample distribution I suppose), but single syllable (i.e. character) words are used much more often relative to Mandarin.

So in general you're probably right. Not sure whether that is a cause of more strict adherence to tones in songs though. It could be alternatively argued that the more complex syllable (due to more tones among other things) in Cantonese allowed it to retain single syllable words without having to add extra syllables to clarify any ambiguities.


> It could be alternatively argued that the more complex syllable (due to more tones among other things) in Cantonese allowed it to retain single syllable words without having to add extra syllables to clarify any ambiguities.

Yes, pretty much.


I've learned to speak some Mandarin. What helped me a lot while talking is using Google's Translate function to see if I get the tone right.

It's free and fast. It's very helpful when you want quick feedback.

I learned a lot from it about the pronunciation of consonants as well. The k-sound has way more air in it. You need to pronounce it as "kh". Same thing with the t-sound.

I just came back from a trip to China and it was noticeable how much more people understood me. Still need to work on vocabulary though...


> The k-sound has way more air in it. You need to pronounce it as "kh". Same thing with the t-sound.

Well, if you are Dutch, all other consonants from all other languages need a lot more aspiration ;)


You can even use Google Translate's text to speech API in whatever learning program you build for yourself:

  GET https://translate.google.com/translate_tts?ie=UTF-8&tl=zh_CN&client=tw-ob&q=<url encoded text>
This returns an audio/mpeg (mp3) response. Change the language code as appropriate.

It's not the most natural sounding TTS engine, but it's free, unauthenticated and trivial to use.


What Google's Translate function are you talking about? Voice input?


When you look up a word there's a speaker icon that you press to hear the translated words.


The annnoying part in written thai is thattherearenospacesbetweenwords.


Spaces between words is a relatively recent Irish invention (7th or 8th century) in western written language, so it’s not like it’s an obvious thing to have.


> Spaces between words is a relatively recent Irish invention (7th or 8th century) in western written language, so it’s not like it’s an obvious thing to have.

Perhaps, but interpuncts between words are several centuries older than that and occur as natural developments in e.g. the Roman Empire. https://loeb-art-center.vassarspaces.net/wp-content/gallery/...

The concept of word separation is an obvious thing to have. Whether the separator is empty space is unimportant.


12 centuries should be plenty of time for a simple upgrade to improve the UX of a language.


The UX of a language for most people for most of that time was speech.


Very true, and we should remember that in a lot of (all?) cultures across time, literacy (learning to read and write) was a marker of class, and/or a protected trade, and/or considered sacred, and/or considered profane.

In other words, there wasn’t much an incentive or recognized need to make the scribe’s job easy to pick up.

Caesar (I think) tells us that the druids of the Celts did not allow members of their tradition to write down their beliefs, traditions, etc. Writing in that context (prior to 50 BCE) was profane.

Of course, those of us in America are familiar with slaves being prevented from learning to read. Forced illiteracy in this context was a tool of oppression. [1]

I think in one of S.M. Sterling’s fictional books (On the Ocean of Eternity?? Part of the Island in the Sea of Time series, anyway) there’s a great exchange with a Babylonian scribe who laughs at the simplistic alphabet of the American, condescendingly remarking that child could learn that, to which the guy replied, yeah, that’s entirely the point!

1. https://docsouth.unc.edu/neh/singleton/singleton.html

I can’t help but link this recollection by William Henry Singleton. He recounts being whipped as a child because it was thought that he had merely opened a book, but the whole pamphlet he authored(!) (and available to read in full at that link) after becoming free, fighting in the War, and learning to read/write, is an utterly fascinating account from a primary source spanning from his experience being born into slavery in about 1830 to the point where he authored this in about 1920. It’s too easy to understate but this man saw a lot of change in a momentous century, first-hand.


It was not much needed until computer text processing become a thing.


Latin used dots to separate words.


Same with Chinese language, thus lexing and parsing requires knowing many more words than in languages with spaces between words.


As an English native speaker who learned Mandarin, I really didn't find the lack of spaces harmful to learning the language.

Since each character represents a syllable, rather than a specific sound, and the written language is essentially not phonetic, reading the characters is an entirely different experience.

OTOH, you have English and German and others that frequently use compound words, and the use of spaces becomes really important to understanding the writing.

I have zero experience with Thai.


> OTOH, you have English and German and others that frequently use compound words, and the use of spaces becomes really important to understanding the writing.

Schleifmaschinenverleih would like to have a word with you.

This is parsed as Schleif-Maschinen-Verleih. Verleih means a rental company. The middle one is machine, and the first one I find both sanding and whetting as translations, not sure which one it is. So you can rent sanding and/or whetting machines there.

There are cases where it's ambiguous but for the most part the lack of spaces in compound nouns in German is not an issue.

A somewhat infamous example is Rohrohrzucker which should be parsed as Roh-Rohr-Zucker (raw cane sugar), but Rohr-Ohr-Zucker is also possible (pipe ear sugar). It's pretty clear when it happens that you got the wrong parsing but it takes a while to figure out what the right parsing is :-)

As far as speaking is concerned: I guess the extra spaces in English don't necessarily translate to pauses, do they? Is plugin pronounced differently from plug-in due to the hyphen?


To clarify, I'm not saying that compound words are difficult to parse due to a lack of spaces. I'm saying that without any spaces in a sentence at all, it's harder to differentiate between compound and non-compound words.

blueberry vs. blue berry

stand up vs. standup

online vs. on line (northeast US term for queueing)

cartwheel vs. cart wheel

Stick compound words in a sentence that doesn't have any spaces at all, and you either have to pause to grok context, or context won't even help you (blue berry vs blueberry). At least German capitalizes all of its nouns, which would certainly help.

Compare this with Chinese, Korean, Japanese or other similar languages that don't use spaces at all (except perhaps after punctuation).


> As an English native speaker who learned Mandarin, I really didn't find the lack of spaces harmful to learning the language.

Definitely. The logograms and being in a completely different language family are the real hurdles.


No, it's different between Chinese and Thai.

Lexing is very clear in Chinese. It's never the case that you look at a Chinese sentence and don't know where a character ends and another begins. Take this sentence in both languages: "good morning, how are you"

早安,你好吗

This sentence clearly has "spaces" and I'm pretty sure any person illiterate in Chinese could tell you there are 5 characters / words. Technically the third character is composed of 人 and 尔 but I don't know that anyone, even kids or beginners, would mistake those as _not_ going together.

สวัสดีตอนเช้าคุณเป็นอย่างไรบ้าง

In contrast, Thai is as you say: lexing and parsing bleed together. There are 7 words in this sentence, but you need to lex the 10 syllables and run them through your mental dictionary to recognize the possible words they could be. My Thai is very limited, but there are examples of sentences out there that actually have multiple valid readings with different semantic meanings, depending on how you group sounds together.


早安 is made up of 2 characters but is a single word. If you fall into the trap of thinking 1 character = 1 word, you won't understand a thing. In this case you'd have thought it meant "early safe" instead of "good morning".


Okay, you make a good point. Let's look back at the GGP's comment though:

> Same with Chinese language, thus lexing and parsing requires knowing many more words than in languages with spaces between words.

In English, can you get away with knowing the meaning of "good" and "morning" and not "good morning", and know that I'm greeting you instead of commenting on the quality of this morning?


Good morning is a bad example because it has a colloquial meaning that is a least a little idiomatic. Most other words/phrases in English don’t have this effect, while many Chinese words are like 早安. 了解, for example, can’t even be pronounced without correctly parsing the word.


Okay, I concede that I may have forgotten that Chinese has its exceptions too. 了解 is indeed a good example. There are plenty in English though. Even with context, sometimes I have to really pause and think whether to pronounce read as red or reed (I read it just fine, I read English just fine).

Where I've had pain specifically with Thai is that I can't even know where a syllable begins and ends until I read a few "syllables" together and decide whether some vowels go with the consonant in front or behind it, and whether some an -ar should be pronounced as an -aan.


Chinese has pretty regular rules about grouping characters into words though, as most compounds are 2-characters, or a 4-character idiomatic phrase. Even if I know only half the characters in a sentence, I can usually guess the word boundaries correctly. It's not 100% reliable, but good enough to avoid confusion.


I guess it really depends on "dialect". Try that with Cantonese :)

As mentioned in another comment, single syllable words are much more common in Cantonese, and word combinations are much more "free" in the sense that there are a lot more ambiguity as to what counts as a "word" and what is merely two single-character-words idiomatically used together. There are also cases where grammatical constructs (and also foul words) are inserted in between a two-character word/idiomatic combo, and sometimes the characters are reversed, to the extent that it used to be a meme: https://evchk.fandom.com/zh/wiki/Y%E5%B7%B2x

It's gotten to a point where, after thinking about it for a couple years, I've come to believe that segmentation on Cantonese is a fool's errand...

Of course, there's also classical Chinese where most of the time a character is a word.


i think you're on to something about cantonese, but it's also true of mandarin. segmentation of words in chinese in general seems inherently messier than segmentation in english. also look at stuff like abbreviations: is 北大 one word? is it an abbreviation for 北京大学 the same way Caltech is an abbreviation for california institute of technology? is it just two single character words, each of which is an abbreviation? i think its much less clear than english


Segmentation in Mandarin is easier due to tendency of the language to use 2+ characters for words. With a high quality wordlist you will go a long way.

The problem with proper nouns is that they don't end up in dictionaries, same with slang and other terms that for reasons don't end up in dictionaries.

The additional problem with Cantonese is that there's a larger class of words where the constituent characters can move around as if they were words themselves. Even for a native speaker with some experience in lexicography, it can be difficult to determine word boundaries as there are many cases where a word with characters X+Y can be interpreted as just word X and word Y with some idiomatic meaning. This issue is more pronounced in Cantonese because there are more single character words in active use.

I've actually done this before. My experience is that naive segmentation on Mandarin text with wordlist is probably 80+% accurate, while using the same algorithm in Cantonese text (with cantonese wordlist) will definitely end up "wtf".


The same problem exists in Japanese FWIW, whose speakers like to make the same sorts of abbreviations despite not having a bisyllabic meter like Mandarin does. Japanese is somewhat helped by having multiple orthographies, however.


Hence expertsexchange.com


Do any ideographic languages use spaces?

I'm used to it in Asian languages but it still does my head in when I try to read older Latin documents.


With the kind of mixed script used in Japan and that used to be used in Korea, they're not exactly necessary (still useful, but not necessary). Neither language uses prefixes much, so a sinograph is a pretty reliable indicator of the beginning of a word, followed by the inflection written out in a phonetic script like hiragana or hangeul. In Japanese's case, a switch from hiragana to katakana also indicates a word boundary and highlights that the word's likely a nonsinitic loan or the name of a plant or animal species or other technical term.

Say, for example:

"Korean people eat kimchi"

In Japanese/Korean, the structure would be:

Korean-person-topic marker-kimchi-object marker-eat-present tense.

In Japanese mixed script, that looks like:

韓国人はキムチを食べます。and would be read as "kankokujinwa kimuchiwo tabemasu".

Splitting it with spaces:

韓国人は キムチを 食べます。

The heftier kanji denoting "Korean person" and at the start of "eat" should be clear even to the untrained eye, while people who've studied the language can easily tell that キムチ is "kimuchi" written in katakana. The sentence is pretty easy to parse without spaces, at the cost of using one of the most insane writing systems in the world.

Now, what if we wrote the entire thing in hiragana instead?

かんこくじんはきむちをたべます。

... yyeaahh. Spaces. Please.

かんこくじんは きむちを たべます。There, much better, though almost no one fluent in Japanese has practice reading stuff like that.

In Korean, without spaces we'd have:

한국사람들은김치를먹어요. Again, similar problems. Korea has adopted spaces now that they don't use sinographs, so we'd have:

한국 사람들은 김치를 먹어요. (han'guk sa'ram'deul'eun kim'chi'reul mog'o'yo)

If we wrote "Korean person" with the same Sinitic loans in the Japanese sentence, we might get:

한국인들은 김치를 먹어요. (han'gug'in'deul'eun kim'chi'reul mog'o'yo)

Spaces clearly do help.


How many (modern, written) "ideographic languages" exist? I can think of two: Chinese and Japanese. Old Korean and Vietnamese used some Chinese characters, but the modern languages use none.

It is interesting to me when written Chinese and Japanese use commas. It is pretty much never required, but pure style. It does help to breakup a complex sentence, similar to phonetic languages.


Comma is required in modern Chinese. Nobody will bother reading your text if you don't at least put some of them in the right places...

(I don't know Japanese so I can't speak to that)


Nice post. Can you give a simple example sentence where it is required? (I believe you.) I studied Mandarin and Cantonese for a few years, but I never got the level where I thought commas were required.


Just a random text I had handy:

起初我見到一碟芽菜同西芹,以為佢上錯菜,用刀叉掘咗幾下,終於見到埋咗響底嘅叉燒。

Nobody writes it like this:

起初我見到一碟芽菜同西芹以為佢上錯菜用刀叉掘咗幾下終於見到埋咗響底嘅叉燒。

Can people parse the latter (with some difficulty)? Sure. The commas are not required to the extent that you can butcher the sentence even more without losing its essential meaning, but why stop there? You can remove even more stuff from it and still retain most of the meaning:

初我見碟芽菜同西芹以為上錯用刀叉掘幾下終見埋底嘅叉燒

But that's not how people write.


Japanese has a lot of “hint” on word (nouns) ending. And they use “full stop” plus one space to end a sentence as comma is not really needed in most cases. This is unlike chinese.


I’ve noticed in things translated from Japanese (video games, anime) there’s two features that seem constant and don’t seem to come from other languages. They seem to constantly say “in other words” and restate and clarify topics, and they put “quotation marks” around things that don’t seem to need quotation marks. I’ve always assumed these oddities would make more sense if I learned to speak or write Japanese.


    Japanese has a lot of “hint” on word (nouns) ending.
Can you give a simple example sentence to demonstrate your point? (I believe you.)

    This is unlike [C]hinese.
Can you give a simple example sentence to demonstrate your point? (I believe you.)


yougetuseditafterawhile

The thing that bugs me about written Thai is that there are spaces now and then and you would expect them to be at sentence breaks but they seem to be randomly placed throughout the text, almost as if that's where the writer felt like he needed to take a breath instead of where one sentence ends and another begins.


idk the more chinese i learn, the more im convinced that the very concept of individual words is blurred and not quite the same because of the way the writing system works

中国共产党, is that one word? should you break it up as 中国 共产党? what about 中国 共产 党? i dont think its nearly as clear which of these is correct as it is in english



There was a post on HN recently about the demise of speech recognition: https://news.ycombinator.com/item?id=35800935

That said, I still think your post raises some excellent points. To refine it, how about an online learning app that plays a sound or video clip from a national news TV or radio program. (Usually, the speakers have perfect pronunciation.) You repeat the words; the online app records it; then, it shows your tones/pitch/accent versus the native speaker. I think that could be incredibly useful. My point: Speech reco is an impossibly hard problem (currently) due to infinitely broad context in spoken languages. My idea would have "perfect context", so speech reco could really work.

When I was learning tones for a while, I used to record myself, then replay it. It's amazing to hear the difference between what you think you sound like and what you really sound like. An online app could help to fine tune your pronunciation very quickly.


This article is from 2010, and it says that as of 2010, progress on speech recognition flatlined since 2001

Is this still the case?

How does 2001 software compares to 2023 software? Or 2010 to 2023 for that matter


Speech recognition is leaps and bounds ahead of where it was in 2001/2010, driven by a combination of deep learning and massive datasets. I'm actually working on a similar concept (speech recognition for Mandarin language learning) and even accented speech is becoming less of a problem.

I haven't tried differentiating tones for discrete words with the recent batch of generic models but I think it might still struggle. I suspect a fine-tuned task-specific classifier might be needed for decent results.


> As many have pointed here Mandarin, Thai, Cantonese and Vietnam are tonal languages

There are also plenty of tonal languages outside East/Southeast Asia, in South Asia, Africa, Europe, North America, and that's just ones I know of.

Languages also switch, like ancient Greek.


Or indeed, ancient Chinese! Which IIRC was non tonal.

There's some linguistic pattern where consonant clusters at the end of words get dropped, but their 'effect' on the vowel remains and that's how these kind of tones develop.

IIRC there's also two different kind of tones, pitch tones, and register tones....

Languages are crazy.


Hong Kong Cantonese has six tones. I knew it was different for Guangzhou Cantonese, but I wasn't sure exactly how many, so here's what Wikipedia says:

> In finals that end in a stop consonant, the number of tones is reduced to three; in Chinese descriptions, these "checked tones" are treated separately by diachronic convention, so that Cantonese is traditionally said to have nine tones. However, phonetically these are a conflation of tone and final consonant; the number of phonemic tones is six in Hong Kong and seven in Guangzhou.


Yeah, the seventh tone in Guangzhou Cantonese that’s gone in Hong Kong Cantonese is the high-falling tone.

In Guangzhou Cantonese, 衫 (shirt) and 三 (three) are not homophones, but they are in Hong Kong Cantonese. The Jyutping romanization (from the Linguistic Society of Hong Kong) reflects this change in HK Cantonese (saam1), whereas Yale, based on the older pronunciation, could represent the difference in tone (sāam vs sàam).

Interestingly enough, the high-falling tone is still retained in Hong Kong Cantonese for one exceedingly common word, the final particle 㖭 (tim1/tìm)!


I suspect the difference is more "academic" than otherwise.

It would be interesting to give a Hong-Konger a test with pairs of characters, one of 1st tone and the other 7th tone, and see whether they can guess which is which.

I suspect with minimal training they'd be able to score significantly better than random chance. (Willing to try it out!)


No need of test and I knew nothing. But I can be sure we hk Cantonese know and immediately hear if the other speak in canton or Malaysia Cantonese very quickly.


And to add to that, there are a bunch of sandhis, which are when the tones are shifted or modified when followed by certain other tones.

Using tones is so natural that many native Cantonese speakers are unaware that the language even has tones lol.


Your last point: I would say that is generally true for all native-level speakers of tonal languages. It is interesting to watch them try to learn a different tonal language. It suddenly opens all these doors into their own language. "Oh, so that's why foreigners struggle with this sound." Mandarin: zai, sai, cai

When you read about tonal languages online, there is so much emphasis on "tonal languages are special". I don't understand why accent and pitch isn't added to the same bucket. In Japanese, accent and pitch is rarely taught, but incredibly important in daily life. Japanese is stuffed full of homophones which makes listening a tricky matter. BTW -- I am sure there are other "non-tonal" languages that I don't know about where pitch and accent are important (Korean?).

If you consistently pronounce words with the wrong accent or pitch, (average, uptight) Japanese listeners will refuse to understand it. Say what you like about it (fake/real/whatever), it is a common response.


It even matters in Latin languages somewhat. Which syllable is stressed can change meaning, sometimes avoiding ambiguity.


Exactly. I use the example of different tones when saying “really.” It can have very different connotations depending on the tone from questioning to sarcastic.


Mandarin, Cantonese, etc. have per-syllable tones. English, German, French and probably many more, have per-partial-sentence tones.


Isn't true that Cantonese has 9 tones?


Yes and no. Cantonese has 9 tone categories that have 6 distinct tone contours. The 3 additional tones fall under the checked tone category (https://en.wikipedia.org/wiki/Checked_tone) for historical purposes, but their realized pronunciations coincide with the tone contours of 3 of the other 6 tones, so for most practical purposes, many sources describe Cantonese as having 6 tones.

I have an old Quora answer here that goes into more detail: https://qr.ae/pyNupi


I like to tell complete newbies that Cantonese roughly has "4 tones."

- High level

- Mid level

- Low (includes "low falling" and "low level")

- Rising (includes "low rising" and "mid rising")

I've combined similar tones into the Low and Rising categories. If you are a non-native Cantonese speaker, and don't differentiate between "low falling" and "low level", native Cantonese speakers will still understand you.

It's difficult for a non-native speaker to distinguish between "low rising" and "mid rising".... so just treat it as a rising tone. I'm a native speaker and sometimes I forget which type of rising tone a particular word is.... I didn't learn it that way, haha. I just learned to say the word the same way my parents did.

The 7th, 8th, and 9th tones are short versions of the three level tones, and they all end in a consonant (like "k"). If you pronounced them the same, but make the syllable very short, you'll be fine.

So yeah.... think of it as 4 tones, just like Mandarin. Three different level tones at high, middle, low pitches, and one rising tone :-).


> If you are a non-native Cantonese speaker, and don't differentiate between "low falling" and "low level", native Cantonese speakers will still understand you.

Whilst it is true that in the case of Cantonese some tones can be misused without the loss of the comprehension in a conversion, and the non-native speaker will still be understood if the surrounding context is clear and concise, that is not the case with the low falling tone, which is the most unforgiving of all. Cantonese speakers are prone to get thoroughly confused when the low falling tone is substituted for a flat low tone or a low rising one. Consider 墳墓 and 分母 when the context is insufficient to deduce which word was actually meant; it is perhaps not the best example but I can't think of a better one at the moment.

EDIT: 大麻, 大馬 and 大媽 from https://news.ycombinator.com/item?id=35870392 are better examples.

> It's difficult for a non-native speaker to distinguish between "low rising" and "mid rising".... so just treat it as a rising tone. I'm a native speaker and sometimes I forget which type of rising tone a particular word is.... I didn't learn it that way, haha.

Most native speakers of tonal languages are not even aware of the fact their native language has tones. They don't think about it, they don't think about the tones. Tones are a concept for speakers of languages that do not have the tones in the first place.


It depends. Some people classify it as 6 tones, some as 9. The extra three are "entering <x> level tone", which are sort of shortened versions of a different tone.

So: some words end in a stop, which is sometimes counted as a different tone even though the pitch pattern isn't different. For example, consider fan versus fat.

https://en.wikipedia.org/wiki/Checked_tone


That's my favourite pet peeve.

TLDR: No. There are 6 tones in Cantonese, the 9 "categories" are made referring to Middle Chinese.

---

Middle Chinese had 4 tones[1]. The 4th tone, "entering" (or "checked"), is words that end in stops (p/t/k). Because of the way it evolved, none of those words in Cantonese have tones 2, 4, or 5 (but not exactly, see below). In other words, they all have tones 1, 3, or 6.

To emphasize this observation and to make a connection to the 4 tones in middle Chinese, some analysis call them tones 7, 8, 9, with names upper dark/lower dark/light entering[2].

But such an analysis has nothing to do with how a modern Cantonese speaking brain process the sounds. E.g. Cantonese has a tone-change to tone 2 for the diminutive form, when this happens to a word that ends with p/t/k[3], the 9 tone framework cannot describe that.

---

Caveat: when I said "Cantonese" above I mean the dominant dialect of Cantonese spoken in Guangzhou/Hong Kong.

[1] https://en.wikipedia.org/wiki/Four_tones_(Middle_Chinese)

[2] https://en.wikipedia.org/wiki/Cantonese_phonology#Tones

[3] https://en.wiktionary.org/wiki/%E7%8E%89#Pronunciation


Japanese has two tones, which is something I didn't know until recently.


This is not true in a strict sense; Standard Japanese has a pitch accent that has a "culminative" pitch countour over a word. Culminativity means that there is a single point of prominence at maximum. In Japanese, this gets realised as a drop in the pitch. (In variants of Japanese, there are more elaborare systems.)

Tone systems are different in the sense that each syllable has it's own countour. (Of course, when realized, these get merged according to various phonological processes) Japanese differs from tone systems in that it has only one culminative pitch contour over multi-syllable words.

(Disclosure: I am an expert in Japanese phonology, especially in pitch accent.)


    Disclosure: I am an expert in Japanese phonology, especially in pitch accent.
+9000. Please post more!


Not sure what to post, but here are four interesting tidbits about Japanese pitch accent:

1) "Culminativity" is actually considered the most helpful functionality of the pitch accent. It helps making sense of word boundaries, as there can be only one prominent syllable per word. In spoken word of any language, there are no "spaces" (there are no spaces in written Japanese either, though), so languages need to provide accommodations for aural parsing strategies.

2) The other functionality provided by the pitch accent is distinctiveness, which means that there are some homonyms (similar-sounding words), that are only differentiated by the pitch accent. However, that is only a secondary functionality. In Mandarin Chinese, tones play a lot more important role for making sense of the meaning of the word. In Japanese, they play a role, but not a significant one. (There are some hundreds of minimal pair words that differ only by the pitch accent pattern, but as many regional dialects also have slightly different accent patterns, clearly 100% nailing the pattern isn't required for communication, as long as the general principles (such as one drop per word) are followed.)

3) Japanese is said to be an "isochronic" language, which means it has (or at least, is perceived to have) a simple integer-based rythm. For example, Haiku, a famous genre of Japanese poetry is based on these rythms, unit of which is called haku (拍) in Japanese or mora in Latin/English. In context of poetry metrics, mora is often misrepresented as a "syllable", but actually, there are uni- and bimoraic (and rarely, trimoraic) syllables in Japanese. An example: 2-mora "ma-to" means a target (in archery etc). 3-mora "ma-t-to" means a "mat/carpet". 3-mora "ma-n-to" means a cloak. 3-mora "ma-to-o" meas "let's wait". 3-mora "ma-to-n" means "mutton meat". 4-mora "ma-t-to-o" means "proper/straight". But every one of these are two-syllable words!

4) The rythm of Japanese speech is not concerned with the pitch accent prominence, unlike stress in English, which tends to make stressed syllables longer and louder (and more defined in vowel quality). Indeed, it is often said that "syllables" are an irrelevant concept for Japanese. However, phonologically that isn't true at all. The rythm of Japanese is indeed dependent mostly on moras, but the pitch accent patterns, where the prominent drop of pitch tends to happen, is highly dependent on syllable structure. It never happens on "weak" morae, which are called the "coda" or tail of the syllable. It always happens on the start of a syllable.


In the same way that Swedish has two tones. Bitonal systems aren't quite as difficult as language like monosyllabic polytonal languages like Mandarin or Cantonese though. There are a handful of words in Japanese which are differentiated in pronunciation only by tone, but these are relatively rare. If you screw up the tone in Japanese it will sound like a bad foreign accent but you will likely still be understood. Just about every word in Mandarin on the other hand has one or more conjugate tone pairings, and if you screw up the tone you're speaking nonsense.

(Source: 10 years learning Japanese, followed by marrying someone from Taiwan.)


this is awesome. thanks for sharing.

there is a wealth of resources for learning mandarin but only a smattering for cantonese.

two favorites:

1. web dictionary http://www.cantonese.sheik.co.uk/scripts/wordsearch.php?leve...

2. iOS dictionary that is free and comprehensive, also covers mandarin: https://apps.apple.com/us/app/pleco-chinese-dictionary/id341...

unaffiliated with either company, just a longtime user.

other suggestions for cantonese resources welcome.


I make a Cantonese dictionary app for Windows/Mac/Linux, similar to Pleco (has multiple sources, coloured characters for tones, etc)!

https://jyutdictionary.com/


awesome, just sent an email about potentially sponsoring this.


What I'm after is a resource for learning mandarin of you're already a speaker of Cantonese.

I've found it surprisingly hard to find a summary of the differences.

By contrast learning German from Danish seemed to have some bits that made it clear reasonably fast.


I'm not affiliated nor do I personally have any experience with this service, but have heard good things about the Canto To Mando Blueprint: https://www.thecmblueprint.com/


If you can read hanzi, can't you just start reading whatever you want in Mandarin? Obviously starting with something easy and then moving on to harder and harder texts, of course.

And can't you just start watching TV shows and movies in Mandarin? It should be a lot easier than for us mortals who start out without knowing any Sinitic languages. We have to work hard just to handle the tones and the near total lack of shared vocabulary.

I am puzzled that learning Mandarin for you hasn't gone much the same way as learning German did.

There is of course the standard way of playing with HelloChinese (or similar apps) and LingQ/Du Chinese (or similar apps) + reading easy readers (Mandarin Companion, Chinese Breeze, and similar). You should be able to speedrun them, compared to the long slog path that we mortals have to take.

(A fellow Dane with zero East Asian parentage.)


Yeah I'm more looking for shortcuts. With German it was really just a matter of seeing a bit of it, and then boom I could read the newspaper with no problem.

Cantonese and Mandarin weirdly seem to have wandered further from each other than Danish and German. At least for me, but that may also be lack of exposure.


Do you mind sharing why are you learning cantonese?


our family grew up with cantonese as the second language. my mom is from hong kong and my dad from the southern part of china.

unfortunately, i failed to cherish this opportunity and spoke english predominantly as a child, leaving me with heavily-accented and vocabulary-limited cantonese.

i have spent an inordinate amount of time repairing these deficiencies and learning mandarin as well.

a little more self-awareness and foresight as a teenager could have saved me years of learning as an adult.

on the upside, self-learning has yielded insights into chinese language and culture, learning, and accents. unsure if these other perspectives are worth the extra time and effort, though. :)


consolidating other resource links from the thread:

* jyutpin typing game https://chaaklau.github.io/cantorocks/


I asked my Asian friend if this font is a good way to learn Chinese. He said a better option was to get a girlfriend who only speaks Cantonese.

Noted.


I’ve been married to a native Mandarin speaker for years now… this strategy does not always work.


This is pretty good for someone who can speak Cantonese but can't read/write it.

As an example, I speak Mandarin and can't read/write (much) Chinese characters as I spoke it at home while growing up in Australia. So, I can imagine there'd be quite a lot who are in a similar situation to me but with Cantonese who would benefit from this (not just as a learning tool).

I've been using the Zhongwen[0] browser extension to "read" websites that have Chinese characters for many years as hovering over Chinese characters will display a popup with the pronunciation ping yin. It may not be the speediest way of understanding a block of Chinese text.

I could imagine someone creating a browser extension that would replace the font used on the website(s) with the Cantonese Visual Font when the extension is enabled.

[0] https://github.com/cschiller/zhongwen


Something's not clear here to me, how does this handle words with multiple pronounciations using a font alone?


I am not a Cantonese speaker; however, in Mandarin, fonts with phonetic guidance are very common.

e.g., Hann-Tzong Wang's (王漢宗) free font collection[1] includes two typefaces with phonetic pronunciation guidance. These are wp{0..3}10-05.ttf and wp{0..3}10-08.ttf [2] As you can see from the filenames, there are actually four different font files for each of these two typefaces. The font files numbered {1..3} are for 「破音字」, characters with alternate pronunciation.

When a user types a word like 「給予」 (ㄐ一ˇㄩˇ/jǐ yǔ) for which there is an alternate, less-common pronunciation (ㄐ一ˇ/jǐ instead of ㄍㄟˇ/gěi for 給) they simply change the font for just the affected character to the variant with the correct pronunciation.

In the case of this Cantonese Font, the authors distribute a single .ttf (alongside a “phrasebook” .ttf whose purpose is not clear to me) and indicate in the Roadmap section of the website that ligature support must be enabled. If alternate pronunciations are common in Cantonese, then I suspect that they must use some ligature-based method. I would have to imagine there must be cases where this could be ambiguous, but I don't know how you would resolve those.

(In practice, just swapping the font on a single character works fairly well.)

[1] https://code.google.com/archive/p/wangfonts/

[2] https://dywang.csie.cyut.edu.tw/dywang/download/pdf/sample-o...


Thanks for sharing this. When I saw the link, I wanted to see if there was something similar for Mandarin. Looking at the PDF sample I don't see anything around how to pronounce the characters, i.e. there's nothing along the top or bottom that looks like pinyin with tonal markers. Am I missing something?


The fonts at the bottom of page one and the top of page two (王漢宗中明體注音 wp010-05.ttf and 王漢宗中楷體注音 wp010-08.ttf) include pronunciation guidance using zhùyīn fúhào 注音符號 (also called “Bopomofo” 「ㄈㄅㄇㄈ」 after the first four consonants): https://en.wikipedia.org/wiki/Bopomofo

This is the phonetic system that is used most commonly in Taiwan.

Typically, phonetic pronunciation guidance is used only in educational materials. For native speakers, this means materials only for very small children. However, in Taiwan, it's not uncommon to see 注音符號 guidance to indicate when a word should be said with a non-Mandarin pronunciation. You'll see this, for example, in shops whose names contain a pun when using the Taiwanese Hokkien pronunciation.

There are other fonts that include pronunciation guidance in hànyǔ pīnyīn 漢語拼音, the phonetic system used most commonly in China: e.g., http://fonts.mobanwang.com/200909/5832.html

I don't think there are any fonts with pronunciation guidance in any of the other phonetic systems (e.g., 通用拼音, Wade-Giles, 國語羅馬字) but these have almost all fallen into disuse and appear only in old signage, historical place names, or in people's names.

(Presumably, if you are born in Taiwan, you get to pick how you want your named romanized… especially since you may want the spelling of your name to match that of your relatives. But are you allowed to pick any transliteration you desire?)


The fonts at the bottom of the first page (wp010-05.ttf) and the top of the second page (wp[123]10-05.ttf etc.) have zhuyin (aka bopomofo) phonetic symbols [1] to the right of each character. These are common in Taiwan, whereas pinyin is used in the mainland. jamesdutc also used them in the comment above; e.g., ㄐ一ˇㄩˇ = jǐ yǔ.

[1] https://en.wikipedia.org/wiki/Bopomofo


Perhaps they use the same technology as ligatures? There could be a glyph for the standalone character, but also special glyphs for certain combos?

The page says they do handle variations:

  Pronunciation in the Cantonese Font adapts to the context. Based on what comes before or after, the Jyutping romanization changes to the right one. The magic behind this is a careful curation from 100,000 contexts where the pronunciation differs from the standalone character.


Hello. Font's author here. You and Jeff are correct in guessing this is (ab)using ligatures maximally :) To satisfy your curiosity, we can go deeper.

----

Conceptually it is simple: 1. assign a default (most likely) sound for each character, 2. loop through contexts, extracting words (char-combos) where the sound is different from the default ("alt-word") 3. create SVGs + font-paths (fallback for incompatible systems) for every char and every alt-word 4. assign a ligature to substitute each char-sequence that forms the alt-word (e.g., "when 乾 隆 appears adjacently, replace with `uniF1234` (the codepoint for the alt-word 乾隆")

It is not perfect, but I didn't expect this to work so well, and was stunned when the testers report high accuracy. I have always believed that bespoke computation with word segmentation (with some 1M frequency attached library) and large data-bank (100k+ words) was necessary.

----

Practically it was horrific, tedious, mind-numbing, gawd-awful set of "why this doesn't work": 1. SVG automation that works for 10^3 breaks with 10^5 2. what worked for Latin breaks for unicode 3. what worked for unicode breaks for PUA 4. what worked for monochrome breaks for color 5. what worked for single glyphs breaks for ligatures 6. what?! The assignments in the database is wrong?? 7. [...]

As I was trying to coerce the system to do what it wasn't designed to do, many of these breaks are undocumented, pretty mysterious to solve, and some steps just got manually gritted through. (And each of the 15k+ glyphs got gritted through about five times.)

It does look pretty elegant at the end ;)


In the FAQ you mentioned

> Unfortunately, without being able to do proper word segmentation, this will remain a limitation.

Can the user manually add a zero width space to help?


Technically yes, but the general public probably doesn't have a concept of zero-width space.

(For everyone else wonder what ackfoobar is proposing: let's take the phrase (if you don't read Chinese, just treat them as shapes) 香港地少人多, properly segmented, is 香港.地少.人多. The font treats this incorrectly, because "香港地" is a commonly used fragment, the 地 in the fragment have a special sound, and parsing as 香港地.少.人多 gives a mistaken sound for 地.

Ackfoobar is absolutely correct that we can coerce the correct reading by going 香港[ ]地少人多 --- where the [ ] is an invisible spacer. My contention is that most users don't know how to do that in their favorite word processor.

Someone is probably thinking, could you add "香港地少" as a fragment? Purist says it's not pretty, but I'm a pragmatist, so I did do many of these patching. Doing this or not relies on some acumen as a native speaker, and there were hundreds of these decisions made. This language knowledge would be necessary if someone were to do Mandarin (or Thai or, ...))


This is an awesome piece of work - congratulations!

I notice you're using OpenType-SVG here; have you investigated whether it would be possible to implement this using COLRv1 (which would potentially result in a lighter-weight font, I suspect, and eventually wider support)? Or are there technical limitations in COLRv1 that make it impossible?


Color fonts really hasn't converged into a standard, and their adoption is slow. OpenType-SVG was accepted 10 years ago, and it was implemented into FreeType only one year ago --- it hasn't even trickled down to most Linux distros (nor is it usable on Windows). I don't see COLRv1 in Win/Mac/Linux until 2026 at the earliest.

But I did try to make it into COLRv1 (as well as COLR/CPAL). The only tools that build COLRv1 right now are the tools from the Google Fonts team; I remember them stalling for hours before saying completion, yet the output was broken (I can't remember how it was broken).

I personally would love to see a COLR/CPAL version, and have some idea on how that could happen. But I probably should be working on some revenue-generating product instead ;)


That is amazing work. You've really plumbed the depths of what's possible with font technology, kudos.


Thank you. Who is really amazing is Simon Cozens, who wrote a set of articles on fonts/global script: https://simoncozens.github.io/fonts-and-layout/

The history of digital fonts added a great deal of complexity to font formats, and without him writing such a concise yet comprehensive guide, I would have been stuck for even longer.


I've seen ligatures (or whatever the underlying feature is in font formats) used for some wild stuff, but this takes the cake. They're effectively encoding a small amount of natural language processing in a font.

Setting aside for a minute the question of whether you _should_, I wonder how far you can take this? I.e. what limits are there on how much context you can take into account, etc.?


(Font author)

This is all off the beaten path, so I suspect the answer is no one knows. Font tables have a limit of 65k characters, but this ceiling can be busted in whacky ways using multiple lookups, useExtension... Practically, font building tools / operations crash (mysteriously), stalls (mysteriously), or slows to a crawl (indistinguishable from stalling), and the Cantonese Font about pushes the limit.


I think it also uses ligatures to render words from the "phrasebook", where you can take a word like "friend" and change the font from Times to this font, and it renders it as 朋友. Beautiful.



Maybe it uses font ligatures to change based on the surrounding characters.

https://en.wikipedia.org/wiki/Ligature_(writing)


You are correct!


As far as I know, Mandarin doesn't have multiple pronunciations for the same character-- does Cantonese? Aside of that, you could use ligatures for that, couldn't you?


Mandarin definitely has many characters with multiple pronunciations. One large class come from literary vs. colloquial reading differences: https://en.wikipedia.org/wiki/Literary_and_colloquial_readin...

Another large class class comes from vestiges of derivational morphology in Old Chinese: https://en.wikipedia.org/wiki/Homograph#In_Chinese For instance, the character 度 in modern Mandarin can be pronounced dù (when used as a noun) or duó (when used as a verb), both of which derived from Old Chinese /daːɡs/ and /daːɡ/, respectively.

With Simplified Chinese characters, some of them come from the merger of originally different words that had similar, but not exactly the same pronunciations. For instance, both 髮 (fà) and 發 (fā) were merged into 发.


For the example you give, in Cantonese they're dou6 and dok6, respectively.

The 3/6 tones in Cantonese and ˋ (4th, falling tone) in Mandarin are the "departing" tone, which comes from the departing tone in Middle Chinese, which I believe comes from the -s ending in Old Chinese.


Mandarin absolutely does:

* 行: xíng or háng

* 的: de or dì

* 长: cháng or zhǎng

(plus I'm sure many more that I can't think of just right now)


Did it get more numerous with the adoption of simplified characters?


Yes, it did if you're not a mandarin speaker. The simplification process was biased towards Mandarin and there are some words that were merged that have different pronunciations in Cantonese but not in Mandarin.


I'm not sure, but I believe that most, if not all, mergers from simplification were homophones.


了 as le or liǎo, too.


In Mandarin there are actually different pronunciation depending on context.

Example

觉得 juede, to think 睡觉 shuijiao, to sleep

Here the same character is pronounced jue or jiao depending on context


Both Mandarin and Cantonese actually have multiple pronunciations for the same character. Here is an example in both:

- 说服/說服 Mandarin: shuì fú Cantonese: seoi3 fuk6

- 说话/說話 Mandarin: shuō huà Cantonese: syut3 waa6


Both do. A single, isolated Chinese character may have multiple unrelated meanings with some of them having an entirely unrelated pronunciation. It is, in fact, ubiquitous.

The idea is that each honzi has exactly one meaning is a misconception.

With respect to ligatures, if by that you mean the length of the same word across different Sinitic languages, that depends on the specific language and its phonology. Mandarin, for instance, has lost a large number of finals over the course of its evolution which has resulted in words generally being longer and requiring extra syllables to resolve the phonetic ambiguity. The Sinitic languages that have retained more finals (and sounds in general) tend to have more of shorter words. Cantonese is one of them albeit not the only one.


For those curious, the romanization system here is Jyutping: https://en.wikipedia.org/wiki/Jyutping

That's new to me; previously I had only seen Pinyin and Wade-Giles:

https://en.wikipedia.org/wiki/Pinyin

https://en.wikipedia.org/wiki/Wade%E2%80%93Giles

Wikipedia has a nice article on the history and a number of other systems:

https://en.wikipedia.org/wiki/Romanization_of_Chinese


Happy to see this here. I think there's tons of potential for making Cantonese easier to learn. The big difficulties I've had as an English speaker learning is:

1. Multiple Romanisation formats (Jyutping vs Yale) 2. Many community lead dictionaries with varying completeness. 3. Many web resources for learning words/phrases/etc use a mixture of traditional characters, jyutping, yale, or something else.

Its very difficult to find the content in the format a learner needs. Hopefully something like this will help learners use content written using traditional characters.


(Font author here)

I whole-heartedly agree. I am a native speaker, and "fluent" in jyutping, yet I have such a hard time with Yale.

One service I'm going to build is a mapping tool between {R1, R2, ...Rn} and {G1, G2, ...Gn} where R is romanization method and G are y/z-variants of glyphs. (These, for the most part, already exists inside packages I built for building the font, and just need to have an UI to expose it to the world.) It would sure save me lots of time trying to read Matthews-Yip...


I was thinking the same thing. Perhaps creating an API around PyCantonese?

My thought is that if there's a common data format for a Cantonese sentence with jyutping/yale/traditional + translation(s), the user could then pick what to display.

It could then also be worked into games/learning exercises. Placeholders could be made with a number of options so users could learn how to slot different adjectives into sentences, for example.

(I have the same username on Reddit, by the way. Sorry I never got to test it out for you!)


This is amazing and very useful!

Does anyone know of such a font for Manadrin?



Seriously. I could really use this right now for Mandarin.


A friend of mine noticed the website thumbnail image is FizzBuzz but in classical Chinese.


The code in that image [1] is written in wenyang-lang [2], a programming language based on Classical Chinese.

It was previously featured on HN here: https://news.ycombinator.com/item?id=22213406

[1] https://visual-fonts.com/#jp-carousel-1092 [2] https://wy-lang.org/


Do you mean the og:image? It's Sieve of Eratosthenes, not FizzBuzz.


I find it, well, outrageous, that Google translate doesn’t support Cantonese.

Nationalism at its worst.


Google Board does, although the text input only. It support both, Yale and Jyutping, romanisations.

Microsoft Bing Translator does support Cantonese tho.


Isn't it listed as "Chinese Traditional"?


No, "Chinese Traditional" refers to the character set used to write each character[1], but the actual text is still written according with the vocabulary and grammar of Standard Written Chinese, which is based on spoken Mandarin.

As an example, this is the sentence "Please give me his book" written in Standard Written Chinese using "Chinese Traditional" characters: 請你給我他的書。

If you use "Chinese Simplified" characters[2] instead, it would still be the same words, but some of the characters have simpler forms: 请你给我他的书。

However, both of those renderings still follow Mandarin vocabulary and grammar. Even though Cantonese speakers generally read and write Standard Written Chinese (in either Traditional or Simplified characters), if they were to actually convey that sentence in spoken Cantonese, it would actually be quite different. Written Cantonese[3] is generally only used in informal contexts, but a rendering of the sentence in Cantonese would instead be: 唔該你畀佢本書我。 (Traditional) 唔该你畀佢本书我。 (Simplified)

The written Cantonese version uses vocabulary and grammatical constructions that are not part of Standard Written Chinese, and Google Translate is currently not able to translate to written Cantonese. I've found that over time though, Google Translate has been getting better at translating from written Cantonese to English (however, for the example I just gave, it appears it still completely botches the translation; it currently thinks it translates to "You shouldn't have let her book me").

[1] https://en.wikipedia.org/wiki/Traditional_Chinese_characters [2] https://en.wikipedia.org/wiki/Simplified_Chinese_characters [3] https://en.wikipedia.org/wiki/Written_Cantonese


I just tried Bing Translate, and even though it nominally supports "Cantonese (Traditional)", their translations don't seem to be much better for the example sentence I gave.

"Please give me his book" is translated to: 請把他的書畀我。 While it does use the usual Cantonese word for "give" (畀), the rest of the sentence still mostly uses Mandarin vocabulary and grammar, notably the 把 construction, the third-person pronoun 他, and the use of the 的 genitive particle, which aren't typically used in spoken Cantonese.

Conversely, 唔該你畀佢本書我 is translated to: "Please give him this book me." This seems to indicate that it doesn't understand Cantonese's ability to use measure words (such as 本) as genitive particles, nor does it understand that the word order of the direct and indirect objects in the sentence are switched compared to the usual order in Standard Written Chinese.


ChatGPT seems to fare much better:

> Translate "please give me his book" to Cantonese.

> > "請俾我佢嘅書" is the Cantonese translation for "please give me his book".

This appears generally fine though stylistically not as colloquial. Using 請 for "please" is considered more formal since it's usually more written than spoken. In addition, it uses the Cantonese genitive particle 嘅, though I think most speakers would prefer to use the measure word 本 in this context, but this is still perfectly grammatical. Lastly, it still retains the same word order for direct and indirect objects as Standard Written Chinese, which while not completely colloquial, is still considered acceptable when spoken.

BTW, since written Cantonese isn't standardized, 畀 can often be seen written as 俾, but they refer to the same word.

> Translate "唔該你畀佢本書我" to English.

> > The Cantonese phrase "唔該你畀佢本書我" translates to "Please give me his book" in English.

Exactly correct.


Google translate Cantonese: 唔該你畀佢本書我

Real life Cantonese: 拎佢本書黎thx


Being a native hk Cantonese speaker this is a surprise you really cannot maintain the word sequence …


I use a similar one for Mandarin. Here's a cool little chrome extension I built to force the use of the font for all Hanzi/Han characters in a webpage.

```// This script changes the font of all Mandarin character on a page.

const url = chrome.runtime.getURL('font/Hanzi-Pinyin-Font.top.ttf');

const fontFace = ` @font-face { font-family: ZhongwenFont; src: url(${url}) } `;

window.onload = function () { const styleSheet = document.createElement("style"); styleSheet.innerHTML = fontFace document.head.appendChild(styleSheet); // For static pages, handle all elements: document.body.querySelectorAll("").forEach(handleElement); // support for dynamic pages // For dynamic apps, like Twitter, observe all DOM mutations henceforth within document body: addMutationObserver();

}

function handleText(textNode) { // Regular expression to match all Chinese characters const regex = /[\p{Script=Han}]+/gu; if (regex.test(textNode.nodeValue)) { const { style } = textNode.parentElement; style.fontFamily = 'ZhongwenFont'; style.fontSize = "large" } }

function addMutationObserver() { const observer = new MutationObserver(function (mutations) { mutations.forEach(function (mutation) { mutation.target.querySelectorAll("").forEach(handleElement); }); }); observer.observe(document.body, { subtree: true, childList: true }); }

function handleElement(e) { e.childNodes.forEach((child) => { if (child && !isUserInput(child) && child.nodeName === "#text") { handleText(child); } }); }

// Some helper functions used by the code above: function isUserInput(node) { const tagName = node.tagName ? node.tagName.toLowerCase() : ""; return ( tagName == "input" || tagName == "textarea" || isInsideContentEditable(node) ); }

function isInsideContentEditable(node) { while (node.parentNode) { if (node.contentEditable === "true") { return true; } node = node.parentNode; } return false; }```


Does anyone know of good tools to practice tone-listening and recognition for cantonese? or tools for drilling jyutping more generally,

I've used https://chaaklau.github.io/cantorocks/, but I would like to also try others, as the audio is not great sometimes and I'd like to listen to different voices.

When studying Mandarin I found that getting the phonetics right early on really helped me, it can be done in a relatively short time and its hugely motivating.



A package to power GNU Emacs chinese-ctlaub input method with 100% font coverage is needed. I see TOFU in the browser and emacs, for example

* https://humanum.arts.cuhk.edu.hk/Lexis/lexi-mf/shuowenRadica...

and there are graphs that can't be input by emacs chinese-ctlaub. Associating a cognitive complexity load score for each graph to guide learning would help.


I honestly wonder how good of a job one could do with Chinese at making a font that just translated each character into English?

Especially given modern font ligatures letting you change the meanings displayed based on context (the way things like Fira code do).

Maybe don't even show the Chinese characters, just their meaning in English? It's obviously not going to be perfect, but I think there's potential there for a fun project.


Short answer is horribly. Chinese is not one word per character. Not even close. Most words you encounter on a regular basis are two-char compounds, but individually each character would tell you less about the whole than ‘camp’ and ‘ground.’ Many words are four-char compounds.

You could potentially design ligatures for every word in the dictionary, but given the vastly different grammar, you wouldn’t be able to understand what a sentence said. You might be able to glean ‘this sentence involves a dog.’ Even google translate makes such poor translations, that it can really only assist in understanding. It usually doesn’t understand the subject of the sentence and confuses all the pronounced into ‘I.’


Like with any language, there’s gonna be a lot of context-dependent words/phrases with multiple meanings that are hard to segment/parse/translate correctly. Things like DeepL or GTranslate take into account probabilities for segmentation and grammar (or use ICU libraries); but that’s harder to do from a context of using ligatures and basic font engine features.

e.g. The classic example is 大麻煩 - is it 大|麻煩 (a big inconvenience), or is it 大麻|煩 (marijuana annoyance)? Is 粉絲 fan, or vermicelli? Is 早唞! “good night!” or “go fuck yourself!”?


Marijuana literally means "big numb" in Chinese? Or "big horse" if I mispronounce it?


Haha, yeah, works that way for Cantonese and Mandarin. 大麻 daai6 maa4/dà má (marijuana, literally ‘big numb’) vs 大馬 daai6 maa5/dà mǎ (big horse).

Bonus: 大媽 daai6 maa1/dà mā (auntie or father’s elder brother’s wife, literally ‘big mother’).


麻 isn’t always numb. It’s also a genetic term for all kinds of flax or fiberous plants like hemp. For example sesame is 芝麻.


Iirc, opentype font tables are Turing complete so... you could put an AI translator in a font! The sky's the limit.


Funded by ... a dance studio! How wonderful. I walked past it the other day. Perhaps I'll pay them back for funding the creation of this font by taking a Tango class :)

> The Cantonese Font is a piece of culture funded from our savings, and from revenue from our dance studio in Causeway Bay. If you are in Hong Kong, come join our fun, friendly, best Argentine Tango classes for all levels at www.eli.dance


Absolutely. Eli is a superb dancer and teacher. And if you are under-25, we even offer 50% off the group classes!


That looks good! I’ll have to give it a try. I’ve been trying to make some resources to encourage my son to pick up more from my wife and have been using hambaanglangs graphical generator[1]. It’s good but can be a bit fiddly to use.

[1] https://hambaanglaang.hk/software-tool/#GC%20Generator


This wouldn't work in the same way in Japanese because characters can have different readings depending on the context. Could that be overcome with font ligatures? Still, even if we could overcome that massively complex task, some groups of characters (i.e. the aforementioned context) can still have multiple ways of reading them.


If you scroll down on the article in the OP, you'll see that Cantonese readings of Chinese characters are also context-specific, and they appear to have solved that problem with ligatures.


The proper solution for Japanese is a time machine to the 1940s and instituting a proper script reform instead of the touyou kanji list and Shinjitai simplifications. :)


I see it, I understand what's going on - it's clear, but I still cannot read it. How do you read those pronunciations?


You have to understand both how the jyutping romanization works (i.e. what sound 'bat' makes) as well as what the tones in Cantonese are. I'll let you read about the initials and finals (mostly consonants and vowels). There are some tricky ones: the 'c' in cat is pronounced more like an aspirated 'ts'.

There are six or nine tones in Cantonese:

high middle rising middle low rising low low falling

There are also three "entering" tones that some people don't distinguish at all, as they're sort of just the high/middle/low tones but short. I couldn't tell you whether they're really separate tones or just short versions of the first six.

You can see how the tone marks on the font line up with the tones here -- if you know them, the font marks are a reasonable guide.


> There are some tricky ones: the 'c' in cat is pronounced more like an aspirated 'ts'.

I would be perfectly happy if a non-native speaker pronounces "c" and "z" like the English "ch" and soft "g".


For a rough guide of the sounds, the pronunciation key in Wikipedia does a good enough job.

https://en.wikipedia.org/wiki/Help:IPA/Cantonese


The font annotates each character using the Jyutping system for Cantonese pronunciation: https://en.wikipedia.org/wiki/Jyutping


Amazing. I started learning mandarin about 10 years ago, and spent a good deal of effort building flash cards where characters and accompanying pinyin were color coded by tone. If this font had existed then, I would've adapted it to mandarin and then used the shit out of it.


I really like this idea, though I haven't tried it yet. Spotify has lyrics for many songs, but for characters the first hurdle is you cannot really sing along even with the lyrics unless you know them.

Is there some easy way one change the font used by Spotify to be this font somehow?


(Font author) I have tried, but I couldn't find a way to change font in Spotify app. If you find a way I'd love to know.


Wouldn't this be better handled as a font rendering engine feature so that it wouldn't be tied to a specific font and users would be able to configure which language to display the pronunciations in?

Good stopgap though. Do they have plans to do the same for other languages?


My girlfriend’s family strictly speaks Cantonese and I known nothing about it. I was just talking with a friend in a similar situation who lamented the lack of resources for Cantonese. This is awesome! Does anyone have any other hidden-gem Cantonese learning resources?


Unaffiliated with this group, but there are a bunch of graded readers, romanized with Jyutping, freely available, and very high quality![1]

If you’d prefer physical books, this online store imports quite a few from Hong Kong.[2]

A dictionary program for Windows/Mac/Linux that I make [3], and another for mobile that I don’t make.[4]

Finally, a list of other good resources from Cantonese Alliance.[5]

[1] https://hambaanglaang.hk/

[2] https://kozzi.ca/

[3] https://jyutdictionary.com/

[4] https://www.pleco.com/

[5] https://cantonese-alliance.github.io/language.html


Here's a side project of mine: https://cantowords.com/

It's an offshoot of https://words.hk (which has a Chinese interface). AFAIK we're the only modern Cantonese-Cantonese and Cantonese-English dictionary that is both comprehensive (50k+ entries) and word (詞) focused (i.e. not only explaining meaning of characters).

It's a huge undertaking, we've been on this for almost 10 years, and probably almost half-way done.

FWIW words.hk has quite a bit of traffic (from native/near-native speakers presumably) due to being well indexed by Google, but cantowords.com is relatively new and could benefit from more attention :)

PS: some of our content has made way to Cantonese dictionary apps as mentioned in sibling post(s), but the web site contains the most up to date content (which does get updated on a day to day basis as editing work proceeds)


Cool! I made an effort to study Cantonese about 10 years ago and wish I'd come across these sites!

At the time, I heavily relied on another great Cantonese dictionary/resource site: http://www.cantonese.sheik.co.uk/dictionary/

In addition to the Jyutping Cantonese pronunciation, it also shows corresponding Hanyu pinyin pronunciation for Mandarin. There are also number of learning resources on the site. I love how you can mouseover any character to get a pop up to learn more about the character - it's meaning as a single character, pronunciation, etc.


If you have basic Mandarin knowledge, you could try Duolingo's Cantonese course for Mandarin speakers. I think it is helpful to get familiar with how words are pronounced and what expressions people use in vernacular Cantonese so far. As with the Mandarin course, there is little explanation, you kind of have to complement it eg with some drilling for tones.


This may look like an ad but I was satisfied with Cantonese101 app (except for the stupid fact that flashcards are not in mobile app, only in the (responsive) web based app). I went through 2 "levels" but I failed to start using the language on a daily basis which is why I still don't really know it (my fault).


This is great! Desperately need this for Japanese!


I'm confident that either this already exists in Japanese or that apps/tools can do it.

Japanese Children learn Kanji with Furigana, so there are solutions for them.

Some examples I found:

- https://support.microsoft.com/en-us/office/use-furigana-phon...

- http://www.furiganizer.com/static/about.html

- https://www.youtube.com/watch?v=2Fmw2IrRUXo


the furiganizer.com website does not seem to work :(

However the third link, with the Mac Pages, really great.


Very interesting! Does anyone know of a similar font for Thai? Or information on how to make such a thing?


Does anyone know if something like this exists for Thai?


Trying to learn pronunciation through some sort of visual language annotation is one the most counterproductive ways you could approach it. Pronunciation varies subtly from person to person and even from situation to situation, all this information can only really be conveyed from actually listening to people speak, where as most systems for transcribing pronunciation have to optimize for regularity. The end result is that it only conveys the minimum amount of phonetic information needed to distinguish between morphemes. If you add more information then the categories become more and more subjective and harder to distinguish. For example try to do some IPA transcriptions for a language you do speak, or listen to trained linguists try to pronounce words in non-native language.

Think of it as trying to compress several kilobytes of information down to several bytes of information and then trying to reconstruct the original data all in the CPU when you have dedicated hardware several orders of magnitude more powerful and which uses a non-compatible black box compression scheme.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: