Hacker News new | past | comments | ask | show | jobs | submit login
Jōyō kanji variants: The curious case of 叱 and 𠮟 (2016) (namakajiri.net)
160 points by yuhong on Nov 11, 2017 | hide | past | web | favorite | 149 comments



> Can you spot the difference between 𠮟 and 叱? Me neither.

Of course I can, not because I am Chinese, but they really do look different. How can you NOT see the difference? Also they have different meanings.

Edit: Here is one that’s really hard to recognize especially in handwritings and often written wrong if not careful.

已 vs 己

This puzzles many Chinese people...

Can you spot the difference? 已 has to do with time (stopped) 己 is self

So 自己 is me/self, while 已經 means already.

There is also 巳, which is ancient Chinese clock means 9-11am I believe.


I can't read any of these characters but to me the difference between 已 vs 己 is as clear and obvious as the difference between 𠮟 vs 叱


The question is if you would recognize the difference if they weren’t side by side.

I am trying to teach myself Japanese, which allegedly is easier than Chinese. Hiragana as far as I can tell makes no internal sense. The characters for yo and ya don’t incorporate elements from o or a. And when you are done eh Hiragana you realize that it is on its own not enough to actually communicate in Japanese, so now in addition to those 46 characters, you have to memorize 2-4 thousand kanji. Oh man that is disheartening. I keep plugging away but it is slow going.


> I am trying to teach myself Japanese, which allegedly is easier than Chinese.

Who did you hear that from? The Japanese writing system is the Chinese one, minus all of the internal consistency, plus lots of characters the Chinese aren't traditional enough to keep in the language. It often uses the same character to write distinct Chinese loanwords which differ wildly in pronunciation based on when and from whom the Japanese heard them, and they often have different shades of meaning.

Chinese, on the other hand, uses only one writing system in any given body of text, uses the Latin alphabet with tone markings for phonetic spelling for learners, and is much less reliant on garbled loanwords.


If you know English, the loanwords become immediately accesible. Also Japanese dictionaries could be easier to check, since you just need to know the pronunciation.

Your conception on the writing systems is quite weird, consistency (if any) is the same, I would say. I do not get what you mean by characters being "traditional" or not.

Chinese loanwords are more garbled, with different methods of assigning them random chinese characters. And the Japanese use Latin characters to mark pronunciations also, minus the tone markings. (Let me introduce you to -- bopomofo)


English loanwords are only a tiny part of either language. The parent was pointing out that in Chinese 生 is always sheng, while in Japanese it can be any of sei, shou, san, nama, ikiru, ikeru, ikasu, umu and 50-odd more readings depending on context.

https://en.m.wiktionary.org/wiki/%E7%94%9F


> If you know English, the loanwords become immediately accesible.

That's a common misconception (on the level of "katakana is for loan words"):

I switched my televi off and rode my autobike from my mansion to the cleaning to pick up my Y shirt because my car's front glass broke. On the way home I stopped at the conveni.

Televi: TV Autobike: motorcycle Mansion: apartment Cleaning: dry cleaning Front glass: windshield Conveni: Conveni-ence store

> Chinese loanwords are more garbled, with different methods of assigning them random chinese characters.

This was standard in Japanese too (ateji). It is more common now to just make a loanword, but if someone doesn't learn them they may be confused about why Japanese abbreviate America as "rice".

米 US from 亜米利加 A ME RI KA (亜 is for Asia) eg 米軍 (US military) or 米ドル (US dollar)


IIRC 米国 (mi guo / "rice country") is used in Taiwan as well. The other Chinese word for America (美国 / mei guo / "beautiful country") is also a phonetic thing.

(I used to think 美国 was descriptive since it didn't sound phonetic, until I came across 米国 and was confused as to why anyone would describe the US as "rice country", and then I learned that both were phonetic, but for the "me" in "america")


You call it a misconception and then post an example that's extremely easy to understand? I'm confused.


I'm talking about the Chinese loanwords; I agree, the English and generally European loanwords are written phonetically and are pronounced consistently.


> If you know English, the loanwords become immediately accesible.

Bit of a stretch IMO - the pronunciation and usage can still often be different enough to trip you up, especially when you consider there are non-English loanwords on occasion (e.g. playing Persona 5, it took me far too long to realise that ルブラン, which I naively transliterated as ruburan, was actually the French "Le Blanc").


They don't make internal sense because they're originally derived from Kanji/Hanzi and more or less arbitrary:

https://en.wikipedia.org/wiki/Hiragana#History

As a Chinese speaker learning the kana, I found knowing the etymology of the kana somewhat useful as a learning tool, but obviously this is of no help if you don't already know Chinese.

Also, Mandarin (and other Northern Chinese) pronunciations have changed quite a bit more than southern dialects, so knowing Cantonese is more helpful than Mandarin, e.g.

か 'ka' comes from 加, which is 'jia' in Mandarin but 'ga' in Cantonese

A syllabary that 'makes sense' would be Korean Hangul, where the relationships between the various parts of the mouth were purposefully encoded into the relationships between the components of glyphs.


"And when you are done eh Hiragana you realize that it is on its own not enough to actually communicate in Japanese, so now in addition to those 46 characters, you have to memorize 2-4 thousand kanji. Oh man that is disheartening. I keep plugging away but it is slow going."

Don't worry about "learning kanji": learn words. A few kanji are used in isolation as words, but most kanji have meanings that you can only infer from context (as part of words). Furthermore, even if you "know the meanings" of kanji, you'll find you just have to memorize words anyway, because you usually can't combine the meanings of the kanji to come up with the meaning of a word. Not easily, anyway.

Too many westerners try to learn Japanese by memorizing 2,000 kanji, because that's easier than memorizing 30,000 words. It doesn't work. The way you "learn kanji" in immersion language schools is that you get word lists, and you're quizzed on how to read (i.e. write the readings) and write (i.e. write the kanji) those words. You quickly start to see patterns.

Memorizing lots of words is painful, but then, language learning is painful. This method comes with the huge advantage that the sooner you start learning words, the sooner you'll get positive reinforcement from the real world. You won't get that by memorizing kanji.


Probably goes without saying that there are as many ways to learn Japanese as there are students. The only really hard-and-fast rule is that whatever you're doing, you need to do a lot more of it than you think.

That said, I've found that the kanji separately really helped my Japanese studying. If nothing else, it's kept me from falling too far behind the Chinese and Taiwanese students in class...

The book I used for the kanji is The Kodansha Kanji Learner's Course by Andrew Scott Conning. Can't recommend it enough.

https://www.amazon.com/Kodansha-Kanji-Learners-Course-Step/d...


There's definitely a lot of wanking about learning methods, but if you're taking an immersion class, you have to consider the amount of word study you're already doing. I'm just responding to the self-learner folks who want to start by "learning kanji", as if that's some kind of preliminary step to learning the language. I've met a bunch of those people, and it rarely works well for them.

(Though, it should be said that the reason the Chinese kids are better isn't that they "know kanji", but that many of the words are written the same way in Japanese. Substitute some of the kanji with hiragana, and they'll lose the bubble quickly.)


And even if they don’t know the words, they have internalized the learning methods to quickly ingest characters. Chinese classes go from learning barely 200-500 characters in a semester at the introductory level to that much in a week by your third or so year.


Wellll...I don't know Chinese and have never studied in China, so I can't speculate in that domain. But I wouldn't be so quick to infer specific techniques from their memorization speed -- they're fluent in a very similar written language, and it's much faster to learn things (in any language) when you're fluent. They also grew up in a...well, let's call it a memorization-intensive learning environment.

The only things I can say with confidence are:

  * focus on words
  * learning words becomes faster in *any* language as you gain proficiency
  * learning radicals helps you recognize kanji, and is a good thing to do
That's really my complete set of advice on the subject.


There are many things about Chinese which are very easy, in particular the grammar and the fact that many words are constructed by logical combinations of smaller semantic units. There's nothing easy about Japanese IMO, it has extremely complex grammar with standard and non-standard inflections, it uses a mix of two different alphabets and Chinese characters, and the Chinese characters often have multiple different readings. Chinese has a big learning curve because of the role of tones and context, but once you get over the hump it's a pretty simple language.


The Japanese grammar is way more consistent (almost without exceptions) then all western languages. And in Chinese you get multiple pronunciations for the same character all the time!


Yeah except for counters.

Those god forsaken counters.

You can't just say five pencils. You have to say

5 long thin object even-numbers-4-and-under-and-odd-numbers-5-and-over-counter pencil

Go hon enpitsu

The pattern for which is

Pon Hon Bon Hon Hon Pon Hon Pon Hon Pon etc.

They can't simply say the number of clouds the same way they say the number of machine presses. Or count small animals the same way they count apples. Or demigods the same way they count flat objects.

Small round objects and military brigades would obviously be counted the same way though. Obviously.

There's also a unique counter specifically for armed naval vessels and slices of fish on top of balls of rice.

And one just for straw mats.

And one for graves, CPUs, wreaths, and dams.

And while small animals and arthropods have one counter, and large animals have another counter, you count butterflies using the counter for large animals.

Oh and the first 10, 14th, 20th and 24th days of the month essentially have unique names.

I knew a native Japanese guy who spent 4 years mostly away from Japan, and he started to have trouble remembering the days of the month.


Although if you just use the generic 个 for everything besides (formal) people, people will understand what you're saying and give you a pass for being a non-native speaker.

At this point for me, the measure words are actually a help instead of a hindrance when I encounter a new noun that I don't know -- it hints strongly enough at what sort of noun I'm dealing with that the context usually allows meaning to more readily snap into place. More useful than memorizing generic masc./feminine/neuter for most Indo-European languages.


Oh that's cool- so if they are talking about CERN and say "one particle accelerator" the counter might help you figure that out?

(Assuming it isn't just like.. "parutikkuru akuseroreta" or something)


This seems to be true in Chinese as well. I know perhaps 10 measure words, but folks seem to be ok with it when I default to 个.

(Chinese is also much simpler when it comes to measure words)


The "internal sense" is that they are both derived from kanji:

Hiragana: https://upload.wikimedia.org/wikipedia/commons/7/7a/Hiragana...

Katakana: https://upload.wikimedia.org/wikipedia/commons/0/0c/Katakana...

Kanji isn't that hard if you are realistic about how large of a task it is. Japanese children spend 10 years learning kanji at school and that is their native language.


That’s also true of common characters used in English depending on font and other things. Context generally makes things pretty clear, which I would guess is also the case with these glyphs in other languages.

Capital I, lowercase l, numeral 1

Numeral 0 and uppercase O


I said in handwritings if not careful.


I think yeukhon means that if you get lazy and write 已 or 己 with a single stroke (easy to do with a ball-point pen), they are basically indistinguishable.


No educated Chinese-writer will write those in a single stroke AFAIK


I think you will be amazed at what can be written in a single stroke when you are in a hurry.


Yeah, but they are side by side here, and not in handwriting.. If you're not careful in handwriting it'd be real hard to tell the difference.


> > Can you spot the difference between 𠮟 and 叱? Me neither.

> Of course I can, not because I am Chinese, but they really do look different. How can you NOT see the difference? Also they have different meanings

The difference at my resolution is about 10-15 pixels, this seems like much less than the difference between one font and another. But then of course I don't have any knowledge about Hanzi. Is the difference whether the diagonal dash touches the part to the left? Seems very hard to notice.


They were originally drawn with brushes, and the stroke direction was obvious. In handwriting, you're also supposed to use the same stroke order and direction. A font doesn't have that, so some fonts use differences in shape and thickness of the line to indicate the stroke direction.

So I'd say that the difference is mostly how your font chose to indicate the proper stroke direction.


I l 1

O 0

We have plenty of more ambiguous characters in our limited alphanumeric set and if special care was not given by don't designers it could look exactly the same.

Every written system has its own clues that readers learn to focus on.

In Japanese the number and direction of strokes is important but may translate to barely noticeable pixel differences that Westerners would dismiss as noise.

I think that this particular pair is confusing because in Japanese orientation is not supposed to be important, but is, in that case.


I and l looking exactly the same in some fonts is a different problem, really. You're not being asked to pick out a two pixel difference in slope (seriously, two pixels using pretty standard resolution). You use grammatical rules to tell the difference, which is annoying but not visually difficult. And numbers aren't getting freely mixed with letters in text.


And probably why no one cares, especially when Unicode was limited to 16-bit.


> How can you NOT see the difference?

I saw the difference, because it reminded me of the difference between ン and ソ in katakana, but if you aren't Chinese I can see how you could assume they are the same. You can draw the same character differently in English without changing the meaning (e.g. 9 straight or curved, 7 with or without a bar through it, looped k or f), so overlooking the difference between 𠮟 and 叱 seems easy enough.


Chinese has a lot of tricky ones.

荼 vs 茶

日 vs 曰

已 vs 己 vs 巳 vs 巴

士 vs 土

水 vs 氺

人 vs 入 vs 八 vs 几

干 vs 千 vs 于

天 vs 夫

... the list goes on.


Here's some more. I made these mistakes when typing using Pingtype, and then checking my results with my teacher.

https://pingtype.github.io

給 耠

辦 辛力辛

重 量

困 因

痕 㾗

眉 㞒

部 卶

蓑 䒾

任 仼

挑 狣

推 猚

䒷 苦

阿 冋

共 芖

休 你

衡 𧗾

一 –

幹 龫

幸 辛

我 找


As in homographs in other languages, context is often sufficient to disambiguate.


>> 荼 vs 茶

Also 苶 looks similar.


At least these are all in the BMP...


Maybe they should just romanize (switch to Latin script) the way Vietnam did.


It’s hard to do in Japanese due to all the homonyms. Kanji gives meaning where romanizations would not.


Early video games didn't have enough resolution to display Kanji, which is why all text was Kana-only. With spaces as word separators and sentence context homonyms are no more of a problem than in other languages.


What do video games have to do with it? Pong didn't have the resolution to display Shakespeare, but we didn't stop using words because of it. Video games provide a visual context that reading alone does not. Also, early video games, especially consoles like the NES Famicom, were geared towards kids who don't understand many kanji. The same is true of books aimed at children in Japan. They don't use kanji and they provide visual context in the form of illustrations.

Ask any Japanese person why they use kanji and not just hiragana, katakana, or romaji and they'll tell you the same thing: Because it is hard to determine the meaning. They don't say it is impossible, just that it is difficult, which it is.

For a contrived example, if I write はしをさがしてる。What am I looking for, a chopstick, a bridge, or the side of something? There is no way to discern from the sentence alone. However, if I use this kanji for hashi, 箸, then it is obvious I'm talking about chopsticks. There is no ambiguity.

This doesn't always help. Sometimes the homonyms use the same kanji. 青山, for instance, can be read あおやま, which is a surname, or it can be read せいざん, which means a lush, green mountain. Most of the time though, kanji provides meaning and clarity where hiragana, katakana, and romaji do not.


This hits you hard when you read children's books for the first time as an adult.


Spoken language doesn't have kanji and people can communicate just fine.


@HumanDrivenDev - Right, because in spoken conversation the listener can ask for clarification when the speaker's commentary is ambiguous, which happens all the time. You cannot do that when you're reading a book.


In mandarin people tend to add lots of little "particles" (not sure about the proper linguistic term) when they speak, which seems to be a way of adapting to homophones.


Spoken language has accent.


One thing I've noticed whilst trying to learn Japanese is that reading long stretches of romaji/kana is just tedious. There's certainly overhead to learning kanji (especially as a non-native speaker), but I'm not sure romanisation alone would make for a better language.


The hardest part for me is not the characters (I type using Pinyin, which is Romanised).

What Chinese lacks is spaces between words.

Even Korean has spaces. It makes reading a large block of text so much easier, and it helps me to identify vocabulary that I recognise.

Adding spaces and Pinyin are just a couple of the many features of Pingtype.

https://pingtype.github.io


You've been down-voted, but it's something I often think about as well. Chinese speakers - even those who speak chinese languages that aren't usually written - seem to have a deep attachment to characters. But the attachment is not free.


also 戊戌戍


And their friend 戉.


I was stunned to find that no article has ever been submitted to HN with 'Han unification' in the title[0].

(there are comments [1] on the topic if you're interested to see HN's discussion on the topic as I was)

[0]: https://hn.algolia.com/?query=Han%20unification&sort=byPopul...

[1]: https://hn.algolia.com/?query=Han%20unification&sort=byPopul...


The reason the second Japanese character is censored out of the title seems to be that having a character outside the BMP in the title broke everything horribly, not because of actual censorship of swearing or something.

I wonder how long it will be until UTF-8 is used everywhere and non-BMP characters enjoy first-class support and testing. You'd think the U+1Fxxx emoji would have been enough to make this happen.


I just filed https://github.com/algolia/hn-search/issues/104 after editing the HN title and discovering problems.


It will still probably take time. It was only recently that MySQL allowed more than 768 byte indexes for example, before which utf8 was commonly used instead of utf8mb4.


Not even going into the "fun" of the Han unification, there are some weird things in Chinese/Japanese characters.

For example, 右 (right) and 左 (left). You'd think the top-left part is the same, but it's not. In the case of 左, you write the horizontal stroke first, but in the case of 右, it's the second stroke. They also have a slightly different shape.

Another example from the Jōyō kanjis, 臭 (stinking, odor) and 嗅 (smell). You'd think the second is just 口 (mouth) added to the first, but it's not. Etymologically, 臭 is 自(nose, simplified form of 鼻)+犬(dog), but was simplified in the Jōyō list, and became 自+大. 嗅 was not part of the list back then. Which doesn't mean it didn't exist. It just means it was not recognized as regular enough by the ministry of education. 嗅 was only recently added to the list (2010), but was not simplified to remove the extra stroke, so it's 口+自+犬, leading to this funny inconsistency.


My biggest frustration with Han unification is that on my phone Japanese text appears with Chinese variants and there's no way to correct this without rooting the phone or setting the system language to Japanese (both kind of a lot of work to get something working that should just work out of the box). People say "well, what's the big deal? You can read it anyway," but think about how you'd feel if English text randomly had Cyrillic or Greek equivalents mixed in with the Latin letters and see how you'd feel.


My exact same feeling. Even if the Han Character, as in the Han unification are the same, the glyph matters a lot in context.

It a little like replacing every french word / character à with just a. And à is only accessible with variant fonts which is not supported by most OS, and if you have your system locale as French every English word would have à instead of a.


Is this because developers have their locale set to their language, so it looks good for them, and they're forgetting to define the language of the text?

That's the case for HTML:

  <HTML lang="zh">
But I don't know about other instances.


Does that work in Android? I've never seen anything look right really. And for apps I don't think it's even an option.

Either way it's annoying to set people up to screw up in this way. And the unification also means it's quite difficult to get multiple variants into the same text.


not sure if you're coming from a purely japanese background, but in chinese 左 and 右 both have the same stroke order for the upper left portion.

and to show just how ridiculous han unification is, on my laptop your example makes no sense because the 臭 has a 犬!


So interestingly, for 左 and 右, the etymological stroke order is the japanese one. It was "simplified" in mainland China and Taiwan.


Do you have a source for your claim? Particularly, you're suggesting that the stroke order for the top left part was once different in China. I'd really like to see a document on this.


I can buy that, as I know the top left portion in chinese historically is for the left hand. If the historical variation of right was for the right hand, that would make a lot of sense.


Are you really suggesting that writers changed hands during writing a single character? That sounds totally impractical. Also, that does not explain why there should be a different stroke order.


No, 左 is "left" and 右 is "right", they're saying that the 𠂇 radical itself means "left hand", and they later added a similar one for "right hand".

Looking it up (on wiktionary, at least) it seems like originally 又 was "right hand", however if you look at really old versions of that character it looks mostly like a mirror of 𠂇.

Interestingly, Wiktionary lists both left and right as phonosemantic compounds where the phonetic part is "left hand" or "right hand" and the semantic part is "assist" and "mouth" (I think in this case the "mouth" is used to bolster the "pronounced like" of the phonosemantic compound, since it's used on the right side not the left). This seems to be because the word for "left hand" became the word for "left" and same for "right hand", so they're pronounced the same; and the semantic component was added later to bolster/specialize the glyph.

Anyway, it seems like the 𠂇 radical in 右 is etymologically a variant of 又 which is a mirror of 𠂇 (well, a mirror of a three-pronged historical form of that) except it was rotated around the glyph so that it looks exactly like 𠂇 but the stroke order is reversed.


Thanks for clearing that up.


Here's how I made my opinion: https://en.wikipedia.org/wiki/Stroke_order#Stroke_order_per_... separates traditional stroke order, Taiwan, Japanese, mainland China and Hong Kong, and specifically talks about 𠂇, saying that the traditional stroke order differentiates its stroke order according to etymology and character structure.

Which in and of itself, is not sufficient, but is a strong clue.

Then, if you look up the seal script origins of 左 and 右, you find that the top-left part of the character comes from different directions: It comes from the left for 左 (https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/%E...) and from the right for 右 (https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/%E...). One way to look at it is that the descending stroke of 𠂇 in 左 is the arm, while it's the horizontal stroke in 右.

But at this point, that's mostly conjecture.

So the next step is to simply find how the characters were actually written by writers from far in the past. And it turns out there are such resources:

左: http://shufa.guoxuedashi.com/5DE6/

右: http://shufa.guoxuedashi.com/53F3/

This contains the characters as written from a lot of different material from different periods. And in those many instances of the characters, the way the two first strokes can be connected is a strong indicator of how they were written.

For instance, on 左, they connect on the top-right end, indicating the horizontal stroke comes first: http://pic.guoxuedashi.com/shufa/6t1/29792.jpg http://pic.guoxuedashi.com/shufa/6t1/29793.jpg http://pic.guoxuedashi.com/shufa/ks/R201308025_TM.TXT.0248.0... http://pic.guoxuedashi.com/shufa/sf26/r201509387.0252.9ea9de...

And on 右, they connect on the lower-left end, indicating the horizontal stroke is last: http://pic.guoxuedashi.com/shufa/xs/r201308029.0111.18[26288... http://pic.guoxuedashi.com/shufa/xsd1/r201308030.0150.23[3dd... http://pic.guoxuedashi.com/shufa/xs/r201308029.0111.16[24a6b... http://pic.guoxuedashi.com/shufa/xsd1/r201308030.0150.23[3dd...

You can't do such connections by writing the horizontal stroke first.

There are also examples of the horizontal stroke connecting to the 口 part, further indicating it's the second stroke: http://pic.guoxuedashi.com/shufa/xsd1/r201308030.0150.5[6f86... http://pic.guoxuedashi.com/shufa/xs/r201308029.0111.2[0a33aa... http://pic.guoxuedashi.com/shufa/xs/r201308029.0111.4[0a2633...

If you look at the pages linked above, you'll also find the names of those writers, which you can look up and check out some of them were living > 1000 years ago.

Funnily, there's this interesting exception of 左 with the horizontal stroke second: http://pic.guoxuedashi.com/shufa/sf26/r201509387.0252.9eb7a9...


Wow, thanks for the resources! You make a convincing argument. I studied sinology, but did not go that deep into character etymology. So basically, what those writers did when they wanted to write properly, was to have the older script variants in their heads, even though the actual forms they were writing may look the same. Or rather, look the same in print and in later standards. Huh.


In Chinese calligraphy one is instructed to emulate the shapes of model master calligraphers, and typically there is only one stroke order that can consistently achieve the look and style of the model characters.


Facepalm. I actually was aware of the han unification of 自+犬 and 自+大 as 臭 (U+81ED), but forgot about it. There's now also 臭 (U+FA5C), that is 自+犬.


Also, radicals and roots can adjust shape to accommodate the partner, so they can vary in exact shape from one character context to another.


this is true, though these cases are fairly limited and 𠂇 is not a radical (you would look it up as 丿+ 1)

after looking through my dictionary, it appears that 右 was originally written with 又 which is a pictograph of a right hand (no idea if the split-top was an accepted variation of the radical or the original form) and was later standardized to use 𠂇 for both characters. Japanese picked them up before the standardization occurred and so missed out on the homogenization.

this also makes 友 a combination of the left and right hand, which imo is kind of cool.


> Japanese picked them up before the standardization occurred and so missed out on the homogenization.

Actually, the writing with 𠂇 is older than the japanese using the characters. The wikipedia page on stroke order suggests the homogenization in stroke order might have happened last century.


According to Japanese dictionaries, 友 comes from 又+又, which makes it a combination of two right hands.


While reading that I also got confused between 犬 (which I haven't encountered before) and 太 (which I have). Though that's just my lack of vocabulary. Still, it's pretty hard to tell these apart as a learner since it's not always clear which parts of a glyph are the important features.

My favorite one of these "look the same but aren't" pairs is 土 & 士 which differ only by the relative lengths of the upper and lower stroke. There's also 囗 and 口, which differ slightly in size but more importantly in the hooking of the second stroke; except half the fonts don't display that; and certainly not at the small font size I keep English text at (which is why I've tweaked the font-size for Chinese in my browser). It would be even more confusing when they get used as radicals except 囗 is always used to surround a glyph, whereas 口 is used squished up the "regular" way as in your example.


In the category "differ only by the relative lengths of the upper and lower stroke", there's also 未 and 末.

I'm not sure what you mean about the hooking of the second stroke of 囗 and 口. Do you mean the fact that in the former, the angle tends to be 90° while in the latter it tends to be < 90° (except in fonts, as you note).

I don't know about Chinese, but in Japanese, the former is not a character on its own (at least, it's not in the 2136 Jōyō kanjis, neither is it in the ~6000 characters of the top level of the Kanji Kentei).


> Do you mean the fact that in the former, the angle tends to be 90° while in the latter it tends to be < 90°

No, I mean that the end of the stroke in the former is often hooked and crosses the third stroke, whereas in the latter the second stroke stops short of the third stroke and isn't hooked. I'm not 100% sure; and I'm still learning :)

I don't think it's a common character on its own in Chinese either; I know it as a radical.


Do you mean this? http://www.geocities.jp/yokomoko3/kihon-5.jpg In japanese, there is not, but there is this difference in how the second and third strokes interact.


Yeah. To be clear; this is what I've guessed from seeing it written in various ways; I don't have an authoritative source.

(In general there doesn't seem to be a good source for what kinds of variations in a character are accepted and what the "kernel" of the character is.)


There is, unfortunately, a great lack on this very topic. In the digital age, it's sad that for those things, you pretty much have to rely on books.

There are a few resources, like glyphwiki or Kanji-VG, that have stroke information, but TTBOMK, they don't contain any information about the kinds of strokes.

Even if they did, that would barely scratch the surface of what I am, personally, interested in. For instance, I'm taking Japanese calligraphy classes, and am interested in the various ways characters can be written in the different styles (楷書, 行書, 草書).

I've also found that interesting etymological facts about kanjis help me remember how they're written. To give a couple examples of things I've found by looking at japanese kanji dictionaries, 並 (line up) is derived from 立 (stand up) repeated twice (立立), and 自 (self) had the meaning of nose before the character for nose was created (鼻), which is why you'll find it in many nose-related characters like 臭 (smell) or 息 (breath). The latter is the kind of thing that I've found most learning apps completely miss. All those I tried, which are all essentially based on the same data sets anyways, will tell you that the radical 自 is for self, and be done with it. Somehow they do manage to tell you the radical 月 can be related to flesh (which is why it appears in many organ-related characters). In the worst case, Kanji-VG based apps will even tell you that characters with the radical 自 contain the radical 目, because you know, 自 looks like 目 with a stroke on top. Which is completely misleading.

Essentially, what I've found works for me is to go to the local library, open good old fashion japanese-japanese dictionaries, and research the subject. It helps that I live in Japan.

I did try a few japanese-targetted apps, but they are mostly drill based, and/or heavily targetted at kids. That doesn't work for me. Maybe I simply haven't found the good ones... finding something on those app stores is so impossible.

As for japanese learning tools for foreigners, they tend to be targetted at beginners, which is fine, when you're a beginner.

On the subject of drills, I haven't found any that actually tries to challenge on similar-looking, possibly even homophone characters. At best, they will make you pick between homonyms, like "given $context, do you write きんし 禁止 or 近視?" (both are correct writings for きんし, but have totally different meanings, so the context tells you which to use), but I haven't found any that makes you pick between homonyms that are purposefully wrong but very close looking. Like "do you write そくおん 促音 or 捉音?" (only one of those actually exists, the other would actually have the same reading if it did exist, but it doesn't).

But I digress.


The depth and breadth of the diaspora around Chinese characters makes it extremely difficult to get these things right. Even this article presenting highly specific domain knowledge tees off with a questionable example:

`The Japanese cross the blade in 刃, the Koreans don’t`

Well Japanese have both 刀 for katana, the well known sword, and 刃 for yaiba / "blade". But in fact 刀 is used for many kinds of blade, by itself as `katana` and as a component of words like 太刀, 印刀, 日本刀, &c. (see http://kanji.quus.net/jyukugo1498/) Is this really pointing to a distinct concept from 刀 in Korean, as the article suggests?  



If that's what the author is referring to, there are less ambiguous ways to express it.

Even so: in my experience in Japan I've seen e.g. 認 hand-written both ways, and specifically remember becoming curious about this variant. My conclusion after consulting with native speakers and university professors I knew there was that they are essentially and functionally equivalent. That is, against the author's point, Japanese speakers do not note a difference at all.


FWIW many of the unified characters under Han unification are considered "wrong" specifically in Japan (in other CJK countries folks seem to be more lax). From what I understand the feeling is as if you started writing `s` as `ſ` in English.

I'm not very sure of this but it seems like the Japanese characters mostly stayed the same after branching off of Hanzi usage at various points many centuries ago (more than a millenium ago, actually), whereas in China these characters evolved.


To the extent this is true I'd say it's because Korean people only occasionally use Chinese characters and the default is almost always to display Chinese variants, and not because Japanese people are uniquely particular.


So on one side we have reference materials and on the other we have your anecdote. Even if we accepted its validity, this is just one example of many.


then is а really a distinct concept from a?

regardless of whether 刀 and 刃 have similar meaning in korean and japanese (in chinese the former refers to the knife/sword itself whereas the latter refers the edge of the knife), the unification of them across multiple languages makes it difficult to handle variations, nuance, and (most importantly for a forward thinking standard) divergence over time. that korean, japanese, and chinese have similar characters for historical reasons doesn't mean they are bound to be the same character forever in their respective languages.

that another commenter below me can't easily transmit to me the character for "smelly" in japanese because whatever font i use is preferring the chinese character for it is a signal that the standard is broken.


Those characters are not unified. What the article is actually saying is that when Koreans write 刃 they place the short stroke in a difference place than the Japanese do (and users of traditional Chinese place it in still another).


rereading my words i realize i was ambiguous; i didnt mean to say that 刀 and 刃 were unified (how would i have written them?) but that unifying those characters across multiple languages is ignoring that they have nuanced meaning based on what language they're in.

then add in your point that the written forms themselve are diverging over time.


To the extent they're exactly the same I don't think it matters. We don't need a separate English and Spanish j because they're pronounced differently. But 刃 in particular isn't the same across languages. The GP seemed to be confused, interpreting the article to mean that Koreans used 刀 instead of 刃 when what the article actually meant is that there are variants of the latter character.


刀 simply means knife/sword, or knife in general, while 刃 means the sharpest part of the knife. In fact, 刀刃 is its own word in Chinese, in this case meaning the blade of a knife.


Never seen it before, but in Japanese that's apparently "tohjin," meaning the blade of a sword. http://www.weblio.jp/content/%E5%88%80%E5%88%83


or just any blade as "ha": https://ja.wikipedia.org/wiki/刃物


I meant specifically the word with both characters.


A favorite of Japanese study beginners is learning how to distinguish between ツ and シ, as well as between ソ and ン.


Handwriting them makes for an 'a-ha' moment. ツ and ソ are written left-right while シ and ン top-down.


For those of us who don't read Japanese, can you explain what the four characters in question mean?


They're katakana characters, used mainly for transcribing foreign words. シ is "shi", ツ is "tsu", ン is "n" and ソ is "so".


Not a Japanese reader so i can't fill you in on what exactly the characters mean but they are characters from the Japanese katakana script. Katakana are generally used for loanwords and each character represents a sound:

ツ - tsu

シ - shi

ソ - so

ン - n

The reason these are difficult to learn is because the tiny differences in stroke angles (especially when handwritten) make it easy to confuse them.


In college, my Japanese teacher told us that a lot of foreigners get these characters wrong, but if you know the stroke order then it's easy to see the difference.

シ (shi) is written top to bottom. You can see that all the starting points for the strokes line up vertically on the left. Also, the last stroke curves from the bottom-left to the upper-right.

ツ (tsu) is written from left to right. You can see that all the starting points for the strokes line up horizontally at the top. Also, the last stroke curves from the upper-right down to the bottom-left.

ン (n) lke 'shi' is written top to bottom. The starting points for the strokes line up vertically on the left. It also uses the same direction for the longer, final stroke as 'shi'.

ソ (so) like 'tsu' is written left to right. The starting points for the strokes line up vertically on the left. It also uses the same direction for the longer, final stroke as 'tsu'.


Replying since I can't edit: The sentence for "ソ (so)" should read "The starting points for the strokes line up horizontally at the top".


All the kana are easy to learn if you use mnemonics, though I admit handwriting recognition might be harder. It took me a moment to recall some of the mnemonics themselves, eventually you just see them as they are... For tsu, I can think of Two sewing needles, for shi, I can think of "she has a funny face", for so it's one "so-ing" needle, for n it's a cyclops who can only say "nnnnnn". (https://www.tofugu.com/japanese/learn-katakana/ and https://www.tofugu.com/japanese/learn-hiragana/ have some great mnemonics to start out, I remember only altering a few.)


On their own they don't mean anything, however the first four may appear and indeed do as multiple words on their own, or as part of other words. Japanese has many homophones.


No meaning per se, these are from the katakana syllabic alphabet. They make the sounds "tsu" (ツ), "shi" (シ), "so" (ソ) and "n" (ン).


They are phonetic characters, so they don't mean anything. シ (shi), ツ (tsu), ノ (no), ソ (so) and ン (n), all in katakana (used to write loan words and slang).


Actually Katakana derived from Kanji, but were typically only used for their sound rather than their meaning and a very very simplified version of how you would write the character quickly allowing for lifting the brush (Hiragana is the simplified version quickly without lifting the brush) roughly speaking; so you are mostly right but not completely.

シ comes from the Chinese character 之, meaning either "of" or "this"

ツ comes from the Chinese character 州 (although some researchers disagree), meaning "state" or "province"

ソ comes from the Chinese character 會, meaning "meet", "party", or "interview"

ン comes from the Chinese character 尓 (again not clear), meaning "you" or "that"

This divergence happened because Chinese characters were used to write Japanese which is mostly from a different language family. Some Chinese characters were used by some for their meaning (ideogram) and used by others for the sound they made to approximate the sounds used in Japanese. Obviously Chinese influenced the Japanese language (for example numbers in Japanese sound similar to numbers in middle ages colloquial Hokkien). Over the last 1,000 years or so there have been numerous occasions for changes, standardisation and simplification. So even to those who study these changes over time it can sometimes be unclear.


> ソ comes from the Chinese character 會

ソ comes from 曾 ("once, at one point") not 會. Yes, those are different characters.


It gets worse. 曾 and 會 in your comment look clearly distinct to me in Safari, but if I check with Firefox, they suddenly look very similar.

The trick is that 曾 is sometimes written with 八 as its "roof", which makes it look like 會, and sometimes with 丷, depending on your font. (Similar to 兑 and 兌 in the article)


In my browser, those are just 4 identical empty squares. I guess I'm a beginner since I can't distinguish between them!


Installing the free hanazono fonts should sort you out

For arch:

    pacman -S ttf-hanazono


Interesting. Installed hanazono font HanaMinPlus.ttf in Firefox as J-default and even in this comment section the distinction between katakana 'n' and 'so' is quite clear whereas using MS Gothic it's somewhat ambiguous. Even more so when the two fonts are juxtaposed in LibreOffice.


You might want to look at some Kyokasho-tai typefaces; they should be pretty close to how we are taught to write.


awesome, thank you for actually being helpful rather than just downvoting and moving on!


Chinese gets even more complicated being a language without an alphabet. Even among the traditional and simplified variants, there are different forms for the same character based on popularity.

For example, 吃 and 喫 have the same modern meaning "to eat," but one is more commonly used. A character like 鎌 has a variant like 鐮, just as well as 塚 and 冢. People choose characters based on stylistic reasonings (and in Taiwan, many choose the Japanese variant to be "hip").


> Chinese gets even more complicated being a language without an alphabet.

If I could be very pedantic for a moment...

Mandarin can be written in many alphabets. Almost every single native speaker one earth uses an alphabet to input chinese characters on computers and phones. And almost every native speaker learns their language using an alphabet - at least initially.

https://en.wikipedia.org/wiki/Hanyu_pinyin https://en.wikipedia.org/wiki/Bopomofo


To get even more pedantic: I wouldn't say Cangjie or Dayi inputs are an "alphabet", and a sizeable number of users use those. They're certainly a way of inputting characters; and a method of looking at how glyphs can be broken down (one by stroke order, one by shape), but for it to be an alphabet there must be a rough phoneme mapping AFAICT; which there isn't.

(but yes, the majority seems to use phonetic/alphabetic input methods, i.e. pinyin or zhuyin.)


As a fellow pedant I made sure to use words like "almost" (:


Right, it's not even "almost"; I'm saying a significant number of folks use Cangjie -- not a majority; but still a large number.


The alphabets are phonetic representations, and require reading in order to type.

I've made Pingtype, which lets you decompose a character into the parts, and rebuild another similar character if your handwriting recognition isn't perfect.

https://pingtype.github.io


It is the consequence of past 2 centuries' political struggle in East Asia.

Written Chinese has basically 3 different standards, in Sinosphere:

1. Mainland China: Simplified Chinese. Meanwhile, contrary to most people's understanding, PRC has standardized its own Traditional Chinese writing as well, which is permitted in logo/trademark and research purposes.

2.Hongkong: Hong Kong has its own system, and plus several hundred characters that created specifically for Cantonese.

3.Taiwan: Taiwan's system has some interesting distinction in characters like 骨(bone),which is different from both Mainland China and Hong Kong's system.

Japan ever since 20 century has simplified and revolutionized its Kanji standard on its own, but unlike China, a lot of Kanjis still follows form from the old Kanxi dictionary(康熙字典). While Korea is an interesting case, where they abandoned Hanja in everyday use, so they don't even bother to simplify it, thus Korean Hanja actually follows the most orthodoxical way of writing of all the systems mentioned above.


Historically alphabet based phonetic descriptive written languages are a recent invention.


Wait. Can you clarify that? Most of the world outside of East Asia has had that since there was a written language.


This gets into slightly pedantic definitional arguments, but broadly speaking the assertion is true that all known earliest writing systems were logographic (where a character represents a word, not a sound), though it is also true that no 'pure' logographia has ever been known to exist (except for an artificially constructed language called Toki Pona[1] from c.2001 that's barely worth mentioning). In practice, all known writing systems have some level of phonemic association to characters along with the ideagraphic, be it using a special subset of characters for phonetic representation or by associating the actual utterances of the words with their ideagrams.

Examples of writing systems that used characters primarily to represent sounds did come later, though similarly, it is difficult to argue that any known language is purely phonetic. As a simple example in English, take '&' — though it is not a recognised letter, neither is it punctuation, and it is unarguably a glyph in the English writing system about which one could debate whether it is ideagramatic or logographic. (But I digress.)

There are many types and varieties of phonetically representative writing systems — most of the earliest, like Linear B, are Syllabaries[2]. The earliest known surviving examples date from the mid-15C BCE, written forms of Mycaenian Greek[3].

Fully segmental writing systems (please distinguish 'fully' from 'purely', as elaborated above) are definitely younger than the rest. The earliest examples of these are often abjads and abugidas, which either eschew vowel representations or form consonant-vowel digraphs respectively[4].

The Phoenician alphabet, or more correctly the Proto-Canaanite script that was used by the Phoenicians, is the earliest surviving example of a segmental writing system — it is an abjad of which the earliest surviving examples are from roughly 1000BCE[5].

The earliest preserved writings that aren't just numbers date to about 3100BCE in Sumer[6], so predate preserved phonetic (excluding syllabic) writings by about 2000 years, and are in the order of 5000 and 3000 years old, respectively. So it's a big stretch to call it a 'recent invention', but certainly not as ancient as logographic or mnemonic writing systems.

[1]https://en.m.wikipedia.org/wiki/Toki_Pona

[2]https://en.m.wikipedia.org/wiki/Syllabary and https://en.m.wikipedia.org/wiki/Linear_B

[3]https://en.m.wikipedia.org/wiki/Mycenaean_Greek

[4]https://en.m.wikipedia.org/wiki/Abjad and https://en.m.wikipedia.org/wiki/Abugida

[5]https://en.m.wikipedia.org/wiki/Phoenician_alphabet

[6]https://en.m.wikipedia.org/wiki/History_of_writing


& is simply a ligature of et, is it not?


Yup:

> The ampersand is the logogram &, representing the conjunction "and". It originated as a ligature of the letters et, Latin for "and".

https://en.wikipedia.org/wiki/Ampersand


Not really, the Phoenician alphabet, which considered as origin of the writing system of many western languages created somewhere around 1200 BC. Oldest Chinese oracle script is discovered around the same time.


Mesopotamian and Egyptian writing goes Back much further than that, and while the alphabet has it’s origin in he Phonician sea peoples, it didn’t become widespread through Europe and Central Asia until the Greek empire a thousand years later. Alphabet like writing systems have been prominent for less than half of written history, hence recent.


What? How is choosing a character variant "hip"? I know in some case Taiwanese people are influenced by the default Microsoft font MingLiU, which has some Japanese variants.


Giving your text some flavor from a 'cool' language can absolutely be hip.


If you're interested in what the other mentioned kanji with regularly-used non-standard forms mean, here are some rough definitions:

餌 - animal feed, bait

遡 - go back in time; go upstream

遜 - humble, modest

謎 - riddle, enigma

餅 - rice cake; mochi


I'm often surprised these kinds of issues aren't fixed quicker. There are so many issues in browsers related to non Roman based letters (assuming that's the correct term for it)

on the one hand I get that most browser (and OS?) dev happens in the West by people unaffected by these issues. One the other hand with > 1 billion people using non Roman based languages I'd expect this kind of stuff to be more of a priority.

maybe this particular issue isn't that important? the one that bites me the most is pressing ESC to cancel IME editing and having it exit some dialog because the browser/os passed the ESC all the down to the app when it was meant only for the IME. I get there are probably no easy solutions tho

this is also a place where VSCode fails because VSCode and any other is based text editing in html needs more IME info than current broswer APIs provide


I am developing a program using a webbrowser control, for my own use, to aid in learning the Jouyou Kanji and their readings. I need some text controls to be exclusively latin, hiragana, or katakana.

The state of IME support in Windows is very poor.

WPF (legacy software?)does not support this at all with reported bugs going back five years.

Winforms does, but doesn't meet my requirements.

Html does define the inputmode attribute, but this only works as a hint to smart phone on-screen keyboards. The inputmode attribute is totaly ignored by desktop browsers.

The Microsoft IME does not appear to have any API which can determine the current state or switch between modes. However, there is currently a Html5 working group on an API for IME control.

In this day and age of multinational software, this is truly pathetic.


"non Latin text" would be the correct way to put it.

This script is called "Latin".

> on the one hand I get that most browser (and OS?) dev happens in the West by people unaffected by these issues.

As far as I can tell, this isn't in the browsers' hands. It's a font/unicode issue. Unicode has to encode these variants; and fonts have to support them.

That said, I am seeing that red table work in Firefox but not in Chrome; however this might just be because Firefox is my primary browser; and I've done a bunch of fiddling with font settings for any language I speak or am learning, and Chinese is one of them. The default OSX fonts may not be so good for that.


Interesting read. Chinese characters and chinese character-like characters can be quickly hard to deal with as soon as you leave the BMP plan. The ids.txt file (Kanji database) is really a lifesaver in this case.


I wish more handwritten kanji input systems would use stroke order and count as a weight and not a filter. As a learner is generally know the stroke order when looking at an unknown kanji but as most input method filter on order and count a single wrong order or count means the matching kanji doesn't even show up in the list of possible matches


All the proprietary systems do that. For some reason, proprietary Chinese handwritten input systems are more widely available than Japanese.

macOS and ios are good examples. Ships with a good one for Chinese and nothing for japanese. If you buy a Sony android phone you'll get a good handwritten Japanese input out-of-the-box.

The reason is because basic stroke order information is freely available, but you need much more information to build a good system (you need common abbreviations and mistakes, fx)


Han unification was really a pretty stupid idea.


Is the 'unspecified' rendering supposed to be the same as the 'popular' rendering (on the right); or is it supposed to be the one in the middle?

For me seemingly every example's left most and right most column (for the sets of 3) looked identical, while the middle form was different.


I tried a couple browsers, and I'm seeing the left and middle kanjis as being the same, with the right one different.


That’s left up to the font.


I see 「つかむ」 often being written as 掴む, but that seems to be a common abbreviation too, since the official version is 摑む. The simplification 國 → 国 is just applied there, similarly to the 辶 → ⻌ case.


One of the tables seem to show characters that are in Unicode BMP but not in JIS X 0208 (many I think came from JIS X 0212). Another table of course show characters not in the BMP.


And that is why sometimes I wonder if CJK encoding should something else entirely rather then part of Unicode making Han Unification.


Isn't there some form of case folding of one of the characters into the other?




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: