Katakana, Hiragana, and Unicode

innocenat · on Sept 28, 2022

To be honest, both questions can be answered in a few seconds by looking at the code point table for Hiragana/Katakana if you already know Japanese. Hence, that's why nobody write about it.

> How do the 46 characters map into the 90 characters?

Because there are actually more than 46 characters.

> Do they map the same way for both hiragana and katakana?

Yes. That's also how we do conversion between hiragana and katakana. By adding/subtracting 0x60.

lloeki · on Sept 28, 2022

For someone like me who knows somewhere between pretty much none to a very small bit of Japanese and slowly working my way up as time permits in a busy life, this was an interesting and very well presented article, saving me more than a few seconds of searching that I don't have to spare, and for which the reading time was both enjoyable and knowledge incrementing.

Hence, that's very much fortunate that someone wrote about it.

innocenat · on Sept 28, 2022

I don't disagree about that. I just answered the first question in the article about why there are no one writing about this.

spacehunt · on Sept 28, 2022

Another simple technical reason is that that's how JIS did it, and Unicode wants to have lossless round-trip encoding conversions in order to promote its adoption in East Asia at the time.

dhosek · on Sept 28, 2022

There are some interesting variations in different scripts thanks to how they were handled in pre-Unicode encodings. Perhaps the most interesting divergence is in the various scripts derived from the old Brahmi script. These are all abugidas (as are the Japanese kana) where vowels do not exist independently of consonants. But in Thai, for example, the syllable NA is written นา with น and า treated as separate characters, while in Devanagari, NA is written ना where न is the N sound and the A sound ा is a spacing mark which changes the shape and spacing of the first letter to give ना. Although a Thai reader will read the combination of consonant and vowel as a single entity, they are treated as two graphemes by Unicode, while the equivalent in Devanagari is a single grapheme (and it’s not simply because they’re printed connected since नाना will be connected but treated as two graphemes).

Perhaps most interesting in this respect is the comparison between the Devanagri ि and the Thai ใ which both appear before the consonant that they’re attached to, but in Thai the input will be ใ + ค to get ใค (so you input in the order of appearance rather than the order of pronunciation) while in Devanagari, the input would be क + ि to get कि (so you input in pronunciation order rather than graphic order).

innocenat · on Sept 29, 2022

Japanese Kana is syllabaries, not abugidas.

inkyoto · on Sept 29, 2022

> … in Thai, for example, the syllable NA is written นา with น and า treated as separate characters, while in Devanagari, NA is written ना where न is the N sound and the A sound ा is a spacing mark which changes the shape and spacing of the first letter to give ना.

Worth noting that «-a» is often implied and both, Thai and Devanagari (and in nearly all other Brahmi and Pali derived scripts), and is implicitely derived by the language speaker, and therefore is dropped from the spelling most of the time except for specific cases.

innocenat · on Sept 29, 2022

-a sound (ะ) isn't dropped in Thai in most case. It's only dropped in specific cases, mostly from words with Pali/Sanskrit origin.

superjan · on Sept 28, 2022

For curiosity: how are they sorted? Are hiragana/katakana symbols considered equivalent?

zerocrates · on Sept 28, 2022

The hiragana and katakana and various versions thereof for each mora all share the same "primary" Unicode collation value. Adding a dakuten or handakuten creates a secondary difference: e.g. は (ha) < ば (ba) < ぱ (pa).

As between the versions for the same mora, they get sorted with tertiary differences as: hiragana comes before katakana, small comes before regular-size, and for katakana regular width comes before halfwidth. There's also a "circled" set of the katakana that sort after the halfwidth ones.

So they're equivalent (or not) depending on how you're doing the collation/comparison.

innocentoldguy · on Sept 28, 2022

The article shows how they are sorted. Hiragana is used for things like Japanese words, particles, names, and to conjugate verbs. Katakana is used for things like foreign words, names, and sometimes emphasis. Both writing systems describe the same phonetics. For example, the hiragana か and katakana カ are both pronounced “ka.”

Pxtl · on Sept 28, 2022

I'm surprised they're both used, from that description it sounds like one would fall by the wayside, like cursive has in North America.

...

That said, culturally Japan seems like exactly the kind of place where, were they English-speakers, all the kids would absolutely be required to learn perfect cursive.

bsder · on Sept 28, 2022

> I'm surprised they're both used, from that description it sounds like one would fall by the wayside, like cursive has in North America.

Katakana, hiragana, and kanji are all in active use--that's why they don't fall away.

Kanji are your primary word base. They are sort of like root words in English.

Hiragana often serves as kind of a marker--endings of certain words as well as phrase markers (particles). These are particularly important because Japanese does not normally break words with spaces.

Katakana often denotes a foreign phonetic word or foreign names. Login is particularly good example for this forum: ログイン (ro-gu-i-n).

Japanese speakers actively use the differences as cues when reading. Watch a native Japanese speaker try to puzzle out Japanese learning materials for non-native speakers. If everything is written in hiragana (not uncommon for beginning materials), native speakers often have to puzzle over things a bit before they work out what a sentence says. This is one of the reasons why you want to get to Kanji as fast as possible when learning Japanese--the differences in script are important for reading comprehension.

You can see all of these in play on the Asahi Shimbun webpage: https://www.asahi.com/

popularonion · on Sept 28, 2022

Japanese people could ask the same question about why English continues to have uppercase and lowercase letters.

Actually when you look at the use of English in Japanese media, you’ll quickly notice a lot of unnatural-looking overuse of uppercase. That’s because to them it feels natural to use uppercase the same way they use katakana.

msbarnett · on Sept 28, 2022

> I'm surprised they're both used, from that description it sounds like one would fall by the wayside, like cursive has in North America.

In practice, there are no less than 4 separate scripts that are used in Japanese: hiragana, katakana, kanji, and romaji, and some mix of all 4 can appear in the same sentence.

It's not so much analogous to cursive, which is a different "style" of writing the same "thing" – katakana and hiragana developed at different times for different groups and came to play different roles, and there are (usually) semantic implications to which are used.

dfinninger · on Sept 28, 2022

I am very early on in my Japanese-learning journey. So if others contradict me, they are probably a better source. :)

But from what I understand Hirigana is used more for Japanese words, and Katakana more for loan words from other languages.

It actually leads to a nice shortcut for some words. If I’m reading Hirigana I’ll try to match that with words in Japanese that I know. However, if the word I’m looking at is Katakana, I’ll flip that off and start trying to match phonetically.

I assume with fluency this all becomes automatic, but I’m a ways off from that yet!

hinoki · on Sept 29, 2022

My understanding is that katakana is used for situations where it’s the sound that is important, not necessarily the meaning. So it’s used for loan words, but also for onomatopoeia, and when writing your name down on a waitlist at a restaurant. (Often it is not clear how to pronounce a name from how it is written)

enqk · on Sept 29, 2022

Katakana is also used for the names of animals. Even when the name is native japanese.

innocentoldguy · on Sept 29, 2022

The reverse is true for words like たばこ (tobacco). It's a foreign word but is almost always spelled using hiragana.

hinoki · on Sept 29, 2022

Yes! Sushi menus are often entirely in katakana.

teek · on Sept 30, 2022

This is not correct. Sushi menus for restaurants in Tokyo use a mixture of kanji, hiragana, and katakana.

smsm42 · on Sept 29, 2022

Yes, but Katakana is also sometimes AFAIK used for spelling out words that otherwise would be written with kanji, and sometimes for emphasis (like italics in Latin script).

AnIdiotOnTheNet · on Sept 28, 2022

English print still has two separate character sets with exactly the same pronunciation too. One is use most of the time, and the other is used to start sentences, for EMPHASIS on whole words, or to indicate proper nouns.

layer8 · on Sept 28, 2022

It’s more akin to italics in usage than to cursive.

rippercushions · on Sept 28, 2022

A very rough equivalence would be that hiragana is a regular typeface, while katakana is italics.

innocenat · on Sept 28, 2022

They are sorted by dictionary order (五十音順 gojuuonjun) and there are no duplicated symbols in hiragana-katakana blocks.

resoluteteeth · on Sept 28, 2022

There are a bunch of other annoying complexities with dealing with Japanese text like halfwidth/full-width characters: depending on what you're doing you may have to account for additional stuff like ｱ instead of ア, or Ａ instead of A. Ideally these wouldn't actually be used (this formatting should not be done at the character set level) but since they were included in unicode for backwards compatibility reasons, they do unfortunately get used a fair amount.

Also I guess this isn't specific to Japanese, but if you use normalization in NFD form, the modifiers like handakuten get split into separate characters (I don't think most people ever use unicode normalization but iirc mac filesystem paths are normalized so it can be really confusing when you do actually run into it).

rester324 · on Sept 29, 2022

Something which surprises me is that these full-width and half-width formattings are still enforced sometimes even these days in new Japanese web services.

Considering that there is a seemingly easy lossless transformation path from/to these full-width/half-width alternatives, it would suffice to only support the full-width form one would think (and do some transformation as necessary without the user being involved).

3np · on Sept 29, 2022

My pet peeve is when the same form requires different format in different fields, ie full-width name but half-width phone number is required. That, and the insistent breaking of the back button.

makeitdouble · on Sept 29, 2022

It's complicated.

You'll want to use half-width for full number fields because html number input is half-width, and you'd need to move to the "full" alpha numeric keyboard to get full-width. Also if you allow full-width it opens the door to Chinese numerals, which you'd have to convert as well.

Same thing for email address fields, allowing full-width character is user-friendly but you're pretty guaranteed to have no autocompletion on the user side, you deal with weird validations errors as some users will mistakenly convert part of the address to invalid chars, when forcing half-width pretty gets rid of that class of error altogether (assuming you don't allow full-width mail addresses in the first place. If I'm not mistaken they can be valid, including with a full-width domain)

Places where there's less room for error usually allow both (like addresses for instance, where you won't run any validation on it anyway. Except on old sites and anal retentive services that will force you to go full-width sometimes)

Symmetry · on Sept 28, 2022

I was confused by the lack of little `yx` characters there so looked it up on my own. The 'yu' in りゆ, "riyu" is 86 and the 'yu' in りゅ, "ryu", is 85.

viggity · on Sept 28, 2022

same with the small tsu which makes you kinda pause/emphasize the following consonant.

Ex. あさり = asari = ah - sa - ri あっさり = assari = ah <tiny pause, hard s> sa - ri

not to be confused with あつさり which is atsusari (which is a made up word), but because the tsu is regular sized, you pronounce it instead of altering other character pronunciations.

Also of note - they completely left out "n" ん in hiragana, ン in katakana.

And "wo" isn't really pronounced "wo", it is pronounced just "oh" and spelled "o" in romaji. And while there is a "wo" in katakana, I have never seen it used. It is used as a particle which is inherently a native japanese thing and ergo you use hiragana for it.

Anon1096 · on Sept 28, 2022

You see ヲ if you read stuff where there's a robot, alien, or super stereotypical foreigner speaking, since oftentimes their entire lines are written in katakana to feel non-native.

dwg · on Sept 28, 2022

Many applicable references can be found on the web. Example of a more complete table:

https://www.key-shortcut.com/en/writing-systems/%E3%81%B2%E3...

As can be seen in this chart, the code points for Hiragana and Katakana are x60 apart (also pointed out by @innocenat).

It's not clear why characters like ん, ン, and ヶ are excluded from the article. They are used extensively in everyday Japanese.

dudeinjapan · on Sept 28, 2022

And here's half-width katakana in non-Unicode, circa 1979: https://github.com/receipt-print-hq/escpos-printer-db/issues...

This is probably the system that is still used today by most bank printers in Japan. The code point usage was carried forward to JIS.

Waterluvian · on Sept 29, 2022

Tangential question: is it possible to evaluate the “information density” of characters of various alphabets and then evaluate, by setting aside all other important factors, the “efficiency” of an alphabet?

The Greek alphabet looks, to super biased me, fairly simple in comparison to others. And yet it only has 26 characters, not 90 or whatnot. So perhaps the complexity is completely necessary.

What about an alphabet that’s as minimalist as possible? Does one have to then fight with distinctiveness?

Are there any examples of alphabets that are both inefficient and hard to distinguish between similar characters? Did they evolve or fall out of favour?

What about data loss? How subtly “decayed” could letters be before you can’t differentiate similar ones?

I’m suddenly so interested in all this…

solarmist · on Sept 29, 2022

Language is optimized for speech not writing.

So it’s not exactly what you’re looking for, but very similar.

“Human speech may have a universal transmission rate: 39 bits per second No matter how quickly you speak, you still share the same amount of information” (https://www.science.org/content/article/human-speech-may-hav...)

As for writing efficiency I would say Korean Hangul is the best writing system in the world. It combines the best attributes of alphabets and syllabaries.

Also western writing systems aren’t as simple as they seem. Every letter has multiple forms upper/lower and print/cursive making at least 4x (cursive forms often vary depending of location within a word) as many forms as you expect.

Symmetry · on Sept 29, 2022

I'm not sure about alphabets but I'm pretty sure there are differences in written languages. All languages tend to have roughly the same bitrate when spoken in good conditions. But some languages tend to have simpler sounds spoken more rapidly, like Spanish or Japanese, and some tend to have more complicated sounds spoken more slowly, like English or Chinese. So if you compare the traditional Latin alphabet encoding of English versus Spanish then English will tend to be denser, though at the expense of a lot of pronunciation ambiguity for unfamiliar words due to having way more vowel sounds than the 5 the Latin alphabet has characters for.

the_clarence · on Sept 28, 2022

I wrote a unicode library for Pascal back when I was in my 40's, it is challenging

kensai · on Sept 28, 2022

John Cook is a genius. All his blog posts are gems. I love his job, I wish I could do it.

gweinberg · on Sept 28, 2022

What are these mysterious symbols that don't fit in the table?

teshigahara · on Sept 28, 2022

ゐ (wi) 𛄟 (wu) ゑ (we) 𛄠 (yi) 𛀁 (ye)

ゐ and ゑ (these days pronounced the same as Japanese i and e) are known by all native Japanese speakers, were used historically, and actually still see some use in certain scenarios (like signs, or names of things). The other ones were never actually used much afaik and only recently were introduced to Unicode at all, and are probably unknown to most Japanese people except those interested in this kind of thing.

sylware · on Sept 28, 2022

Is there a plain and simple C written text shaper for unicode Katakana, Hiragana strings?

mananaysiempre · on Sept 28, 2022

Modern horizontal hiragana and katakana are not complex or huge scripts, there are several dozen base characters (of one or two different widths) and two or so accent marks. There might be no spaces, you break lines whenever they run out without considering word boundaries. I expect anything capable of dealing with Latin should be able to handle this, and it hardly deserves the name of "shaping".

(Adding kanji into the mix somewhat complicates matters, as there are so many potential characters you cannot just blindly cache the rasterization of every one of them and never throw any away, but that's also not the degree of complexity you get from Arabic and such.)

amichal · on Sept 28, 2022

Line layout rules are a bit complex. Long long ago when i was 19 someone handed me a photocopied set of around 50 rules for line breaking Japanese text and followed them to implement our first draft of it in a text layout program we were adding Japanese support to. I implemented it blind, I dont speak Japanese , it never shipped and I dont remember the rules but i do remember quite some complexity around punctuation etc. This section from W3C covers some of what I remember and quite a bit more I'm sure https://www.w3.org/TR/jlreq/#line_composition_rules_for_punc...

brigandish · on Sept 29, 2022

In my experience, Japanese often gets broken at the page width, regardless of any other consideration. It makes reading, for a learner like me, a pain, as you have to read ahead on the next line to be sure you’ve got all the characters.

Arnavion · on Sept 29, 2022

>there are several dozen base characters (of one or two different widths) and two or so accent marks. There might be no spaces, you break lines whenever they run out without considering word boundaries.

You're forgeting about small kana which can be arbitrarily appended to the big kana. Regular words only have one or two small kana after a big kana, but technically there can be arbitrarily many. Either way, they affect word-breaking.

vore · on Sept 28, 2022

There's no shaping or really even anything fancy typographically required for kana, just put the glyphs next to each other fixed-width no kerning.

ranger_danger · on Sept 28, 2022

what is a text shaper?

HelloNurse · on Sept 28, 2022

The software that controls how to render text to images, relying on fonts but considering higher level issues (e.g. line breaking and metrics for multiple characters) than the low-level information in a font.

sapkernel · on Sept 28, 2022

this is cool language learning project in code. Thanks