
Sorting in Japanese – An Unsolved Problem (2011) - lelf
http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-%E2%80%94-an-unsolved-problem/
======
mopreme
Blog author here. Surprised to see one of my old posts on the front page while
browsing Hacker News.

It's interesting to reflect on what has improved since I wrote it, and what
has not.

Both Android and iOS, for instance, provide mechanisms to get this right, if
you know to use them and expose them for those locales (and only those
locales). For example, both have a Contact object that contain corresponding
phonetic-reading fields for first and last names.

iOS Contact - see phoneticGivenName, phoneticFamilyName
[https://developer.apple.com/documentation/contacts/cncontact](https://developer.apple.com/documentation/contacts/cncontact)

Android contact - see PHONETIC_GIVEN_NAME, PHONETIC_FAMILY_NAME
[https://developer.android.com/reference/android/provider/Con...](https://developer.android.com/reference/android/provider/ContactsContract.CommonDataKinds.StructuredName)

For fun I tried using Google Translate to translate the kanji name in the post
淳子 in various contexts to see what Google thinks it is:

\- 淳子 translated to "Dumpling"

\- 淳子さん translated to "Atsuko"

\- 淳子様 translated to "Sadako"

\- 淳子さま translated to "Mrs. Lion"

\- 淳子殿 translated to "Mr. Reiko"

\- 私の名前は淳子です translated to "My name is Miko"

\- 私の名前は淳子です。 translated to "My name is Reiko."

\- 私の名前は淳子です！ translated to "My name is gyoza!"

I expected them all to translate to Junko or Atsuko. The variation and
unexpected results for what should be exactly the same thing is very
interesting.

~~~
lifthrasiir
Many statistical machine translators like Google Translate are very sensitive
to the availability of bilingual corpus. In this case GT seems to have learned
that -子 means either a dumpling (餃子) or a female name ending with -ko, but
haven't seen enough corpus to determine that the preceding 淳 is pronounced
either Atsu- or Jun- in given context so it is guessing. Combined with the
user-contributed corpus this can be rather disastrous: several machine
translators had translated Japanese "初音ミク" [1] to Korean "이명박" [2] ;-)

[1]
[https://en.wikipedia.org/wiki/Hatsune_Miku](https://en.wikipedia.org/wiki/Hatsune_Miku)

[2] [https://en.wikipedia.org/wiki/Lee_Myung-
bak](https://en.wikipedia.org/wiki/Lee_Myung-bak)

------
hermitdev
Am I missing something subtle in this Kanji example, or all 4 names actually
written the same?:

"There are four Japanese women whose names you have to sort: Junko, Atsuko,
Kiyoko, and Akiko. This does not seem difficult, until they each show you how
they write their names in kanji:

淳子 (Junko) 淳子 (Atsuko) 淳子 (Kiyoko) 淳子 (Akiko)"

I'm not familiar with Japanese at all (and have never had to deal with
localization beyond date formats), even less so written, but these seem like
wildly different pronunciations for some spelled the same.

I know English has its own large set of warts with pronunciations and
spellings (even disregarding US vs every other English speaking nation), but
this seems overly odd.

How do you get further context on how to pronounce a proper name like this?
The post mentions context, but obviously in the above example, lacking the
"abc" spelling (as the post terms it), what context do you have to know the
proper pronunciation?

~~~
Iv
It seems odd because it is. I live in Japan, tried to learn Japanese like I
did learn English and German: through reading, as the true asocial geek I am.
Well, that's not possible unless you know about 1000 kanjis, unless you stick
with kids book or some mangas (which, actually, I am not that interested in)

To learn Japanese you need to learn to speak it first.

I stopped caring about kanjis wen I realized that a 15 year old, top of her
class, was having difficulties reading news articles because of some words she
did not know.

That's polemical and I would not tell it that upfront to my Japanese friends
but I feel like kanjis are a vastly inferior writing system compared to the
alphabets they have. The reason why it still exists is to create an
educational hierarchy: the more kanjis you know, the more educated you are.

Kanjis are of Chinese origin, they occupy a similar place in Japanese culture
than Latin does in European cultures. I wish they would do like Koreans and
get rid of it altogether, but it will never happen. The culture here is far
too conservative for such a brutal change. They even call the furikana (kanji
"subtitles" that explicit the pronunciation) as a dumbing down.

This is an example where obfuscation is confused for depth.

~~~
mikekchar
And as a counter example, I like kanji and learned Japanese by reading (though
I also like manga, so perhaps it was helpful). I find kanji _very_ helpful in
learning vocabulary. In fact, I actually measured how fast I could learn
vocabulary containing kanji I didn't know by learning the kanji first, versus
learning the vocabulary phonetically. It was about the same. However, as there
are only about 2200 kanji in the list of commonly used kanji and there are
about 20,000 words you need for adult level proficiency, it means that each
kanji is used in, on average 10 words. Actually, it's even better than that
because the first 1000 most common kanji are used in 90% of the words you will
encounter. I can learn vocabulary with kanji I know dramatically faster than
vocabulary with phonetics alone. Ironically, I think it's even more helpful in
Japanese than it is in Chinese because there are so many readings for the
character that if you learn the word phonetically, you may have no idea what
the root of the word is.

Kanji is frustrating to write, but a joy to read, IMHO. Yes, you need to learn
2000 (or so) characters, but, let's face it -- you've got _years_ to do it. I
can now read words that I don't know and have a pretty good idea what they
mean -- something that would be practically impossible phonetically in
Japanese. In fact, when I'm faced with new vocabulary, I _often_ ask people to
write it for me (just in the air) -- as soon as I see the kanji I almost
always know what the word means.

I've often thought how superior this writing system is from roman phonetic
characters (which bizarre historical spelling anachronisms to boot) ;-) I
appreciate that you don't agree, but I hope you'll also understand that I have
no intention of grading your education level on the number of kanji you know.

~~~
echelon
I agree with you wholeheartedly.

Kanji has aided in my study of the Japanese language in ways I couldn't fathom
before trying to do so. I made it to N5-level proficiency [1] after a few
semesters of study and only begrudgingly studied Kanji because it was required
by the text. I hated it and really only wanted to speak and understand spoken
Japanese.

Then I discovered Wanikani [2], a spaced repetition service that focuses on
teaching Kanji reading. Unlike Anki decks, Kanji is literally the only thing
Wanikani focuses on, but it does so incredibly well. With diligence it's even
possible to learn all Joyo Kanji within about a year.

After taking Kanji seriously I began to understand the root meanings of the
vocabulary I had known all along. It let me form associations I previously
wasn't privy to and enabled a much faster uptake of new vocabulary. Instead of
remembering why the syllables さいこう (saikou, pronounced "sigh co") means "best"
or "utmost", I can simply recall the kanji: 最高. The character 最 means "most"
and the character 高 means "tall". "most tall" = "utmost".

Japanese is extremely logical like this, as is their grammar. For fans of
logic it's really top-notch.

If you're learning Japanese, please don't do yourself a disservice by skipping
Kanji. A fully-integrated learning approach that includes reading will yield
dividends in the long term.

[1]
[http://jlpt.jp/e/about/levelsummary.html](http://jlpt.jp/e/about/levelsummary.html)
[2] [https://www.wanikani.com/](https://www.wanikani.com/)

~~~
Iv
Of course kanji helps. Like Latin helps with many European language. If you
learn Latin before learning French it will feel extremely logical.

Making it a pre-requisite to write a single sentence in French, however, seems
like a waste of time and goodwill.

------
needle0
As a Japanese native, it pains me a lot when the westerners come upon
linguistic differences like this and start calling them things like "overly
odd" or "inferior" as they seem to be doing in the comments here. I can't help
but smell a whiff of anglophonic condescension - I had thought political
correctness have swept over the west over the last few years, but perhaps they
only selectively apply to those who can talk back to you?

~~~
cmroanirgo
Reading the article showed me how ignorant I was to this problem... so much so
that I'm surprised I haven't seen more articles like this in the past.

For all the talk on ML and PC here, it's surprising that language is still a
huge unsolved issue. As someoneone who knows a tiny amount of Japanese, I'm
surprised that a common solution is to provide the katakana equivalent.
Outside of the tech realm, is this really what a native Japanese would write
on (say) a physical/ paper document: both the Kanji and Katakana?

~~~
fenomas
Yes, entering a name's kanji and kana separately is standard for most any form
in Japan, whether digital or on paper.

------
trw999
[https://en.m.wikipedia.org/wiki/Four-
Corner_Method](https://en.m.wikipedia.org/wiki/Four-Corner_Method)

You can sort Chinese characters (including Kanji but i'm not sure they use the
Four Corners Method) by the Four Corners method. Why would you need to sort
kanji phonetically in the first place? Do Japanese users actually expect names
to be sorted phonetically? English speakers don't expect names to be sorted by
IPA so consistency of the sorting scheme should be all that matters.

~~~
fenomas
> Do Japanese users actually expect names to be sorted phonetically?

Yes - or that's how a human would sort them, at any rate.

~~~
trw999
I don't think you're in a position to speak for all of humanity.

~~~
fenomas
I meant a human as opposed to an algorithm.

Point being: kanji words aren't _always_ sorted phonetically, for the reasons
described in the article, so a user may not be surprised if they aren't. But
when a human is sorting kanji words they do so phonetically by reading.

------
quelltext
>In English, the logo goes with their saying: “Everything from A to Z.” This
is indicated by the arrow. But in Japan, and any other country that doesn’t
use English, A and Z aren’t always the first and last letters of the alphabet.

This is grasping for straws.

a) Many English speakers/countries are unfamiliar with the meaning of the
logo's arrow. b) They will get the meaning if explained. c) Some won't because
A-Z standing for "everything" requires a given level of literacy (cf. alpha
and omega) d) They will get it easily when explained because it's a simple
concept.

Japanese people learn the English alphabet pretty early on in school and while
some may not be familiar with the saying A-Z the logo and the meaning still
perfectly works and would get an aha reaction when explained.

------
level3
For those who want to know more about this:

The article touches on just the tip of the iceberg. You might think that all
you need to do is add an extra field for phonetic readings, and then simply
sort on that field, but there are a lot of things that can go wrong. A naive
sort (i.e. based simply on character code) will hit the following snags:

1) Hiragana vs Katakana

The article focuses more on Kanji vs Kana, but Japanese users will expect
Hiragana and Katakana to be properly sorted together. Either you normalize
your sort field (by converting everything to Hiragana, for example) or you use
a Kana-insensitive collation.

2) Half-width characters

Katakana can be encoded as full-width or half-width characters (カ vs ｶ).
Generally you want these treated as the same, so again you need to normalize
or use a width-insensitive collation. There are also full-width alphabet
characters (Ａ vs A).

3) Youon

These are actual different characters (ゆ vs ゅ, つ vs っ), so you can't
normalize, but you want them sorted together. Here you need a collation that's
case-insensitive with respect to these.

4) Dakuten/Handakuten

Like youon, these are also different characters (は vs ば vs ぱ) so you can't
normalize, but you want them sorted together (insensitively). A sensitive sort
will give you (はね, ばつ, ぱすた) while an insensitive sort will give you (ぱすた, ばつ,
はね).

There has been a lot of work around this, resulting in many different database
collations over the years, some of which result in sorts that would greatly
confuse Japanese users. As of today, you probably want to be using (in the
case of MySQL) utf8mb4_0900_ai_ci or something similar.

~~~
txtsd
I'd assume you'd want ゅ andっ to be added to the kana they're attached to, and
then sorted. I'd want my き and きゅ together and び and っび together.

~~~
level3
That does make sense in a way, but I don't think that would feel natural to
any native Japanese speaker. I'm not native and even I would find that
ordering very odd.

At the very least, it would make the sorting algorithm a lot more complex if
you had to look ahead at later characters in order to sort the current prefix.

------
cooper12
Just a thought experiment, don't take it too seriously:

The crux of the issue is that kanji don't have an inherent "natural" ordering
that a user would expect. Sorting by their character code doesn't mean
anything to a Japanese person. But, what if we made our own standard of what
entails a "natural order". There's nothing about A–Z that makes the alphabet
obligated to be in that order (and not something like based on sound or shape)
other than it being the convention that developed. Even hiragana can have
different orderings (AIUEO vs IROHA [0])

One proposed method would be to do it how the dictionaries do it: first sort
by major radical, [1] and then by stroke count. This is something most
Japanese learn when learning how to write characters anyway (of course
ambiguities would arise when the radical is shared and the stroke count is the
same, but we could just choose a third arbitrary factor; we'd also have to
decide on a specific written form as stroke count can differ depending on
whether it is handwritten or the font).

We could then teach our approach to schoolchildren and it would just become
accepted over time like other things they learn. But wait you say, it's more
natural for them to sort on pronunciation. However, if I gave you a list of
polygon names and told you to sort by the number of sides they had, you'd be
perfectly capable of doing it despite that not being alphabetical. Things are
less "unnatural" if you grew up learning them and your brain doesn't
experience dissonance.

Anyway, just my hot take.

[0]:
[https://en.wikipedia.org/wiki/Iroha](https://en.wikipedia.org/wiki/Iroha)

[1]:
[https://en.wikipedia.org/wiki/Radical_(Chinese_characters)](https://en.wikipedia.org/wiki/Radical_\(Chinese_characters\))

~~~
jpatokal
The only problem with that is that the resulting order would be useless for
many applications. Say you're looking up your friend Tanaka Tarou, but you're
not sure which characters his name is written with. If the sort order is
phonetic, you can find the name and likely work out that this is the Tanaka
you were looking for. But how do you search for a name in a kanji-indexed list
if you don't know the kanji?

Incidentally, this is why kanji dictionaries invariably have _multiple_
indices: one by radical, the other by pronunciation(s).

~~~
cooper12
Great point. I was considering a very visual-minded reader but of course not
everyone would be so good at it nor would they always just care about how the
kanji looks rather than other aspects. It's a difficult problem indeed... My
intention is to show that we'd need some sort of radical solution that might
not be what we'd immediately jump to (for example the current approach the
author mentions is having a separate field for readings, but this is clearly
resource-intensive and wouldn't work on arbitrary data). To solve sorting for
Japanese, I feel we need to rethink what it means to sort.

------
tmm84
I've seen this problem solved a few different ways.

1) Have a romanized version of the value to sort. 2) Have a hiragana or
katakana version of the value to sort (hiragana > katakana > roman order for
sorting).

Excel seems to sort things without telling it the reading for something. In JS
and a few other languages Japanese is sorted based on UTF-8/16 code. This
works for everything but kanji because of the reading requires a human.

~~~
Technetium_Hat
In JavaScript, I imagine using string.localeCompare would give a more helpful
sort than just UTF-8 value.

------
GorgeRonde
I think taking people names as an example is extreme because when Japanese
people give a name to a baby, they choose both a phonological sign and a
kanji-based transcription: phonologically, they tend to pick quite a common
name (like christian names in the west, contrasting it with the native-
american names that are a lot more specific to the person), but try to be
clever and original when it comes to writing it with kanjis.

And I'm not even sure Amazon has an additional input field just for the sake
of sorting names. Isn't this a common practice in the country ? (if I have to
call a customer, how am I supposed to greet her if I can't pronounce her name
?)

------
aliswe
If im not mistaken, another unsolved problem is generating url segments (or
"slugs") from characters such as chinese, arabic and possibly japanese as
well.

~~~
kijin
You just use a (somewhat cleaned up) UTF-8 representation as a slug. UTF-8 URL
components are universally supported (transparently encoded and decoded) in
all modern browsers, and Google shows the real characters instead of the
urlencoded version. You still end up with the urlencoded version when you hit
Ctrl+C, but I expect that to be fixed in the near future as Unicode becomes
even more widely used.

For example, the following link should work perfectly well in all modern
browsers:

[https://ja.wikipedia.org/wiki/メインページ](https://ja.wikipedia.org/wiki/メインページ)

------
augbog
Ran into this issue at work the other day. Relevant SO question:
[https://stackoverflow.com/questions/54543528/intl-
collator-s...](https://stackoverflow.com/questions/54543528/intl-collator-
sorting-japanese-why-does-collator-not-prioritize-japanese-chara)

------
rootsudo
This is great, I never took this into consideration and I actually enjoy
Japanese as a hobby. JLPT N4 and I'm astounded to "wow, yeah, that makes sense
it's a problem."

I always like unique problems like this, I never considered ABC vs Kana, Kanji
-- and Romaji.

------
seanlinmt
Interesting. I didn’t know this was a problem. Kanji uses chinese characters.
If sorting has been solved for chinese character, I’m assuming that the
existence of Chinese dictionaries mean that sorting is no longer an issue,
then why can’t the same method be used for kanji characters?

------
Razengan
I'm not native to Japanese or English, but I do think Kanji is beautiful and
kana is "better" than the Alphabet, and although I dislike resorting to
whataboutisms† to assert a point, @everyone here who suggests the abolition of
kanji, please consider this proposal for "simplifying" the English language:

• Consolidate letters with identical pronunciations into a single letter: A/E,
E/I, C/S, C/K, C/Q, G/J, I/Y, K/Q, U/OO, V/W, X/KS

• Split letters with ambiguous/variable pronunciations into multiple letters
with fixed pronunciation: A, C, E, G, I, O, S, U, X, Y

• Remove any "silent" letters from all words.

• Disambiguate homographs and homonyms+homographs (e.g. lie, fair.)

I'm sure I missed or left out many examples, but would any of these proposals
fly, at all?

Do they not disregard all the subtle nuances and historical significance which
only advanced learners might appreciate? Do you feel a little aghast at my
naiveté for even suggesting this? What would be the response of native
speakers to a foreigner campaigning for a more streamlined and consistent
English?

† _or more accurately: Tu quoque_

\----

I'll just leave a piece of fun trivia as an example of the cool stuff that
would be lost if kanji was abolished:

The "slang" for a female ninja is _kunoichi_ : く _ku_ , ノ _no_ , 一 _ichi_ (the
first two characters being _katakana_ )

and the kanji for woman is 女, made up of the following strokes: く, ノ, 一

It's something that I noticed on my own some time ago and it brought a smile
to my face, like someone discovering an in-joke. There is a lot of arguably-
clever wordplay like this in Japanese, and losing that would just make things
bland for everyone, and for what? Just to make the language a little easier to
stomach by Westerners?

What we should be campaigning for, is better resources and tools for learning
and looking up kanji.

------
ErotemeObelus
The correct word is collation.

------
yutori
Protip: Kirakira name

------
expat2003
Some statements you have made about the Japanese Language are ambiguous at
best. Let's clear them up, but first let's be precise with the terminology.
Pronunciation and sound have not interchangeable meaning. Every sound -or
sequence of- in Japanese is associated with a specific character. It happens
that some Kanji have multiple sounds or that a group of Kanji shares the same
sounds, but still how are read it's univocal. There is no pronunciation
involved. Just to give a quick example, the sound of あ is one and only one,
while in English the sound of `a` has different pronunciations in pal, Paul,
paediatrics, and so on. For the sake of simplicity it's ubiquitous savage
practice overlapping the use of the "reading" with the one of "pronunciation"
and the one of "intonation", but you should know the difference. The
pronunciation transforms based on the preceding and/or following characters.
There's no such concept in Japanese. The sounds of the Japanese language are
distinctively dictated by Hiragana. Therefore what you have is a "Reading". A
reading is made of one or more sounds. But once again readings or sounds have
no multiple pronunciations. I hope I was able to shed some light on this
complicated subject. It's a nuance, but makes an important impact.

Quoting: "I should note that there are two different alphabetical sorting
orders in Japanese. For this article I am going to use the a i u e o (あいうえお)
sort order." Alphabetical order as we know it, it's one in Japan too. The
order of the Hiragana works exactly like our alphabetical order: a preordered
sequence of scripts from A to Z is the methodologically equivalent to the
Hiragana from あ to ん, albeit there is no words starting with を and ん - last
two characters of the alphabet. The Katakana alphabet is exactly like
Hiragana, just the characters `design` changes - it is only used to write, in
Japanese sounds, imported words from non-Japanese languages (Chinese and
Korean are exception because you can use the Kanji to write them, although
when writing just Chinese or Korean sounds, you will be using Katakana still).
When talking about `ordering` we normally refer to which sequence we are
listing the Kanji: by their Hiragana sounds is the most common way of listing
them. They can also be listed by what in English are called Radicals -
foundational shapes composing the Kanji ideograph, they can be listed by
strokes count or be listed by their Chinese reading - the Japanese sound of
the equivalent Chinese Kanji, or even by recurrence. In the Japanese language
it's frequently used `ordering` by meaning, by which Kanji may or may not be
listing, where only Hiragana is used to write all the words. This method is
the closest rendition in Japanese to what we are accustomed to name English
dictionary. More here
[https://en.wikipedia.org/wiki/Japanese_dictionary](https://en.wikipedia.org/wiki/Japanese_dictionary)
I think it's already clear where the problem of sorting lays. Kanji, Hiragana
and Katakana, though being different looking alphabets, have a well-defined
scope and interconnection in the Japanese language. Kanji are words, thus have
meaning; Hiragana offers a vocalization to the Kanji and also absolves the
crucial grammatical role, providing the language with adverbs, prepositions,
particles, determiner, verbs conjugation and more. Katakana usage is sidelined
to just words foreign in nature, hence they exclusively represents sounds,
carrying no meaning. Consequently when reading Japanese on a generic subject
Kanji are mostly encountered, some Hiragana that connects them are present (it
needs to be said that very common words are often written in Hiragana only,
e.g. こんにちは means Hello) and sporadic Katakana when some imported word is used.
This is happening because in the Japanese language there is hardly any
punctuation at all - yes, there's a full stop, a comma, a way to encapsulate
direct speech, but no concept of empty space. You can read Japanese from any
direction you set up yourself for. Traditionally it's read top to bottom from
left to right. Modern books are read as ours would be, even so sometimes you
turn the pages backward to further the reading, e.g. comics etc. The English
alphabet is used in case of very technical terms or for proper names - the
context may sometime better served with romaji. But that's not a rule
whatsoever. Katakana is employed more often. The Western alphabet -romaji in
Japanese- adoption is similar to how you would take up French, Spanish,
Italian or German words in your writing. (continue)

~~~
expat2003
Quoting from "Sorting Settings": "In this example you can see ABC and katakana
are separated. Kanji are then separated from katakana. There were no hiragana
in this list[...]" The Hiragana on that list are the characters と and の. On
that particular list の expresses the meaning of correlation and と convey the
mere meaning of `and`. For instance 地域 と 言語 の オプション (notice I myself have
entered the `spaces` for clarity's sake, but there should be none) is a very
instructive example. Those are actually three distinct nouns you are trying to
order as one: 地域 reads ちいき (Hiragana for `chiiki`), means area; 言語 reads げんご
(Hiragana for `gengo`), means language; オプション read opushon, is, you guessed,
option in English. Note I didn't write that オプション means option. オプション IS the
word `option` written with Japanese sounds or, in other words, written in the
Japanese script - i.e. Katakana, you guess it. So 地域と言語のオプション can be
translated as "Regional and Language options".

Quoting from "Sorting Names" "It is very possible to have different people
with the same name write their name in different character sets. The
traditional way of writing the Japanese name of Ayumi would be written in
kanji; a modern, stylish way would be to write it in hiragana, and a second
generation Japanese-American might write their name in katakana or the
alphabet." Japanese always writes their name in Kanji. They don't use
different sets. When the Kanji composing their names can assume several
sounds, they write alongside what's called Furigana, the Kanji reading -
usually as a subscript or superscript. Furigana is written in Hiragana (not
Katakana as you stated later on), but for the Internet where websites could
potentially be read by a non-Japanese crowd, Katakana might be used in some
cases. Nonetheless as I said earlier Hiragana and Katakana differ only by
scripting style, if you wish to call it as such. So which script is in use,
it's not a relevant issue for foreigners. All the other statements are a
matter of opinion, save for the last: Japanese do write sometimes their names
in the romaji, aka "ABC alphabet", especially when they are dealing with
foreigners at any level. Although why, being 2nd gen Japanese-American entails
writing its own name in Katakana, beats me. It's a tiny bit like saying,
forgive me here, a 2nd gen Italian-American would write its name in Latin.

Quoting from "Kanji - The Real Problem": "Kanji have multiple pronunciations,
determined by the context in which it appears.[...]Only from the context in
which the kanji appears do you know how to pronounce it." That's like saying
the pronunciation of the word 'pool' is determined by the context referring to
water or balls. Not the pronunciation, but the meaning of the word is changed
by the context. We all agree on this statement. Single Kanji words change
along with their meaning based on the context, but like the word 'pool', its
reading is the same. e.g. あめ (reads ame) which means 'candy' if written 飴 or
'rain' if written 雨 . In this case the Kanji itself controls its meaning and
reading, not the context. When single Kanji is a verb the meaning and reading
changes with what's called Okurigana - Hiragana written after the Kanji
absolving the purpose of conjugation -, while the Kanji remains invariant. For
example 着く and 着る reads つく (tsuku) and きる (kiru) respectively, the former
means 'to arrive' and the latter 'to wear'. Single Kanji aren't at all like
you describe them. Compound Kanji words follow a different ruleset build upon
how many they are. Generally speaking one kanji in two-Kanji words has
multiple readings depending on what is the word it appears in and where it
appears in that word. You can learn the rules, or you can get used to them
just by seeing them used in massive frequency. This case alone is as you've
described: the need to know the context in which the Kanji lives. But since
the meaning also changes with its reading, you could be able to catch the
overall meaning of the sentence without being able to read that single word.
But compounds Kanji with multiple readings aren't that recurring and they
generally represents common words requiring not much effort to memorize.
Compound Kanji words composed by more than two Kanji reads unequivocally in
one way as English words do with very few exceptions. More here
[https://ja.wikipedia.org/wiki/%E5%90%8C%E5%BD%A2%E7%95%B0%E9...](https://ja.wikipedia.org/wiki/%E5%90%8C%E5%BD%A2%E7%95%B0%E9%9F%B3%E8%AA%9E)
If anything what could, quote "[...]keeps students up nights studying for
years[...]" isn't the multiple Kanji readings, but the fact that you need to
know between 2000 and 3000 Kanji and its their combinations that build words
(mostly two Kanji words). So it's like having a permutation with repetition
(in ordered arrangements) of 3000 syllables that makes words in pairs or
singularly.

Quoting "Here is an example: 私は私立大学で勉強しています。[...]A second year Japanese
student could figure this out. For a computer, this is a very difficult
problem." The choice is particularly sad. This isn't difficult at all for a
computer, granting you understand the Japanese language. Let's dissect this
phrase: 私 は 私立大学 で 勉強しています。 私 わたし (reads watashi) is the English pronoun 'I'.
A computer instantly knows it because only the watashi reading/meaning can be
followed by the Hiragana は. That's something a 6 years old Japanese knows. And
something you would learn in your first months or less of Japanese studies.
Conversely a computer know instantly that's the reading/meaning isn't watashi
when it scans that following 私, there's another Kanji. This compound Kanji 私立
reading can only be しりつ (reads shiritsu) out of a staggering number of
combined readings of 2 - and that's only because 立 has two usable Chinese
reading りゅう and りつ (the third would require the Kanji to be lonesome), 私 only
one, し. Kanji have usually one Japanese reading and one or more Chinese
readings governed by strict rules on which reading group has to be used.
Coding it isn't as much of a headache. As soon as the computer realize that
the third character that follows the first two Kanji is a Kanji as well, the
range of possible readings bottoms. That's also due to the fact the first two
Kanji makes already a word - as often happens with 2+ long Kanji words, they
are compose of multiple words, just like some long words in Western languages
would - that means 'private' in English. With the same approach the computer
instantly finds the reading of the two Kanji 大学 だいがく (reads daigaku) means
University, another very common noun. I think you already got the gist of it.
Last word 勉強しています the computer know instantly is a verb because of the unique
okurigana しています (read shiteimasu) Present Continuous of "to do" and 勉強 is both
extremely popular and has a unique reading, べんきょう (reads benkyou) which is the
noun 'study'. "I'm studying at a private university", even a machine
translation would be accurate here.

I think the point is that there is no use in sorting all the words written in
the three different Japanese alphabets simultaneously in the same juncture.
Microsoft knows it so well it has yet to implement it. In your final thought
you completely miss to understand that you don't need to attack the problem by
"pronunciations". You have only to treat the Kanji with different approach and
translate Hiragana/Katakana in romaji, which it has been done already long
time ago. I hope at least you're going to quit using "pronunciation" in favor
of "reading" by the time you've done reading this post. If ever.

