
How to tokenize Japanese in Python - polm23
https://www.dampfkraft.com/nlp/how-to-tokenize-japanese.html
======
kanobo
If you're interested in this you should take a look at Chrome's break engine
source code for japanese/chinese: [https://github.com/unicode-
org/icu/blob/778d0a6d1d46faa724ea...](https://github.com/unicode-
org/icu/blob/778d0a6d1d46faa724ead19613bda84621794b72/icu4j/main/classes/core/src/com/ibm/icu/text/CjkBreakEngine.java)
[https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...](https://chromium.googlesource.com/v8/v8/+/refs/heads/master/src/objects/js-
break-iterator.cc#75)

~~~
polm23
So that works on similar principles to MeCab but it's more limited and
performance is... not very good. I wrote about some of the issues with it
recently here.

[https://github.com/polm/fugashi/issues/23](https://github.com/polm/fugashi/issues/23)

(Also, if you learned about Chrome's Japaense tokenization on HN, that was me
too!)

------
yomly
For someone in the know, how different is tokenization strategy between
Japanese and Chinese? The grammar/sentence structure is different and Chinese
does not use kana or qualifiers for things like verbs in the same way but then
both have compound words and lack of spaces in text.

Do languages like German which is rife with compound words have any overlap in
tokenization or do you end up with completely different strategies?

~~~
peterburkimsher
My strategy for tokenising Chinese is brute-force: read 5 characters ahead, is
that a word? No -> 4 characters, No -> 3 characters, No -> 2 characters ->
Yes, copy to output, move 2 characters ahead, read the next 5, 4, 3, 2. It's
very slow.

This is how humans do it though.
[https://pingtype.github.io/docs/advanced.html#spacing](https://pingtype.github.io/docs/advanced.html#spacing)

If you have some Chinese text that you want to read, throw it into Pingtype
(my program) and it should help you to pick up the vocabulary!
[https://pingtype.github.io/](https://pingtype.github.io/)

~~~
gruez
>read 5 characters ahead

alright programmer question: why was the magic number 5? Are there no chinese
words that are longer than 5 characters? Or does 5 characters cover 99.999% of
all possible cases, and is therefore "good enough"?

~~~
peterburkimsher
It's the maximum limit for names on Ruten, a shopping site in Taiwan. That's
also the reason I used the name 尸三十三尺 for deliveries for a while! ASCII
characters weren't allowed.

There are some words that are longer (e.g. Mesopotamia) but in practice those
are basically all proper nouns.

------
contravariant
It seems a bit weird that the example somehow doesn't consider 'fugashi' a
compound noun and claims it's 2 nouns, 'fu kashi'.

Maybe that's intended but then apparently this is pointless if you want to
know how things are pronounced.

~~~
Twisol
The key aspect not captured here is "rendaku" \-- the voicing of a normally
unvoiced consonant in a compound word.

[https://en.wikipedia.org/wiki/Rendaku](https://en.wikipedia.org/wiki/Rendaku)

~~~
9nGQluzmnq3M
But rendaku rules are extremely complicated and often unpredictable , eg 中島
can be Nakashima or Nakajima depending in how the person chooses to pronounce
it.

------
sova
There have been some recent attempts at creating EBNF grammar definitions for
Japanese that try to balance simplicity and coverage [https://learn-
japanese.org/2020/01/04/japanese-grammar-in-eb...](https://learn-
japanese.org/2020/01/04/japanese-grammar-in-ebnf-notation/)

~~~
yorwba
If I'm not mistaken, that's a post by yourself, so you could just say "I
recently attempted..."

------
Danieru
The moment I saw the title I thought of you Paul (polm). No doubt you are the
world's expert on tokenizing Japanese.

------
sova
Also cool:
[https://www.aclweb.org/anthology/Y03-1034.pdf](https://www.aclweb.org/anthology/Y03-1034.pdf)
(parser from 2003 with 97% coverage on native text)

------
ezoe
Be careful what you do with a Japanese tokenizer. Just because you got an
ability(or I'd like to call it probability since it's not perfect) to tokenize
Japanese doesn't mean you can process Japanese like the language you familiar
with. Please, don't do clever things. It's better to just leave Japanese
string as binary blob and treat it as is, passing left to right without any
modification.

It is my observation that the programmer of native Indo-European languages
tend to have a delusion that their native language is so fundamental that any
language processing they deemed necessary must be applied to the any and all
languages in the world.

Which cause the locale library in most programming languages tend to have a
feature that's unfit for Japanese and it break the Japanese. Yes, the presence
of the locale library break the Japanese language support.

These locale libraries modify the string binary and break the Japanese
encoding. You can't easily implement no-op local for Japanese because these
modification are often cause by the fundamental common part of the library
rather than language specific part.

People have a good intention. They want to support multiple languages. but
it's a delusion that passing the string binary to locale library magically
solve the problem. The locale libraries were often written badly for Japanese
so it break the Japanese. I would like to call it locale-tainted.

Since The locale library is part of the standard library of the programming
languages and many software relies on it. We can't just avoid using locale
library without massive refactoring of the source code which is impossible
given the available resource for localization. So if we, the unfortunate
Japanese programmer, got a job of localizing locale-tainted existing software
to Japanese, we have to come up some workarounds.

Modifying the source code to remove the locale library usage is the best but
often not possible. Hooking locale library to replace it with Japanese-safe
version are dirty workaround but it requires to reimplement most of the text
manipulation which cost many man-monthes and rigorous testing, Imagine
implementing a printf/regex library to replace the standard library that
satisfy all the use case in the wild. The worst workaround is let the stupid
locale library do whatever the stupid Japanese breaking string binary
transformation they damn want, then, reverse-transform after the string binary
passing through the locale library to fix the mess. A mess that was
unnecessary in the first place.

I was involved in C++ Standard Committee but people out there are so
delusional they think it's a good idea for a new standard library to depend on
broken locale library because it help the localization. Those delusional
people are also designing Unicode library now. They won't fart out a good
library under such delusion. I lost all hope for this industry.

So please. Don't introduce stupid unnecessary language processing to the
language you don't understand. Make our job less horrible.

~~~
yorwba
Tokenizing strings is usually not something that happens unnecessarily, but
it's required for the implementation of e.g. keyword search. So it's not
something people can just avoid doing.

> These locale libraries modify the string binary and break the Japanese
> encoding. You can't easily implement no-op local for Japanese because these
> modification are often cause by the fundamental common part of the library
> rather than language specific part.

Can you give an example of a situation where that happens? Is it an issue with
e.g Shift-JIS encoding vs. Unicode?

