
How does Chrome decide what to highlight when you double-click Japanese text? - polm23
https://stackoverflow.com/questions/61672829/how-does-chrome-decide-what-to-highlight-when-you-double-click-japanese-text/61673028#61673028
======
JonathonW
ICU (International Components for Unicode) provides an API for this:
[http://userguide.icu-project.org/boundaryanalysis](http://userguide.icu-
project.org/boundaryanalysis)

Assuming Blink is using the same technique for text selection as V8 is for the
public Intl.v8BreakIterator method, that's how Chrome's handling this--
Intl.v8BreakIterator is a pretty thin wrapper around the ICU BreakIterator
implementation:
[https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...](https://chromium.googlesource.com/v8/v8/+/refs/heads/master/src/objects/js-
break-iterator.cc#75)

~~~
erjiang
According to your first link, BreakIterator uses a dictionary for several
languages, including Japanese. So I guess the full answer is something like:

Chrome uses v8's Intl.v8BreakIterator which uses icu::BreakIterator, which,
for Japanese text, uses a big long list of Japanese words to try to figure out
what is a word and what isn't. I've worked on a similar segmenter for Chinese
and yeah, quality isn't great but it works in enough cases to be useful.

~~~
peterburkimsher
What did you use for your Chinese word segmentation? I wrote Pingtype myself
to help me learn by splitting words and translating them in parallel.

[https://pingtype.github.io](https://pingtype.github.io)

------
trnglina
Firefox, in contrast, breaks at script boundaries, so it'll select runs of
Hiragana, Katakana, and Kanji. Not nearly as useful, and definitely makes
copying Japanese text especially annoying.

~~~
knolax
Personally I find double click highlighting to be a useless feature in any
language, but the Firefox approach is superior imo. Breaking at script
boundaries is predictable behavior the user can anticipate whereas doing some
janky ad hoc natural language processing invariably results in behavior that
is essentially random from a user perspective. I tried out double click
highlighting on some basic Japanese sentences on Chromium and it failed to
highlight any of what would be considered words.

It's not like English highlighting does complex grammatical analysis to make
sure "Project Manager" gets highlighted as one chunk and "eventUpdate" gets
highlighted as two chunks, most implementations just breaks at spaces like the
user expects.

~~~
microcolonel
I feel like this is a conclusion you could only reach by having an irrational
compulsion to defend the deficiencies of Firefox, and not being a regular user
of the Japanese language.

~~~
Wowfunhappy
I don't use/speak Japanese, but the GP's conclusion makes complete sense to
me.

Consistent behavior > inconsistent behavior, almost always. If I can
anticipate what my computer is going to do, I can plan around it, even if it
means a bit of extra manual work.

~~~
BigJono
That doesn't work if the consistent behaviour is completely useless. If double
click selected English tokens grouped by which half of the alphabet they came
from, nobody would ever double click anything, they'd just click-drag
highlight.

~~~
Wowfunhappy
Are spaces completely meaningless in Japanese? I was under the impression they
separated phrases.

~~~
microcolonel
Spaces have zero importance to the Japanese language itself, but they are
occasionally used like punctuation. e.g. some YouTube video titles will use
spaces around names of things that could be hard to parse as not part of a
sentence, another example is when you're typesetting phrases in lyrics.

In general, whitespace characters have no place or significance inside a
Japanese sentence, and most of the whitespace in Japanese typesetting is built
into punctuation marks.

Furthermore, even most punctuation is optional in Japanese. The full stop 。
and comma 、 are mostly a matter of preference, sometimes spaces are used in
place of full stops or commas.

------
polm23
OP here, surprised to see this took off.

I actually work with Japanese tokenization a lot - I took over maintenance of
the most popular Python Mecab wrapper last year, and I have another Cython-
based wrapper that I maintain.

Word boundary awareness for Japanese is a pretty uncommon feature in
applications, so I was surprised to see the feature had been in Chrome all
along, even if it's buried and the quality has issues.

Anyway, thanks to everyone who tracked down the icu imlementation and the
relevant part of Chrome!

------
LikeAnElephant
This is often determine by Unicode and not the browsers specifically (though
some browsers could override the suggested Unicode approach).

Each unicode character has certain properties, one of which is whether that
character indicates a break before / after itself.

I've done extensive research on this for my job, but unfortunately don't have
time to do the whole writeup here. Here are several resources for those who
are interested

Info on break opportunities:

[https://unicode.org/reports/tr14/#BreakOpportunities](https://unicode.org/reports/tr14/#BreakOpportunities)

The entire Unicode Character Database (~80MB XML file last I checked)

[https://unicode.org/reports/tr44/](https://unicode.org/reports/tr44/)

The properties within the UCD are hard to parse, here's a reference if you're
interested:

[https://unicode.org/reports/tr14/#Table1](https://unicode.org/reports/tr14/#Table1)

[https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt](https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt)

[https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase...](https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliases.txt)

Overall, word / line breaking in Unicode in no-space languages is a very
difficult problem. Where the UCD says there can be a line break isn't where a
native speaker would put one. In order to do it correctly you have to bring in
Natural Language Processing, but that has its own set of complexities.

In summary: I18N is hard!

~~~
swang
Yes. This seems to work even when you pass it Chinese while maintaining ja-JP
as the language

    
    
      function tokenizeJA(text) {
        var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
        it.adoptText(text)
        var words = []
      
        var cur = 0, prev = 0
      
        while (cur < text.length) {
          prev = cur
          cur = it.next()
          words.push(text.substring(prev, cur))
        }
      
        return words
      
      }
    
      console.log(tokenizeJA("今天要去哪裡?"))
    

still seems to parse just fine. so most likely just using the passed input to
parse.

~~~
yorwba
The underlying library is actually using a single dictionary for both Chinese
and Japanese [https://github.com/unicode-
org/icu/tree/7814980f51bca2000a96...](https://github.com/unicode-
org/icu/tree/7814980f51bca2000a963307cb5c4d711cc05fdb/icu4c/source/data/brkitr/dictionaries)

------
darkerside
> The quality is not amazing but I'm surprised this is supported at all.

I find this line hilarious for some reason. Reminds me of the line about being
a tourist in France, "French people don't expect you to speak French, but they
appreciate it when you try"

~~~
curiousgal
Being a resident of France I have to say that French people are the opposite
of that. Same for Japanese people. The only people who get genuinely excited
as you butcher their language are Arabic speakers I've noticed.

~~~
toast0
As a tourist in Paris about 10 years ago, I felt I had better interactions
when I butchered the language and then switched to english after the person I
was speaking with did, than when I just started off in english.

I only knew a handful of phrases though, so anything off script and I was
pretty lost.

------
dlivingston
Here’s a brief write-up [0] on techniques and software for Japanese
tokenization.

[0]: [http://www.solutions.asia/2016/10/japanese-
tokenization.html...](http://www.solutions.asia/2016/10/japanese-
tokenization.html?m=1)

------
simplicio
I've been doing some work parsing Vietnamese text, which has the opposite
problem. Compound words (which is most of the vocabulary) are broken up into
their components by spaces, indistinquishable from the boundaries between
words.

~~~
enos
Is that why the name of the country is sometimes spelled with a space, "Viet
nam"?

~~~
freddie_mercury
Yes, that's how it is written in Vietnamese. To oversimplify: Vietnamese words
are a collection of single syllables that are always separated by a space when
writing.

"Viet Nam" is also, actually, the "official" English way to write it. (Check
how the UN puts it on all their stuff.) However, most Europeans don't do that
in their languages, so it usually gets written as Vietnam even by Vietnamese
when they're writing European languages.

------
oh_sigh
Given this property of Japanese text, is there wordplay associated with a
string of characters with double/reverse meanings depending on how the
characters are combined?

~~~
needle0
Yes. One that comes to mind is "[name]さんじゅうななさい" which can be either
interpreted as "[name]-san, 17 years old" or "[name], 37 years old", depending
on whether you interpret the さん(san) as an honorific or part of a number. (The
sentence would usually be written in a combination of hiragana and kanji, but
is intentionally written in all hiragana here to ensure ambiguity.)

~~~
needle0
Another one: "この先生きのこるには", which should be broken up at "この先/生きのこるには" to mean
"To keep surviving going forward", but since 先生 (teacher) is such a common
word, it jumps at your eyes and turns the sentence into "この先生/きのこるには" which
means the nonsense "For this teacher to mushroom(verb)". Usually this doesn't
happen because the "survive" part is written with more kanji as 生き残る, but here
it is written in hiragana to make the きのこ(mushroom) part visible and further
mess with lexing.

In both cases, some liberties have been taken with notation to intentionally
encourage silly mis-readings; It happens much less often in ordinary text.

------
AlchemistCamp
Word segmentation in Japanese isn't that difficult, for the most part. The
mixed scripts help quite a bit.

The really interesting question is how does Chrome decide what to highlight
when you double-click Thai text? It's a non-breaking, (baroquely) phonetic
script and the training set is much smaller.

------
peter303
I wonder if this applicable to Chinese. In Chinese one to four characters
comprise a word. In post-revolutionary Chinese all the characters in a
sentence are run together and you have to mentally parse the words in your
mind. (Its worse in pre-revolutionary Chinese where neither sentences nor
paragraphs are punctuated.) (Pre-medieval european languages used to run all
their letters together without word or sentence breaks. Torah Hebrew still
does this.)

~~~
peterburkimsher
Yes, I wrote [https://pingtype.github.io](https://pingtype.github.io) for
Chinese word segmentation to help me learn.

------
1024core
So what is the most accurate way to tokenize CJK text?

~~~
edflsafoiewq
By hand using a native speaker.

~~~
tobyhinloopen
Is there an API for that?

~~~
cferr
Yes, but it hasn't been written yet.

~~~
edjroot
You joke, but...

[https://blog.ycombinator.com/scale/](https://blog.ycombinator.com/scale/)

------
wikibob
Safari seems to do this too.

------
dirtydroog
TIL: Some languages do not have spaces

~~~
SenHeng
Fun fact, Japanese originally didn’t have punctuation. It was a European
introduction.

[https://www.tofugu.com/japanese/japanese-
punctuation/](https://www.tofugu.com/japanese/japanese-punctuation/)

------
emilfihlman
From a very quick cursory look, Japanese seems to have some sort of commas and
periods. Shouldn't it be just simple to adopt spaces, (and proper commas and
periods and parentheses, which they actually use already) too?

~~~
Freak_NL
You want to change a written language for the express purpose of enabling
computers to be able to detect word boundaries easier?

~~~
bluquark
Note that Japanese and Chinese already changed to commonly use horizontal
left-to-right text (in addition to vertical top-to-bottom/right-to-left text,
which is still usual especially in "proper published typesetting") in large
part because computers handled that much better for decades.

~~~
cooper12
It's not so much that because they handled it better, but because that _was_
how they handled it. Even to this day vertical text layout in CSS is
primitive.

Early Japanese on computers also used half-width kana but now it's all
properly fullwidth.

Don't mistake technical limitations for choice.

~~~
BigJono
I wonder if there's any Japanese sites out there that fully use the
traditional style? Text top to botttom, right to left, and with the horizontal
axis unbound (to the left) instead of the vertical.

~~~
needle0
I'm sure there are some extremely design-heavy sites that do it, but by then
you would be fighting the DOM/CSS pretty hard, as default behaviors for stuff
like underline position and 縦中横 (short bursts of horiz text within vertical
text, for things like two-digit numbers) seem to be sketchy at best, and many
things still don't seem to be uniform across browsers. UX would be affected as
well, as things like text selection or right-clicking would feel quite awkward
as it is so exceedingly rare. While us Japanese are fully capable of reading
both horizontally+LtR and vertically+RtL, it's not in our expectations for the
latter to appear in online web text.

------
knolax
Double-click highlighting barely makes any sense in English, let alone in a
language that doesn't use spaces. The fact that mobile browsers treat long
presses as a double press for urls has been the bane of my existence. I doubt
any native Japanese speakers use this feature.

