
Displaying Japanese and English Text on the Web - polm23
http://www.nobadmemories.com/blog/2017/04/better-together-displaying-japanese-and-english-text-on-the-web/
======
greglindahl
The advice about utf-8 encoding is a bit incomplete. While adding <meta
charset="utf-8" /> is a good idea, it's also important to make sure that the
webserver isn't sending an http Content-Type header that's something other
than utf-8.

In HTML5, the browser is supposed to not sniff the document for a meta charset
if the server headers specify a charset in the Content-type.

~~~
kevin_thibedeau
I've been scraping some Japanese sites lately and this has been a minor
annoyance. Content-Type rarely has the encoding and Requests doesn't default
to UTF-8 so you get mojibake for EUC-JP and UTF-8 unless you intervene.

------
zepearl
Am I the only one on Linux that has Firefox able to display any characters
correctly (e.g. in this case the Japanese chars, arabic/russian chars on some
other pages, etc...), but Chrome not being able to at all? (it's since months
that I scan websites while writing a prototype sw and it's always been like
this when I did some random checks in the browsers)

~~~
suspectdoubloon
I had these issues on Fedora, especially around emoji. Firefox was fine but
Chrome was not. I ended having to play around with some font settings to force
chrome to use noto fonts.

~~~
unsignedint
The reason Firefox displays emoji while Chrome doesn't is probably because
Firefox actually ships with fonts for them.

------
yingw787
I remember in the book "Remote: Office Not Required" by Jason Fried and David
Heinmeier Hansson, they mentioned finding internationalization issues early on
because globally distributed remote teams naturally dogfood for international
audiences. This struck me as one of their key selling points for remote work.

~~~
ken
You also, of course, need a company that is receptive to this as a goal. I
usually run my computer in whatever language I'm trying to learn at the
moment, and I've had more than one company dismiss i18n issues I've discovered
as "well, we're not going to sell in _that_ country any time soon!" (i.e.,
"get back to making the demo look pretty").

------
Grue3
I had a lot of problems making Japanese text on ichi.moe display correctly.
I'm using lang="ja" instead of lang="ja-jp" though, it seems to work and is
shorter. The main problem with Japanese characters displayed in Chinese font
(which happens when lang property is not set explicitly) is that some
characters are barely recognizable as the same character. Compare 誤 in a
Chinese font and a Japanese font. [1] Yep, this is the same character
according to Unicode. Imagine if Latin letter g, Cyrillic г and Greek 𝛾 had
the same Unicode codepoint.

[1]
[https://en.wiktionary.org/wiki/%E8%AA%A4](https://en.wiktionary.org/wiki/%E8%AA%A4)

~~~
bgee
Can you explain more about the 誤 example you provided?

When you say it's barely recognizable, do you mean simplified vs traditional?
Because to me (as a native Chinese speaker) the Japanese and traditional look
almost identical. I can't comment on traditional vs simplified because I can
read both.

If it's simplified vs traditional, I wonder why OS/browser prefers to render
the character in simplified form (I assume the Chinese font you are using has
both styles).

~~~
fenomas
> the Japanese and traditional look almost identical

They have different radicals on the lower right. Unless there's some reason to
consider the two radicals equivalent (there isn't in JP, I wouldn't know about
CN) they're different characters.

~~~
titanix2
Still it’s the same character. Enjoy the 16 others variants:
[http://dict.variants.moe.edu.tw/variants/rbt/word_attribute....](http://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QTAzODM4LTAxMA)

~~~
fenomas
They are considered variants of each other for linguistic reasons, but that's
not what GP was talking about. GP suggested the characters _look_ almost
identical, and I'm pointing out that they clearly don't.

------
mrob
Unicode's Han unification makes this more difficult than it needs to be. Now
that Unicode has more than 16 bits worth of characters anyway, it looks very
much like a mistake.

[https://en.wikipedia.org/wiki/Han_unification](https://en.wikipedia.org/wiki/Han_unification)

~~~
jessaustin
One can kind of understand why they did it. A lot of glyphs _are_ the same
from one language to another. If the codepoint for Chinese "大" were different
from that for the Japanese "大" were different from that for the Korean "大",
that would generate some complaints. OTOH, a Unicode that includes _both_ "学"
and "學" certainly has enough room for several versions of "道". I suspect this
will eventually be fixed, but as a series of one-off additional codepoints,
not as a giant duplication for each of C, J, K, and V. Some of the problem
seems to have been that while the Chinese Unicode experts realized they wanted
to write both "学" and "學" in the same document, the Korean experts never
considered the possibility of including Chinese or Japanese text in a
primarily Korean document. I think by this point they've realized it, though,
so it will be fixed eventually.

~~~
kevin_thibedeau
The consortium always expected implementations to use higher level markup to
specify language specific variants.

~~~
mrob
That's impossible in some cases. How am I supposed to list a directory with a
mix of Chinese and Japanese filenames? Most filesystems don't have a filename
language attribute.

------
kijin
Noto fonts are fantastic for multilingual documents, but beware that the CJK
versions weigh multiple megabytes each. Loading them as a webfont will eat up
a lot of data, not to mention cause a noticeable delay. This is unavoidable
when there are 10,000+ characters to encode.

My Korean company website loads a subset of Noto Sans, but uses the system
default sans-serif if it is accessed with a mobile device. Fortunately most
Koreans don't use Hanja (Kanji/Hanzi) anymore, so visual consistency is not an
issue.

------
priansh
Interestingly, Google very quietly shuttered the Google Translate for web
plugin that made it possible to autotranslate sites.

The Web Speech API already exists for Speech Recognition and Speech Synthesis,
so hopefully they add a translation API directly to Chrome. Can't call it an
_inter_net if it doesn't easily support multiple languages!

~~~
tokyodude
You need a plugin? I thought you just past the URL into translate.google.com

[https://translate.google.com/translate?hl=&sl=ja&tl=en&u=htt...](https://translate.google.com/translate?hl=&sl=ja&tl=en&u=https%3A%2F%2Fwww.yahoo.co.jp%2F)

Seems pretty straight forward to make a plugin that generates those URLs if
you want one.

~~~
priansh
It's not "allowed", i.e. if I want to have GTranslate by default on my site
regardless of what browser someone is on, I can't do that anymore.

Their translator can also be accessed via API, but again not "allowed." It's
not a huge issue since they don't do much other than try to 503 you, but it's
still "illegal."

------
jpatokal
> Spoilers: Google has an interesting solution in the works.

This appears to be a reference to Google's universal Noto font set, released
in 2014:

[https://www.google.com/get/noto/](https://www.google.com/get/noto/)

------
fouc
Mixed-language webpages is an interesting problem I hadn't thought about.

~~~
reaperducer
It's a fascinating problem, if you're into that kind of problem solving.

I recently had to build a large site that was English, Spanish, and Chinese.
Which was fun considering that some of the audience was es-mx, some was es-es,
and some were es-419.

------
sebazzz
Does all this apply for Mandarin / Taiwanese as well?

------
reaperducer
_For PCs, cursive falls back to Comic Sans_

Eep!

~~~
jandrese
That has to be awesome when viewing historical documents. On the other hand,
kids will have a much easier time reading the Constitution in Comic Sans
instead of the original cursive.

