Displaying Japanese and English Text on the Web

wumpus · on Feb 8, 2019

The advice about utf-8 encoding is a bit incomplete. While adding <meta charset="utf-8" /> is a good idea, it's also important to make sure that the webserver isn't sending an http Content-Type header that's something other than utf-8.

In HTML5, the browser is supposed to not sniff the document for a meta charset if the server headers specify a charset in the Content-type.

kevin_thibedeau · on Feb 8, 2019

I've been scraping some Japanese sites lately and this has been a minor annoyance. Content-Type rarely has the encoding and Requests doesn't default to UTF-8 so you get mojibake for EUC-JP and UTF-8 unless you intervene.

zepearl · on Feb 8, 2019

Am I the only one on Linux that has Firefox able to display any characters correctly (e.g. in this case the Japanese chars, arabic/russian chars on some other pages, etc...), but Chrome not being able to at all? (it's since months that I scan websites while writing a prototype sw and it's always been like this when I did some random checks in the browsers)

suspectdoubloon · on Feb 9, 2019

I had these issues on Fedora, especially around emoji. Firefox was fine but Chrome was not. I ended having to play around with some font settings to force chrome to use noto fonts.

unsignedint · on Feb 11, 2019

The reason Firefox displays emoji while Chrome doesn't is probably because Firefox actually ships with fonts for them.

microcolonel · on Feb 8, 2019

Yeah, you should probably report that somewhere, probably to your distribution maintainer (assuming you're talking about a distro Chromium build, and not Google's Chrome blob).

Google distributes Chrome on Linux on tens of millions of devices every year, it displays fonts just fine for almost everyone.

zepearl · on Feb 8, 2019

Thx - but yes, it's really Chrome and Gentoo (my distro) downloads of course Chrome's blob => weird that I have this problem, isn't it?

microcolonel · on Feb 8, 2019

> Thx - but yes, it's really Chrome and Gentoo (my distro) downloads of course Chrome's blob => weird that I have this problem, isn't it?

Chances are that it's not that weird. If you have enough memory I recommend building Chromium from source. There's probably some library somewhere which is behaving in a way which packaged Chrome isn't fond of (and it seems like maybe the maintainer of the chrome binary package you're using needs to update it in some way).

(Aside: funny enough, Chrome OS is Gentoo-based IIRC.)

zerocrates · on Feb 8, 2019

I haven't had that problem, on the same distro even! I do remember them often differing in what fonts they would select by default, and Chrome often seemed to pick a worse-looking option... but I never really had any problems with the text simply not rendering.

yingw787 · on Feb 8, 2019

I remember in the book "Remote: Office Not Required" by Jason Fried and David Heinmeier Hansson, they mentioned finding internationalization issues early on because globally distributed remote teams naturally dogfood for international audiences. This struck me as one of their key selling points for remote work.

ken · on Feb 9, 2019

You also, of course, need a company that is receptive to this as a goal. I usually run my computer in whatever language I'm trying to learn at the moment, and I've had more than one company dismiss i18n issues I've discovered as "well, we're not going to sell in that country any time soon!" (i.e., "get back to making the demo look pretty").

Grue3 · on Feb 8, 2019

I had a lot of problems making Japanese text on ichi.moe display correctly. I'm using lang="ja" instead of lang="ja-jp" though, it seems to work and is shorter. The main problem with Japanese characters displayed in Chinese font (which happens when lang property is not set explicitly) is that some characters are barely recognizable as the same character. Compare 誤 in a Chinese font and a Japanese font. [1] Yep, this is the same character according to Unicode. Imagine if Latin letter g, Cyrillic г and Greek 𝛾 had the same Unicode codepoint.

[1] https://en.wiktionary.org/wiki/%E8%AA%A4

bgee · on Feb 8, 2019

Can you explain more about the 誤 example you provided?

When you say it's barely recognizable, do you mean simplified vs traditional? Because to me (as a native Chinese speaker) the Japanese and traditional look almost identical. I can't comment on traditional vs simplified because I can read both.

If it's simplified vs traditional, I wonder why OS/browser prefers to render the character in simplified form (I assume the Chinese font you are using has both styles).

Grue3 · on Feb 9, 2019

For reference this is what I see in my browser: https://imgur.com/a/F3pFzCm

fenomas · on Feb 9, 2019

> the Japanese and traditional look almost identical

They have different radicals on the lower right. Unless there's some reason to consider the two radicals equivalent (there isn't in JP, I wouldn't know about CN) they're different characters.

titanix2 · on Feb 9, 2019

Still it’s the same character. Enjoy the 16 others variants: http://dict.variants.moe.edu.tw/variants/rbt/word_attribute....

fenomas · on Feb 9, 2019

They are considered variants of each other for linguistic reasons, but that's not what GP was talking about. GP suggested the characters look almost identical, and I'm pointing out that they clearly don't.

fireattack · on Feb 8, 2019

lang="ja" absolutely should work [1], so I have no idea why it doesn't on your website.

Maybe it's related to font you specified which may directly or indirectly (fallback) cause the problem. After all, `font-family` overrides language (which essentially just helps to get the right font(s)).

If you don't mind to provide a test page I can help debugging.

[1]: https://en.wikipedia.org/wiki/User:Fireattack/sandbox

Grue3 · on Feb 9, 2019

You misunderstood me, it does work on my website, I just had to do a lot of work to insert this attribute in all the right places where Japanese text can be displayed.

fireattack · on Feb 11, 2019

Ah you're right, sorry!

mrob · on Feb 8, 2019

Unicode's Han unification makes this more difficult than it needs to be. Now that Unicode has more than 16 bits worth of characters anyway, it looks very much like a mistake.

https://en.wikipedia.org/wiki/Han_unification

jessaustin · on Feb 8, 2019

One can kind of understand why they did it. A lot of glyphs are the same from one language to another. If the codepoint for Chinese "大" were different from that for the Japanese "大" were different from that for the Korean "大", that would generate some complaints. OTOH, a Unicode that includes both "学" and "學" certainly has enough room for several versions of "道". I suspect this will eventually be fixed, but as a series of one-off additional codepoints, not as a giant duplication for each of C, J, K, and V. Some of the problem seems to have been that while the Chinese Unicode experts realized they wanted to write both "学" and "學" in the same document, the Korean experts never considered the possibility of including Chinese or Japanese text in a primarily Korean document. I think by this point they've realized it, though, so it will be fixed eventually.

kevin_thibedeau · on Feb 9, 2019

The consortium always expected implementations to use higher level markup to specify language specific variants.

mrob · on Feb 9, 2019

That's impossible in some cases. How am I supposed to list a directory with a mix of Chinese and Japanese filenames? Most filesystems don't have a filename language attribute.

jessaustin · on Feb 10, 2019

"The consortium" is not a monolith. Some of them probably did expect that, but such expectation seems to conflict with the "学"/"學" distinction mentioned above and for that matter with similar distinctions for other scripts e.g. "a"/"α" and "A"/"Α".

Dylan16807 · on Feb 9, 2019

That can work in an unwieldy way for entire documents. But for inside a single document it should absolutely be Unicode's job to prevent quasi-mojibake.

kijin · on Feb 9, 2019

Noto fonts are fantastic for multilingual documents, but beware that the CJK versions weigh multiple megabytes each. Loading them as a webfont will eat up a lot of data, not to mention cause a noticeable delay. This is unavoidable when there are 10,000+ characters to encode.

My Korean company website loads a subset of Noto Sans, but uses the system default sans-serif if it is accessed with a mobile device. Fortunately most Koreans don't use Hanja (Kanji/Hanzi) anymore, so visual consistency is not an issue.

cuuupid · on Feb 8, 2019

Interestingly, Google very quietly shuttered the Google Translate for web plugin that made it possible to autotranslate sites.

The Web Speech API already exists for Speech Recognition and Speech Synthesis, so hopefully they add a translation API directly to Chrome. Can't call it an _inter_net if it doesn't easily support multiple languages!

tokyodude · on Feb 8, 2019

You need a plugin? I thought you just past the URL into translate.google.com

https://translate.google.com/translate?hl=&sl=ja&tl=en&u=htt...

Seems pretty straight forward to make a plugin that generates those URLs if you want one.

cuuupid · on Feb 14, 2019

It's not "allowed", i.e. if I want to have GTranslate by default on my site regardless of what browser someone is on, I can't do that anymore.

Their translator can also be accessed via API, but again not "allowed." It's not a huge issue since they don't do much other than try to 503 you, but it's still "illegal."

dspillett · on Feb 8, 2019

> Google very quietly shuttered the Google Translate for web plugin that made it possible to autotranslate sites.

The feature still seems to be built into Chrome: https://random.spillett.net/stuff/tmp/translate.png

Or was there a plug-in for other browsers too, that is now AWOL?

cuuupid · on Feb 14, 2019

Plugin for the site! You could add/trigger it and have the entire site autotranslated for anyone from any browser as long as you left the "translated by Google Translate" somewhere on the site.

_0nac · on Feb 8, 2019

> Spoilers: Google has an interesting solution in the works.

This appears to be a reference to Google's universal Noto font set, released in 2014:

https://www.google.com/get/noto/

fouc · on Feb 8, 2019

Mixed-language webpages is an interesting problem I hadn't thought about.

reaperducer · on Feb 8, 2019

It's a fascinating problem, if you're into that kind of problem solving.

I recently had to build a large site that was English, Spanish, and Chinese. Which was fun considering that some of the audience was es-mx, some was es-es, and some were es-419.

sebazzz · on Feb 9, 2019

Does all this apply for Mandarin / Taiwanese as well?

reaperducer · on Feb 8, 2019

For PCs, cursive falls back to Comic Sans

Eep!

jandrese · on Feb 8, 2019

That has to be awesome when viewing historical documents. On the other hand, kids will have a much easier time reading the Constitution in Comic Sans instead of the original cursive.