The advice about utf-8 encoding is a bit incomplete. While adding <meta charset="utf-8" /> is a good idea, it's also important to make sure that the webserver isn't sending an http Content-Type header that's something other than utf-8.
In HTML5, the browser is supposed to not sniff the document for a meta charset if the server headers specify a charset in the Content-type.
I've been scraping some Japanese sites lately and this has been a minor annoyance. Content-Type rarely has the encoding and Requests doesn't default to UTF-8 so you get mojibake for EUC-JP and UTF-8 unless you intervene.
Am I the only one on Linux that has Firefox able to display any characters correctly (e.g. in this case the Japanese chars, arabic/russian chars on some other pages, etc...), but Chrome not being able to at all? (it's since months that I scan websites while writing a prototype sw and it's always been like this when I did some random checks in the browsers)
I had these issues on Fedora, especially around emoji. Firefox was fine but Chrome was not. I ended having to play around with some font settings to force chrome to use noto fonts.
Yeah, you should probably report that somewhere, probably to your distribution maintainer (assuming you're talking about a distro Chromium build, and not Google's Chrome blob).
Google distributes Chrome on Linux on tens of millions of devices every year, it displays fonts just fine for almost everyone.
> Thx - but yes, it's really Chrome and Gentoo (my distro) downloads of course Chrome's blob => weird that I have this problem, isn't it?
Chances are that it's not that weird. If you have enough memory I recommend building Chromium from source. There's probably some library somewhere which is behaving in a way which packaged Chrome isn't fond of (and it seems like maybe the maintainer of the chrome binary package you're using needs to update it in some way).
(Aside: funny enough, Chrome OS is Gentoo-based IIRC.)
I haven't had that problem, on the same distro even! I do remember them often differing in what fonts they would select by default, and Chrome often seemed to pick a worse-looking option... but I never really had any problems with the text simply not rendering.
I remember in the book "Remote: Office Not Required" by Jason Fried and David Heinmeier Hansson, they mentioned finding internationalization issues early on because globally distributed remote teams naturally dogfood for international audiences. This struck me as one of their key selling points for remote work.
You also, of course, need a company that is receptive to this as a goal. I usually run my computer in whatever language I'm trying to learn at the moment, and I've had more than one company dismiss i18n issues I've discovered as "well, we're not going to sell in that country any time soon!" (i.e., "get back to making the demo look pretty").
I had a lot of problems making Japanese text on ichi.moe display correctly. I'm using lang="ja" instead of lang="ja-jp" though, it seems to work and is shorter. The main problem with Japanese characters displayed in Chinese font (which happens when lang property is not set explicitly) is that some characters are barely recognizable as the same character. Compare 誤 in a Chinese font and a Japanese font. [1] Yep, this is the same character according to Unicode. Imagine if Latin letter g, Cyrillic г and Greek 𝛾 had the same Unicode codepoint.
Can you explain more about the 誤 example you provided?
When you say it's barely recognizable, do you mean simplified vs traditional? Because to me (as a native Chinese speaker) the Japanese and traditional look almost identical. I can't comment on traditional vs simplified because I can read both.
If it's simplified vs traditional, I wonder why OS/browser prefers to render the character in simplified form (I assume the Chinese font you are using has both styles).
> the Japanese and traditional look almost identical
They have different radicals on the lower right. Unless there's some reason to consider the two radicals equivalent (there isn't in JP, I wouldn't know about CN) they're different characters.
They are considered variants of each other for linguistic reasons, but that's not what GP was talking about. GP suggested the characters look almost identical, and I'm pointing out that they clearly don't.
lang="ja" absolutely should work [1], so I have no idea why it doesn't on your website.
Maybe it's related to font you specified which may directly or indirectly (fallback) cause the problem. After all, `font-family` overrides language (which essentially just helps to get the right font(s)).
If you don't mind to provide a test page I can help debugging.
You misunderstood me, it does work on my website, I just had to do a lot of work to insert this attribute in all the right places where Japanese text can be displayed.
Unicode's Han unification makes this more difficult than it needs to be. Now that Unicode has more than 16 bits worth of characters anyway, it looks very much like a mistake.
One can kind of understand why they did it. A lot of glyphs are the same from one language to another. If the codepoint for Chinese "大" were different from that for the Japanese "大" were different from that for the Korean "大", that would generate some complaints. OTOH, a Unicode that includes both "学" and "學" certainly has enough room for several versions of "道". I suspect this will eventually be fixed, but as a series of one-off additional codepoints, not as a giant duplication for each of C, J, K, and V. Some of the problem seems to have been that while the Chinese Unicode experts realized they wanted to write both "学" and "學" in the same document, the Korean experts never considered the possibility of including Chinese or Japanese text in a primarily Korean document. I think by this point they've realized it, though, so it will be fixed eventually.
That's impossible in some cases. How am I supposed to list a directory with a mix of Chinese and Japanese filenames? Most filesystems don't have a filename language attribute.
"The consortium" is not a monolith. Some of them probably did expect that, but such expectation seems to conflict with the "学"/"學" distinction mentioned above and for that matter with similar distinctions for other scripts e.g. "a"/"α" and "A"/"Α".
That can work in an unwieldy way for entire documents. But for inside a single document it should absolutely be Unicode's job to prevent quasi-mojibake.
Noto fonts are fantastic for multilingual documents, but beware that the CJK versions weigh multiple megabytes each. Loading them as a webfont will eat up a lot of data, not to mention cause a noticeable delay. This is unavoidable when there are 10,000+ characters to encode.
My Korean company website loads a subset of Noto Sans, but uses the system default sans-serif if it is accessed with a mobile device. Fortunately most Koreans don't use Hanja (Kanji/Hanzi) anymore, so visual consistency is not an issue.
Interestingly, Google very quietly shuttered the Google Translate for web plugin that made it possible to autotranslate sites.
The Web Speech API already exists for Speech Recognition and Speech Synthesis, so hopefully they add a translation API directly to Chrome. Can't call it an _inter_net if it doesn't easily support multiple languages!
It's not "allowed", i.e. if I want to have GTranslate by default on my site regardless of what browser someone is on, I can't do that anymore.
Their translator can also be accessed via API, but again not "allowed." It's not a huge issue since they don't do much other than try to 503 you, but it's still "illegal."
Plugin for the site! You could add/trigger it and have the entire site autotranslated for anyone from any browser as long as you left the "translated by Google Translate" somewhere on the site.
It's a fascinating problem, if you're into that kind of problem solving.
I recently had to build a large site that was English, Spanish, and Chinese. Which was fun considering that some of the audience was es-mx, some was es-es, and some were es-419.
That has to be awesome when viewing historical documents. On the other hand, kids will have a much easier time reading the Constitution in Comic Sans instead of the original cursive.
In HTML5, the browser is supposed to not sniff the document for a meta charset if the server headers specify a charset in the Content-type.