I wonder how the decisions for inclusion of languages were made, as there are some very odd decisions. For example, Osmanya is a script created for the Somali language that was hardly ever used (Somali literacy was only widespread after the latin alphabet was adopted - previously Arabic was commonly used). The population of actual users of this script is pretty indisputably 0. 100,000 would be a wildly ambitious estimate of the number of people who had ever actually even seen the script.
On the other hand, Oriya, which has over 33 million native speakers, including 80% of India's Odisha state, does not appear to be supported.
In their defense, when you click India and scroll down, it does say, "not supported yet". Which leads me to believe they picked both languages with few characters (or straightforward to render?) and those most common, and they'll get to the rest shortly. :)
Meanwhile, I wonder if this means we'll see OCR and ePubs for all kinds of scripts now; or if this will help enable Google Translate in more languages? ;-)
Also maybe this was a %20 time thing and the programmer who started it just wanted to do those languages (probably because they couldn't be found elsewhere).
It's probably just a matter of whether or not there's somebody in the relevant team(s) who is familiar with, or at least has heard of, any given script.
I wouldn't be surprised if there happens to be an Osmanya geek in Google, but none of his teammates has ever heard of Oriya. For the same reason, I wouldn't surprised if they added a bunch of geeky fictional languages before actual ones.
And the fun thing is, there is two OLD! persian script available, (Pahlavi, OldPersian Both dead for almost 1500 year) and the current Persian is not supported :)))
Perhaps even more curious is the inclusion of Deseret, a toy language developed by the Mormons in the early 1800s. It never caught on and few books other than the Book of Mormon were ever translated into it.
Good pick, but I also see the value in preservation — maybe 500 years from now, the Noto fonts will be the best or only representation of many dead or forgotten scripts and languages.
There are many more glyphs than there are codepoints -- a font contains a ton of information that would be needed to reproduce a script that is not present in unicode tables.
This is particularly true for non-European languages – ligatures are a minor feature for most of the languages using latin-1 but there are quite a few which depend on complex, multi-letter combinations which are required for text to be comprehensible:
Hi fellow Oriyan, google has very bad support for Oriya,since IT is not that great as in R&D in odisha, I work in IIIT hyd, which is the leading NLP lab in India and I dont see anything in Oriya.
I like the implementation of CJK fonts in Noto, which was just released this week. I particularly like that I can illustrate that the various Sinitic languages ("Chinese dialects") do NOT all use the same written characters, so that Chinese people who travel to different dialect regions sometimes find written signs that they cannot read, even if they are literate in Modern Standard Chinese. (I have seen this regional illiteracy on the part of native speakers of Chinese in several contexts.)
How you might write the conversation
"Does he know how to speak Mandarin?
"No, he doesn't."
他會說普通話嗎?
他不會。
in Modern Standard Chinese characters contrasts with how you would write
"Does he know how to speak Cantonese?
"No, he doesn't."
佢識唔識講廣東話?
佢唔識。
in the Chinese characters used to write Cantonese. As will readily appear even to readers who don't know Chinese characters (if you have a good Unicode implementation enabled as you read Hacker News), many more words than "Mandarin" and "Cantonese" differ between those sentences in Chinese characters.
I thought Han Unification meant that (most of the common) CJK characters were represented by the same Unicode code points, and that the way to differentiate like this is by specifying metadata that indicates the "language".
Obviously I'm wrong, because these are just regular Unicode characters, without an HTML "lang" attribute.
Two separate issues. Pan-CJK fonts like Noto solve a problem that arises from the fact that many CJK characters are only ever used in certain locales. Since most CJK fonts are made for a given locale, they tend not to include any of the (many) CJK characters never used in that locale. Hence, if you render mixed CJK text in (say) a Japanese font, any characters that don't appear in Japanese won't render at all. That's what (I believe) this comment refers to.
The Han Unification problem arises from the inverse case - characters that are used in several languages but rendered differently depending on locale[0]. For those characters, they'll render even without a pan-CJK font, but the problem is they'll render in a way that's not appropriate for their locale.
[0] Another way to phrase this would be "distinct characters which share a code point becaus Unicode mistakenly thinks they're a single character whose rendering differs by locale". The difference is basically subjective.
> if you render mixed CJK text in (say) a Japanese font, any characters that don't appear in Japanese won't render at all.
In theory, yes, but that almost never happens. What really happens is that any missing characters will fall back to a different font, so the text is legible but looks terrible from an aesthetic point of view.
I'm not aware of any other font that does a decent job of handling all of Simplified Chinese, Traditional Chinese, Japanese, and Korean simultaneously, and with light, bold, thin etc variants to boot. Most existing fonts, even expensive commercial ones, are lucky to support two, and even then usually regular text only.
Still no Nastaliq [1] for Urdu and Persian script. There's a great piece on Medium [2] about the death of the Urdu script at the hands of the more structured Arabic Naskh font.
I saw the article before and I too was deeply moved by it, but don't you think you could word this better so as not to make it sound so blasé? "Pah, no nastaliq, useless!" -- this is an amazing project.
+1. It sounds incredibly condescending the way it's written right now. Like the omission of Nastaliq is a great tragedy and Google should be deeply ashamed.
> Like the omission of Nastaliq is a great tragedy and Google should be deeply ashamed.
Well, the heading says «Beautiful and free fonts for all languages» and the OP noticed that it fails to include at least one important script/language combination.
A. I wouldn't exactly call an almost-dead not-incredibly-commonly-used script "important". Especially when others, which are available on Noto, are more widely used.
B. There's much better ways of phrasing the request than the way it was written.
First of all, I wouldn't call one specific use case proving that it isn't almost dead. Just because it's being used in one case, for Bollywood wedding songs, does not mean it isn't "almost dead". Secondly, it was by the OPs own admission that the language is dying.
It's infuriating how many Japanese sites still don't use Unicode, purportedly because of this issue (though I suspect that it's just another example of Japan lagging when it comes to web/computer tech).
Please understand that Han unification is _the_ problem. It is clean that Unicode needs to realize that the Han unification is wrong and accepts what the native writers of those languages think about their scripts.
To make the problem more understandable to the people that are used to alphabetic scripts, suppose that tomorrow an Asian committee starts creating Uniword, a repertoire that maps complete words to numerical IDs. At a certain point they get to "colour".
Uniword committee: Well, that word shares meaning and origin with the other word "color", for which we have already a codepoint, so we will encode them under the same codepoint.
GB, Australia and Canada: Ehi! No! To us those are different words; especially, we do not want Mr. Colours to appear as Mr. Color.
Uniword commitee: No problem, just add some out-of-band information like "nationality" or "<span lang='en-GB'>"
"colour"-people: that will not work, there are so many cases in which this can go wrong. Whenever I copy a field from a DB I also have to extract this extra information?
Uniword: yes, that is the problem? C'mon!
"colour"-people: but do you need to do that in your applications?
Uniword: no, we have one code for every single word in our languages, including codes for very old languages that exist only in two palimpsests.
"colour"-people: and why cannot we have the same level of granularity?
Uniword: because you have too many words!!! And we started we had only 100k available integers.
"colour"-people: and now?
Uniword: now we have 2^32. But, yeah, that is not the point; just do how we suggest. This dialog is getting to long.
The only way I could improve on this dialogue would be accusing the Australians of anti-American prejudice for refusing to accept the English unification.
That was perceived as happening more than a few times in the Han Unification debate.
This is a great summary, thanks for that. I'd never had this explained in a way I could personally relate to.
I remember being concerned about Han unification around the time Ruby 1.9 was released, since this seemed to be one of Ruby's major reasons for being encoding-independent instead of standardizing on Unicode. But I hadn't heard about this issue in a while, except to hear occasionally someone say it's not a problem (maybe it was a Chinese person instead of a Japanese person -- the Wikipedia page says that the Chinese aren't as concerned about Han unification since Traditional Chinese didn't get unified with Simplified Chinese).
Every discussion we ever had about going to Unicode: "Is it bidirectional compatible with SJIS?" "No." "OK, so when it breaks, what characters does it break on?" "People's names, mostly." "... And why is this being considered?" "It is very, very convenient for white people. Almost all of their stuff works out of the box."
I wouldn't say it's convenient for white people. Being able to write Japanese on all the non-Japanese people's stuff (basically, most websites and open-source softwares) should be mostly useful to the Japanese.
It may not be so infuriating when you consider how important names are to Japanese people. Breaking your name is unacceptable and breaking it because of a technical convenience chosen during development would be deeply offensive on top of being unacceptable.
Until Unicode stops breaking people's names, it will continue to be the one standard for Japanese systems on and offline. Even when(if?) it stops breaking Japanese names, it will take a very, very long time to roll over existing systems and that's precluding unforeseen problems during the conversion.
We should stop before we take the "not following standard" = "broken" ideology. Especially when we consider whom the standard serves best.
Edit: By "it will continue to be the one standard for Japanese" in the 2nd paragraph, I meant ShiftJIS not Unicode. That looked a bit unclear.
What do you mean by saying that Unicode breaks people's names? The problem isn't that there's anything wrong with Unicode, the problem is that it's not possible to unambiguously convert text from SJIS to unicode and back again, because SJIS has some duplicate mappings for historical reasons (compatibility with different pre-existing encodings as I recall). The same would presumably be true of converting SJIS to any other encoding that didn't have the same duplicated code points.
Why does Unicode get to decide what "duplicate" means in this context? If Mr. Smythe is addressed as "Mr. Smith" in a letter from his bank, is he going to be mollified by the explanation that Unicode considers his preferred representation to be a duplicate?
Unicode didn't decide anything like that, the characters were duplicates per SJIS itself. SJIS intentionally has duplicate mappings in order to be compatible with vendor encodings that put the same character in different places.
Basically, the characters in Unicode are a superset of what you'd find in any single Japanese encoding. The problem is that it mashes many of them together with characters you'd find in C or K encodings.
The amount of Japanese systems that are SJIS only is staggering, basically all traditional IT (banks etc) uses it and does not support, nor will ever support, unicode.
This, naturally, has an impact beyond those systems' borders.
Oddly, the following more modest proposal hasn't gotten much traction: characters that share history but have divergent graphical representations in the various dialects of alphabetic script shall share codepoints, and a mechanism beyond the scope of Unicode (like lang attributes or plain guesswork) shall be used to decide whether a given codepoint means L or ᴫ or Λ or whatever.
Actually I think that's roughly how things work today. It's not my area, but here's how this was explained to me recently by a colleague (the lead font guy at Adobe Japan):
There are two ways of dealing with glyphs that share code points. The first is TTC (truetype collection) fonts. A TTC is basically one set of glyphs with several sets of mappings (i.e. which code point maps to which glyph). When you install it, assuming your computer groks ttc, your system shows you a separate font for each mapping. Taking for example Source Han Sans, which adobe just released - if you go to the download page[0] and get the complete version (the "OTC" one), you get a bunch of files like "SourceHanSans-Bold.ttc". If you install one of them you'll see four new fonts: "Source Han Sans J", K, SC, and TC. Then when you use the font, depending on which font name you used the system will change which mapping it applies to the combined set of glyphs. (Hence the choice of font name is the selection mechanism you described.)
The second way is that TrueType fonts have a way to build locale settings into the font. I'm less clear on the details here but apparently it's similar to TTC behind the scenes, except that the mappings are associated with locales - so in an app that supports TT locales, even if you select "Foo J" as your font, when the locale was simplified Chinese you'd get the SC glyph. Of course now the selection mechanism is whether the application knows what locale the content is. (And also whether it supports the mechanism - I don't know how widespread this is.) Either way though, in principle you get different glyphs for the same code point, depending on context.
Or anyway that's the understanding I took away as a font layperson - happy to be corrected.
The modest proposal was to extend this approach with all its complications to western alphabetic scripts where possible. Of course nobody wants to do that for obvious reasons, which also apply to the various languages that use Han-derived characters. The extra mechanisms required to work around unified Han remind me of the pre-unicode days when you needed to know what language a text was in to render it.
Then you break copy-paste. Or maybe we could add locale markers in unicode, then encode the different symbols as <locale><codepoint>. It will only take something like up to 12 bytes per character in UTF-8, no big deal, right?
Surely there are enough unicode code points for this not be a problem? Can you use the historical character + combining mark (which shows which 'newer' version of the character to use), where the combining mark is ignored if the computer doesn't understand it, and only then it falls back onto guesswork/lang attributes.
I don't know if that's a decent solution, but just guesswork doesn't sound like a good idea, because there are bound to be edge cases where it wouldn't work, and then we're back where we started...
I can, partially understand why the Japanese refuse to support Unicode. And while most just adopt unicode simply because of its convenient, it doesn't actually solve the problem behind the Hans Characters in different form and glyph.
Over the years, i am starting to think Han Unification is western ways of hacking the CJK Hans problem rather then actually solving it.
I wouldn't say Japanese refuse to support unicode. I'd say that legacy encodings like EUC-JP/JIS/SJIS are still used because it's hard/unfeasible to convert systems and data that were built for earlier encodings to anything newer. But it's not like they offer some particular technical advantage over Unicode - indeed they only reason they don't suffer from CJK issues is that they have no support for C or K. ;)
But speaking as a front-end webdev guy, it's been a looong time since I came in contact with any encoding here besides utf-8.
Eh, most Japanese people don't know/care about the problem of Han unification. It's mostly because of the legacy data. The thing is that ASCII and UTF-8 is kinda compatible with minor annoyances, while Shift JIS and UTF-8 are entirely different. People don't want to convert a trove of documents to another encoding which might not be supported yet in some apps. Slow software upgrade is another reason. As someone else pointed out, the default encoding of Windows is still Shift JIS, which is totally understandable for compatibility sake.
Edit: Besides, a TTF font doesn't have to always use Unicode internally. It supports an arbitrary mapping from bytes (could be in UTF-8 or SJIS) to a glyph number. People who really care about the looks (i.e. printing) have been using a charset for each specific language, such as Adobe-Japan1, which is different from both UTF-8 or Shift JIS.
Part of the problem here is that Japanese Windows using SJIS at its front end, perhaps for "legacy" compatibility issue... (Backend, like filesystem on modern Windows is Unicode; it was apparent when adding files containing Japanese characters into Git; it would work if I'm retrieving the file from another Windows machine, but horribly corrupted when moving to other platforms.)
Fortunately, Git guys added fix to convert it into Unicode internally.
I think this is amazing. I have never seen Cherokee glyphs that beautifully rendered before. Apparently there are still missing scripts, but this is a great step forward. This couldn't have come cheap, and I'm happy that Google is investing effort into this.
The Google Code page used to have a comment on the origins of the name. Noto is short for 'no tofu', tofu being the rectangles you get when you don't have a font covering that glyph.
This is incredible and is going to be very useful for people developing applications for use in Eastern Asia. Nailing typefaces for Chinese, Japanese, and Korean is a huge challenge. Noto and the accompanying Source Han Sans is going to be a huge boon for people in Eastern Asia and hopefully it will have widespread adoption.
Sadly, it's probably still not possible to use as a Webfont. A single font weight is over 8mb, but there is a distinct possibility this could go into mobile devices and operating systems which would be awesome.
Well, one reason would be that your web font would be 134Mb or so? (looking at the size of the comprehensive Noto download)
The other is simple practicality - these things take time to develop, you can either wait until all the glyphs are done, or release subsets that cover languages as you work; a subset that covers part of a language isn't very useful but subsets that cover whole languages are.
That is a technical problem to solve. There should be a way for browsers to only download the characters being rendered in the current page; so even if the file is 134Mb it could get only the little pieces of it that it needs.
They're not really "for" a language. They are optimized subsets of the same font that only contain glyphs from one or more languages to minimize the file size.
If your website is written in English and an occasional accented character from other Western languages, there's no need to load a 50MB web font containing all the Tradntional Chinese characters.
Part of the answer is that Unicode has a lot of characters, and web pages use only a few, so for web fonts it makes sense to have the user download only ones likely to be used.
CJK unification in Unicode means that you don't know how to render a Unicode codepoint without also knowing the language of the text - the same Unicode codepoint looks different when rendered in Chinese, Japanese or Korean.
There are technical reasons for this. For example m, OpenType only supports 64k glyphs within a font. Which is enough for the BMP but nothing else (and counting ligatures that are necessary for, e.g. Arabic, it might not even suffice for the BMP).
Then there are practical considerations. While Latin, Greek and Cyrillic are similar enough to warrant the same styles (serif, sans-serif, script, italic, and various weights) not all of them make very much sense for, say, CJK or a variety of other scripts. So having different fonts for different scripts that are still designed to go together is actually not that bad a solution.
It does mean that for good typography you need a matrix of fonts based on style and script. Word includes two fonts per style for this, to treat CJK differently, which might not be enough, depending on the numbers of different scripts involved in a single document. But a) several dozen scripts per document are somewhat rare apart from Wikipedia's language list per article and font demonstrations; and b) good typography needs effort, this won't change.
I would guess that it's because of the amount of work required to design glyphs for every character that retain the font's style while still capturing the character's look and meaning from its language.
According to the site all the fonts together are 134MB compressed. Maybe that's the reason or maybe because it's a work in progress so works out better in segments.
No. "Those fonts" are hundrends, if not thousands. There are already 100 or so unicode fonts installed with Windows, OS X etc. Do you really want 10GB of fonts though?
A typographer typically works within a small set of languages. It would be unfeasible for a single type foundry to cover every possible glyph with the same consistency.
I have been using Noto fonts for a more than 6 months now (mostly Indic fonts) and quite pleased with them. And just saw that they have "Noto Sans Brahmi" in the pipeline. Although Brahmi script (ancient Indian script used around 300 BC) entered Unicode in 2010, there is not a single font available that covers Brahmi.
I also couldn't find any font that covers mathematical symbols from the SMP.
EDIT: Just downloaded the zip archive. Unix permissions for the Bengali and Gurmukhi fonts are different from the rest of them.
"All human beings are born free and equal in dignity and rights"
I love the fact that they use The Universal Declaration of Human Rights as the text for showcasing the fonts, using every opportunity to stand for human rights!
I am in love. Why don't they offer a monospace programming version? Noto Sans outdoes my Consolas easily for clarity. No easy feat! Please release a Noto Sans Code Google!
If they released a monospace version, I would switch from Inconsolata! Although it is probably more important to continue work on supporting more languages.
No technical reasons, but it would be unlikely for Apple or Microsoft to adopt a font made by Google, when their own alternatives exist, despite technical superiority. Android already uses Roboto, which has been heavily invested in for Android 4+ and now with Material design too. Of course, by Linux I'm assuming you mean the popular distributions of it like Ubuntu, but even Ubuntu has its own font which it is unlikely to change - other non-"branded" distros probably would be the only ones who might.
Was hoping for a moment that we could all come together in harmony and enjoy universal access to fonts for all languages without relying on webfont kludges... hope springs eternal
The Serif looks questionable to me as well. The vertical line on the lowercase 'h' in particular looks like it has artifacts, both too thick and oddly blurred. The Greek serif has a lot of the same artifacts as well. The sans-serif looks fine, but something looks pretty wrong with the serif. I also thought it might be a client-side rendering issue (I'm using a Mac), but the demo page has the text prerendered into an image.
edit: Found a different demo page that renders the webfont client-side instead of showing images, and looks much better to me: http://www.google.com/fonts/specimen/Noto+Serif. Maybe it's just that the pre-rendered specimens are made with a poor rendering engine?
I really like Noto Sans. From what I can tell, it's a fork of Open Sans. For the Latin alphabet it's mostly the same, but with a single story lowercase g.
Which itself is a redrawing of Droid Sans. This Typophile thread has some comparisons, and a couple comments by the designer, Steve Matteson: http://typophile.com/node/101655
It's just a shame that Noto Sans doesn't have a nice range of weights. They would of been better off improving Open Sans instead of creating yet another font.
I find the Canada->Cree glyphs very interesting (geometrical). The art from the area is also very beautiful. If you are ever in Ottawa a trip to the Canadian Museum of History (was Civilization) is well worth it.
On the other hand, Oriya, which has over 33 million native speakers, including 80% of India's Odisha state, does not appear to be supported.