If people are scared to put basic punctuation marks in the names of things, out of fear that badly-written software might break as a result, then that is a sign of just how far we still have left to go.
On 99% of the keyboards I've seen in my life, the middle row reads like this:
The apostrophe is considered significant enough to be on that row, but so are Ö and Ä. It's reasonable for users to expect that software can accept the letter that is right next to L on their keyboard, yet there remain software engineers who assume that users won't be surprised that things break if they dare to touch this key.
For the majority of people, a string format that accepts 100% of ASCII but 0.1% of Unicode is just as useless as one that only accepts 95% of ASCII. Therefore the goal should never be to get your ASCII coverage from 95% of 100%.
Resolving the Unicode issues is undeniably a higher priority for speakers of foreign languages, but there are still plenty of languages and libraries whose support for Unicode ranges from nonexistent, to antiquated, to limited, to just plain broken.
There's nothing application developers can do in the short term to fix those problems, but by examining their own code and removing fallacious assumptions they can better facilitate the proper handling of Unicode once it becomes available in their underlying technologies. In the mean time, though, they may only have ASCII, or ISO 8859-X, or KOI8-R, or Shift-JIS, etc. to test with.
The first A in Ascii stands for American. This is because it was an American standard, created by and for Americans. Who pretty much all speak English (and spend dollars $!). What about ascii isn't anglocentric?
In fact, some IMEs already intercept " ' : etc and won't print them literally, but assume they're going into a combo with another letter.
I've talked to a few friends, and they said it was a holdover from typewriters and not really designed with modern computing in mind.
So, if you sort on frequency by clicking the appropriate arrows icon in the Swedish column, you can see that both Å and Ö are more frequently used in Swedish than B, C, J, Y, X, W, Z and Q. Of course, all of those have dedicated keys as well and I don't think that's going to be changing anytime soon.
The important point is that "the same letter" can have different roles in different languages. In many languages Ä is considered a "modified version" of A (like in German, I think) but still "mostly" an A when it comes to sorting and so on. In Swedish, that's just not the case, it's a separate letter. Our alphabet has first-class 29 characters (http://en.wikipedia.org/wiki/Swedish_alphabet).
If your last name is "Åkermark", you're sorting a loong way from "Andersson".
In Finnish, Ä is actually one of the most common letters while Ö is somewhat less frequent.
In Swedish, both "ö" and "å" are words on their own: ö means island, å is a small river or something like that :)
That doesn't mean that there aren't any Unicode bugs in other tools, but that wasn't the issue in this particular case.
mv foo \!\'
$ touch 'j-kidd'\''s file!'
$ ll 'j-kidd'\''s file!'
-rw-rw-r-- 1 uid gid 0 Apr 11 13:30 j-kidd's file!
$ ls "!$"
ls "'j-kidd'\''s file!'"
ls: cannot access 'j-kidd'\''s file!': No such file or directory
$ ls '!$'
ls: cannot access !$: No such file or directory
"directly adjacent strings which are double quoted, "'single quoted or even'\ unquoted\ and\ possibly\ full\ of\ escape\ sequences\ 'get concatenated and count as a single parameter.'
Running touch on the above will create one file with very long name.
Or, "don't be your computer's tool!" -- the operator should accept no arbitrary limitations.
That also lets you write confusing filenames like "foo\rbar". Which can be really irritating to figure out without a GUI.
We need a courage like this! ;)
Fedora 20 ; rm -rf / # '; DROP table *; --
I have to admit, if I launched an email campaign, I'd use them too...
Firstly, broad character sets are all very well, but they have limited value if you can’t rely on everyone using your text being able to see them. Something like the ö (o-umlaut) is a reasonable thing to expect in modern Western fonts, but what about the emoticons I’m increasingly seeing in e-mails, or more advanced mathematical operators, or the cat in the title of this very article that many people commenting here can’t see in some contexts?
We need much better standardisation and prioritisation of sets of related glyphs that are, for example, permitted by the coding standards for a software project or supported by a font file. ASCII is too small, all of Unicode is too big, and picking glyphs for inclusion in fonts one at a time is too fine-grained for this purpose.
Secondly, it is crazy that with literally a million code points available in Unicode world, we don’t seem to have new control characters for “begin literal” and “end literal” to mark a range of text that should be interpreted verbatim regardless of context. Instead, we’re still using horrible hacks like quoting and escaping in environments like command lines and source code, and in text file formats like comma- or tab-separated values. These kinds of techniques are, invariably, horribly error-prone and terrible for usability; after all, in the case we’re discussing, it seems to be the apostrophe that is causing more problems than the ö! I think the computing world would be a much easier place for many, many people if there were one universal standard way of saying, “This is plain text”.
That is not of Unicode's concern. Markup is reserved for higher-level things and is not done within Unicode. There are historical oddities like language tags and variation selectors but the former are deprecated and the latter exist to serve a particular need. You will never see semantics like those you describe applied to code points in Unicode. That's not what Unicode is for.
Well, I’m not sure what authority you’d cite for your statements about what is or is not Unicode’s concern, but right on page 3 in chapter 1 we have:
“The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. [...] Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption.”
That sounds familiar, given the various contexts I mentioned before. I wonder whether the cost of all the software bugs ever caused by getting quoting/escaping wrong is greater than other repeat offenders like dereferencing nulls.
Then on page 8 in chapter 2 there are several explicit text processes mentioned, including user interaction and treating text as bulk data.
I do understand that a basic principle of Unicode is to represent plain text rather than mark-up, but it seems to me that avoiding problems with representing widely used text strings in different ways using different quoting and escaping conventions is consistent with that principle and exactly the kind of lack of standardisation that Unicode should help with.
There are already more than 60 control characters in Unicode. There are also many characters that aren’t described as control characters but which convey things like layout information or rendering hints, such as the non-breaking space, soft hyphen, and direction controls. I know that language tags are deprecated, but those are rather complicated and context-specific and, as the standard itself observes, supporting them is a significant burden on implementers, so I don’t think it is reasonable to equate that situation with what I suggested.
On balance, as a practical matter, if you're working in a text-only medium, your character set is all you have. If anything else must be done using some form of mark-up built with the same character set, then a simple, unambiguous, standardised way to switch from one to the other seems entirely consistent with the general goals of the Unicode project.
Links to relevant sources:
Obviously there are issues with unicode phishing of domain names and other cases where you might want to signal that a character is "strange", but surely the memory and processing requirements for this are low enough now. It doesn't have to be a good font!
Yes, I think it probably is. Unicode is vast, and the 100,000+ characters specified in the latest standard include numerous obscure, specialised, or downright gimmicky ones.
The effort required to create just one font that supports even a crude version of each and every character is probably measured in human lifetimes. Imposing that kind of burden as a barrier to entry for any new platform seems unrealistic.
Unicode serves many different needs and not all characters are necessary to support in a general-purpose OS. There are fonts to cover the missing pieces and professionals working in fields requiring those usually have them installed.
There is also little benefit to provide a single font that encompasses all of Unicode. Designers pick fonts for aesthetic reasons and every script has different styles (although Latin, Greek and Cyrillic are fairly similar which is why they usually are all included in every font). E.g. you have the main distinctions into serif and sans-serif (for non-decorative body text). This distinction never existed for scripts like Han, Hebrew, Arabic, various Indic scripts, etc. So if you were to create only one font, what are your choices to include for every script? Pan-Unicode fonts are mostly useful as fallback fonts to ensure that you can see some rarely-used glyphs but for nearly all practical purposes they cannot be used for anything else. It's also an enormous effort beyond creating the glyphs because you'll have to include kerning tables, define positions where combining characters appear, etc. Those are often issues that make such pan-Unicode fonts unusable because yes, they may contain plenty of glyphs but cannot be used reliably to render text that goes beyond simple scripts (and diacritic placement can even be wrong with just Latin.
I’m just not sure I see a compelling argument that any new device entering the market must be able to render advanced mathematical notations, animals, and tarot cards. That’s a very high barrier to entry.
In due course, if there are freely available, good quality fonts that do the job, then by all means include them, but we’re a long way from that situation today. Even the most comprehensive efforts, things like Unifont, don’t cover all of Unicode. Also, without wishing to belittle anyone’s efforts, some of these projects are working on bitmap fonts, and it’s increasingly a vector world. Perhaps they are still useful as a rendering of last resort, but I suspect anyone working on a new platform or device has more pressing concerns.
Plan 9 actually tried to do something about this: it assumed Unicode for everything, and invented their own encoding for the process. The OS didn't work out so well (regrettably), but we still use their encoding: it's called UTF-8 now. Still a superset of ASCII, I guess, but at least we've gotten beyond the single-byte assumption.
Also part of the problem was U+0027 and its use as a quote character in shell scripts. So that was not a Unicode issue but rather a shell script injection issue and programmers do that all the time (although usually with SQL, I guess).
Found a complete list here:
You can find all of them in the Emoji section of Special Characters through the Edit-menu, as described in another comment. They work in the classic Terminal application as well, along with most other applications using standard OS X API:s or ones that have specifically added support.
Chrome on OS X 10.8 shows me a square, and so I assumed that the joke is that it represents Schrödinger's box -- you know, where he keeps the cat which may or may not be dead.
That being said, Emoji seem to work at least on Mac OS X (current) and Windows 8. Alas, Chrome doesn't render them within a page (but in the tab/title bar, maybe because the OS' rendering takes over there). Firefox and IE show them both, apparently (presumably because both are DirectWrite-based by now).
See http://apple.stackexchange.com/questions/41228/why-do-emoji-... for more details.
Also, I am not an expert about this so I could be wrong about the details.
Granted, this is on Linux, but I think it's only a matter of whether or not your font has glyphs for the emoji characters -- correct me if I'm wrong, but I don't see a technical reason to treat emoji differently from any other Unicode character.
Edit: changed glyph based on someone mentioning it had heart-shaped eyes
$ iconv --to-code UTF-32BE|xxd
0000000: 0001 f63b 0000 000a ...;....
The square looks like the qed symbol, I thought the article was about a proof or a box. Both worked fine in my mind :)
Why does Safari show the cat while Chrome doesn't? I can think of two explanations: (1) Mac OS doesn't actually have a designated place to put fonts for all applications to find (I don't use Mac OS so I have no idea how it works, but this sounds unlikely), (2) Mac OS does have a central location for fotns but Chrome doesn't use them and just uses some fonts it comes bundled with. Is either of those two explanations correct? If not, what is going on?
I suspect its an OS-related issue, as the cat shows in Chrome on Kubuntu 12.10 for me.
Diacritical marks (training wheels) and unicode break this simplicity, you cannot program in a language with many characters that are hard/impossible to distinguish. Or with crazy characters that can switch direction of text and other nonsense. Unicode is a luxury to pretty things up for end users, not something to do serious work in.
> Diacritical marks (training wheels) and unicode break this simplicity, you cannot program in a language with many characters that are hard/impossible to distinguish.
Many languages get by with their communication in spite of having things like "spots over the o's" or whatever. I have no problem distinguishing them. Do you have experience with reading such languages? Or are you simply blowing hot air?
Ever looked at typewriter font for l and 1? Yeah, these things are not historical accidents at all...
> Unicode is a luxury to pretty things up for end users, not something to do serious work in.
It's a luxury for end users... yes, non-English users should count their blessings when they are able to use their whole alphabet. Part of this issue was correctly spelling the name of an Austrian physicist, not someone trying to write "cat" with some esoteric Unicode that looks like a cat.