Hacker News new | past | comments | ask | show | jobs | submit login
Schrödinger's 😻 and outside-the-box naming (lwn.net)
123 points by edwintorok on Apr 11, 2013 | hide | past | web | favorite | 84 comments



While disappointing, it's perhaps not surprising that the umlaut in "Schrödinger's Cat" caused trouble. The fact that the apostrophe was ruled too risky as well, however, is an indictment on software engineering as a profession.

If people are scared to put basic punctuation marks in the names of things, out of fear that badly-written software might break as a result, then that is a sign of just how far we still have left to go.


Unfortunately your comment is a sign of just how far we have left to go to get rid of the notion that computers are for use by English speakers primarily, and the rest of the world is an afterthought at best.

On 99% of the keyboards I've seen in my life, the middle row reads like this: ASDFGHJKLÖÄ'

The apostrophe is considered significant enough to be on that row, but so are Ö and Ä. It's reasonable for users to expect that software can accept the letter that is right next to L on their keyboard, yet there remain software engineers who assume that users won't be surprised that things break if they dare to touch this key.


My point was that Unicode is newer than ASCII, and that we can't hope to deal with Unicode (Ö) properly if we can't even cope with ASCII (') yet. Nothing to do with anglocentrism at all. I agree that there are lots of annoying computing problems for non-English speakers though.


What I tried to say is that dealing with ASCII is meaningless; it's not even a useful starting point.

For the majority of people, a string format that accepts 100% of ASCII but 0.1% of Unicode is just as useless as one that only accepts 95% of ASCII. Therefore the goal should never be to get your ASCII coverage from 95% of 100%.


There are two issues here: one is not accepting Unicode properly, and the other is making incorrect assumptions about the content of strings. Both need to be resolved, and all this cultural butthurt is not productive to solving either of them.

Resolving the Unicode issues is undeniably a higher priority for speakers of foreign languages, but there are still plenty of languages and libraries whose support for Unicode ranges from nonexistent, to antiquated, to limited, to just plain broken.

There's nothing application developers can do in the short term to fix those problems, but by examining their own code and removing fallacious assumptions they can better facilitate the proper handling of Unicode once it becomes available in their underlying technologies. In the mean time, though, they may only have ASCII, or ISO 8859-X, or KOI8-R, or Shift-JIS, etc. to test with.


The ISO-8859-1 ASCII-based character encoding contains an Ö, too. That's not even Unicode. It may not be in the first seven bits of the ASCII alphabet, but it's so widely implemented that not supporting it is just as ridiculous as not supporting the apostrophe.


The problem is that there are so many standards for what characters 128-255 stand for.


there are literally hundreds of codepages http://www.unicode.org/Public/MAPPINGS/


The first A in ASCII stands for American, so it does have quite a bit to do with anglocentrism.


This is excessively literal and ignores the actual technical problems and circumstances in a global context related to ASCII and unicode. North Korea calls itself the Democratic People's Republic of Korea. Does that mean South Korea is, necessarily, part of the DPRK, since that's what its acronym suggests?


I have no idea what you're talking about.

The first A in Ascii stands for American. This is because it was an American standard, created by and for Americans. Who pretty much all speak English (and spend dollars $!). What about ascii isn't anglocentric?


His point is perfectly viable. If Americans for some reason spoke Romanian, then the ASCII standard would also contain letters such as 'Â','â' (perhaps pushing out some special characters for them), so that Americans would be able to write Romanian on it. ASCII is the way it is because it is convenient for English speakers; it is not the way it is because it is for some reason harder to encode 'Â' then it is to encode 'A'.


I'm curious, are the Ö Ä characters so common that they warrant dedicated physical space? I've seen "Spanish" keyboards that have a dedicated ñ, but it seems like the letter isn't that common and would be better served via a key-combo.

In fact, some IMEs already intercept " ' : etc and won't print them literally, but assume they're going into a combo with another letter.

I've talked to a few friends, and they said it was a holdover from typewriters and not really designed with modern computing in mind.


This handy page http://en.wikipedia.org/wiki/Letter_frequency has a frequency table that includes Swedish.

So, if you sort on frequency by clicking the appropriate arrows icon in the Swedish column, you can see that both Å and Ö are more frequently used in Swedish than B, C, J, Y, X, W, Z and Q. Of course, all of those have dedicated keys as well and I don't think that's going to be changing anytime soon.

The important point is that "the same letter" can have different roles in different languages. In many languages Ä is considered a "modified version" of A (like in German, I think) but still "mostly" an A when it comes to sorting and so on. In Swedish, that's just not the case, it's a separate letter. Our alphabet has first-class 29 characters (http://en.wikipedia.org/wiki/Swedish_alphabet).

If your last name is "Åkermark", you're sorting a loong way from "Andersson".


At least in Finnish and Swedish they are so common that they're considered letters in their own right. Å/Ä/Ö are placed in the alphabet after Z, rather than being mixed in with A and O like would usually be done with accented letters.

In Finnish, Ä is actually one of the most common letters while Ö is somewhat less frequent.

In Swedish, both "ö" and "å" are words on their own: ö means island, å is a small river or something like that :)


Is ñ less common than q, z, or j?


Notice that the fix was to use "Schrödinger’s Cat" instead of "Schrödinger's Cat". The fix was removing the single quote, as the script checked if there were characters that would run afoul of string escaping. The problem as it is has nothing to do with Unicode and was actually fixed by using a Unicode character.

That doesn't mean that there aren't any Unicode bugs in other tools, but that wasn't the issue in this particular case.


Handling of special character is usually a bigger problem than unicode. Just today, I tried to name a file with both a single quote and an exclamation point from bash shell. Ended up doing that with a GUI file manager.


  mv foo \!\'


Better:

  $ touch 'j-kidd'\''s file!'
  $ ll 'j-kidd'\''s file!'
  -rw-rw-r-- 1 uid gid 0 Apr 11 13:30 j-kidd's file!
History expansion will happen on the ! in double quotes:

   $ ls "!$"
   ls "'j-kidd'\''s file!'"
   ls: cannot access 'j-kidd'\''s file!': No such file or directory
It won't happen on single-quotes:

  $ ls '!$'
  ls: cannot access !$: No such file or directory
The only issue is that you can't escape a single-quote within single quotes, so you have to do one of these '\'' (escape a single-quote block, a literal single quote, start a new single quote block).


Yep, that's an important and maybe non-obvious behavior of the shell:

"directly adjacent strings which are double quoted, "'single quoted or even'\ unquoted\ and\ possibly\ full\ of\ escape\ sequences\ 'get concatenated and count as a single parameter.'

Running touch on the above will create one file with very long name.


I was going to question you about that, but then I realized I, too, want computers to handle natural language filenames.

Or, "don't be your computer's tool!" -- the operator should accept no arbitrary limitations.


I usually wind up calling Perl for that kind of stuff.

That also lets you write confusing filenames like "foo\rbar". Which can be really irritating to figure out without a GUI.


I'd argue it is worse that almost all languages can not be used to name things. A pain that many computer users in non-English speaking countries feel regularly. As a German I set all my devices to US English for this reason -although this causes other problems. The apostrophe is really just related to shells and scripting and should be a bit easier to fix.


I agree and disagree. If everything had unicode support, we'd still probably want to limit the codepoint set used for identifiers that are supposed to be unique. Lots of characters have visually indistinguishable glyphs ('x' could be cyrillic kha as well as plain old x) and this already causes problems - see homoglyph attacks.


In fact, some participants in the mailing list discussion proposed adding non-alphanumeric characters to future release names just to see what happens. (...) Peter Robinson proposed the project go right for the goal and choose "DROP table *;".

We need a courage like this! ;)


Who's really going to use os-release in an SQL database? Real courage would be 'Fedora 20 ; rm -rf / #'


Don't forget --no-preserve-root


I believe this release name would combine the best of both:

    Fedora 20 ; rm -rf / # '; DROP table *; --


That part of the article made my day :)


I've seen a bunch of companies start to use icon/emoticon characters in their email subjects to get attention - Newegg, LinkedIn, etc. It certainly works on me. I'm afraid soon every mail will have an attention-grabbing icon and we'll just get inured to it.

I have to admit, if I launched an email campaign, I'd use them too...


Very easy to spam detect on non-alphanumeric chars in subject.


There are two big lessons from this discussion, IMHO.

Firstly, broad character sets are all very well, but they have limited value if you can’t rely on everyone using your text being able to see them. Something like the ö (o-umlaut) is a reasonable thing to expect in modern Western fonts, but what about the emoticons I’m increasingly seeing in e-mails, or more advanced mathematical operators, or the cat in the title of this very article that many people commenting here can’t see in some contexts?

We need much better standardisation and prioritisation of sets of related glyphs that are, for example, permitted by the coding standards for a software project or supported by a font file. ASCII is too small, all of Unicode is too big, and picking glyphs for inclusion in fonts one at a time is too fine-grained for this purpose.

Secondly, it is crazy that with literally a million code points available in Unicode world, we don’t seem to have new control characters for “begin literal” and “end literal” to mark a range of text that should be interpreted verbatim regardless of context. Instead, we’re still using horrible hacks like quoting and escaping in environments like command lines and source code, and in text file formats like comma- or tab-separated values. These kinds of techniques are, invariably, horribly error-prone and terrible for usability; after all, in the case we’re discussing, it seems to be the apostrophe that is causing more problems than the ö! I think the computing world would be a much easier place for many, many people if there were one universal standard way of saying, “This is plain text”.


> we don’t seem to have new control characters for “begin literal” and “end literal” to mark a range of text that should be interpreted verbatim regardless of context

That is not of Unicode's concern. Markup is reserved for higher-level things and is not done within Unicode. There are historical oddities like language tags and variation selectors but the former are deprecated and the latter exist to serve a particular need. You will never see semantics like those you describe applied to code points in Unicode. That's not what Unicode is for.


That is not of Unicode's concern. Markup is reserved for higher-level things and is not done within Unicode.

Well, I’m not sure what authority you’d cite for your statements about what is or is not Unicode’s concern, but right on page 3 in chapter 1 we have:

“The Unicode Standard began with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. [...] Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption.”

That sounds familiar, given the various contexts I mentioned before. I wonder whether the cost of all the software bugs ever caused by getting quoting/escaping wrong is greater than other repeat offenders like dereferencing nulls.

Then on page 8 in chapter 2 there are several explicit text processes mentioned, including user interaction and treating text as bulk data.

I do understand that a basic principle of Unicode is to represent plain text rather than mark-up, but it seems to me that avoiding problems with representing widely used text strings in different ways using different quoting and escaping conventions is consistent with that principle and exactly the kind of lack of standardisation that Unicode should help with.

There are already more than 60 control characters in Unicode. There are also many characters that aren’t described as control characters but which convey things like layout information or rendering hints, such as the non-breaking space, soft hyphen, and direction controls. I know that language tags are deprecated, but those are rather complicated and context-specific and, as the standard itself observes, supporting them is a significant burden on implementers, so I don’t think it is reasonable to equate that situation with what I suggested.

On balance, as a practical matter, if you're working in a text-only medium, your character set is all you have. If anything else must be done using some form of mark-up built with the same character set, then a simple, unambiguous, standardised way to switch from one to the other seems entirely consistent with the general goals of the Unicode project.

Links to relevant sources:

http://www.unicode.org/versions/Unicode6.2.0/ch01.pdf

http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf


Is it too much to expect that every device should be capable of rendering every unicode point (latest version at release) in at least one font? With the option to fall back to that font if it tries to render something in a font that doesn't have that character?

Obviously there are issues with unicode phishing of domain names and other cases where you might want to signal that a character is "strange", but surely the memory and processing requirements for this are low enough now. It doesn't have to be a good font!


Is it too much to expect that every device should be capable of rendering every unicode point (latest version at release) in at least one font?

Yes, I think it probably is. Unicode is vast, and the 100,000+ characters specified in the latest standard include numerous obscure, specialised, or downright gimmicky ones.

The effort required to create just one font that supports even a crude version of each and every character is probably measured in human lifetimes. Imposing that kind of burden as a barrier to entry for any new platform seems unrealistic.


For one, all those glyphs don't need to be drawn by a single person or even coexist in the same font (they cannot, anyway, given current font formats). Of the 110182 characters Unicode 6.2 defines Windows 8 (the only thing I can test at the moment) includes glyphs for 102082 of them out of the box. The missing ones fall mostly into either »rarely-used CJK ideograph« or »historical script« (like Hieroglyphs).

Unicode serves many different needs and not all characters are necessary to support in a general-purpose OS. There are fonts to cover the missing pieces and professionals working in fields requiring those usually have them installed.

There is also little benefit to provide a single font that encompasses all of Unicode. Designers pick fonts for aesthetic reasons and every script has different styles (although Latin, Greek and Cyrillic are fairly similar which is why they usually are all included in every font). E.g. you have the main distinctions into serif and sans-serif (for non-decorative body text). This distinction never existed for scripts like Han, Hebrew, Arabic, various Indic scripts, etc. So if you were to create only one font, what are your choices to include for every script? Pan-Unicode fonts are mostly useful as fallback fonts to ensure that you can see some rarely-used glyphs but for nearly all practical purposes they cannot be used for anything else. It's also an enormous effort beyond creating the glyphs because you'll have to include kerning tables, define positions where combining characters appear, etc. Those are often issues that make such pan-Unicode fonts unusable because yes, they may contain plenty of glyphs but cannot be used reliably to render text that goes beyond simple scripts (and diacritic placement can even be wrong with just Latin.


Whether you try to supply a comprehensive set of characters in one font file or many isn’t really the issue, though. You’ve still got to get all those glyphs from somewhere, however they are grouped.

I’m just not sure I see a compelling argument that any new device entering the market must be able to render advanced mathematical notations, animals, and tarot cards. That’s a very high barrier to entry.

In due course, if there are freely available, good quality fonts that do the job, then by all means include them, but we’re a long way from that situation today. Even the most comprehensive efforts, things like Unifont, don’t cover all of Unicode. Also, without wishing to belittle anyone’s efforts, some of these projects are working on bitmap fonts, and it’s increasingly a vector world. Perhaps they are still useful as a rendering of last resort, but I suspect anyone working on a new platform or device has more pressing concerns.


You might be interested in GNU Unifont: http://unifoundry.com/unifont.html



Um, how could you ever showll someone the string-termination character without escaping it?


Why is unicode support so bad in 2013?


It's basically Y2K without the sense of urgency. People assumed ASCII, or at least single-byte supersets of ASCII, would pretty much be the norm, so they never bothered with anything else. And since there are no apocalypse nutters breathing down people's necks to fix it, it often gets deprioritized.

Plan 9 actually tried to do something about this: it assumed Unicode for everything, and invented their own encoding for the process. The OS didn't work out so well (regrettably), but we still use their encoding: it's called UTF-8 now. Still a superset of ASCII, I guess, but at least we've gotten beyond the single-byte assumption.


Because every implementation of anything has certain assumptions and when you're not someone who regularly juggles different scripts at once it's easy to miss a lot of things (e.g. blindly assuming that everything is ASCII).

Also part of the problem was U+0027 and its use as a quote character in shell scripts. So that was not a Unicode issue but rather a shell script injection issue and programmers do that all the time (although usually with SQL, I guess).


You might ask why shell scripting and string escaping are so bad in 2013 as well. Those have couple of decades more history behind them after all.


Not all Unicode characters are indexed by search engines, for example the 😻 is not indexed apparently: http://lwn.net/Articles/545993/


Because it's so much work for so little return.


It's kind of a dull problem, unfortunately. I think this is also why so many of the XML standards are poorly implemented.


Next release name: "' && rm -rf /"


More interested by the title... I had no idea unicode emoticons existed.

Found a complete list here:

http://www.alanwood.net/unicode/emoticons.html


I don't know how but the Homebrew project print a pitcher of beer somehow in my iTerm2. Most likely unicode but it was not listed in the page you linked.


🍺 Beer Mug - Unicode: U+1F37A (U+D83C U+DF7A), UTF-8: F0 9F 8D BA

You can find all of them in the Emoji section of Special Characters through the Edit-menu, as described in another comment. They work in the classic Terminal application as well, along with most other applications using standard OS X API:s or ones that have specifically added support.



I have an emoji apple in the Terminal prompt on my Mac. If I don't see that splash of red I know I'm on one of my Linux servers.


Not sure how standard they are, but there are a several hundred you can browse in OS X by selecting Edit > Special Characters... > Emoji in Finder. 🐈


They have been standardised in Unicode 6.


:( the cat glyph worked in Chrome on my phone but not on my desktop.


Oh, it's supposed to be a cat glyph?

Chrome on OS X 10.8 shows me a square, and so I assumed that the joke is that it represents Schrödinger's box -- you know, where he keeps the cat which may or may not be dead.


mine only showed up as a box until i observed it.


Heh, that's actually kind of a neat, serendipitous, not-really-neat-because-this-shit-should-fucking-work-by-now-it's-2013-fer-chrissakes fallback.


To be fair, Emoji are a fairly recent addition to Unicode and they're not really ubiquitous outside of mobile phones (they started as a formal encoding of various pictograms used predominantly by Japanese carriers for SMS). No OS supplies enough fonts to cover 100 % of Unicode out of the box.

That being said, Emoji seem to work at least on Mac OS X (current) and Windows 8. Alas, Chrome doesn't render them within a page (but in the tab/title bar, maybe because the OS' rendering takes over there). Firefox and IE show them both, apparently (presumably because both are DirectWrite-based by now).


Firefox on Linux (where I assume DirectWrite isn't used) also displays it fine.


My OS X 10.7.5 Chrome shows a cat glyph in the title in the tab bar, but a square in the actual text. Quite weird.


Windows 8 Chrome does the same, here.


Different fonts, right?


Odd, considering it works fine on Firefox on OS X 10.8 for me. Perhaps your Chrome install is broken? Or does Chrome simply not support the cat symbol?


Chrome doesn't render emoji. I believe it uses it's own font rendering engine or something along those lines. OSX does render emoji so Firefox shows the character (which I believe relies on the OS font rendering engine).

See http://apple.stackexchange.com/questions/41228/why-do-emoji-... for more details.

Also, I am not an expert about this so I could be wrong about the details.


http://imgur.com/dvHjJoa

Granted, this is on Linux, but I think it's only a matter of whether or not your font has glyphs for the emoji characters -- correct me if I'm wrong, but I don't see a technical reason to treat emoji differently from any other Unicode character.


No font can have complete Unicode support (OpenType supports 65k glyphs while Unicode has a little over 100k code points). So font rendering and layout engines usually pick different fonts for different scripts. The reason why Chrome on OS X (and Windows 8) displays Emoji in the title bar but not in the page is likely that the OS knows how to render emoji (in OS X's case even as coloured bitmaps) while Chrome's rendering engine does not. In both cases the emoji are very likely not part of the font (you ship one font that contains the icons and while rendering you can pick them from that font, instead of having to ship every font with those glyphs).


The technical reason is that when a font is drawn, a few coherent chunks of Unicode are included and the rest are left out.


Maybe it's some special "Schrödinger's cat" glyph. My Firefox displays it properly on HN, but when I follow the link the tab bar shows a square. And title bar a cat.


That's the wavefunction collapsing.


I have yet to see the cat glyph in the title (firefox, chrome, or safari) is it 1F63B in http://www.unicode.org/charts/PDF/U1F600.pdf ?

Edit: changed glyph based on someone mentioning it had heart-shaped eyes


Yes it is 1F63B, to confirm use iconv,xxd and copy+paste the character from the title:

  $ iconv --to-code UTF-32BE|xxd
  😻
  0000000: 0001 f63b 0000 000a                      ...;....



Hmm, is there a reference of who actually designed all these unicode chars anywhere? I remember looking a while back to no avail - these more 'graphic' ones resurfaced my interest. Did one poor sod do the whole lot [of Emojis] or...?


The ones you are seeing come from a font you have installed. There should be a copyright somewhere in there. For those in the Unicode code charts they come from several people and prototype fonts, usually.


Ah, of course. Thanks :)


For me in Chrome on Mac OS 10.8 instead of an emoticon it just shows a sqare. Safari showed the cat.

The square looks like the qed symbol, I thought the article was about a proof or a box. Both worked fine in my mind :)


It sounds like something weird is going on, I would expect the following behaviour: if you have a font with the cat glyph installed on your system all browsers display the cat, if you don't have a font with the glyph then no browser displays the cat.

Why does Safari show the cat while Chrome doesn't? I can think of two explanations: (1) Mac OS doesn't actually have a designated place to put fonts for all applications to find (I don't use Mac OS so I have no idea how it works, but this sounds unlikely), (2) Mac OS does have a central location for fotns but Chrome doesn't use them and just uses some fonts it comes bundled with. Is either of those two explanations correct? If not, what is going on?


> Why does Safari show the cat while Chrome doesn't? I can think of two explanations: (1) Mac OS doesn't actually have a designated place to put fonts for all applications to find (I don't use Mac OS so I have no idea how it works, but this sounds unlikely), (2) Mac OS does have a central location for fotns but Chrome doesn't use them and just uses some fonts it comes bundled with. Is either of those two explanations correct?

I suspect its an OS-related issue, as the cat shows in Chrome on Kubuntu 12.10 for me.


There is a reason the latin alphabet is the most used in the world, and why 7bit ascii is the standard for computing. Simplicity. They are highly distinguishable characters consisting of mostly straight lines and simple curves.

Diacritical marks (training wheels) and unicode break this simplicity, you cannot program in a language with many characters that are hard/impossible to distinguish. Or with crazy characters that can switch direction of text and other nonsense. Unicode is a luxury to pretty things up for end users, not something to do serious work in.


>, and why 7bit ascii is the standard for computing.

The US.

> Simplicity.

Oh...

> Diacritical marks (training wheels) and unicode break this simplicity, you cannot program in a language with many characters that are hard/impossible to distinguish.

Many languages get by with their communication in spite of having things like "spots over the o's" or whatever. I have no problem distinguishing them. Do you have experience with reading such languages? Or are you simply blowing hot air?

Ever looked at typewriter font for l and 1? Yeah, these things are not historical accidents at all...

> Unicode is a luxury to pretty things up for end users, not something to do serious work in.

It's a luxury for end users... yes, non-English users should count their blessings when they are able to use their whole alphabet. Part of this issue was correctly spelling the name of an Austrian physicist, not someone trying to write "cat" with some esoteric Unicode that looks like a cat.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: