Hacker News new | past | comments | ask | show | jobs | submit login
UTF-8 – "The most elegant hack" (hackaday.com)
315 points by raldu on Sept 28, 2013 | hide | past | favorite | 174 comments



I continue to be amazed by Ken Thompson's name popup in unexpected places even after all these decades. When I first learned that co-creator of Unix also invented Regular Expressions I was surprised. Then I came to know about that elegant hack of inserting malware in compiler that's practically invisible. And then Ed... Then Chess... And then UTF... And then Go... It feels like Ken Thompson is Newton of Computer Science. If you randomly open up a section of physics book 3-4 times, you are very likely to read something about Newton's contribution. Even what's more cool about him is this humble down-to-earth line from his Turing award lecture:

I am a programmer. On my 1040 form, that is what I put down as my occupation. As a programmer, I write programs.

It would be great to create a website that showcases all of his contributions in detail.


Nitpick: Thompson didn't invent regular expressions (that achievement belongs to Stephen Kleene), but he figured out how efficiently parse them with a computer, turning them from a mathematician's toy into something practical.

I don't recall if Thompson is responsible for "modern" regular expression syntax--at least the `*` is borrowed from Kleene's notation. I'd look at Thompson's paper and find out, but I'm having trouble accessing my ACM subscription.


Thompson is at the very least responsible for the ^ and $ metacharacters.


Thompson's original paper used used the example espression:

    a(b|c)*d
without any description of what it meant, so I take it that at the very least, that much was borrowed from Kleene. I am having a hard time reading it, to see if anything else was implemented, as I am unfamiliar with Algol-60 or IBM 7094 assembly.

He cites Kleene's paper, saying that he assumes the reader is familiar with regular expressions; I'd look at that paper and see what Kleene described, but I can't find a copy of that.

Thompson also site a paper by Brzozowski. The abstract for that paper includes: "Kleene's regular expressions, which can be used for describing sequential circuits, were defined using three operators (union, concatenation and iterate)"

Which I take to mean that Kleene only came up with the syntax expressed in the example expression at the beginning.


If Ken Thompson is Newton of Computer Science (as opposed to say, unix), who is Niklaus Wirth?


The Euler of Computer Science, of course:

http://en.wikipedia.org/wiki/Euler_%28programming_language%2...


Or indeed Claude Shannon.


Shannon is more like the Feynmann of Computer Science: http://en.wikipedia.org/wiki/Claude_Shannon#Hobbies_and_inve...


The fixed point is, of course, en.wikipedia.org/wiki/John_von_Neumann‎


Shannon was an electrical engineer by training and while all of his works have applications in computer science, their origins were solutions to engineering or math problems.


We should be grateful to UTF-8 for saving us from fixed multibyte Internet (so persistently pushed by MS, IBM and others), which would have made little sense in this predominantly 8-bit world. I remember someone saying at the time: the future is 16-bit modems and floppy disks anyway, so why not switch to Unicode now? Somehow that sounded absurd to me.

Anyway, 20 years later, hardware is still mostly 8-bit, and basically nobody cares about Unicode apart from font designers and the Unicode Consortium.

(On a side note, UTF-8 as a hack is a distant relative to Huffman encoding, itself a beautiful thing.)


"16 bits should be enough for anybody." --Anonymous

The funny thing is that 16 bits wasn't enough.

We thought it would be, back then. Just use 16 bits for each character and we can return to the joy and simplicity of a fixed width encoding.

That didn't work out. UCS-2 was a fixed width 16-bit encoding, but it it had to be extended into the variable width UTF-16 to get more code points. You get the bloat of 16-bit data, incompatibility with ASCII, and variable width characters too.

You have to go all the way to UTF-32 to get back to a fixed width encoding.

So UTF-8 really was a smart move. Bite the bullet on the variable width, but make it so you can always find the beginning of the next character even if you start in the middle of one, and make it backward compatible with ASCII too. Genius.

https://en.wikipedia.org/wiki/UTF-8

https://en.wikipedia.org/wiki/UTF-16

https://en.wikipedia.org/wiki/UTF-32


Please correct me if I'm wrong, but you don't really get fixed width in anyway do you? Because unicode itself is not fixed width. There are situations where one character on the screen is represented by two consecutive unicode code points.


Or a lot more than two, or even multiple characters for a single code point.

UTF-32 appears to simplify things a lot, but in practice O(1) indexing of code points and simple iteration over code points is not actually all that useful, because code points are not characters.


There are multiple factors at play here and it ultimately comes down to what you need:

If you to access code points, then UTF-32 gives you a nice fixed-width (albeit quite large) encoding to work with. Some applications or libraries may use UTF-32 internally to speed things up.

If you deal with things on screen or what humans perceive as character (graphemes) then there is nothing really that can help you at the encoding level because graphemes can be arbitrarily long with regard to code points: e̬̱ͤ̈́̿̂̅́͟x̰̞̻̼̻̼͉̰ͫͧ̑͂͊͜ͅa̶̳͖͔̞̰͈̯͓͆̓̎̉̏̎̚m̛̮̩͙ͦ͗͆̄̋́̄p͎̥̠͈̮͚̾ͅl̛̗̈̄e̴̞̫͖̝̘̋̑̄̎ͥͬ̾ (this probably doesn't even render properly).

Sadly, most APIs use either UTF-8 or UTF-16 and consider code units the most important entity, not code points. And that's inexcusable and frankly a mess, because I can count the number of times I needed to access a code unit instead of a code point on zero hands.


> If you deal with things on screen or what humans perceive as character (graphemes) then there is nothing really that can help you at the encoding level because graphemes can be arbitrarily long with regard to code points:

> e̬̱ͤ̈́̿̂̅́͟x̰̞̻̼̻̼͉̰ͫͧ̑͂͊͜ͅa̶̳͖͔̞̰͈̯͓͆̓̎̉̏̎̚m̛̮̩͙ͦ͗͆̄̋́̄p͎̥̠͈̮͚̾ͅl̛̗̈̄e̴̞̫͖̝̘̋̑̄̎ͥͬ̾

As how many characters am I supposed to perceive this, anyway? Some of those tiny squiggles are in fact letters (I mean they look like letters, I see an e,m,u,o,i and r in there). What if you'd make a crossword this way?


It says "example" with a pile of combining characters (77 to be precise): http://i.imgur.com/acyHBkR.png


I've experienced crosswords where e.g. the word going across uses a diacritic on a particular letter ("élan") and the word going down doesn't. It drives me crazy.


FYI, that is the way crosswords work in italian, accents are considered irrelevant.


For index based operations code units are more useful than code points.

a = s.indexOf("e̬̱ͤ̈́̿̂̅́͟")

b = s.indexOf("m̛̮̩͙ͦ͗͆̄̋́̄")

s[a:b]

This operation is fast with code unit indices and slow with code point indices.


If you use stream-safe normalization rules, then you have a worst-case limit of 30 combining code points in a row. You could even make an array of 128 byte single-character-strings if you wanted.


Please define what you mean by code point and code unit.


I mean the same thing that Unicode uses these terms for.

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

From http://www.unicode.org/glossary/


UTF8 != unicode.

Unicode is a standard, that maps the giant number of letters to unique numbers.

UTF8 (and UTF16, UTF32) are various encodings of that. Unicode itself is not an encoding; unicode just says "letter 'č' is 269"; UTF-8 says "We will represent it as two bytes, first is C4, second is 8D".

UTF8 is indeed variable width; UTF16 and UTF32 are, AFAIK, fixed width. Both use unicode standard.


Even if we're talking about code points and not characters, UTF-16 is variable width.

UTF-8 is variable width using 8, 16, 24, or 32 bits per code point.

UCS-2 is fixed width using 16 bits per code point.

UTF-16 is variable width using 16 or 32 bits per code point. It's an extension of the fixed width UCS-2 format: every valid 16-bit UCS-2 code point is also a valid UTF-16 code point, but UTF-16 also includes 32-bit code points.

UTF-32 is fixed width using 32 bits per code point.


UTF-8 is variable width using 8, 16, 24, or 32 bits per code point.

I wondered why you didn't write that "UTF-8 is variable width using 8, 16, 24, 32, 40, or 48 bits per code point", because that's how Prosser/Thompson/Pike's UTF-8 idea was proposed, as shown in the "most elegant hack" article.

And here's the answer from Wikipedia: In November 2003 UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

Even though they eliminated the 5th and 6th bytes, UTF-8 can still represent all of Unicode's 1,114,112 code points.

I'm mentioning this because others probably are wondering the same thing.


How sad. One defective format limits another great format.


We don't see the code point space run out anytime soon this time, even with questionable additions like Emoji. It took us 20 years to get to the current state and character encoding rate has slowed down significantly since the early days (mostly because most things are covered already).


UTF8 is indeed variable width; UTF16 and UTF32 are, AFAIK, fixed width. Both use unicode standard.

Your knowledge is wrong.

As of 2012, there are 110,182 assigned Unicode characters, and 137,468 more which have been reserved for private use.

With 16 bits there are only 65,536 possible values, and so UTF-16 has to be variable width. (Though it was originally intended to be fixed width. A misconception unfortunately embedded in the design of languages from the time such as Java.)


What the comment you're replying to is saying that glyphs may be represented by more than one code point. So sure, UTF32 has a fixed width for each code point, but it's not quite so clear how useful that is.


Oh, in that case I misunderstood parent. Sorry about that.


The Unicode standard nowadays defines UTF-8, UTF-16, and UTF-32. It speaks of "characters" which are assigned (i.e., mapped) to "code points" (i.e., U+0000 to U+10FFFF) and character encodings which encode scalar values (all code points except the surrogate ranges, therefore including all characters).


Characters are not fixed width, no, but codepoints are. The idea of "what exactly constitutes a character" gets pretty fuzzy in some cases anyway with utility codepoints.


Even Windows-format ASCII isn't fixed width - CR-LF is a two-byte 'codepoint'...


You can think of it as two separate instructions, CR to move horizontally to the beginning of the line, and LF to move down by one line.


You can think of it that way until you say, try to take the first n characters of a string and wind up with your string terminated with an orphaned CR without its accompanying LF.

Outside of ANSI terminals, teletypes and and dot matrix printers, they really don't work like that any more. If they did you'd be able to substitute LFCR for CRLF, or use multiple LFs with a single CR to leave multiline gaps - and multiple CRs would be redundant. Try sending a CSV file with LFCR line-endings, or HTTP headers separated by CRCRCRCRLF... see if that works everywhere reliably. Also, CRLF would be an acceptable substitute everywhere for UNIX LF line-endings, but clearly it's not. Fact is, CRLF is a two-byte magic number in a lot of places.


Not really because a single CR or LF does not do a real CR or LF.


It used to, back in The Day, when you had hard-copy consoles or even dot matrix printers. I think sending CR (but not LF) and then printing accents over top of the original line was a way of printing some European languages with just ASCII.


  $ man terminfo | grep cursor_down | head -1
         cursor_down                   cud1       do        down one line
  $ tput cud1 | hexdump -C | cut -f 1-20 | head -1
  00000000  0a                                                |.|
  $ man terminfo | grep carriage_return | head -1
         carriage_return               cr         cr        carriage return (P*)
  $ tput cr | hexdump -C | cut -f 1-20 | head -1
  00000000  0d                                                |.|
If you ever write a bash script that does more advanced things with the terminal, you'll get familiar with tput, termcap and terminfo pretty quickly.


It does in ASCII, and (consequently) in any ANSI-conforming terminal [emulator].


> You have to go all the way to UTF-32 to get back to a fixed width encoding.

UTF-21, actually, as there are no code points above 0x10FFFF. Powers of two have other nice properties, though.


There is no UTF-21. The number after UTF signifies the size of the code unit, which in UTF-32 is 32 bits wide. That code points are only defined up to U+10FFFF is irrelevant for that; the code unit is still 32 bits.


Yeah, I was joking. UTF-32 is a waste of 11 perfectly good bits[1], so you could just as well write each code point as 21 bits (0o0000000–0o7777777), and call it UTF-21. 24 bits (3 × 0x00–0x7F) would be fine too, but 21 is the smallest possible fixed-width Unicode encoding.

[1]: https://github.com/evincarofautumn/protodata


> so persistently pushed by MS

People still flame how MS uses 16-bit characters, which IMO is a bit of an unfair criticism. (Full disclosure: I used to work for them. But I was never a koolaid drinker.) I don't think they have any religious aversion to UTF-8, it's just history. By adopting Unicode as early as they did, they made this decision before UTF-8 existed and they stuck with it.

Probably the bigger crime is not switching to UTF-8 as the default "ANSI"/"multi-byte" codepage (to use the Windows terms). This means C programmers who are not Windows experts often end up writing non-Unicode-safe software because they expect every string to be char*.

(Also as has been mentioned UTF-16 is not fixed. And even in UTF-32 there are instances where a single glyph takes multiple code points - the decomposed accent marks are the one example I know, possibly there are others?)


IMHO it's completely fair to criticize others for the externalities that they create. UTF-16 is a design flaw, and Microsoft repeatedly decides to release a new version of Windows that fails to meaningfully address that flaw. To date, MS hasn't communicated a vision for a future of Windows where nobody has to deal with UTF-16, we can only conclude that their vision is that for the next 100 years, the rest of us will still be paying to support their use of UTF-16.

This isn't just about UTF-16 either. Think about how much IE6 has cost everyone who's made a website in the last 10 years. Microsoft's penchant for building and marketing high-friction platforms is a huge drain on innovation (and a big barrier-to-entry for people who want to learn programming), and the company deserves an enormous amount of criticism for it.


Can you explain how utf-16 is a design flaw? This sounds totally kooky to me. Are you thinking of the more limited UCS-2? Utf-16 represents the same set of chars as utf-8. It's true that lots of errors can occur when people assume 1 wchar = 1 codepoint = 1 glyph but utf-8 has similar complexity and I've seen plenty of people screw it up too. It sounds much more to me like you are saying that "not working how I am used to" is the same as "design flaw".

Edit: put another way, the NT kernel since its inception represented and continues to represent all strings as 16 bit integers, saying that they need to "meaningfully address this" is like saying Linux should migrate away from 8-bit strings; there is no reason to do it. A lossless conversion exists and I fail to see it as a big deal, it's just a historical thing because NT's initial development predates UTF-8.


Imagine a world where we don't have to convert between multiple charsets, or even think about them, beyond "text" vs "binary". That world is perfectly achievable, it's basically already happened outside the Microsoft ecosystem.

OSX and Linux default to using UTF-8 everywhere, Ruby on Rails is UTF-8 only, the assumed source encoding in Python 3 is UTF-8, URI percent-encoding is done exclusively using UTF-8, HTML5 defaults to UTF-8... The list goes on and on. "UTF-8" is becoming synonymous with "text".

In an all-UTF-8 world, if you want to build a URI parser, or a CSV parser, or an HTML parser, or really anything that does any kind of text processing (except rendering), you can just assume ASCII and everything will work as long as you're 8-bit clean. Even non-US codepages are all more-or-less supersets of ASCII. The only major exceptions are EBCDIC and UTF-16.

Because of UTF-16, we have to build charset conversion/negotiation into every single interface boundary, where we could otherwise just be 8-bit clean except in places where text actually has to be rendered.

It's completely unnecessary friction that undermines one of the big motivations for creating Unicode in the first place: coming up with a single standard for text representation, so that we don't have to deal with the mess of handling multiple charsets everywhere.

None of these decisions is a big deal in isolation, but it's a death by 1000 cuts. When you have mountains of unnecessary complexity, it only serves to make programming inaccessible to ordinary people, which is how we get awful policies like SOPA, software patents, and the like.


So much hand-wavy text in there...

There's no inherent, fundamental reason in the universe other than momentum that says you've got to use an 8-bit encoding. You could serve to be a bit more honest about how arbitrary that call really is and where it comes from. In fact many higher level languages even outside the MSFT ecosystem are happily using UTF-16 everywhere natively without problems, they just do a small conversion step at a syscall to run on your beloved Unix. AFAIK the JVM is working this way for example. This "imagine a world" game can just as easily go the other way: "imagine a world in which everything is UTF-16 ...". I think it's disingenuous to say that it matters one way or the other, that your favorite is better and all others, even if they are a 1-to-1 mapping with your favorite, constitute a "design flaw". Because information theory does not care; it is just a different encoding for the same damn thing. In truth it makes little difference as long as you are consistent, and you are whining simply because not everybody picked the same thing as your favorite.

Your claim that UTF-16 is a "design flaw" has not been validated, just that you don't like encodings ever to change and UTF-16 isn't your favorite. It's very hard to consider that anything other than whining. I thought we software types are supposed to be big on abstractions and coming up with clever ways of managing complexity? It seems rather rigid to say you'll only ever deal with one text encoding.

Lastly even in 2013 this is a total lie:

> OSX and Linux default to using UTF-8 everywhere,

If that's true then how come on virtually every Unix-like system I've set up for the last ~15 years one of the first things I've had to do is edit ~/.profile to futz around with LC_CTYPE or whatever to ask for UTF-8? I am pretty sure every Unix-like system I have set up gave me Latin1 by default, even quite recently.

I think you are underestimating the extent to which UTF-8 is a crude hack designed to avoid rewriting ancient C programs that did very much the wrong approach to localization. There was a time before UTF-8 existed and became popular when it was a fairly common viewpoint that proper Unicode support involved making a clean break with the old char type. You are right to say that UTF-8 "won" in most places but the fact that the NT kernel or the JVM use 16-bit chars reflects that prior history. I think a more mature attitude would be to accept this, that it came from a time and a place and is a different way of working, rather than call it "wrong".


> There's no inherent, fundamental reason in the universe other than momentum that says you've got to use an 8-bit encoding.

"Other than momentum"? That's a double-standard: Momentum is the only reason why UTF-16 is still relevant today. Ignoring momentum, UTF-8 still has a bunch of advantages over UTF-16, namely that endianness isn't an issue, it's self-synchronizing over byte-oriented communication channels, and it's more likely to be implemented correctly (bugs related to variable-length encoding are much less likely to get shipped to users, because they start to occur as soon as you step outside the ASCII range, rather than only once you get outside the BMP). What advantages does UTF-16 have, ignoring momentum?

The comparison between UTF-8 and UTF-16 is adequately addressed here: http://www.utf8everywhere.org/

I don't want to debate abstract philosophy with you, anyway. Momentum may be the reason, but there's no plausible way that UTF-16 is ever going to replace octet-oriented text. The idea that UTF-8 and UTF-16 are equivalent in practice is a complete fantasy, and I'm arguing that we should pick one, rather than always having to manage multiple encodings.

> I thought we software types are supposed to be big on abstractions and coming up with clever ways of managing complexity?

The best way to manage complexity is usually to adopt practices that tend to eliminate it over time, rather than adding more complexity in an attempt to hide previous complexity. It doesn't matter how "clever" that sounds, but it's generally accepted that it takes more skill and effort to make things simpler than it does to make them more complex.

> It seems rather rigid to say you'll only ever deal with one text encoding.

It seems rather rigid to say you'll only ever deal with two's-complement signed integer encoding. It seems rather rigid to say you'll only ever deal with IEEE 754 floating-point arithmetic. It seems rather rigid to say you'll only ever deal with 8-bit bytes. It seems rather rigid to say you'll only ever deal with big-endian encoding on the network. It seems rather rigid to say you'll only ever deal with little-endian encoding in CPUs. It seems rather rigid to say you'll only ever deal with TCP/IP.

Why not eventually only ever deal with one text encoding? There's no inherent value in paying engineers to spend their time thinking about multiple text encodings, everywhere, forever.

Remember that we're talking about the primary interfaces for exchanging text between software components. Sure, there are occasions where someone needs to deal with other representations, but the smart thing to do is to pick a standard representation and move the conversion/negotiation stuff into libraries that only need to be used by the people who need them. This allows the rest of us to quit paying for the unnecessary complexity, and incentivizes people to move toward the standard representation if their need for backward compatibility doesn't outweigh the cost of actually maintaining it.

> I am pretty sure every Unix-like system I have set up gave me Latin1 by default, even quite recently.

Ubuntu, Debian, and Fedora all default to UTF-8, and have for several years now. You're going to have to name names, or I'm going to assume that you don't know what you're talking about.

> I think you are underestimating the extent to which UTF-8 is a crude hack designed to avoid rewriting ancient C programs that did very much the wrong approach to localization.

Really? How would you have done it so that UTF-16 wouldn't have broken your program? Encode all text strings as length-prefixed binary data, even inside text files? It's ironic that you say it's immature to call UTF-16 "wrong", but you've basically just claimed that structured text in general is "wrong".

Let's not forget that nearly every important pre-Unicode text representation was at least partly compatible with ASCII: ISO-8859-* & EUC-CN were explicitly ASCII supersets, Shift-JIS & Big5 aren't but still preserve 0x00-0x3F, and even EBCDIC, NUL is still NUL. Absent an actual spec, it was no less reasonable to expect that an international text encoding would be ASCII-compatible than it was to expect that such a spec would break compatibility with erverything. Trying to anticipate the latter would rightly have been called out as overengineering, anyway.

In that environment, writing a CSV or HTML parser that handles a minimal number of special characters and is otherwise 8-bit clean is exactly the right approach to localization.

Also, "ancient C programs"? Seriously? Are you really saying that C and its calling convention were/are irrelevant?

> There was a time before UTF-8 existed and became popular when it was a fairly common viewpoint that proper Unicode support involved making a clean break with the old char type.

Sure, and it was a fairly common viewpoint that OSI was the right approach, and that MD5 was collision-resistant. Then, we learned that all of these viewpoints turned out to be wrong, and they were supplanted by better ideas. UTF-8 became popular because it worked better than trying to "redefine ALL the chars".

> You are right to say that UTF-8 "won" in most places but the fact that the NT kernel or the JVM use 16-bit chars reflects that prior history.

The JVM isn't comparable, because its internal representation is invisible to Java developers. I can write Java code without ever thinking about UTF-16, and a new version of the JVM could come out that changed the internal representation, and it wouldn't affect me. Python used a similar internal representation, and recently did change its internal representation. Most Python developers won't even notice.

If NT used UTF-16 internally, but provided a UTF-8 system call interface, I wouldn't care. The problem is that, in 2013, people writing brand new code on Windows still have to concern themselves with UTF-16 vs ANSI vs UTF-8. This is a pattern of behavior at Microsoft, and that's what I'm criticizing.

> I think a more mature attitude would be to accept this, that it came from a time and a place and is a different way of working, rather than call it "wrong".

Look, I understand that mistakes will be made. My criticism isn't that the mistakes are made in the first place, but that Microsoft doesn't appear to have any plan to ever rectify them. The result is a consistent increase in friction over time, the cost of which is mostly paid for by entities other than Microsoft.

One could argue that Microsoft's failure to manage complexity in this way is one of the reasons why Linux is eating their lunch in the server market. Anecdotally, it's just way easier to build stuff on top of Linux, because there's a culture of eliminating old cruft---or at least moving it around so that only the people who want it end up paying for it.

As for your ad hominem arguments about "attitude" and "maturity", I could do without them, thanks. They contribute nothing to the conversation, and only serve to undermine your credibility. Knock it off.


> The idea that UTF-8 and UTF-16 are equivalent in practice is a complete fantasy,

Except that most obscure of details that they represent the same characters.

> Ubuntu, Debian, and Fedora all default to UTF-8, and have for several years now. You're going to have to name names,

In the room where I'm sitting now I have Debian, OpenBSD and Arch systems. All of these defaulted to Latin-1 when I installed them, in the current decade.

> I can write Java code without ever thinking about UTF-16

Patently false. A char in java is 16 bits. String.length() and String.charAt() use utf-16 units, meaning surrogate pairs are double-counted.

> As for your ad hominem arguments about "attitude" and "maturity", I could do without them, thanks.

I am sorry, sometimes I am blunt and overstated about these characterizations, but it really did seem like the shoe fits. You seem to have an impulsive defense of UTF-8 and an inability to see that there might be merits or tradeoffs in an alternative. Throughout all of this, I am not saying that UTF-8 is a bad encoding, I am just saying it's goofy to "attack" UTF-16 for being different.


Standard ruby is not utf8-only and not likely to be any time soon, because converting text into unicode is lossy due to han unification. So like it or not, you're going to need to deal with non-unicode text for a while yet.


I disagree:

- Ruby isn't UTF8-only, but Rails is.

- Unless I'm building a site that caters to CJK languages where Han unification is unacceptable, I'm not going to need to deal with non-unicode text. In 4 years of Rails development, I never set $KCODE to anything except 'UTF8'.

- Any solution to the Han unification problem is almost certainly going to happen within Unicode, or at least in some Unicode private-use area that can be encoded using UTF-8.

- As a last resort, it's still easier to use something like "surrogateescape"/"UTF-8b" to pass arbitrary bytes through the system than it is to support multiple text encodings at every single place where text is handled.


Very interesting, I guess it's just a function of not knowing much about that part of the world but I'd never heard of this problem before. http://en.wikipedia.org/wiki/Han_unification

Reading that article it's a wonder they didn't come up with a mode-switching character, similar to what the article says about ISO/IEC 2022 or (cringe) Unicode bi-di chars.

Unicode is one of those things that naively sounds great and everyone talks about as solving every problem, but it ends up having lots of warts...


Ah OK, reading again I see that I Unicode does provide a way to disambiguate with combining characters.


You need two versions of functions. You need two modes for lots of text programs to use. It doubles the space for things like storing identifiers for programming languages, which normally would use up to 7 bits (for big percentage of the languages).

UTF-8 can work with all 7-bit ASCII characters, and that's what it's great about it.


> You need two versions of functions.

For Windows this is entirely a compatibility thing. One could imagine a world in which that was not strictly necessary. A good best practice for a Win32 app is to always use 16-bit strings when calling system APIs and pretend the "ANSI" versions don't exist; I would not recommend anything else.

In NT the 8-bit versions generally do nothing but convert to 16-bit and call the "real" function. In recent versions (I think Win7 was the first to do this), AFAIK the 8-bit shims typically exist in another module, so if you don't use them they don't get loaded. In NT on ARM the 8-bit shims are not even there.

> It doubles the space for things like storing identifiers for programming languages

Which is why GetProcAddress() still takes an 8-bit string. Just because the kernel (and hence the syscall interface) uses 16-bit everywhere doesn't mean you can't use 8-bit strings in your own process, or that you can't selectively pick what makes sense for your use.

(By the way, none of what I'm saying is a criticism of UTF-8. I think it's a very clever encoding.)


If they were using UTF-8, wouldn't just one version of function suffice ? That is, instead of using a shim, just determine which version to call based on the encoding.


There would be no need for the 16-bit versions if they had used UTF-8. The same function signatures would work for ASCII, ISO-8859, or UTF-8.


NT development began before UTF-8 existed. Surely you're not suggesting that the authors should have used an encoding that did not yet exist?


Another example of UTF-16 brokenness: it is byte-order dependent, that is, programmer must care about big-endian vs little-endian encoding of 16-bit numbers. This caused another hack: using non-text BOM (Byte-Order Mark) symbols to denote what endianness is used.

I think it's very ugly and may be ok for Windows-x86 worlds, but not acceptable for variety of platforms connected to the Internet.

UTF-8 avoids endianness problems completely.


A legit complaint. Though I'll say, in early Unicode before UTF-8 existed, this was probably seen as less of a problem than using the older non-Unicode charsets, and probably rightly so.

Also, if you keep a clean separation between serialization and in-memory representation, the BOM hack is not such a big deal. Serialize code can write a BOM, and de-serialize can do the char swapping and omit the BOM in RAM, then in-memory it's always host byte order and without BOM. Or you can just have the on-disk/over-the-wire format be UTF-8, which is common today on platforms where UTF-16 is used in RAM. (I'll repeat a point that seems to be lost on a lot of people in this thread, that Microsoft adopted 16-bit chars for NT before UTF-8 existed, so that was not an option.)


I still remember the day that the "View > Encoding" menu of browsers was absolutely critical in browsing the web if you browsed enough sites that were outside of your normal locale (your default codepage in Windows). The menu is still there, at least in Firefox & Chrome, but I haven't used it in a few years, as most things Just Work™ for modern websites, largely thanks to UTF-8.


The eventual browser default is still (sadly) browser locale dependent, so the menu exists for sites designed for a different locale. I and some others still hope one day we can get rid of that (likely replaced with per domain, primarily TLDs, defaults).


In IE, you mean? A standards-compliant browser shouldn't need to guess at the encoding. It can be found in the <meta> tag or (if that wasn't supplied or we're not dealing with HTML) in the Content-Type header. If it isn't found there, I believe HTML says, "it's ISO-8859-1 (latin1)".

That said, certain browsers have been known to guess.


IE, Firefox, Chrome, Opera, Safari all have locale-dependent defaults for HTML. The process used is essentially: user-override, BOM, higher-level metadata (e.g., Content-Type in HTTP), meta pre-parsing, and then a locale-dependent default (Windows-1252 in most locales). Anything labelled "ISO-8859-1" is actually treated as Windows-1252 in browsers (they differ only in ISO-8859-1's C1 range, so Windows-1252 is a graphical superset).

http://www.whatwg.org/specs/web-apps/current-work/multipage/... has the detail of what all browsers implement nowadays.


I love that UTF-8 kind of followed in the tradition already set by the guys who came up with ASCII. ASCII itself is brilliant in that they aligned characters to be easy to test. Take upper and lower case 'A' which are 0x41 and 0x61 respectively:

0x41 0100 0001 0x61 0110 0001

So if you want to test whether a character is uppercase or not, you can just & 0x20.

Another one is how the number digits got lined up, which start at 0x30 for 0 to 0x39 for 9, which look like:

0x30 0011 0000 0x31 0011 0001 0x32 0011 0010 etc.

You can just read off the first nibble to convert from ASCII to an int which is pretty cool.


Anyway, 20 years later, hardware is still mostly 8-bit, and basically nobody cares about Unicode apart from font designers and the Unicode Consortium.

Err, failure to parse sarcasm?

Most people on most networks use it daily. On mobile networks, it's used for tens of thousands of messages per hour as SMS messages in almost every country with non-Romanized scripts uses UCS-2 encoding. I've had to write a raw SMS PDU generator for a whole bunch of human languages, including right to left ones like Farsi and Arabic, or Chinese, Japanese, Korean and Thai. Furthermore, Unicode is now used by everyone in China, having finally mostly migrated from GB encoding. That means "most of the internet uses it daily".

I guess your point may have been sarcastic, because the above is obvious to me. Either that, or you have a really bad case of a merry centrism.


I think he means "nobody cares about Unicode" in the sense that nobody cares about the technology itself. The guy in Japan probably loves that he can make a website in Japanese and have it look right on any system in the world. He probably loves even more that this just works without him having to know or care about encoding systems. Even most programmers using modern languages and frameworks can get away with not paying much attention to encodings and still probably have good support for international characters.


OK, thanks. So you've translated that "nobody cares" apparently means "everyone uses it and it works really well". And that's supposed to be a point? God, I must be getting old.

(Edit: I am not getting frustrated with you, rather my incapacity to grok the original post.)


I don't know why you are getting frustrated with him, he just (correctly, as far as I can tell) interpreted the above comment for you.


> The guy in Japan probably loves that he can make a website in Japanese and have it look right on any system in the world.

That guy is probably using Shift-JS.


> Unicode is now used by everyone in China, having finally mostly migrated from GB encoding

I doubt it.

Actually Baidu and other major sites are still on GB2312/GB18030. There are practical reasons as most of the frequently used (3000+ characters) Chinese glyphs take two bytes in GB2312, but three in UTF-8. Many URIs contain escaped GB2312 instead of escaped UTF-8. For mobile(, not only 3G, but more significantly, the GPRS/EDGE/WAP) users, every bit counts.


Uses it != cares about it


Nobody cares about unicode?

I couldn't write web software that deals with non-ascii, and especially non-Latin, alphabets without paying attention to Unicode.

When I start getting into weird edge and corner issues -- I am continually amazed that Unicode works as well as it does. It turns out modeling alphabets is way more complicated than you might think. And Unicode gets an amazing amount right, making it practical to do things robustly correct.

Including not just the character set and encodings, but things like Unicode collation algorithms. A bunch of really smart people have worked through a bunch of different non-trivial problems of dealing with diverse alphabets, and provided solutions you can use that pretty much just work, bug-free.


and basically nobody cares about Unicode apart from font designers and the Unicode Consortium.

If by nobody you mean, pretty much the entire world east of New York.


Just the other day I was having a good laugh with the carpenters building my house here in New Hampshire at the expense of the rubes in Maryland who still only use ASCII.


These days, it wouldn't make much difference in terms of bandwidth, since everyone gzips everything on the server side.


And thanks to Java popularizing UTF-8. Java was the major language to support Unicode natively. The lack of Unicode support in C++ was disappointing.

(Yes, UTF-8 hack is nothing compared to Huffman coding. The Deflate compression algorithm using Huffman coding to encode the LZ77 output and the Huffman encoding table itself is another notch up on the hacking coolness scale.)


From http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Ch...: "The native character encoding of the Java programming language is UTF-16."

That said, at least they adopted a Unicode encoding, and helped it establish mindshare.


Is this right? Doesn't Java use UTF-16 internally?


And Java's first attempt to implement UTF-8, the one they called "UTF8", is not UTF-8. It's almost but not quite CESU-8. It will probably fuck up your astral characters (like emoji).

The real Java UTF-8 is, intuitively, named "AL32UTF8".


And then there's the unremitting horror that MySQL calls UTF-8, properly known as WTF-8.


I'd want to save the name "WTF-8" for that case where someone takes some text encoded as UTF-8, decodes it as if it's in a single-byte encoding, and then encodes the resulting mojibake as UTF-8 again.

Double UTF-8. WTF-8.


One really cool property of UTF-8 I never realized before is that a decoder can easily align to the start of a code word: every byte that has '10' in bits7,6 is not the beginning of a code. That's really useful if you want to randomly seek in a text without decoding everything in between.

Between that and backwards compatibility with ASCII, I'd say it's a pretty neat hack.


This was actually one of the design goals of UTF-8. A decoder should be able to pick up a stream at any point and not lose a single valid codepoint.


This assumes you know the byte alignments. What if all you have is a stream of zeros and ones?


That's the physical and link layer's fault for not possessing alignment information. You shouldn't have a bit stream, you should already have a byte stream, otherwise you are dealing with data forensics.

The physical part of disks and networks usually have things like preambles, correlators, phase-locked loops, Gray codes, and self-clocking signals which can be used to deduce bit and byte boundaries, which is why in software, we almost always deal with information which we already know is correctly aligned.


In practice, you always know the byte alignments at the level of abstraction that ASCII and UTF-8 occupy.


There's no way to align a bit stream containing exclusively 8 bits per byte. (Whatever code you choose for the alignment mark could also be a valid code since there are no extras.)

You need to add extra bits so you can find the alignment, and once you do that you have the answer to your question.


While you may be correct, it's interesting to note the existence of Consistent Overhead Byte Stuffing (http://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffi...) which sort of kind of if you squint manages to add a 257th symbol to the data stream at a constant (worst-case) overhead per byte.

The original paper also discusses how to use the "Eliminate all zero bytes" behavior in order to set up an unambiguous sync pattern to indicate the beginning of transmissions in a bit-level wire protocol.


You remind me of a question my cryptography professor in college asked (and failed to answer...) -- what are the information-theoretical properties of a channel which sometimes deletes a character?

He attempted to answer it, but the answer assumed that the recipient could recognize deleted characters, which doesn't make any sense to me (it seems to be rather like assuming that if someone sends you a postcard and you don't receive it, you nevertheless sense, at the time it would have arrived, that you should have gotten it).


I wonder how computers would process text if societies with more complex alphabets had been at the foundation of the industry instead of English-speaking societies. What if Intel, Microsoft, IBM, and Apple were all Japanese companies and grew in a global market where English were not dominant? A big if, sure. Certainly, there must be glimpses at this in history of computing.


You can look at Sharp or Toshiba building word processors, which were dedicated electronic appliances, but slowly moving to generic computers as they became cheap and good enough to manage complex input methods and non english character display (i.e. enough pixels per screen to have recognizable characters).

It's funny to think that the concept of a fully functional typewriter would be foreign to Japan until dedicated computers could be built.


Huh, that's interesting. Perhaps computing simply took the path of least resistance to express itself in society (for better or worse).


The video mentioned that Japan developed five different encodings that couldn't talk to each other. So in a sense, regardless of who started it, we would have ended up at something similar to UTF-8.


Mainland China and Taiwan did something similar. I actually went to one of the newly-post-Unicode meets of Academica Sinica in Taipei where hardcore ancient Chinese academics and computational linguists were discussing unsolved conversion issues for some of those creatures.


How often does UTF-8 update to account for these issues?


I think basically people who need to communicate beyond a certain age tend to avoid Unicode and just use images, and CJK Unicode is essentially fixed, even if still changing slowly now. More info here: http://www.unicode.org/reports/tr38/

My overall impression was that super ancient characters (of which there are tens of thousands more, probably with many academic arguments as to their individual distinctions or similarities) have been left out of Unicode proper and are under some documentation/standardization effort by a separate group as a 'special use region' mapping within Unicode for their own use by agreement. I can't find their site, though I could swear I had it a few years back. Initially, "Han unification" was an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages (Chinese/Japanese/Korean) into a single set of unified characters and was completed for the purposes of Unicode in 1991 (Unicode 1.0). Unfortunately, not only did they try to get modern scholars to agree on a normalized set of characters, but they also wanted them to agree of semantic equivalence (and a subset of pronunciations!)... in all cases... across all time: obviously not a good way to please hardcore academics. (It should be noted that Vietnamese also used Chinese, that in ancient history (even ~3000+ years ago) numerous non-Chinese ideographic/logographic/alphabetic scripts existed in the south-western Chinese borderland, and that in modern Chinese, certain surviving characters today seen as 'Cantonese' (from southern Chinese coast to east of Vietnam) are a surviving relic of this arguably greater and prodominantly Southern Chinese culture of ideogrammatic innovation). Some of the scripts still survive archaeologically, others survive in literary reference, and some (though primarily alphabetic, save for the Naxi Dongba script, an understanding of which is critically endangered to lost now despite government efforts at preservation) are still alive today... often with government reforms or some 19th century debris of Jesuit or other religious meddling.


That's fascinating!

I can only imagine both the pressure and push back to 'get it right' from academics. Not only for their own language, but in a competitive sense either other counties.

Great reply!


There's not much pressure. I don't think anyone reasonably expects the Unicode consortium to fix the problems with Han Unification at all - mainly because Unicode is already too pervasive, and the cost to update existing technology to make it compatible would be too big.

There's specialized software, encodings and fonts that can be downloaded for writing traditional characters such as Mojikyo (http://www.mojikyo.org/PWU8N/index.php), but any text is basically incompatible with other software, including the web, except via converting them to images.

It's unlikely this will change. Technological progress is more important to most people, and anything not in Unicode will eventually be lost in time, just like spoken languages disappear every year as state educations force a standardized language on people.


Mojikyo! That's the one. Their site has changed a lot since I last saw it (probably ~5 years back).

On the disappearance meme, I would posit that "writing almost never dies out anymore, it just gets progressively more obscure".

On the spoken languages meme, I also volunteer on occasion for the World Oral Literature Project (Cambridge/Yale) @ http://oralliterature.org/ .. I've also been contemplating heading up to Assam in India, maybe Bhutan and Nepal to do what little recording I can manage.


UTF-8 encodes unicode code points, so its unicode or some external entity that converts between character sets that have to deal with those issues, not UTF-8

UTF-8 would pretty much only need to be updated if the unicode standard redefines what a code point is (e.g. starts using floating point, decimals, imaginary numbers or something else that is also unlikely to happen)


> UTF-8 would pretty much only need to be updated if the unicode standard redefines what a code point is (e.g. starts using floating point, decimals, imaginary numbers or something else that is also unlikely to happen)

Or if they decide that they need more codepoints, so some invalid-but-possible UTF-8 byte sequences suddenly become valid.


There's no reason for that, UTF-8 is only there to encode Unicode codepoints, and the whole range of codepoints (including the 80% not yet attributed) can be expressed in UTF-8.


UTF-8 has only been updated once (to remove 5 and 6 byte sequences, to limit it to the same range of values that UTF-16 can express).

New versions of Unicode are standardized every year or two.


I think I've seen some people argue that it's not a coincidence that the development of computers happened in a culture with a small phonetic(-ish) alphabet, that this facilitated the development of computers.

To me, that sounds both rather too teleological, and plausible.


As a self-taught newbie programmer, one of my early question was, "How do letters get drawn on the screen?"

My limited understanding is that it works like:

bits (0's and 1's) -> encoding (e.g. UTF-8) -> glyphs (e.g. Unicode) -> Some insanely complicated black box. This black box knows how to do all sorts of things like kerning, combining cháracters, bizarre punctuation,and other magic.

I understand UTF-8 and Unicode, but I have no idea how all the other magic works. Why is AV nicely kerned, and 我. nicely spaced? Apparenty this is a really hard problem because my trusty old code editor Textmate didn't get it right. Unicode to screen is a terribly hard problem.


One way to think about text encoding is that it provides a nice way to represent indexes into a big array of glyphs: "unicode". Fonts map (some) of these indexes to graphic representations and provide "font hinting" -- clues as to how different letters should be rendered at different sizes -- and kerning information.

Usually your desktop environment is the thing that handles all of this, providing widgets for other programs to combine to create GUIs. I can't find it right now, but you can probably find e.g. GNOME/GTK's implementation.

(Re: Textmate: code is typically rendered in monospace font -- where every glyph has the same size -- so that it has the same alignment in every font with every renderer.)


GTK+ uses Pango [1], which uses the FreeType [2] backend on Linux. There's also a pure Go implementation of FreeType. Here's the entry point for those who are interested in reading the code: http://code.google.com/p/freetype-go/source/browse/freetype/...

[1] http://www.pango.org

[2] http://www.freetype.org


I think the terminology you use is not right. Unicode maps integers to graphemes (not glyphs) and vice-versa (i.e. it's a 1-1 mapping). Fonts map graphemes to glyphs.


This is even more wrong. Unicode maps integers to abstract characters. Fonts map characters to glyphs and layout engines map glyphs to graphemes. Keep in mind that graphemes can be composed of multiple parts, e.g. ligratures or combining characters.


individual glyphs: http://www.smashingmagazine.com/2012/04/24/a-closer-look-at-...

kerning: http://en.wikipedia.org/wiki/Kerning#Digital_typography

some stuff is embedded in the font, other things the OS will do on a "best guess" basis -- which is why things look radically different between operating systems and even between different apps on the same OS if they use their own font rendering.

Check out a very high-quality open source font rendering engine: https://developer.gnome.org/pango/unstable/


Learning about the freetype library may be useful to you (http://freetype.org/). I used many years ago to render TTFs to in-memory pixmaps, and then loaded those as OpenGL textures.


This is a Computerphile video. You can find the rest here: http://www.youtube.com/user/Computerphile/videos

Really fascinating interviews with luminaries.


Brady Haran's videos are in general very good. All of his videos are in this style, concentrated on a single topic and in general shorter than ten minutes. I especially like the SixtySymbols channel (Physics and Astronomy). He seems to have a talent for asking the right questions and/or finding the right people to answer his questions.


UTF-8 is a beautiful hack but the way applications handle UTF-8 text and deal with corruption ranges from excellent to horrible. I've had a text editor balk at me after an hour of work and trying to save with a 'xx character can not be encoded' and simply refusing to save without any hint as to where this character was. Playing manual divide and conquer without being able to save (just undo the changes) is pretty scary. Finally it turned out to be a quote that looked just like every other quote.


The UTF-8 hack is beautifull. All the problem of corruption range come from standardization commities. The only purpose of corruption range is to not admit that UCS2 and UTF16 were very bad ideas that should be killed.


Note that the UTF-8 encoding mechanism has inspired the variable length integers encoding (http://sqlite.org/src4/doc/trunk/www/varint.wiki) introduced by the SQLite author (DRH) which I hope will be popular among programmers.


Unfortunately many variable length integer encodings aren't well thought out. The integer encoding used in Protocol Buffers for instance allows for multiple valid encodings of any particular integer (there's no one true canonical form)... which can make the format difficult for use with things like hashing and digital signatures


FWIW UTF-8 allows multiple encodings of any particular integers. Decoders should reject non-minimal encodings (they are a security risk, as it becomes possible to smuggle ASCII payload in a non-ASCII form which blindsides some security systems) but don't always do so. And of course if you don't decode UTF-8 data at any point and don't validate it either, you're still fucked.


UTF-8 wins for the internet because most of the payload of an HTTP request/response is 7-bit ascii characters, so it is the most efficient even in languages where UTF-16 may be more efficient because of the complexity of their character set.


I once wrote a programming language that used UTF-8. I was rather proud of this hack that succinctly allowed UTF-8 variable / function names. I went searching for it and, sure enough, I still like it.

  BAD_UTF8 = [\xC0\xC1\xF5-\xFF];
  UTF8_CB  = [\x80-\xBF];
  UTF8_2B  = [\xC2-\xDF];
  UTF8_3B  = [\xE0-\xEF];
  UTF82    = UTF8_2B UTF8_CB;
  UTF83    = UTF8_3B UTF8_CB UTF8_CB;
  UTF8     = UTF82 | UTF83 ;
  
  ATOM     = ([_a-zA-Z]|UTF8)([_a-zA-Z0-9]|UTF8)*;


This looks limited to the BMP in the same way the broken utf8 charset for MySQL is. You're simply ignoring the existence of the other 16 planes. While not necessarily a problem in practice, especially with identifiers in programs, it still hints at a greater problem that you nonetheless chose to name this UTF-8 which it isn't.


Yes, I chose to stop there for pragmatic reasons, however, the intention was always to add the other 4, 5 & 6 byte forms if/when needed.

This was as much as I needed to test my wide char code (and to parse written languages e.g. Chinese). At least left like this my code would error properly if it encountered a character it didn't know.


I thought 7 bits were used in ASCII because terminals needed the 8th as a parity bit, not because machines dealt much in 7bit entities.


Per the wiki page on ASCII:

"The committee considered an eight-bit code, since eight bits (octets) would allow two four-bit patterns to efficiently encode two digits with binary coded decimal. However, it would require all data transmission to send eight bits when seven could suffice. The committee voted to use a seven-bit code to minimize costs associated with data transmission. Since perforated tape at the time could record eight bits in one position, it also allowed for a parity bit for error checking if desired.[17] Eight-bit machines (with octets as the native data type) that did not use parity checking typically set the eighth bit to 0.[18]"


No. 5 and 6 bit codes existed. And according to Wikipedia (which cites a book "Coded Character Sets, History and Development") making ASCII an 8 bit encoding was considered (not to encode more characters, but to embed BCD data efficiently) but not done because reducing the required bandwith for all text by 12.5% represented a really significant cost reduction at that time.


That is definitely true for early terminals, however the parity bit got tossed after things got less lossy. There was an 8 bit ASCII definition called "Extended ASCII" which was (is?) used for things like Curses which describes a lot of characters which have been remapped in UTF-8.


I once wrote a potted history to the precursors of Unicode, starting from the telegraph codes:

http://randomtechnicalstuff.blogspot.com.au/2009/05/unicode-...


Nice article. Reading through the linked Wikipedia page[1] on ¤ (the international currency symbol), I was amazed at this part of history on symbolic innovations for great socialist profit!:

In some versions of BASIC (notably in Soviet versions and ABC BASIC), the currency sign was used for string variables instead of the dollar sign. It was located on the keyboard and the character set table at the same position in many national keyboards (like Scandinavian) and eq versions of 7-bit ISO/IEC 646 ASCII, as the dollar sign is in US-ASCII.

PS. This made me think of ₪, the sign for Israeli New Shekel. I wonder what percentage of currency symbols are completely symmetrical, and whether that derives from some kind of innate psychological quality that we associate with symmetry? IIRC, symmetry is oft-linked to beauty in the subconscious. (Of course, centrally issued fractional-reserve currencies nationally systematize usury, an activity universally shunned by all the world's monotheistic traditions, including Judaism: There are three Biblical passages which forbid the taking of interest in the case of "brothers," but which permit, or seemingly enjoin, it when the borrower is a Gentile [...] Lending on usury or increase is classed by Ezekiel [...] among the worst of sins. [...] abominations [...] be afraid of thy God [...] "...surely suffer death; his blood is upon him"; hence the lender on interest is compared to the shedder of blood. ... more @ http://www.jewishencyclopedia.com/articles/14615-usury)

[1] https://en.wikipedia.org/wiki/%C2%A4


Tom Jennings did something similar which only goes up through ASCII: http://www.wps.com/J/codes/

You might also know Tom Jennings from having created Fidonet and having written the BIOS which became the basis for Phoenix Technologies BIOS.

http://en.wikipedia.org/wiki/Tom_Jennings


That's awesome! Thanks for sharing this, it's really detailed. I got in a mention of the Gauss-Weber Telegraph Alphabet though :-)

Everything else he's far more detailed!


This always reminds me of the Microsoft long filename support for Windows NT. They basically use the unused bits of the old DOS 8.3 file entry (8 characters for the filename, 3 characters for the extension) to signal a long file name. Then they allocate enough space to store the file name depending on its length. The genius part is, that older systems only see the shorter filenames and keep on working.

The specification of the format is at [1], although I would love to see a nice drawing of the hack.

[1]: http://www.cocoadev.com/index.pl?MSDOSFileSystem


Your URL is broken but it sure sounds like you're describing the FAT32 long filename hack which appeared in Windows 95. That has nothing to do with NTFS.


VFAT [1] existed in WinNT 3.5, which was released in 1994.

[1] http://en.wikipedia.org/wiki/File_Allocation_Table#Long_file...


You may also want to see UTF-8 Everywhere manifesto. https://news.ycombinator.com/item?id=3906253


Describing a hack that uses a varying number of bytes for different characters as elegant makes me cringe. However, I have no better proposition.


Consider that it's

1. completely backwards-compatible with ASCII

2. you can always tell if you're reading a part of a multibyte character or not, meaning that you can tell if you're reading a message that was cut in some random point and can tell how many bytes to drop before reaching the first start of a character

3. endian-agnostic due to being specified as a byte stream

4. contains no null bytes, so it can fit in any normal C string

Point 1 is absolutely vital for backwards-compatibility and 2 makes it better than a lot of other multibyte encodings. Consider for instance chopping one byte off the start of a ucs2-encoded string - you'll get complete garbage. And 3 means you don't get strange endian-related errors.

It's a robust, resilient encoding that's a drop-in replacement for ASCII and needs no special support from many utilities - tail and head for instance can work with UTF8 text as if it's plain ASCII, just looking for a \n byte and chopping the input in lines.

So yes, I do think it's elegant. It may not be the simplest possible encoding for unicode, but it's extremely practical.


> 4. contains no null bytes, so it can fit in any normal C string

This is false; the encoding of codepoint 0 is 0x00 per the standard; modified utf-8[1] makes 0 a special case with an over-long encoding: 0xC0 0x80.

[1]: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


Have you ever seen codepoint 0 used for anything other than terminating a string?


Personally no, but you have to handle 0 correctly (well, mainly, consistently) otherwise you end up with an attack vector, where one part of the application stops at the 0 (e.g. string equality), and another later steps assumes the condition of the previous part but continues past the 0, resulting in it reading bytes that should've been considered previously (but weren't).


Exactly, and moving from ASCII to UTF-8 means you get to keep that consistency: 0x00 means 'End of string' in ASCII and it means 'End of string' in UTF-8. No change. Never a miscommunication. No possibility of old software getting confused on this issue. Any code which had its last buffer overrun flushed out in 1983 is still free of buffer overruns in 2013.

And, if you really need to represent codepoint 0 in strings, you can use Java's Modified UTF-8, where codepoint 0 is represented by the byte sequence 0xC0, 0x80. (This isn't valid UTF-8 because in straight UTF-8, every codepoint must be represented by its shortest possible representation.)

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


> it means 'End of string' in UTF-8

No it doesn't, unless you are saying that one should treat it like that. But null termination is as dangerous[1,2] with UTF-8 as it is with ASCII and should be avoided as much as possible anyway. Also, ASCII doesn't mandate that \0 is end-of-string, that's just a "convention" from C.

(Did you notice that my original comment actually included the exact modified UTF-8 link you provided?)

[1]: http://cwe.mitre.org/data/definitions/170.html [2]: http://projects.webappsec.org/w/page/13246949/Null%20Byte%20...


Linuxant [1] released drivers that bypass GPLONLY controls like so:

  MODULE_LICENSE("GPL\0for files in the \"GPL\" 
  directory; for others, only LICENSE file applies");
I'm not sure whether this counts as something other than terminating a string.

[1] https://en.wikipedia.org/wiki/Loadable_kernel_module#Linuxan...


> This is false; the encoding of codepoint 0 is 0x00 per the standard

Well, yes. That's the point: 0x00 is only ever used to encode codepoint 0. It never shows up anywhere else. That's precisely what the text you replied to means.


But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.


that's a problem for those using 0 to terminate a string. if you don't there's no issue. you can read DMR's historical perspective on choosing 0 as a string terminator here: http://cm.bell-labs.com/who/dmr/chist.html (critique section)

even the C guys have moved on though, as Go (co-designed by Ken Thompson) illustrates.


> But using 0 to terminate strings means you can never have a string that actually contains codepoint 0, as the terminator isn't part of the contents of the string.

This is indeed a tradeoff. The only alternatives are to either store the length of the string explicitly somewhere (the Pascal solution is to prepend the string with a fixed-size length parameter, which can be as awkward as it sounds when you get to really long strings) or to do something nasty with the string's actual contents, such as saying that the last byte of a string has its high bit set.

I think the C solution is the most reasonable when storage is really tight, and more flexible than the Pascal method in general, but I agree that it's potentially dangerous and it's annoying to have a byte which you can never represent in a string.


> more flexible than the Pascal method in general

Why? C strings and Pascal strings can store exactly the same contents, except Pascal strings can store a literal \0, and are faster to manipulate (in many ways), at the cost of (sizeof(word) - 1) extra bytes.


Because you can have extremely long strings without worrying about overflowing a fixed-size length prefix.


A computer has a finite address space (e.g. 2^32 bytes on x86, 2^48 on x86-64), and always has a fixed size data type large enough to address it all; an in-memory string cannot possibly be usefully larger than this so using this finite-size type as the length-prefix is perfect & optimal.

C-strings are useful in extremely constrained environments when the extra few bytes of the length prefix vs the trailing \0 byte is too much to pay, but are essentially just a security risk in any other situation.


Why? The concept of a "character" actually becomes a lot less meaningful as you add support for all of the world's writing systems (and even all of the features of even the Latin alphabet, such as diacritics and ligatures). Between combining characters, ligatures, case mapping rules that map one character to many, and so on, you realize that actually, there are very few cases where you want to address a "character" as a single independent entity. In fact, for almost everything you do, just acting on a string of bytes is more useful than trying to act on characters individually.


It's about as good as you're going to get with character encoding. Unicode + UTF-8 will likely be the standard for representation and encoding for decades to come.


It's not really a 'hack', but making it backward compatible with ASCII while maintaining a lot of other desirable properties is quite elegant.

A lot of people complain about not being able to easily or programmatically manipulate UTF-8 strings on a character-by-character basis, like it's a typical programmers daily business to go messing with natural language, but I've never seen the great loss.


Seems perfectly sensible to me. Why should characters be of fixed size?


Getting i-th char in O(1) is a nice thing to have.


It is, but it's not worth wasting 4x memory for the common case of mostly-Latin text.

It also turns out that you very rarely need arbitrary string indexing. Most of the time, when you're indexing into a string, it's a fixed (and relatively small) number of bytes from the start or end. UTF-8 can do this in a tight inner loop that just checks for bytes that don't start with 0b10. If you need .startswith or .endswith, you can just compare bytes with a byte length offset. If you need to do substring search, you can do Boyer-Moore on bytes. If you need to test for equality, do it on bytes. If you need to chop a string in half for divide & conquer algorithms and only need to be approximately right, you can chop in half by bytes and then use UTF-8's self-synchronizing property to find the nearest character boundary.


Except that because, in unicode, there is no one-to-one correspondence between code points an character, so even in utf-32 providing i-th char is not O(1). As Richard O'Keefe clearly explains: "Now as it happens I don't think random access is important. Here's why. Consider Ṏ. There are three ways to encode that: O + ̃ + ̈ | Õ + ̈ | Ṏ. This one "user character" may be one, two, or three code points ..." [1]

[1]: http://erlang.org/pipermail/erlang-questions/2011-October/06...


This is why normalization methods exist, but is also the source of various security flaws, like Spotify's issue in June 2013 [1] with user names that were different at the Unicode level, but normalized into forms that were not unique. (Their example, the account 'ᴮᴵᴳᴮᴵᴿᴰ' could seize control of the account 'bigbird'.)

[1] http://labs.spotify.com/2013/06/18/creative-usernames/


> This is why normalization methods exist

Not all (base, combining+) have a precomposed version, so no that doesn't work.


When is that a nice thing to have? It's almost never a useful operation to perform when working with Unicode strings.


Surrogate pairs mean you can't do this with UTF-16 either. Combining characters make it much less useful for UTF-32, as your understanding of a character no longer matches the user's. So arbitrary strings can't be guaranteed to have this property no matter which standard encoding you use.


UTF-8 is a nice format for storage/transmission. If you're going to do some heavy processing with your text you're supposed to convert it to a fixed-width format in memory (typically UTF-32).


Most heavy text-processing applications I know actually tokenize text to words (well, technically terms), keep a lexicon mapping from the term ID to textual representation, and then work in term space. Individual letters are usually not semantically meaningful in most languages (both human and machine), and so your analysis becomes much easier if you operate in a space that is semantically meaningful.


If you do some heavy processing with your text, the triviality of decoding UTF-8 (or anything else) to codepoints is no issue compared to the complexity of actually processing text. If you think UTF-32 makes it (significantly) easier to do text processing, you're not processing text you're destroying it.


Why? When is this a useful thing?


I once had to decode an ASCII based datastream that was encoded something like this:

* the first 14 characters were the length of the data stream in ASCII, padded with 0

* after this came the records - each record started with the size of the record, including the record header

* the record then had each field start with a letter that indicated what sort of record it was - integer, float, character, variable string - all were encoded in ASCII

* variable records were the letter "V", then a 14 byte length

Yes, the format was awful. But you asked when that would be useful. There's your answer.


You could still do this even if strings were encoded in UTF-8. Just define the length of the record to be the length in bytes.

This is why most modern programming & serialization languages (Go, Python 3, Protocol Buffers, Cap'n Proto) define separate byte[] and string types. Some things are just binary data and should be treated as such. Other things are encodings of world languages and should also be treated as such.


You don't show how O(1) random access is useful here. You show a standard length-prefixed stream encoding.


I see you love wasting resources.


UTF-32?


I never understood why there wasn't adopted a simpler solution, having bytes like:

     0xxxxxxx
the backward-compatible ASCII range

     1xxxxxxx
means this doesn't stand alone, but is followed by another one (or more) complementary byte(s).

This way, the ASCII range would be defined like the "first order", and the

     1xxxxxxx 0xxxxxxx
would be the "2nd order" character range and so on!

     1xxxxxxx 1xxxxxxx ... 0xxxxxxx
as N-bytes length sequence would form the "Nth order" character range (with an extensible N, of course).

This kind of encoding would have accommodated any range! If I am missing something, I'm all ears.


I think that there is still a fundamental problem of string encoding.

The problem is that decoders cannot know what encoding a byte stream was encoded in without additional information. Such information are often lost or omitted as you can see in web world.

In such a situation, what decoders can do is just guessing. This is the reason why we still suffer Mojibake.

A possible solution was to attach encoding information to a head of bytes as one or two byte.

For example:

UTF-8 = 0b00000001

UTF-16 = 0b00000002

Shift_JIS = 0b00000003

EUC-JP = 0b00000004

and so on.

Of course this is not actual and reasonable solution because everyone must switch decoder/encoder to this protocol at once.


I wouldn't really call it a "hack", rather an instance of a standard way of producing variable length codes, namely a prefix code [1]. That it was also made to be self-synchronizing is of course neat, but again, as a hack I would rather describe some one-off thing rather than applications of well known concepts to solve a problem they were designed to solve.

[1] https://en.wikipedia.org/wiki/Prefix_codes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: