
How ASCII lost and unicode won - stevejalim
http://blog.goosoftware.co.uk/2012/12/05/how-ascii-lost-and-unicode-won/
======
joshuaellinger
ASCII is by far the most successful character encoding that computers have
used. It was invented in 1963, back in the era of punch cards and core memory.
Modern RAM did not exist until 1975 -- a decade later.

Unicode is the replacement, not the competitor, like 64-bit IP addresses are
the replacement for 32-bit IP addresses. It was developed in the early 1990s
when RAM got cheap enough that you could afford two-bytes per character.

Personally, I deal with data all the time and rarely encounter unicode. Of
course, I'm in the US dealing with big files out of financial and marketing
databases. In fact, I've seen more EBCDIC than UNICODE.

~~~
VLM
"Modern RAM did not exist until 1975 -- a decade later."

What does that even mean? It doesn't mean DIP packaged DRAM because my dad was
buying COTS Intel 1103's in 1971 or so before I was even born. And the first
"I'm gonna store one bit of data in a capacitor" was done over the pond in the
.uk during WWII at their code breaking plant.

"like 64-bit IP addresses are the replacement for 32-bit IP addresses."

Um...

~~~
joshuaellinger
I just looked it up on Wiki. I remember my dad showing me core memory (after
it wasn't in production) in the late 70s.

------
timthorn
I really hate to nitpick, but the article implies that ASCII was the first
character encoding. In fact, there was a rich history of different encodings
before that, with different word sizes and/or incompatible 8 bit encodings.
It's quite interesting to look back and see what trade-offs were made and why.

~~~
ygra
Well, it's the oldest character set and encoding that's still semi-relevant
today. I doubt many people nowadays encounter EBCDIC and the like (and if they
do, the article isn't aimed at them, I guess).

~~~
VLM
But it misses some relevant semi-dramatic story.

For all practical purposes EBCDIC is the "IBM Standard" as opposed to ASCII's
"American Standard". The mood in the 70s/80s outside the business world was
"In your face IBM!"

And putting "Information Interchange" in the name itself is another "In Your
Face" posturing to the mainframe world. We're the future of data transmission
and you'd best get used to it, IBM...

ASCII really was a rebellion in the olden times. One that won.

Another story was before ASCII note that teletype codes and such were usually
modal, LTRS/FIGS to switch from 5-bit letters to 5-bit numbers. So there's
that dramatic great circular wheel of IT where we've oscillated both before
and after ASCII between simple encodings and modal encodings. This was an
early whine against unicode, who cares about codepages, just embed it like, or
in, what amounts to the MIME media type, and glyph-like Asian languages should
just be drawn in gif files anyway. Or so the complaints went at that time.

Another design statement story: Kind of like the Uni- in unicode uniting all
the extended code pages into one really huge space.

There are other dramatic stories not in the article. For example the Klingon
in Unicode movement. Basically about 15 years ago they tried to get Klingon
script into unicode, about a decade ago the unicode people (who?) said no, so
the Klingon people squatted in the unicode equivalent of what networking
people would call RFC1918 space, and its simmered since then. Will Klingon
actually make it into unicode officially or not, who knows. You can add to the
fire by pointing out that Unicode is already stuffed with scripts that no
living human culture currently uses, and numerous glyph symbols (think of math
stuff like + or %) On the other hand this would inevitably result in Tolkien
Elvish script being included in unicode. And does this really matter one way
or the other? And this maps perfectly into the wikipedia battle between the
deletionist (expletive deleted) and the inclusionist saviors of humanity. Well
maybe that last line was a little biased to my opinions...

There's tons of extra fun drama to tell about unicode and cutting off the
story at the founding of ASCII misses some of the fun drama. Aside from some
fun drama that wasn't mentioned in a post aimed at non-techs who we're told
really like drama. You could probably turn the Unicode story into a trashy
reality TV show somehow; Vampire Romance Fiction is going to be a harder
translation although I'd love to see it.

~~~
kps
IBM was actually quite involved in the development of ASCII. See Charles E.
MacKenzie, _Coded Character Sets: History and Development_ [1] and Bob Bemer's
stuff [2].

[1]
[http://openlibrary.org/works/OL8019369W/Coded_Character_Sets](http://openlibrary.org/works/OL8019369W/Coded_Character_Sets)

[2] [http://www.bobbemer.com/](http://www.bobbemer.com/)

~~~
VLM
True. Much as they were deeply involved in PCs. But the mainframe people and
the desktop people identified pretty strongly with their respective character
sets in the old days.

There is no technological reason EBCDIC couldn't have been the encoding for
the whole desktop revolution, other than the dramatic central control vs local
control, mainframe vs desktop thing.

There is some truth to the claim that whenever IBM reached a fork in the road
they usually found a way to go both ways, at least for many years in the olden
days.

------
salmonellaeater
The fact that UTF-8 and UTF-16 are often exposed to programmers when dealing
with text is a major failure of separation-of-concerns. If you had a stream of
data that was gzipped, would it ever make sense to look at the bytes in the
data stream before decompressing it? Variable-length text encodings are the
same. Application code should only see Unicode code points.

In general it was a mistake to put variable-length encodings into the Unicode
standard. A much better design would have been to use UTF-32 for the
application-level interface to characters, and use a separate compression
standard that is optimized for fixed alphabets when transporting or storing
text. This has the advantage that the compression scheme can be dynamically
updated to match the letter frequencies in the real-world text, and it
logically separates the ideas of encoding and compression so that the
compression container is easier to swap out. And, of course, an entire class
of bugs would be eliminated from application code.

 _Edited first paragraph to clarify:_ Variable-length _text encodings are the
same._

~~~
Someone
I agree that that is the ideal end situation, but Unicode would have been dead
on delivery if they had chosen that approach. Memory just was too expensive at
the time to make a system that, in most of the computer-using world, wasted
75% in every text string. And no, just-in-time decompression wouldn't have
worked either; CPU cycles also were too expensive at the time to do that.

Unicode also would have been too incompatible with existing code that copied
8-bit character strings around. See
[http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt](http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)
for some rationale behind UTF-8.

------
ygra
I'm impressed. Easily readable and understandable, short and as far as I can
tell no factual inaccuracies and wrong information (unlike many other Unicode
introductions and tutorials).

~~~
qznc
You want some criticism, because we have too little of that here on HN? I'll
bite. ;)

"A byte is a set of 8 bits. Computers typically move data around a byte at a
time."

A byte being 8 bits is ok. Historically, a byte might have a different number
of bits, but all modern architectures use 8 bits. Since this is an
introductory article, this is fine. (more details:
[http://en.wikipedia.org/wiki/Byte](http://en.wikipedia.org/wiki/Byte))

Computers do not typically move data in byte chunks. You could say "a byte is
the smallest unit of data a CPU can load or store". If you talk about moving
data, the question is between what. Probably memory. However, there are caches
nowadays, since bandwidth is cheap and latency is expensive. Data is moved in
cache line chunks, which means 4-64 byte chunks depending on architecture and
cache level. Bigger chunks in upcoming architectures.

~~~
chiph
IIRC, the IBM System 360 is the machine that really set the 8 bits == 1 byte
convention in stone.

Before that you had CDC machines with 6-bit bytes (or Nybbles if you owned an
Apple ][ with a floppy disk)

~~~
npongratz
I knew bytes were ambiguously sized, but I had thought a nibble/nybble was (A)
always half an octet, and (B) a fun Apple coders magazine [0].

But sure enough, I was very wrong on (A): Apple II disk writes were done in
"sets of 5-bit or, later, 6-bit nibbles" [1].

Thank you, chiph!

[0] [http://www.nibblemagazine.com/](http://www.nibblemagazine.com/) [1]
[https://en.wikipedia.org/wiki/Nibble#History](https://en.wikipedia.org/wiki/Nibble#History)

~~~
chiph
There's probably some variation in the terminology, but I always used these
terms:

    
    
        bit
        nibble (4 bits)
        nybble (6 bits)
        byte (8 bits)
        word (16 bits)
        :

------
Digit-Al
>ASCII really should have been named ASCIIWOA: the American Standard Code for
Information Exchange With Other Americans.

So he thinks Americans are the only people to use the _English_ language does
he?

~~~
kps
It's worse than that, actually, as ASCII from the start¹ included provisions
for variants for non-English latin characters and alternate currency symbols,
and ASCII was essentially the same project as ECMA-6² (ECMA being the European
Computer Manufacturers' Association³, a standardization group founded in
1961).

ASCII as we know it (which is essentially the 1967 version⁴) like the
corresponding ECMA standard⁵ provided for overloading punctuation characters
as diacritics ("/¨ ^/ˆ ~/˜ '/´ ‘/` ,/¸) to be overstruck in typewriter
fashion; ECMA-35⁶ (1971⁷) defines further extension techniques using control
and/or escape sequences.

So, yes, it's just a failed attempt at an anti-American cheap shot from
someone who isn't familiar with the development of character set encodings.

¹ _American Standard Code for Information Interchange_ ,
[http://www.wps.com/projects/codes/X3.4-1963/index.html](http://www.wps.com/projects/codes/X3.4-1963/index.html)

² _7-bit Coded Character Set_ , [http://www.ecma-
international.org/publications/standards/Ecm...](http://www.ecma-
international.org/publications/standards/Ecma-006.htm)

³ [http://www.ecma-international.org/default.htm](http://www.ecma-
international.org/default.htm)

⁴ [http://www.wps.com/J/codes/Revised-
ASCII/index.html](http://www.wps.com/J/codes/Revised-ASCII/index.html)

⁵ _7-bit Input /Output Coded Character Set, 4th Edition_ is unfortunately the
oldest available online; [http://www.ecma-
international.org/publications/files/ECMA-ST...](http://www.ecma-
international.org/publications/files/ECMA-ST-
WITHDRAWN/ECMA-6,%204th%20Edition,%20August%201973.pdf)

⁶ _Character Code Structure and Extension Techniques_ , [http://www.ecma-
international.org/publications/standards/Ecm...](http://www.ecma-
international.org/publications/standards/Ecma-035.htm)

⁷ _Extension of the 7-bit Coded Character Set_ , [http://www.ecma-
international.org/publications/files/ECMA-ST...](http://www.ecma-
international.org/publications/files/ECMA-
ST/ECMA-35,%201st%20Edition,%20December%201971.pdf)

------
peterkelly
Another good article on this topic is the one by Joel Spolsky:

[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

------
gnosis
_" Designed as a single, global replacement for localised character sets, the
Unicode standard is beautiful in its simplicity. In essence: collect all the
characters in all the scripts known to humanity and number them in one single,
canonical list. If new characters are invented or discovered, no problem, just
add them to the list. The list isn’t an 8-bit list, or a 16-bit list, it’s
just a list, with no limit on its length."_

Is this really true? My impression was that UTF-32 is a fixed-length encoding
which uses 32 bits to encode all of Unicode. It seems that this means that
Unicode can never have more code points than could fit in 32 bits. Right?

~~~
peteri
Think you're right. The article could with a rewrite as Unicode used to be 16
bit until 1996 (according to Wikipedia) which explains why Java/Windows are
really UTF-16 based.

The more interesting question is if you're designing a new operating system
would you pick UTF-8 or UTF-32 as the basis of your character system. Bearing
in mind you need to normalise strings anyway for comparison purposes the
general space efficiencies for UTF-8 for most systems seem tempting.

~~~
lucian1900
Why would one _not_ use UTF-8? You can't do byte comparisons or indexing, so
all processing has to be linear anyway.

------
okwa
> These mappings of numbers to characters are just a convention that someone
> decided on when ASCII was developed in the 1960s. There’s nothing
> fundamental that dictates that a capital A has to be character number 65,
> that’s just the number they chose back in the day.

I don't think it's mere coincidence that the capital letters start at 65 and
the lower case at 97 and the decimal digits at 48.

------
stuartcw
It's not a matter of winning or loosing. The pre-unicode mix of character sets
was a mess when it came internationalization. Try truncating a Japanese Shift-
JIS string in C. That will learn you..

~~~
ygra
Arguably Unicode (UTF-8 and -16) doesn't necessarily make this any easier. Or
any variable-length encoding, really. You see halved code points quite
frequently, and if not that, then halved &quot; and the like.

------
danso
OT and out of curiosity...how do non-native English speakers experience
typing/keyboard education? I can barely remember how to make any of the basic
accents over the `e` when trying to sound French...are typing classes in non-
English schooling systems much more sophisticated than in English (i.e. ASCII-
centric) schools? I wonder if non-native English typists come away with a
better handling of the power of keyboard shortcuts (whether to create accents
or not)

~~~
kalleboo
Most countries just have different keyboards so they don't need any dead keys.
Look to the right of P/L [http://www.danielschlaug.com/journal/wp-
content/uploads/2011...](http://www.danielschlaug.com/journal/wp-
content/uploads/2011/03/Screen-shot-2011-03-03-at-20.44.46.png)

~~~
yebyen
What do you do without square and curly braces, I'm curious? Everyone writes
only lisp and html? :)

------
lmm
Given the controversy over Han unification, I suspect that incompatible
charactersets will be with us for a while yet, more's the pity.

------
lelf
Well, when 99% think unicode = encoding = ucs2 = utf-16, don't believe there's
something outside BMP, and wtf is the only word coming to their mind when they
hear about graphemes… Unicode won?

------
rayiner
Unicode, meh. Nobody will ever need more than 128 characters.

