
Programming languages have to break free from the tyranny of ASCII - adambyrtek
http://queue.acm.org/detail.cfm?id=1871406
======
jerf
How do I type it?

How do I type it?

How do I type it?

That's the fundamental question worth repeating three times. Numerous
languages support Unicode in the syntax and it doesn't matter because nobody
can use them because who wants to use a language where the 'or' operator is a
textual copy and paste operation for a new user?

You can incrementally stuff more symbols onto a keyboard but I'd want to see
some science that shows that people are able and willing to deal with a
keyboard that contains a their local language, a full suite of mathematical
symbols including Greek, Hebrew, and the various mangled Latin characters that
mathematics uses, plus logic, set notation, plus the useful not-necessarily-
mathematical operators that language designers either want or in some cases
already use/permit such as arrows, boxes, and all the other things. _And_ we
still need our full editor keyboard shortcuts and if you're a power user (and
we're talking programmers...) our keyboard shortcuts for the window manager.

So, you want to predicate learning your language on me learning all that? I'm
the kind of fruitcake that learned Haskell but I still laugh at the idea of
learning something like that. No sale. (And I don't use the Unicode Haskell
permits because while I'm actually very comfortable remapping my own keyboard,
for instance I actually have interrobang bound to a key because I found myself
using it so much, why would I do that to anybody _else_ who ever tries to read
or use my code‽)

~~~
lisper
Like this: (λ (ø) (sin (* π ø)))

I did that on a Mac using the option key and a few custom key bindings.

The real question is how do you READ it. The problem is that unless you are
very selective about your fonts and which characters you use, it can become
impossible to read because many unicode characters are visually
indistinguishable.

~~~
LInuxFedora
C'mon they are distinguishable. can you differentiate between λ and ø .

~~~
nuclear_eclipse
But can you quickly and easily distinguish between Ͱ and ͱ, or even Ω and Ω?
The last two really _are_ different Unicode characters, the first is Greek
capital Omega, the second is the Ohm symbol...

~~~
wmf
I see why that's a problem for domain names, but not for programming
languages. You just define that one character has meaning and all its
homoglyphs are syntax errors (even in identifier names).

~~~
akgerber
But what characters are homoglyphs are determined by your choice of typeface.

------
neilk
The pain/payoff slope is very steep once you try to go beyond plain text.

Let's say you want to express hints to the compiler about what paths are
common. And you like the idea of expressing that with tinted blocks. Here's
what you have to do to make that work:

\- rejigger your editor to input and save such tint blocks in the code;

\- convince everyone else using your language to use your editor, or make
similar modifications to their editor;

\- ignore the cases where you don't have a particular editor available, such
as SSH'ing into a minimally configured machine;

\- exclude color-blind and blind programmers;

\- abandon the convenience of being able to view your source code in pagers
like less, diff tools, revision control tools, web browsers, IRC, instant
messaging, and web services that allow you to paste snippets of code.

Or, you could:

\- add a new keyword to your language.

\- add a new syntax coloring rule / code folding rule to your editor, and get
almost exactly the same effect.

We have a basic information science problem here. Every feature of the
programming language is going to have some concrete representation as a stream
of bytes _anyway_. As long as we use multiple tools and environments to
manipulate and view that stream of bytes, it makes sense for the
representation and the serialization to be identical -- that is: plain text.

However, there's no reason why we can't have the odd unicode character in
source code, as long as there were some well-known shortcuts for typing. A
lambda symbol would be wonderful for many languages. Similarly, a Chinese
programmer ought to be able to define variable names and strings directly in
Chinese.

~~~
LInuxFedora
What about those complex Indic scripts in India ?

~~~
neilk
Could you clarify what your question is?

------
enneff
Calling out Go here is a strange thing to do; Go supports unicode from the
ground up. Strings are UTF-8, and any identifier can include unicode
characters - you can name your Pi constant with the greek letter.

Not only that, but we actually debated using the unicode characters for !=,
>=, and so on instead of the standard multi-character ASCII equivalents. In
the end it was decided that it would be a gratuitous change for little
benefit.

So it's a shame that he levelled his criticism at Rob Pike, a guy who clearly
knows a lot about unicode (he co-authored UTF-8 with Ken), built the first
Operating System with unicode support in its foundations (Plan 9), and has
taken great care to support unicode in Go.

~~~
ek
This post should be at the top of the thread. When you go to golang.org, one
of the first things they highlight, on the homepage, is the fact that Go has
native unicode support. The article makes a good point, but what Hanning-Kamp
describes as needing to happen has already started to happen, with Go. The
entire ethos of this piece is that it is a rant about something that needs to
happen, but unfortunately, that thing has already happened.

------
colomon
I'm wondering what the author would make of Perl 6, which both embraces
Unicode and allows the programmer to create new operators from arbitrary
collections of Unicode symbols. It seems like his fondest dream and his worst
nightmare at the exact same time -- which makes me think he may be less than
perfectly clear on what he wants.

I've never understood the arguments against operator overloading. Yeah, you
can create unreadable programs that way. But it's equally easy to create a
function named "sort" that erases your hard drive. If you apply the same
disciple to naming your operators that you do to naming your functions, it's
not hard to understand. And it's a huge boon if you happen to be (for
instance) defining a new numeric type.

~~~
rwmj
I don't know what people have against overloading either. Nevertheless,
"delimited overloading" is like overloading, but safer, since the overloading
is limited to some scope, and readers can easily see where overloading is
allowed. In OCaml you can now write something like:

Big_int.( 23 __567 mod 45 + "123456789123456789123456789")

where only the operators in the parentheses are overloaded.

<http://pa-do.forge.ocamlcore.org/>

------
mellis
An even more radical argument is that programming languages have to break free
from the tyranny of text. That is, they should be represented by a internal
structure that can be edited through a variety of representations (graphical
or textual). Various visual programming languages take this approach, but most
don't scale to large, complex programs. See subtextual.org for some
interesting work in this direction (not mine).

~~~
Groxx
I'll have to play the ubiquitous "abstractions leak" card.

Moving to this sort of setup will result in a) conflicting representations, b)
significantly slower code, or c) _epic_ frameworks on the order of all-of-CPAN
with tens of thousands of things you must know to be even remotely efficient.

Barring a Sufficiently Advanced Compiler, of course, but until one exists (and
puts all of us out of a job) optimization is still frequently important.

------
gruseom
_Syntax-coloring editors are the default. Why not make color part of the
syntax?_

Chuck Moore, of course, has done this with Colorforth. I read him somewhere
saying that by encoding some of the program in color, a visual part of the
brain not normally involved in programming was engaged, freeing up other
cognitive resources to think about the problem at hand. He didn't put it that
dogmatically; it was more of a speculation about he found this use of color so
helpful.

~~~
IgorPartola
And keep out the colorblind programmers?

~~~
neilk
And the blind programmers.

~~~
obneq
you dont need it to be displayed in color to work. you can use other text
attributes or tags. the blind programmers prefer black on black color schemes
i hear...

------
brownegg
Most of the comments seem a bit critical. I don't think the author is naively
suggesting "this needs to be better@#!!!". Not all ideas have to be practical
or even feasible; it's beneficial to at least _consider_ everything. The
payoff on a truly new / useful abstraction could be huge. Think of C with /
without the preprocessor.

Or take a language like Scala, which I happen to like a lot, but has one
hellaciously complex type system: how much would it benefit from an additional
syntax abstraction? This question is worth asking!

I like the columns of code idea a lot, and it's something I've thought about
for a long time. But my conclusion so far is that this is actually better done
with IDE functionality. Just establish a common annotation for metadata which
IDEs interpret as a display configuration. Or have the IDE open a .src file
and a .doc file in the same directory, displayed side by side. The doc file
has the same line count as the .src file, and there are conventions / macros /
etc. that make use of the second column, etc. (Of course the .doc could be
compressed into some kind of metadata file when stored.)

To bring it back to the topic, aren't we all REALLY interested in something
that makes coding... just, better?

------
rwmj
Perl 5 has been able to use UTF-8 identifiers for over a decade, but I don't
know anyone who is actually using them.

    
    
      $ perl -e 'use utf8; $水=1; print $水,"\n"'
      1

~~~
d0mine
The rationale to add non-ascii identifiers to Python:

 _Python code is written by many people in the world who are not familiar with
the English language, or even well-acquainted with the Latin writing system.
Such developers often desire to define classes and functions with names in
their native languages, rather than having to come up with an (often
incorrect) English translation of the concept they want to name. By using
identifiers in their native language, code clarity and maintainability of the
code among speakers of that language improves._ </quote>
<http://www.python.org/dev/peps/pep-3131/>

    
    
      $ python3 -c '水=1; print(水)'
      1
    

But I also don't know anyone who is actually using them.

~~~
nitrogen
Right now computer programs written in common languages are internationally
readable. It's bad enough having a language barrier in normal human
communication. Why do we want to extend the language barrier to code, too? I
don't care if it's not English, but there needs to be an internationally-
accepted base character set and language for selecting programming language
keywords, and that character set needs to be roughly the size of ASCII.

------
cavilling_elite
What is everyone's opinion of LabView. Clearly it was never designed as a
server-side language, but it has a very not "linear" approach to designing.

I used to mock LabView (arguing speed and complexity), but now that I work
with a colleague that exclusively uses it--we are a medical research
engineering group--I really start to respect its utility.

So the question becomes: Would an opensource cross platform scripting
language, modeled after LabView, fit this bill?

~~~
elblanco
Dataflow systems like LabView (<https://www.ni.com/labview/>), Starlight Data
Engineer (<http://www.futurepointsystems.com/>) and the Data Model subsystem
of ArcGIS can be massively useful for a very large class of problems.

However, they tend to fall down for many general programming paradigms.
LabView is probably the furthest along that I know of in attacking more
general problems, but ultimately they are designed around building a pipeline:
datain->do something->data out.

LabView also builds in interesting levels of instrumentation and interactive
components that can make it seem more general purpose like.

In that sense, the old UNIX tools (simple, single function tools you can pipe
together to create complex systems) are the same thing.

But there are other graphical programming environments that aren't focused
around pure dataflow.

<http://scratch.mit.edu/>

Is actually a very interesting experiment in using graphical components to
build up a program. But ultimately it's just a nice wrapper around a regular
old programming language, and in playing with it a bit, I found that it was
simply faster to just type things in a regular old language.

One of the more interesting things about graphical programming systems, is
that you can often coax people into learning to use them who otherwise would
have no interest or inclination towards writing code. I think there's probably
a huge untapped market for a fully graphical system that can compile and
perform like native code...

~~~
st0p
[i]datain->do something->data out[/i]

Isn't that what computing is all about?

~~~
fhars
A surprisingly large part of computing is about

    
    
      while not (user clicked 'close') do
        something
      done

~~~
elblanco
Exactly, basing the assumption that computers are only about data in->do
something->data out, is a bit of an old-fashioned way of thinking. These days,
the vast majority of computation involves execution loops:

1 - Display something (perhaps put some data out) 2 - Receive input 3 - do
something (perhaps combine input with data in) 4 - goto 1

This is true of everything from video games to operating systems.

Dataflow models are simply a subset of the models of computation we deal with
these days...they are extraordinarily useful models, but still limited in what
they do.

------
dlsspy
One thing that sucks about go in particular with UTF-8 symbol names is that
the case of the name of the symbol is used to specify the scope of the symbol.

In the following program, what public methods exist?

    
    
        package main
    
        import "fmt"
    
        func ॐ(ॐ string) {
        	fmt.Printf("%s\n", ॐ)
        }
    
        func main() {
        	ॐ("x")
        }

------
chipsy
I don't care too much about the character set, but I do want additional
structure semantics - namely trees.

As it is, parsing is ugly because we'll write 1D text visualized as a 2D
projection; then we use parse rules to blow it out into a tree so that your
(parens and {braces}) can work properly; and then (if writing a compiler
rather than an AST-walking interpreter) we collapse it again into a 1D
executable format after one or more rounds of semantic analysis. It's a lot of
steps, and there's plenty of room for error. It offends my minimalistic
sensibilities.

What would the alternative be, though? It's nice to be able to just type out
code, and to copy+paste the data from other formats like HTML documents. So in
the end you still want to have an editor that can deal with plain text, even
if it is "natively" aware of the AST rules of the language.

My other alternative is to write more assembler or Forth-style languages that
don't have to be parsed as a tree. Not what most people would want...

~~~
pjscott
Part of the fun of using Lisp is that there are emacs keystrokes for tree
navigation and manipulation. It's very handy.

------
Deprecated
"Why keep trying to cram an expressive syntax into the straitjacket of the 95
glyphs of ASCII when Unicode has been the new black for most of the past
decade?"

Because it is simple and everyone can read it. I don't want to see code that
is "greek to me."

------
Teckla
I think it's a terrible idea to adopt more arcane glyphs merely to keep code
more terse. Just because code is short doesn't mean it's readable. If
anything, I think programming languages should use less arcane glyphs. Not to
mention it would be annoying and uncomfortable to memorize how to enter those
arcane glyphs with our current keyboards.

I also think it's a terrible idea to use color. Not only are a lot of people
color blind, but what happens if you need to print your source code on a black
and white laser printer? Those are still heavily used.

I don't think the author of the article thought these issues through very
carefully.

------
Umr-at-Tawil
I'm surprised nobody so far has mentioned Agda
(<http://wiki.portal.chalmers.se/agda>). Not only does it allow the use of
unicode in identifiers, but it uses them extensively throughout its standard
library. After using Agda for a while, one quickly realises that most of the
issues raised in this thread are rather unimportant. Having extra brackets,
more equality symbols and greek letters available really does make code nicer
to read when used judiciously.

------
samfoo
> Why do we still have to name variables OmegaZero when our computers now know
> how to render 0x03a9+0x2080 properly?

Isn't this obvious? What keyboard has 'Ω' or '₀' easily accessible? I'd be
pretty upset refactoring or reusing any code that forced me to copy-pasta or
type some crazed key combination just to render a character. It might not look
perfect, but the lowest common denominator (ASCII char set) for programming
works because every keyboard layout on the planet can enter those characters
in (mostly) one keystroke.

Am I missing the joke?

~~~
edsrzf
I couldn't agree more, and I really don't understand the author's argument. He
refers to APL, a language that's legendary for its use of wild symbols, as
"write only," but then seems to be complaining that newer languages aren't
doing the same thing.

I think Go actually gets this right: its syntax doesn't require any wacky
characters, but it allows identifiers to consist of any Unicode characters
classified as letters or numbers. The syntax caters to the lowest common
denominator, but if I really want to name my variables in Greek, I can.

------
mmphosis
_1995: More characters were made available on SAIL and later on the Lisp
machines. Alas, the world went back to inferior character sets again—though
not as far back as when this paper was written in early 1959._

Footnote from John McCarthy's Recursive Functions of Symbolic Expressions and
Their Computation by Machine, Part I

<http://www-formal.stanford.edu/jmc/recursive.pdf>

------
quicksilver03
Event if this idea catches on, there will be years before each and every
source code editor properly supports it.

For example, I work now on a Java code base which was originally developed on
French keyboards, with accents and other diacritical marks typical of the
French language in comments and sometimes in constant values.

This code base, developed on Eclipse on Windows boxes which defaulted to
CP-1252 encoding, was kept in CVS on Linux boxes, then migrated to Subversion
on Linux boxes, and now is edited on Eclipse on Linux boxes which default to
ISO-8859-1 encoding (when the developer has not fat-fingered the Eclipse
configuration). None of the original diacritical marks has been preserved and
the comments in particular are a mess. You can however choose to make an
effort to interpret the meaning of the comment and to replace any funny
looking sign with a similar-looking ASCII character.

None of this would have happened if the editor had accepted only ASCII
characters: it would have been simpler and because of that much more robust.

------
anotherperson
I kind of like Mathematica’s syntax of ESC-shorthand-ESC for inputting non-
ascii characters. Not quick, but definitely readable.

~~~
stevenbedrick
+1, +1, +1. Mathematica's fabulous in this regard- its InputForm -> FullForm
-> StandardForm/OutputForm pipeline is The Right Way to handle this problem,
IMHO. You can use its ESC-shorthand-ESC/etc. syntax to enter your code in
StandardForm, or you can type it longhand using InputForm- and it all gets
represented internally the same way.

------
dinkumthinkum
I didn't see a real argument that this is necessary or would be better. And
having non-textual aspects of the code, like the color or a green tint? That
would be interesting for some novel programming language but, really, this is
why we need to break from of the "tyranny" of ASCII to make everything more
complicated?

------
arohner
I find it amusing that the article doesn't contain examples of the unicode
characters he mentions. For example, he brings up "Dentistry symbol light down
and horizontal with wave" ⏇. I sorely wanted to see this printed next to the
description, but it's not. ACM portal limitation?

~~~
akgerber
"And, yes, me too: I wrote this in vi(1), which is why the article does not
have all the fancy Unicode glyphs in the first place."

------
bch
Comment example (which formatted poorly on ACM site) I gave: (in Tcl)

#!/usr/pkg/bin/tclsh8.6

proc გამარჯობა {} {

    
    
        # Georgian "hello", via Emacs (C-h h).
    
        puts "Oh hi!"
    

}

;# call above proc

გამარჯობა

;# I can eat glass, Georgian, via <http://www.columbia.edu/kermit/utf8.html>

set მ "მინას ვჭამ და არა მტკივა."

puts ${მ}

~~~
bch
Sorry to reply to self, but as many others have pointed out, one may not want
to type a large subset of code in unicode, but real example from tcllib aycock
parser where this ability is nice is for mathematic symbols:

lset newrhs $position ${sym}\\{Ø}

which lets one's code read like the printed algorithms in a book.

------
mariuskempe
Why not just remap one Shift key to a "Greek" key, which inserts the
corresponding lowercase greek letter? (And then Greek+Shift+letter inserts the
uppercase, obviously). That would give you quite a lot of the useful
functionality without too much trouble...

------
zachbeane
I thought this would be a link to Steele writing about Fortress. He's said
something similar.

~~~
pjscott
Steele is taking a very pragmatic approach: his target audience with Fortress
is mainly scientists and people with scientific programming backgrounds, and
they would love to have their programs look more like what they write on a
blackboard. So Fortress comes along with a fancy Unicode syntax that trivially
maps to an ASCII syntax underneath.

------
cwp
I think the issue here is input methods. Ω isn't so bad (option-z on my Mac)
but there's really only a few dozen additional symbols available via such
relatively easy shortcuts: mostly accented characters, Greek letters and
punctuation. As a Smalltalk hacker I'd like to use ← for assignment, and ↑ for
return, but AFAIK the only way to enter them is via the CharacterViewer
utility. It might be easier on Windows or Linux, but I doubt it. Alt+8592 or
whatever is still going to be too difficult to type frequently.

Mr. Kamp slags syntax designers as putting compatibility with the ASR-33 above
expressiveness, but I dont' see a way around it with today's standard
hardware.

------
zyb09
Cool idea, but the real question is how would you make something like this
happen? Someone would need to invent a new programming language with support
for all the Unicode characters and syntax coloring mechanisms, along with
making a totally innovative IDE for it.

Who's gonna do that? And how is it going to compete against C++, Java, ObjC
and the rest out there? When you come up with an idea, that radically changes
the programming landscape, you need to have a vector to introduce it to the
world.

The only way I can see that happen is, when someone like Apple declares this
its new iPhone language, or Microsoft's new .NET language, but yeah .. the
odds for that happening.. exactly.

------
michael_dorfman
I was so hoping this was going to be about APL.

~~~
narrator
I have actually programmed professionally in APL. You don't ever ever ever
want to go there. Reading that stuff is awful.

------
jberryman
This is something that is used and talked about quite often with Haskell,
because the languages roots in Category theory, predicate logic, etc. You can
use the more accurate mathy Unicode symbols in place of the ASCII and ghc can
interpret them correctly

[http://hackage.haskell.org/trac/haskell-
prime/wiki/UnicodeIn...](http://hackage.haskell.org/trac/haskell-
prime/wiki/UnicodeInHaskellSource)

------
mcmahanb
Based on the title I was expecting him to talk about escaping the tyranny of
text but he was very specific about ASCII. I think we can make the learning
curve for programming languages much easier to scale with some graphical
interfaces. Think command line prompt versus a GUI. Once learned a command
line is awesome, but damn is it hard to learn.

------
petercooper
Despite its previous bad reputation for Unicode handling, Ruby 1.9 deals with
using Unicode in syntax pretty well. At one of the big conferences a cpl years
ago, he showed off using Unicode symbols in syntax during his keynote, stuff
like using the lambda symbol for lambdas, etc. It never caught on,
unsurprisingly.

------
seles
The root of the problem is that conventional programming languages are
designed to be written by people and read by computers. With ascii it is
arguably easier to do both... So for a solution to be found, priorities need
to shift even more towards making it easier for people to read.

------
noahlt
I think that in general there are more interesting problems to work on in the
programming language design than additional syntactic sugar.

------
yarapavan
Dupe <http://news.ycombinator.com/item?id=1838234>

------
konad
So 1993

The first program to be converted to UTF was the C Compiler. There are two
levels of conversion. On the syntactic level, input to the C compiler is UTF;
on the semantic level, the C language needs to define how compiled programs
manipulate the UTF set.

<http://plan9.bell-labs.com/sys/doc/utf.html>

Plan9, what Unix did next.

------
confuzatron
_Why do we still have to name variables OmegaZero when our computers now know
how to render 0x03a9+0x2080 properly?_

Ω is a valid variable name in C#. (But Ω₀ isn't).

------
LInuxFedora
There are many issues like rendering engine for complex script.

~~~
robin_reala
What modern operating system doesn’t have one?

