Hacker News new | comments | show | ask | jobs | submit login
So you want to write your own language? (drdobbs.com)
199 points by ternaryoperator 1113 days ago | hide | past | web | 203 comments | favorite

My unsolicited, biased advice: use s-expressions. Don't trigger a parse simply to compute with the sums of products and integers [0]. If you can think of something better than s-expressions use that.

Also, read http://this-plt-life.tumblr.com/ and don't take yourself too seriously. Stand on the shoulders of giants and continue reaching for the sky. Languages are super fun.

[0] http://www.infoq.com/interviews/mccarthy-elephant-2000

But parsers are easy to write and unfriendly syntax will completely kill a nascent language's adoption. If you're aiming for users that like s-exprs, then by all means use them. Otherwise, try to stick to a syntax roughly similar to one familiar to the users you're targeting.

As a user of programming languages, I don't care about s-expr vs infix syntax, but I do care about the amount of semantic innovation that went into a language (as in I refuse to learn any language whose main contribution is syntactic), and sadly it seems that wasting time on syntax distracts many PL creators from the semantic poverty of their creations.

This was my primary frustration with the article. It notes that the "bulk" of the work in building a programming language is in defining the semantics but devotes absolutely no space to the process, tools, and tradition.

A programming language is the semantics and without a strong notion of those semantics and the interactions you end up with confusion and difficulty. For an example one need look no further than Ruby's "implementation as specification" crazyness.

I won't go as far as to say that all language creators should adopt operational semantics when building a language, but if you're serious about it then you should at least investigate the process.

I deliberately tried to avoid topics that are well-covered elsewhere.

Appropriate, but can you point us at some of those elsewheres?

If you want to figure out how to write a lexer or parser, what better way than to read the code for one?



There's a lot of detail there, but the operation of both is straightforward.

I'd thought this subthread was discussing semantics.

For anyone who comes across this thread and is genuinely interested in educating themselves about how programming languages can be constructed in a principled way (what I would call sane) you can consult Pierce's Types and Programming Languages.

With some work anyone can learn the material. It's not as simple as reading a blog post unfortunately, but you'll have a much deeper understanding of what it means to build a programming language.

I'm in the midst of reading Types and Programming Languages right now.

I'm also reading the more recent Practical Foundations for Programming Languages by Robert Harper. It's written for a similar audience.

I would recommend both.

If someone wants to build a language and "A programming language is the semantics" doesn't it really make sense to tell them what the semantics should look like?

Sorry I thought I was being clear when I mentioned operational semantics but in retrospect that may have been totally unclear. I recommend Pierce's Types and Programming Languages. It has everything you need to learn about operational semantics and types systems.

I dunno, I definitely find

  d = {}
  d[key] = value
easier to read/write than

  (set! d (make-hash-table))
  (hash-table-put! d key value)

On the other hand, I find (define d (make-hash-table)) (hash-table-put! d key1 value1 key2 value2 key3 value3)

easier to read than d = {} d[key1] = value1 d[key2] = value2 d[key3] = value3

So syntax matters, but it's not obvious which syntax is best.

Yes, syntax matters.

Although I largely agree, I think that line of thinking can be taken too far.

One reason I think that is because syntax helps develop conventions and make them obvious. If everything is an s-expr and you can do absolutely anything with macros anywhere, then two programmers might not really be able to communicate even if they are technically using the same language.

When several programmers develop on a mutual codebase, there is not one programming language where coding conventions are unnecessary. This is hardly an argument against Lisps. In fact the clojure code I read so far seems to adhere far more to conventions than code of almost any other programming language I have worked with so far.

Macros are pretty much like Classes, Gotos/Loops, Exceptions, Functions, etc in this respect, the communities have mostly agreed on conventions _when_ and _how_ to use them.

I think the argument that macros will make a codebase unreadable to other programmers is largely an exaggeration to turn a pro-Lisp argument against Lisp. Its a pattern I have heard a little too often:

    "C++ is powerful" -> "C++ is too powerful"

    "Macros let you express ..." -> "Macros can do just too much"

Now in my non-Lisp programming, I miss macros a lot. Especially those that are built into a lot of lisps and are quite "simple" additions. The Clojure threading macro for example, or `if-let` variable binding, or the awesome xml-templating languages that are realizable with macros. On the other hand, I have seen code that - for me as an outsider - came quite close to what you describe, just that it hardly was Lisp but PHP, Javascript, etc. Macros, like functions and classes are part of the vocabulary programmers build.

So next time when you criticize macros, talk about something that is worth debating, like the fact that they can be unsafe (hygiene), they are not intuitively writable like functions, they might make debugging harder, etc.

"So next time when you criticize macros"

I wasn't criticizing macros; I claimed that they don't give you much guidance to any particular convention because you can do anything.

A language without conventions is incomplete, so you need to do something active to bring the conventions about. In other words, the designers of lisp leave conventions as an exercise for the reader and really offer no guidance from the language itself.

Clojure and racket are trying to address that, which is good. I've played around with them a bit, generally found them pleasant, and I hope they succeed (though I did not find anything that would compel me to actually use them for the kinds of things I work on).

It's interesting that you bring up C++. I wonder what your feelings are about people who prefer C over C++?

Regarding C and C++, I am slightly leaning towards C. While C++ offers some abstractions over C that are really nice to have, it often feels like an awkward chimera. I see C as a even-higher-level assembly language. c++ was a logical step but is kind of obsolete for non-realtime systems

Guidance in a language is a hard thing to do right. I think the only reasonable way of guidance is to make good coding practice simple, limiting guidance is often quite bad [1], because while in a lot of cases it makes things easier, there are those cases where you then have to work around the limitations and this is when you get those "bug ridden incomplete implementations" of that "liberal" language.

Which good guidances from which programming languages are you fond of?

[1] now that I think about it, I do like immutability in functional programming languages - a not-so-liberal thing..

I'm not sure there is a point worth debating. Hygiene is left to the programmer and the facilities exist to write "safe" macros (gensym). Debugging isn't terribly difficult with macro-expansion facilities. The only "problem" with them, if you can call it that, is that newbies like to over-use them once they learn them. However that problem goes away with time.

Macros in the hands of an intermediate-to-experienced programmer are powerful tools. They remove patterns from your code. They allow you to modify the compiler to efficiently run your program. And all sorts of good things.

People arguing that X "is too powerful" are saying that nobody needs a combine harvester when a team of farmers with scythes will do.

"People arguing that X "is too powerful""

Who is arguing that?

One example in a recent HN post: https://news.ycombinator.com/item?id=7085682

I fully agree. Syntax is the least of my concerns when I'm using a programming language. I use many languages in many IDEs and editors, and when I'm coding, I don't really see syntax, I see the algorithms and patterns behind it.

It's a misconception parsers are easy to write. Plus, they do an unfathomable amount of damage.

Parsers are far from harmless pieces of logic you can just throw together. And not just because they take time to write - parsers are physically dangerous. Compiler writers develop incipient carpal tunnel syndrome trying to write parsers.

When you finish writing a parser as a language designer the problem of writing a parser for the language doesn't go away. The programming language users might want to apply transformations to source code written in the language -- which now means they need to write brand new parsers from scratch to do these transformations.

What s-exprs give you instead is the option to write a "parser" in one function call: using the read function. It doesn't get shorter than that.

Now, that's a parser that's easy to write.

edit: rewording

It's a misconception that parsers are hard to write and believing this does "unfathomable amount of damage"(whatever that means).

If you understand abstract languages, writing a recursive descent parser is a simple, paper and pencil exercise.

If you don't understand abstract languages, you should not be designing a language till you stop and learn them and you should STFU about what people designing languages should do until then.

Agreed. I wrote a recursive descent parser generator in python and it took about an hour. Takes a BNF grammar and spits out a parser. End of story. Recursive descent parsers are dead simple.

Yes, the complexity of the parser is related to the size of the language. A smaller language needs a smaller, easier to write parser.

Have you ever written a recursive descent parser for C?

I realize now what was inaccurate about what I wrote. It's that the things you have to do after you parse might be the more harmful parts. Processing the parse tree you get back.

edit: rewording

I would point out that a recursive descent parsing does not have to return an AST or any particular tree, and often it doesn't.

Instead, when you create a recursive descent parser, you create a series of functions called whenever a syntax element is discover. In these functions, you construct whatever your final data structures are going to be.

Of course, you still can create and return a full abstract syntax tree but one nice thing about recursive descent is that if you are only going to do a few things, you can just have those few operations in your parser and be done with it.

I've gotten really far with a precedence parser before, and as a bonus they are intrinsically incremental. However, beware of braces to match separately.

> It's a misconception parsers are easy to write.

Really, they are easy. They are literally insignificant when you factor in all the hours you'll work on a language.

I wrote another one recently for a small side project. It took more time to write the unit tests for it. The parser practically wrote itself.

but, to be fair, you've obviously had lots of experience, and understand a great deal more about formal languages and how they translate to something that is trivially parsed, than the majority of people who will read this article. should they attempt a language, they'll likely fail to produce a working parser long before they spendthe rest of the hours on it.

If you can not trivially implement a parser, you probably shouldn't be implementing your own language. The problems only get worse from there.

I'm egalitarian inasmuch as I believe every serious programmer ought to implement some sort of toy language at some point, but I'm not so stupid as to think that this is a good idea at all phases of a programmer's development. Beginners should concentrate on other basic tasks, even low-intermediate really should too. I wouldn't reserve this task for "experts" though, because this is one of the big steps in moving from intermediate to expert. (Anyone who has assembled the skill set to implement a toy-but-nontrivial language has assembled the skill set to accomplish a very wide variety of programming tasks. If I were interviewing someone and they could demonstrate this, I would almost entirely cease to care what actual languages or frameworks they may have worked in.)

I agree in principle, but that won't stop someone from fighting with shift reduce conflicts or dangling elses for hours on end before giving up. constructing an unambiguous CFG isn't trivial without experience, and it's tough to get that without bashing your head a few times.

> someone from fighting with shift reduce conflicts or dangling elses for hours on end before giving up.

Right, but if you just hand-write a recursive descent parser, you won't have to deal with shift-reduce conflicts. Dangling elses are trivial to solve. The nice thing about hand-writing a parser is that it lets you learn one new thing (how to implement a parser for a grammar) while taking advantage of what you already know (how to write, run, test, and debug code in some existing language).

Throwing a parser generator at someone means they end up learning the weird vagaries of that generator instead of focusing on their own language. Meanwhile, the resulting generated code is nigh-unreadable, so all of those debugging skills and nice IDE they have go to waste.

Are you including the time taken to make the parser produce helpful error messages? That always seems like the hard bit to me ...

The harder error messages to do well are the ones that come out of the semantic processing.

The ones coming out of the parser aren't that hard to do.

What is your approach for generating error messages during parsing?

Ad-hoc, with the proviso that error recovery should always consume tokens, so the parser can never get stuck in a loop.

Good point -- also, dealing with corner cases, writing tests, and maintaining the parser as the language evolves are a couple more things that might take a lot more work.

Do you know of any strategies for error reporting, or of tools that implement? I'm always on the lookout for cleaner approaches to this.

> dealing with corner cases, writing tests, and maintaining the parser as the language evolves are a couple more things that might take a lot more work.

Not really.

Seriously, if you're going to get bogged down trying to get the lexer/parser to work, you're not ready to work on a full blown language/compiler. Lexing/parsing is the EASY part, as in a minute, insignificant part of the time you'll invest in the project. I really do mean that.

Wait a second. Please don't misrepresent what I'm saying. I never said anything about "get[ting] bogged down trying to get the lexer/parser to work". I never mentioned a full-blown language/compiler.

I'm genuinely interested by your comment. If I'm interpreting it correctly, you seem to be claiming that ensuring corner cases are handled correctly, testing, and maintaining a parser require a trivially small amount of work. What do you use to build your parsers?

> What do you use to build your parsers?

I use a text editor.

Using a coverage analyzer is adequate for evaluating the thoroughness of the tests.

Yes, it all is a trivially small amount of work compared to the rest of a language project (and even compared with the rest of the compiler). You'll spend much more time just trying to figure out how to convert floating point values to strings.

> You'll spend much more time just trying to figure out how to convert floating point values to strings.

I should hope not, since the source code to printf() should have pretty much the complete answer to that!

It may not have a license that is compatible with your needs.

And yes, I did have to do my own float => string implementation.

Just to clarify: I'm asking what strategy/approach you use -- hand-written vs. generator, bottom-up vs. top-down, backtracking, ambiguous, context-sensitive, semantic actions, etc. Sorry for the confusion.

Walter - I'd really be interested in an article/blog about writing a lexer/parser, from your experience and perspective. Just a suggestion ;)

I might actually do that. I see that people are persistent in believing it's hard, maybe after making these claims I have an obligation to back them up.

I think there are a few things worth mentioning here.

One is that there is a big difference between writing a parser for a language you are inventing and writing a parser that is attempting to implement an existing language. Languages have lots of corner cases; if you are inventing the language then every quirk of your parser is correct by definition. You might not even be aware of some of the subtle choices that your hand-written parser is making.

As an example, of this, it was not discovered that ALGOL 60 had a "dangling else" ambiguity in its grammar until it after had been implemented, used extensively, and even published in a technical report. It was essentially an accident of the implementation that it resolved the ambiguity in the way that it did. So while it might not be too much work to "get the lexer/parser to work", it doesn't follow that all of the issues around parsing are trivial. There is still a lot of complexity and subtlety around parsing if you're trying to design something that could reasonably have multiple interoperating implementations.

Secondly, there is a very very seriously wide variation of lexical/syntactic complexity between languages. You can pretty easily write a 100% correct JSON parser in an hour or less (possibly much less, depending on what language you choose to write it in). On the other hand, it takes man-years to write a 100% correct C++ parser (not least because C++ tightly couples parsing and semantic analysis). Now I know this article is more talking about designing your own language, and no language will start out as syntactically complicated as C++, but empirically most of the languages we actually use have a fair bit of complexity to them, so delivering the lesson that lexers/parsers are easy in general is, I think, the wrong message to be sending.

Thirdly, there are a lot of practical considerations that can make parsing more complex. For example, take Steve Yegge's attempt to do some incremental parsing (from http://steve-yegge.blogspot.com/2008/03/js2-mode-new-javascr...):

  I had two options: incremental parsing, or asynchrous
  parsing. Clearly, since I'm a badass programmer who can't
  recognize my own incompetence, I chose to do incremental
  parsing. I mentioned this plan a few months ago to Brendan
  Eich, who said: "Let me know how the incremental parsing
  goes." Brendan is an amazingly polite guy, so at the time I
  didn't realize this was a code-phrase for: "Let me know when
  you give up on it, loser."

  The basic idea behind incremental parsing (at least, my
  version of it) was that I already have these little
  functions that know how to parse functions, statements,
  try-statements, for-statements, expressions,
  plus-expressions, and so on down the line. That's how a
  recursive-descent parser works. So I figured I'd use
  heuristics to back up to some coarse level of granularity —
  say, the current enclosing function – and parse exactly one
  function. Then I'd splice the generated syntax tree fragment
  into my main AST, and go through all the function's siblings
  and update their start-positions.

  Seems easy enough, right? Especially since I wasn't doing
  full-blown incremental parsing: I was just doing it at the
  function level. Well, it's not easy. It's "nontrivial", a
  word they use in academia whenever they're talking about the
  Halting Problem or problems of equivalent decidability.
  Actually it's quite doable, but it's a huge amount of work
  that I finally gave up on after a couple of weeks of effort.
  There are just too many edge-cases to worry about. And I had
  this nagging fear that even if I got it working, it would
  totally break down if you had a 5,000 line function, so I
  was kinda wasting my time anyway.
All of this is to say: I can't argue with your basic point that "getting lexing/parsing to work" for a language you are inventing isn't terribly difficult. But I disagree with your larger (somewhat implied) point that parsers as a whole are easy.

A couple points: 1. the dangling else problem is trivial to deal with. 2. I've written a C++ parser, I know the issues involved, and I know that a parser generator isn't going to fix that.

> delivering the lesson that lexers/parsers are easy in general is, I think, the wrong message to be sending.

I stand by the message :-) in the sense that if a person finds lexing/parsing to be hard, they're likely to find the semantic/optimization/codegen parts of the compiler to be insurmountable.

I've written compilers for numerous languages, including C++, including 2 languages I invented, and lexing & parsing is just not that hard relative to the rest of a compiler.

Sure but you argue that problems of the sort you mention come from ambiguous grammars.

So, hand-created parsers may not flag ambiguous grammars and automatically generated parsers might (I've only done hand created parsers so I don't know).

And Steve Yegge quote just shows how much abstract languages are something you need to learn rather than something you can power your way through. And plenty of good programmers can power their way through almost any other kind of programming challenge so someone who seems very smart doing a very dumb thing in parsing doesn't surprise me (I've tried that myself).

Top down operator parsing (http://javascript.crockford.com/tdop/tdop.html) is pretty easy to implement, is inherently efficient, and trivial to extend.

Even that might be overkill, though: a generic operator parser for unary, binary, ternary, n-ary, and so on will take about 50 lines of code. You can encode a surprisingly large number of control structures with a cleverly crafted precedence parser.

  The programming language users might want to apply transformations
  to source code written in the language -- which now means they need to
  write brand new parsers from scratch to do these transformations.
That would be Doing It Wrong. Tools like clang-format (source formatting) and clang-modernize (source transformation to use new language features) use exactly the same parser library — among other things — as the compiler proper.

may I add this article by Laurence Tratt about parsing :

http://tratt.net/laurie/blog/entries/parsing_the_solved_prob... ltu discussion http://lambda-the-ultimate.org/node/4489

No-one sane writes parsers any more. People use parser generators.

GCC uses a hand-written parser. So does the C# compiler for .NET. V8 and, as far as I know, the other browsers' JS implementations have hand-written parsers. Dart (both the VM and the self-hosted dart2js compiler) have hand-written parsers.

Offhand, I'm not aware of any real-world language with lots of users that has a generated parser.

If you start with s-exprs, and then gradually (and thoughtfully) and syntactic sugar, there shouldn't be too much harm.

Complexity aside, concrete syntax places a hard barrier on composability of constructs (I say that based on usage, not a PL researcher). This is something quite far from the applicative/functional world (lisp, stack based, ml, haskell and its predecessors... ) where the combinations are greater (since most things are expressions).

I don't care about S-expr vs. infix syntax.

I use Haskell and Clojure happily.

Sounds like petty nonsense to me.

I'm going to give a differing opinion here, depending on what you write your interpreter and/or compiler with, using s-expressions may not be the best choice.

While using s-expressions is obviously the right thing to do if you are writing your compiler in some dialect of Lisp, it may not be the best choice if using another language.

In particular, I have past experience from writing several toy programming language interpreter and compiler prototypes in Haskell. Representing the abstract syntax tree with a tree structure and sum types is easy and convenient and writing the actual parser (using e.g. Parsec) is trivial and doing a pretty printer isn't hard either (using Hughes-PJ pretty printers).

If you have not tried it, I recommend doing a small programming language prototype in Haskell. It's a very good exercise and really fun.

I agree that s-expressions are not necessarily always the best choice.

But your argument about this having to do with what compiler one is using doesn't make sense to me. S-expressions are easily parseable in virtually any language. S-expressions are always going to be one short-cut around the problem of deciding what syntax your language should have (my object is this short-cut may not help your language in the end - your S-expression language will be considered less-than-understandable by the same reasonably large group of people who consider LISP less-than-understandable).

For example, Google give this complete lisp interpreter in c:


Haskell may have facilities that make parsing any language reasonably easy, sure. But I have news, parsing language is easy once you learn a bit anyway.

Ok, let me clarify this for you. Warning: this is going to be rather Haskell centric. My argument isn't really about the syntax of the target language at all but the internal program representation in the compiler/interpreter.

Representing tree structures like abstract syntax trees is really easy and convenient in Haskell. The traditional example, curried λ-calculus with integers looks something like this:

    data Expr =
        Constant Int |
        Identifier String |
        Application Expr Expr |
        Lambda String Expr
Haskell makes it easy to handle such tree structures using pattern matching. Of course it is at least as easy to represent s-expressions with a similar tree structure. However, there are multiple benefits of using a tree structure like that, for example you will get a compiler warning if you forget to handle one or more of the possible cases in a non-exhaustive pattern match.

It is also easier to write a parser and pretty printer (using Parsec and Pretty combinators) to convert that tree structure to/from human readable/writable strings (ie. program source code) than it is to convert s-expressions to such a tree structure.

So while using s-expressions is certainly possible, in a nicely typed language like Haskell, it will be nicer to have a proper syntax tree data structure and the advantages of s-expressions are not really there unless you are writing an interpreter for a homoiconic language like Lisp or Prolog.

The way I usually start a programming language project (I do lots of those) is writing "programs" by specifying syntax trees for test cases in Haskell using literals and then running them through the interpreter/compiler/type checker. Only when I have something that actually works (and developing and debugging by writing the tree structures becomes unwieldy), I start thinking about the syntax and write the parser and pretty printer.

So if you're using a host language other than Lisp to implement your compiler/interpreter and your source language isn't homoiconic, using s-expressions as your internal program representation might end up being unwieldy. The syntax of your source language is pretty much irrelevant in this discussion, but writing a proper parser to produce tree structure might make sense rather than reading s-expressions.

Does my explanation make sense to you? If you have any questions, don't hesitate to ask.

Here's a few of my toy language projects for your enjoyment, you'll find examples of the things discussed above there:

https://github.com/rikusalminen/funfun LLVM compiler for a toy functional programming language (with type inference!)

https://github.com/rikusalminen/slolog Prolog-esque logic programming language

It is also easier to write a parser and pretty printer (using Parsec and Pretty combinators) to convert that tree structure to/from human readable/writable strings (ie. program source code) than it is to convert s-expressions to such a tree structure.

Well, parsing S-expressions is easy in most language. Unless somehow parsing an S-expression is hard in Haskell, it doesn't matter whether parsing easier. Normally, it just means both S-expressions and generic parsing is easy.

Your overall argument could better be written "S-expressions aren't part of the particular syntax transformation method I use." OK, sure.

An S-expression based interpreter is still pretty darned easy to write (unless Haskell has weird barriers that other language don't have, if so, it doesn't reflect well on Haskell).

> Well, parsing S-expressions is easy in most language. Unless somehow parsing an S-expression is hard in Haskell, it doesn't matter whether parsing easier. Normally, it just means both S-expressions and generic parsing is easy.

> An S-expression based interpreter is still pretty darned easy to write (unless Haskell has weird barriers that other language don't have, if so, it doesn't reflect well on Haskell).

Oh, I knew I was too Haskell-centric in my answer because you missed my point entirely.

You can write an S-expression parser in Haskell, it's just as easy as writing it in any other language. But this wasn't the point at all.

Using S-expressions as the program internal representation just doesn't make sense in Haskell. You can do it and you should do it if implementing a Lisp dialect or other homoiconic language, but in general, it's a lot better to build a proper tree structure. The major advantage of s-expressions in the Lisp world is not really the syntax, but it is an universal tree structure.

Same thing applies if your host language isn't Lisp and you have a proper abstract syntax tree structure (either enums + unions or a class hierarchy or whatever). You can parse s-expressions but you need an extra step to get into the tree structure you're going to be using internally.

Your impression that S-expressions are hard in Haskell is just incorrect, my point was that there are better facilities for tree structures and parsing than S-expressions are.

As I said first thing in the earlier post... my argument isn't really about syntax, it's about the internal program representation.

Well, infix math is very nice and easy on the other hand. Also, range and index syntax like Python is nice.

Infix math is only nice because you've been trained to believe it's nice.

That strikes me as being pretty much the same as saying that English is only nice because you've been trained to believe it's nice, and why not write your language in Klingon instead?

When you're dealing with human interfaces, existing convention counts enormously.

> That strikes me as being pretty much the same as saying that English is only nice because you've been trained to believe it's nice, and why not write your language in Klingon instead?

There is a reason we don't program in English, and keep inventing new programming languages. English is poor and ill suited for the problem, even though it has enormous existing convention.

Nor do we program in infix math. But we write languages that use both infix math and bits of English where appropriate, because taking advantage of existing conventions helps a great deal.

We don't just use English to write software, but we DO use convention from English. In some sense, keywords like "f" and "j" are better-suited for expressing conditionals and loops because they are shorter to read and type. But I'm sure glad language designers settled on "if" and "for". The reason they did that is convention.

While I am pretty sure this is entirely true, it's crazy not to recognize that you're producing a language where basically everyone has been thusly trained since kindergarten.

I question its truthiness. It's not as if there's some shadowy cabal conspiring to force people to put operators in the middle, we developed the infix notation over a long period of time and it seems likely there are good, though possibly not entirely logical, reasons for it. We humans have funny brains, after all.

And if you really want to dive down that rabbit hole I think you have to establish that prefix is better than suffix as well. I'm somewhat dubious of even this claim, since I find suffix easier to reason about. But maybe that's just me.

I think path dependence is plenty to force us into behavioral patterns where alternatives aren't more than marginally better. There may be ways in which infix is genuinely better (I'd conjecture perhaps flexibility of ordering permitting more communication, in terms of emphasis and structure, in the same formula), but I would stand strongly by the notion that the primary reason we respond that it's "nice" is familiarity. This is particularly recalling the trouble some of my peers had with order of operations, but I think if it was flipped and we'd spent a decade dealing with prefix notation before being confronted with infix, while we might find things with prefix notation to struggle with as children we would find prefix "nice" having struggled.

I think looking at pros and cons (in the abstract and in people's heads) of prefix vs. suffix could be interesting. What I like about suffix is that you can treat it as a stack. What I like about prefix is that I know what kind of node I'm building as I consider the arguments. I've not done enough of either to have much opinion on which matters more (some lisp, some rpn calculators, but not enough). I certainly wasn't agitating for prefix (in particular) above.

One argument for infix is that it minimizes the distance between the salient token (the verb) and its arguments. When you see "a := b", the ":=" serves as a convenient visual anchor to check the variable name (to its left) or its value (to its right).

Note that English itself is an infix language: "Bob likes Clara" (SVO, infix), rather than "Likes Bob Clara" (VSO, prefix) or "Bob Clara likes" (SOV, suffix). A cursory search tells me SVO and SOV cover 75% of all languages. It would be interesting to see if people speaking SOV languages would prefer suffix notation. I would expect common patterns in (unrelated) world languages to loosely mirror natural dispositions towards syntax. In practice, that's probably a hodge podge of prefix, suffix and infix depending on whether you're dealing with verbs, connectives, prepositions, etc.

Somewhat tangentially, I wonder if some of the appeal of object oriented languages is that they usually explicitly mimic SVO. myArray.find(2) is closer to real natural language than array_find(myArray, 2).

Well, really, Java has a tendency to take these things to an extreme. Like the author of that post I like that C++ gives you a choice.

Of course, usually there's an implicit subject (receiver): this or self. In a language like ruby, which is very very object oriented, and has a similar thing where you always have a receiver to any function call, every instance has a certain set of stuff (the Kernel module) mixed into it that allows for general tasks to be treated as an implicit receiver. It works pretty well for solving this problem on the other side.

I think that VSO confusion arises only when there is a single subject and verb.

Take for example a common prefix notation: (< a b c d) this stands for "are all increasing?"

Since none are really the subject, most Java-like languages (if they had this at all) would have to invent a subject Integer.areIncreasing(myList). No longer does this give you a valuable subject, just a made up subject.

Certainly some operations (mostly arithmetic) are easier on the eyes since we've had a lot of practice with it.

Whenever I do arithmetic, I use threading macros to make it easier to read. Suddenly, it reads like infix, but with more flexibility.

(+ 4 (- 1 (/ 4 2))) ;; what?!


(_> 4 (/ _ 2) (- 1 _) (+ 4)) ;; ah

In infix, it would be:

(1 - (4 / 2)) + 4

The threading macro isn't quite as nice as the default infix, but it allows for both notations.

I think the second reads very well all things considered, start with 4, divide by 2, subtract from 1, add 4. (_ is the placeholder).

That's really interesting. I can't say that I, personally, find it easier to read than the rather plain polish notation of the first example. I find I have to work really hard to connect the _s together and I find this process unpleasant. Maybe it's just an artefact of not finding bare lisp in general all that appealing.

Re your example with <, the 'natural' object oriented way to do this to me would be to treat the list itself as the subject. [a,b,c,d].isOrdered() say. To the fact that doing it this way in java would be incredibly ugly, I'll only say that I'm not even remotely a fan of java or, for that matter, C++/Java/C#-style static-typed object-orientation.

The threading macro idea is meant to just make it easier for humans to read, so if it's not easier for you, no sense bothering with it. In such, it's not a true prefix vs infix tool, just a tool possible in languages with macros.

As to the are increasing, yes, I suppose the list itself would be a more natural subject.

This is why I love macros, not really infix or prefix specifically, because a macro makes it trivial to just have this:

(. [1 2 3 4].isOrdered)

turn into this:

(isOrdered [1 2 3 4])

That way both the human and the compiler get their preferred view.

That's not really true. Infix math definitely has its place, particularly for sequences of operations, like monadic chains. It's kind of nice having the input and output on the extremes of the expression, with the parameters attached to the operators, as opposed to prefix notation, which puts the input inside layers of nesting.

I love this. Write an integral with infix notation. :)

(0, 1) $ (\x -> x*x)

I might actually put the limits after the integral sign. Then it makes an indefinite integral look like a partially-applied function just waiting for a tuple of limits to make it a definite integral. That might get hairy for iterated integrals, besides being highly unconventional.

Yeah, that would seem to be more practical with respect to currying (and perhaps dynamic dispatch). I was just trying to make it look as similar as possible to the Leibniz notation. :)

Why not use postfix notation, and make a stack based language like Forth? Even less parsing to do.

This is a nice compromise between pure s-expression syntax and something a little friendlier: http://shriram.github.io/p4p/ (basic idea, get rid of the implicit 'begin' form from everything)

Another syntax that tries to make S-expressions friendlier is “sweet-expressions”: http://readable.sourceforge.net/. Its main change is using Python-like indentation to allow you to remove a lot of parentheses. It is backwards compatible with S-expressions.

I find that much less readable than S-expressions. Well-formatted S-expressions actually make it pretty fast for me to find what I'm looking for in code, probably faster than in C-like languages - and I mostly use languages which are more-or-less C-like.

Because the world needs another version of LISP...?

Wait, if you say "just don't do it, learn LISP instead and just customize your stuff", well that is good advice, OK.

Still, learning real language construction tools is a cool thing. It's mainly cool in the long, miserable, thankless journey way the article describes yes. But beautiful still. Nice syntax the way human like to read has had a long, successful history. But LISP has had it's moments too...

I'd just like to point out that s-expressions are not the same as Lisp. Rather, they are orthogonal. Using s-expressions for concrete syntax does not say anything at all about the semantics of your language.

This also means that there can be multiple concrete syntaxes for a language: e.g. for Python, you can use the standard syntax, but you could also use s-expressions, without changing the semantics one bit.


s-Expressions are not the same as LISP sure but I don't think you can say they are orthogonal either. They are a key aspect.

Sure, you can have a non-LISP s-expression language but the boundary between this and a subdialect of LISP is going to be porous.

The strengths and the weaknesses of LISP, as far I can see, come because it is so easy to just create a mini-language, a sub-dialectic to do this and that.

I'll assume people know the strengths (and I'm the one say them most eloquently), I'd mention simplicity, flexibility, recursion, homoiconity, etc.

I'd say the weaknesses "easy wheel reinventing", which has resulted in many half-formed wheels and few canonical wheels.

So to get back to syntax, what's nice about a "strong clear" syntax like infix math or c-style function declarations is that they make purposes clear and make boundaries clear. The useful of designing a "real language" is doing that. You can do everything with s-Expression but if it just your new language for doing X, you'll putting forward your purpose in a loose, flabby way. If you expression-language isn't going to be just rolled into LISP, it will just have all the weaknesses and of the strengths of LISP.

But we have to jump through some hoops to convert expressions into statements/expressions (I work a bit on Hy[0] which does exactly this).

[0] https://github.com/hylang/hy

Yes, that's typically how it works -- at some point, the concrete syntax tree has to be converted to abstract syntax.

I wouldn't call it jumping through hoops though. Rather, for me, it's a natural consequence of separation of concerns. I try to build parsers that only know about concrete syntax. A later pass over the CST is responsible for building the AST, which prevents coupling of the parser to the abstract syntax.

YMMV. There's more than one way to skin this cat.

a million times this. Having written programming languages, this is the best advice to give. First, you focus on what makes you language novel, interesting, and special; once you know that have something worth while, you sit down and invent the syntax that makes it pretty and readable.

So what if the syntax is (part of) what makes your language novel, interesting and special?

From my experience, syntax tends to align with bike shedding. Semantics, on the other hand, tend to be nuclear reactors.

If you don't know how to start but want to give it a try, I highly recommend the How To Create Your Own Freaking Awesome Programming Language book http://createyourproglang.com/ . One of the best purchases of this year for sure.

I've been writing my own toy compile-to-js language after reading the book: https://github.com/cosmith/panda (I might have to change the name though, I recently discovered http://yesco.org/panda.html ...)

What all backends are considered in the book? LLVM? native code? Any mention of developing debuggers for the language? I suppose if one wants to write an interpreter, that would follow automatically from the learning about how to write compilers.

I am writing an OTP-ish dialect of Erlang that compiles to javascript http://luvv.ie Just bought that book on your recommendation - better be good or else ;-D

Suggestion: you really need "here's what the syntax looks like" example(s) on your home page with some non-trivial code (more than just hello world). Ideally, show the sample and the JS output side by side like on the CoffeeScript home page: http://coffeescript.org/

I suppose I am being a bit of a blushing bride about the syntax as it isn't mine, just being good old-fashioned Erlang.

The demo doesn't show much more than hello world, because that's all there is at the moment - the focus has been on building the end-to-end toolchain and I am now filling out the features, but good suggestion going forward.

I read "OTP-ish dialect of Erlang" as meaning it's not quite the same, and I was looking for details of those differences. If it's just plain Erlang, then I see your point.

Ha! I hope you like it ;)

It seems like you have a good head start already, I hope you still find something interesting in there. If not, you can always ask for a refund...

What the hell ;) Support your local technical book writer!

I really like that this article suggests against using compiler compilers. My experiences have always been much worse when using them. I think the biggest risk when not using them (especially if you're writing a recursive descent by hand) is avoiding the temptation to make things more context sensitive (not necessarily in the rigorous language theory sense) than absolutely necessary.

I have used lex/yacc a number of times, and written several parsers by hand, and I don't see anything really wrong with using compiler compilers. Sure, it's good to have the experience of writing parsers by hand, so that you know how it works. But a compiler compiler is probably more likely to be correct and efficient, especially when still in the process of developing the language and things are changing rapidly.

The best of both worlds is a library like Parsec in Haskell, which lets you write in the native language, but feel like you're writing in a DSL. Parsec is a breeze and easily my favorite thing with which to write parsers...

I definitely don't think there's anything wrong with using them, but my experience has been more pleasurable without in general. I'm also not entirely sure you can ever really understand the benefits of using a CC, or the reasons they impose the limitations they do, if you've never even attempted to go without one. And I think that when you're first learning about parsing you'll probably be learning more at a faster pace doing it yourself than trying to learn your way around the rather frustrating tooling built around CCs.

I guess if I'm trying to be more clear I'm glad to see this perspective expressed, since it's a relatively rare one to see expressed authoritatively. There are a few topics in parsing that I don't think get enough discussion.

Like, my pet peeve is people thinking a naive implementation of the packrat parser is guaranteed to perform better than a plain recursive descent parser. In reality you might just be trading explosive memory growth for algorithmic steps. In a lot of cases that's actually slower.

Parser generators are a different issue than lexer generators imo. With a lexer, usually it ends up being a lot easier to do it by hand, because otherwise you end up fighting the system or writing awful regular expressions. Parser generators at least let you use nice formats like BNF.

Edit: also your lexical spec is much less likely to change than your grammar is.

so i really just want to say something like, "yeah, but Parsec is still awesome," but then i saw someone got downvoted for a similar comment with Rob Pike in place of Parsec, so what can i add to the discussion?

perhaps just that it's important to keep in mind that computer systems are things we've worked out for ourselves. these notions of "lex" and "parse" are not things given to us by nature...

...although, they do kind of do fall out the circumstances of (1) needing to emit x86 instructions and (2) the preference for writing programs in text editors. the second point gets a lot of discussion what with ideas about editors understanding parse trees and all, but i wonder what happens to the whole lexer-parser dichotomy if we keep 2 and periscope our notions of hardware. does the need for tokenization appear as a result of something fundamental to von neumann architecture, or is it just a result of currently-vogue instruction sets?

ah well, back to making things happen with the tools at hand.

The argument for lexers has nothing to do with machine instructions: it has to do with algorithmic performance. Grammars tend to have time complexity that is greater than linear with a moderate constant (and a parser combinator library, which is really just a lightweight programming abstraction making it easier to develop by-hand recursive descent parsers, normally doesn't try to help with this problem), whereas a good lexer tends to have linear time complexity with a nearly zero constant. If you can separate your compilation phase into "first run a lexer, then run a grammar", you can parse much longer files (not just constantly longer, but asymptotically longer) in the same amount of time. There is no fundamental reason to separate these phases, however, and it has minimal effect on the resulting compiler: numerous compilers, even numerous compiler generators, have these phases combined into a single grammar step, with the tokens now being effectively characters. (There are also engines that try to slide between the two levels, using an algorithm more similar to a lexer at the base of your grammar, but still letting you define these psuedo-tokens in the unified grammar as rules.)

I've personally found that even when working with parser generators or handwritten parsers that don't need the separation it still helps to consider them separate. Having a stable and solid set of fundamental tokens makes the resulting language easier to understand. Whenever I see a language (usually made with a PEG generator) that blurs these lines everything feels very shifty. Yes I know these are very touchy-feely attributes I'm describing, and you can obviously avoid them without that separation as well, but these things are important as well.

It's an interesting accident, to me at least, that this separation turned out to be both optimal and useful.

Yes, as Rob Pike said, just write everything by hand. Ken Thompson also recommends writing your lexers by hand as well! I also agree about the temptation to make it too complex just because how powerful writing it by hand is, I've found myself succumbing a bit to the temptation by adding in the ability to change the behaviour of the lexer in the source itself.

This doesn't make any sense to me. The algorithm to construct optimally implemented lexers is not something any human would ever type by hand. Tools like flex were designed with one goal in mind: performance; it even has a mode that analyzes your set of tokens looking for "mistakes" that would cause backtracking, as it has a really keen interest in getting exactly linear performance. It also strives for an insanely low constant multiplier: the manual page even talks about the number of machine instructions used per character of input. Of course, some people might have needs that allow for lower throughput in exchange for "less memory", so it had various modes to "compress" its lookup tables, with varying performance tradeoffs, which you can experiment with without changing your lexer definition. This is one of the super powers you get from "the correct level of abstraction".

Like, seriously: what are you trying to achieve by writing your lexer by hand? Your result will be both more difficult to maintain and is pretty much guaranteed to be slower than the output of a tool like flex. At least when people throw away the advantages of parser generators and write recursive descent parsers they gain the ability to have "easy context sensitivity" (which makes implementing many languages much easier), but I don't see why anyone would ever hand write a lexer. "I know how it all works" is also fine, but in that case you write your own lexer generator tool, you don't skip directly to the lexer (unless you are doing your first one as a "homework assignment"). If you don't like the input syntax of your lexer generator, there are many to choose from, or maybe you write one yourself, but that's no reason to switch to something that is not only going to be more verbose and error-prone but will also be slower.

(afterthought: I guess an interesting analogy: you don't lay out a hashtable by hand, like in a C initializer list in your source code that has a large number of NULL entries with only a subset filled in, already in hashed order; you instead write a hash function and let the computer reorder your entries for you. Manually hashing the values feels "hard core", but offers no advantages, and means you have to throw away all of your work and start over when the size of your table changes. Doing it by hand is also strictly grunt work: you don't gain anything by having placed it by hand other than the possibility that you made a mistake somewhere and now your element will never be found. And if later you want to try different hash functions or different search algorithms--maybe you are willing to pay some costs to get range queries, and end up using a tree--you can later do so without changing your input files. You should think of your lexer like a really complex data structure tied to an algorithm that always has the abstract interface "get next token".)

Lexing is such a small part of compilation overhead that not being optimal isn't going to kill you. In my case, writing the elder by hand is necessary because I have to memoize token identities (and all the parsing/typing info attached).

Yes, I write my own hash tables also for the same reason (so they support incremental computations).

> Lexing is such a small part of compilation overhead that not being optimal isn't going to kill you.

You still then need to just the extra boilerplate per element. If you use a tool, you are literally looking at just "keyword <space> code when that keyword is pressed" without any surrounding "how to check if it is that keyword". This is easier to write and easier to maintain, in addition to the advantages I discussed earlier.

> In my case, writing the elder by hand is necessary because I have to memoize token identities (and all the parsing/typing info attached).

If I understand what you mean, then this is trivially done with most existing tools as part of your token rule (retired an interned string, which you can use an existing data structure for). If not, then you would first write a tool. Again: it may feel really "hard core" to write a lexer, but it is repetitive code that a good design factors out into a lexer generator.

> Yes, I write my own hash tables also for the same reason (so they support incremental computations).

Careful: I imagine you mean to say you write your own hash table library/compiler, which is different from laying out the hashtable by hand. I have also written my own hashtable for many reasons, but I would never sit down with an array literal and manually put the entries in there by hand: even if I don't make any mistakes, it is a pointless waste of my time that is trivially automatable using a computer.

Keywords are quite easy: just plug your identifiers into the hash table after their boundaries are detected. Again, it is much more expensive than generating a FSM, but the expense is really in the noise when everything else is considered.

As for incremental lexing, you need to tell if your tokens pre-existed as the same kind (not necessarily the same string!) before the edit or not. It would be trivial to add this to a generator, but how would it then feedback the signals needed to take advantage of the memoization (e.g. by providing a persistent token ID that can unlock pre-existing information about the token). There are simply no standards for that.

In most of the professionally written compilers (e.g. scalac) I've worked on, lexer and parser generators aren't even used, and it really isn't that big of a deal to write these in code; you also get the benefit that the same language is being used. This becomes especially true when IDE services are considered, whereas most generators are pretty much limited to batch applications.

> I have also written my own hashtable for many reasons, but I would never sit down with an array literal and manually put the entries in there by hand: even if I don't make any mistakes, it is a pointless waste of my time that is trivially automatable using a computer.

I see your point, but it really depends on the key space you are optimizing for. You might just put the elements in by hand if a generic algorithm isn't really called for.

In regards to syntax, I found that using a higher-level syntax like BNF to explore a language space is great. But then write it by hand to get performance.

I've only done it a couple of times with relatively simple syntaxes so YMMV.

+1 for mentioning Rob Pike, the guy's gifted!

Yikes! I guess that comment didn't go down well with a few folks!

I found his explanation on how the lexer/parser for the templating language used in go http://www.youtube.com/watch?v=HxaD_trXwRE was created much easier to understand than other media/literature available online on the same subject. Hence my appreciation of his work. :)

> Yikes! I guess that comment didn't go down well with a few folks!

I guess because '+1 I agree' or variations on it aren't appreciated here. That's my impression of this community, anyway. A post usually needs more meat, like the way you elaborated on your OP.

I used Flex/Bison for writing my first interpreter, and I definitely recommended for beginners. I think when you're starting out, it's more important to focus on writing a proper grammar and traversing the AST than to do all the lexing/parsing by hand. Plus, it makes it easier to rapidly prototype your language when you're still in the design phase, and the Flex/Bison files serve as a form of documentation.

That said, now that my language's grammar is stable and I have a better understanding of lexing and parsing, I'm thinking about ditching Flex/Bison, mainly to achieve more descriptive error messages.

I found the lemon parser generator distributed with the source to sqlite3 to be far easier to use than bison.


This looks really interesting, thanks.

I've playing with PLY* for fun, but I'd like to try something else in C without going back to flex/bison.

* http://www.dabeaz.com/ply/

The last bit I think is the most important - that you will have to take the show on the road and tell people about it to ever have it really take off is interesting and important. Promoting a piece of tech as critical as a language is no small chore. You can't just put something out on the internet and hope.

Having helped put something out on the internet and hoping (http://gosu-lang.org) I 100% agree, but I think there is something else in there beyond just advocating a language. Let me try to articulate it:

Programming languages don't win because they are better qua programming languages. They win because they solve one problem really well that others do not. Ruby (+ Rails) -> building structured web apps. Javascript -> Being there. C -> Being portable assembler. Java -> C with garbage collection.

Consider the greatest language ever, Lisp (I hate it, but I recognize it's power.) The reason it's never taken off is that there is no one big problem that it solves that other languages can't solve easily enough.

The bad news for a lot of the better, smaller languages out there is that the newer problems are often solved in libraries and momentum (tooling, deployment support, etc.) in the big languages is huge.

> They win because they solve one problem really well that others do not.

That assertion's got more holes than emmental. Where does Python fit? C#? Javascript and Objective-C don't really "solve one problem really well", they're the only fucking option in their space. Java was not and has never been "C with garbage collection", the syntax is the only thing it got from C.

The language being great at solving a specific use case is definitely a boon and increases its chances (witness PHP or Lua), but it's nowhere near sufficient (Tcl lives in the same niche as Lua — lightweight, extensible, multi-paradigm language for embedding in and scripting of native codebases, predates Lua, and is considered by many of its lovers better than Lua, yet it's dying its slow and protracted death)

> The reason it's never taken off is that there is no one big problem that it solves that other languages can't solve easily enough.

The reasons it never took of is a combination of little advertising/pushing (compare with java), and highly fragmentary communities (the community unit of Lisp, as those of mages, is 1) leading to no corpus of code.

> The bad news for a lot of the better, smaller languages

And yet there's never been more uptake in new languages.

That assertion's got more holes than emmental.

Maybe, but your counterexamples don't seem particularly convincing.

Where does Python fit?

Originally, a readable scripting language.

More recently, a language for back-end web development that doesn't ram OO down your throat.

Also carving out something of a niche as a general purpose scripting language embedded in other applications instead of a custom macro language, like Lua but many more people have prior experience programming in it.


Similar strengths to Java, but with a lot of useful extra features, without most of the limitations that should have gone away a decade ago but didn't, and with a runtime environment that is available by default on Windows.

The reasons [Lisp] never took of is a combination of little advertising/pushing ... and highly fragmentary communities ... leading to no corpus of code.

Also, a highly uniform and generic structure is not always an advantage in a programming language. Sometimes it's better to have things that are different obviously look like they're different.

In short, a language needs advocates that can brag about something, and the brags need to be objectively accurate. Do not claim the adjective "fast" if I can trivially outrun you in Python, or claim "simple" if I can trivially write shorter source code in Java, etc.

It also need to be an interesting thing to brag about, but, honestly, just getting the first bit is enough challenge....

>and the brags need to be objectively accurate

No, it just needs to be convincing. Plenty of convincing lies (or unproven claims) are used to sell programming languages.

The problem with inaccurate bragging is that the loop ends up unclosed; the advocate says "This language is so fast!", the earliest of the early adopters try it and it isn't fast, and so they slag your language on their blog. And in this case, there is a such thing as bad publicity.

Even when I disagree with the propaganda of a language community (lookin' right at you, Node), there is enough truth that it made it through that gauntlet. In that case I'd argue it isn't that they lie about Node, it's that they lie about all the other languages they putatively compare themselves to. Apparently that strategy does work....

cough, cough

I would say that Lisp never took off because when it had its moment to shine, it failed in the speed department. Since then, many languages have since filled the expressiveness niche that Lisp filled.

Of course, that discounts Clojure, which is showing Lisp's power to a new generation, in an environment they can relate to (the JVM).

I agree that languages need to solve a real problem. When I created Obvious Architecture, I actually realized that Ruby is basically the wrong language to do Obvious in, but I did it anyway because it had enough tools to make it kind of work anyway.

Since then, I've been looking for a language that really offers all the bits that would make for a truly compelling Obvious experience without having to write it as a "framework" and it's pretty difficult to find something that really hits all the points well.

I haven't ventured into building my own language yet, but it's awfully tempting.

I think it is worth considering that perhaps obvious architecture is the wrong thing to make, rather than existing languages being wrong for making it.

Have you thought about moving Gosu to using Truffle and Graal for high performance? http://blog.jruby.org/2014/01/truffle_graal_high_performance...

Your own examples contradict your hypothesis. Ruby is a generic scripting language, it is not better or worse than python or perl at anything. Rails made ruby popular due to evangelism, not solving a problem others do not (notice how rails was literally just a crappier django?).

"Being portable assembler" is not solving one problem at all, it is one of the most general things you could cite. That literally is being a better language.

Java took millions of dollars of marketing to make it popular, people didn't just suddenly flock to it because it was C with garbage collection (which was already a solved problem: see C).

Perhaps lisp "never took off" (seems like it took off pretty well to me) because it isn't the greatest language ever? Lisp barely is a language, it is more of a toolkit to build a language with.

> Java took millions of dollars of marketing to make it popular, people didn't just suddenly flock to it because it was C with garbage collection (which was already a solved problem: see C).

Could you please explain what you mean? C is the only language I know in any depth. Beyond a basic description, "garbage collection" is a fairly alien concept to me.

I was thinking along those lines, cheers.

If you actually expect your language to be used (and not only write it because it is fun), you should probably also consider this checklist: http://colinm.org/language_checklist.html It isn't quite serious, but it can prepare you for a lot of the justified and unjustified dismissals you will have to face.

It depends what your goal is. My own reason for writing my own programming language was to have a virtual machine for a multi-player 3d games framework that I wrote in 1996. This sat on the back-burner for a few years, and then I used the compiler and virtual machine in a commercial project.

Probably less than 100 people have developed for the language (mostly just tweaking the provided applet code), but the goal never was to unleash a new language onto the world. The aim was to provide a rock solid platform for our product that allowed people to customize the applets if they wished, and it achieved that.

I would recommend people build their own programming language simply for the hell of it - it's challening, fun, and could come in useful in future.

Also it won't take "years" to get a working prototype. You can get something working in less than a month. It will obviously take a lot longer to get it stable and bug-free.

> Also it won't take "years" to get a working prototype. You can get something working in less than a month.

It all depends on the language. Tiny domain specific languages might only take hours, under the right circumstances (for an experienced language implementer).

When someone says "years", I assume they mean an industrial strength general purpose language.

Yes, perhaps a month was optimistic. Looking back through my emails, I see it took me 5 months (in my spare time) to design my language and instruction set and develop the compiler and VM. After that I wrote the function libraries. This was the first (and only) compiler I've written, and I just bought a book on compiler construction to figure out how to do it. This was an object-oriented C-like language, which was general purpose. It took a few years of actually using the language to make it "instustrial strength".

Writing a compiler is a lot simpler than people might think. I used "Practice and Principles of Compiler Building With C", which explains things very well. It seems to be out of print now.

What do you think about the point of view of skipping concrete syntax and just design the language at the construct level using s-exps ? ( as mentioned here http://cs.brown.edu/courses/cs173/2012/book/Everything__We_W... )

This how could help in build a C-like or Python-like language? Or you say just make a LISP-like language?

meaning the syntax is not the semantic, I can try imperative algol things in lisp clothing, or prolog things or both .. but I won't waste time on syntax, more on semantics.

I think that's valid, even desirable. Others have said lexing/parsing are a fraction of the language implementation problem. So, when using s-expression like syntax (xml as some have done) you are basically writing code in the AST. Then you can concentrate on interpretating the AST, or translating it to bytecode for some vm (maybe of your own design) or performing passes on the AST to do things like type checking, up casting values in expressions, optimizations, etc. There is really a lot more to language implementation than lexing parsing.

That's the point made by the course instructors, focus on semantics/interpretation/compilation rather. Another point I'd like to see is that with a small syntactic surface it may be easier to compare linguistic traits. Instead of having 219 languages we'd have one core and libs (with their macro sugar) giving the desired features. But maybe that would only move the problem further.

I think exist some gap in how build languages. I wanna make one, but need to re-create a lot of basic stuff just to bring my own thing. I know .net/java/llvm is suposed to solve this, but exist something that give just a AST and I plug onto it? So I don't need to invent how declare a var or how make a function, but just control the syntax, the sugar, etc.. (ie: Like have a API to build languages).

Of course, where this dream get killed is when I'm interested in do it in C, not in .NET, or need a VM (or not). But when I see julia/GO/lua/nimrond and think "Hey, I could build on that" but is too highly coupled and then the option is do everything from the start...

I found http://www.hokstad.com/compiler/ really educational - implementing a language in the same way I'm used to writing code, by solving one problem at a time. It makes sense in a way that compiler theory lectures never did for me.

I have to speak up in defense of minimizing typing. Sure, programming is mostly long stretches of thinking with short stretches of typing, but when you do start typing, it's usually the result of some piece of inspiration. It's especially at those times that typing lots of useless characters is a distraction. Lots of noise in the code is also distracting while reading, which we all know is what we spend most of our time doing, though you don't want it to be too short either.

In my opinion it is not the light syntax that really minimizes typing, but rather the availability of features in the language that can afford higher level abstractions.

Light syntax is useful for readability. I like it when I don't have to end each line with a semicolon, because that's just noise meant to spoon-feed the compiler. However, I hate it when the grammar of the language leads to awful gotchas. And sometimes visual clues are useful, again for readability.

Since we are on the subject, I hate CoffeeScript.

As someone who had a bout with RSI, I agree. Minimizing keystrokes, if it can be done without impacting readability too much, is excellent.

I don't have RSI, so don't know from experience, but wouldn't a syntax aware IDE with autocomplete be very useful for minimizing keystrokes without necessarily minimizing the characters?

Probably. I tend to avoid IDEs because to me they feel like straight jackets and their editors tend to be inferior to standalone editors. I'll have to look around at autocomplete options for my usual languages (Erlang, Python) though -- I figured if I couldn't fit what I was using into my head, I needed to work on my head, not on my autocomplete options. But maybe I'm wrong. :-)

Hmm, so this is the author of D. I've never seen D actually used anywhere. Honestly, I think he missed the biggest thing - being employed by a large company that will push your platform. From the Wiki page, D sounds like a replacement systems programming language, but I've never heard of it while Google's Go system programming language replacement is all over the place. Similarly Sun, now Oracle, and Microsoft push languages that actually get used.

The first thing that jumped out at me in the (very interesting I thought) article was this; "If you need validation and encouragement from others, it isn't going to happen". In other words, you need a thick skin as well as technical ability. I remember Walter Bright achieved a certain amount of fame in the programming community back in the 80s by writing and productizing outstanding C and C++ compilers for MS-DOS when that was an important market with big competitors including MS and Borland. He seems to have been plugging away with D, conceived as a better successor for C than C++ for a long time now. I suspect that the reason that C++ is ubiquitous while D remains relatively obscure has a lot more to do with inertia and momentum than the objective merits of the two languages. It's a tough field to get traction in with so many brilliant and opinionated critics and competitors. My impression is that Walter Bright is an outstanding programmer who deserves more recognition and kudos.

MS and Borland produced C++ compilers long after Zortech C++ proved the market for them on MS-DOS.

I was lucky in that Zortech C++ came out just as the market for C++ exploded (and it might even be that ZTC++ was the case of that explosion).

Thanks for replying. I feel a Wayne's World moment coming on... "I'm not worthy, I'm not worthy...". Congratulations on a wonderful body of work. I am going to use this moment to motivate myself to push my own projects forward. Of course I'm not trying to write my own language (that would be crazy...).

D is being used on an "experimental" basis at Facebook:


That said, you're right. Rust and Go both have momentum, and they've both had big backers behind them.

Is anyone actually using Rust in production? There's a lot of excitement around it in the HN echo chamber, but I'm not sure how far that translates to real world use.

Since it's not very stable yet, production use is actively discouraged. I do know Tilde is using Rust for Skylight (http://skylight.io), but I don't know how far from production they are.

Skylight itself is still not quite out of beta, but iirc ever since reimplementing their monitoring and instrumentation in Rust they've been allowing their paying customers to opt into it (with the alternative being to stick with the older Ruby-based implementation).

For another company currently using Rust despite our frequent warnings, see OpenDNS: http://www.opendns.com/

And now you have heard of D! See, my eeevil plan is working!

Yep, actually Steve Yegge has a rant saying that his friend Walter Bright (creator of D) didn't care enough about marketing.

So? Java's success can be largely attributed to Sun's marketing push. On the other hand, no corporation stands behind a language like Lua which is enjoying success. Same for a language like Julia (though very young, it has clearly managed to attract attention).

Well one other difference is that D is a systems language, and Go is nothing even remotely close to a systems language but initially claimed to be one. There is not much demand for a new systems language, most people are happy to use higher level languages for what they do these days. Most go converts are coming from python and ruby, not C++.

Though is probably isn't as straightforward as other options, I really like how Lepl is put together:


A quick example:


Edit -- I guess not many people agree with me:


Well my pet language Ivy http://sourceforge.net/projects/ivy-lang/ allows you to define new statements via delayed evaluation tricks (but without actual quoting):

  fn myfor(&init, &test, &incr, &body)
    while *test

  var a

  myfor a=0, a!=10, ++a
    print a
& creates a zero-argument lambda function out of the argument. * is sugar to call a zero-argument function.


  fn set(&x y)
    *x = y

  set a 7  # Set variable a to 7
It works because when variables are evaluated, the result includes both the value and the address from which it originated.

Anyway, creating a language is one thing. Creating a fast compiler for it is quite another. It's impressive how well v8 works for the dynamic language javascript, but the amount of work embedded in it is daunting. On my list of future projects is to try to make a simple SSA-based optimizer.

That's really awesome!

I have ideas about a new language since I finished school in 1993. I had many ideas, some of them got implemented in java. Recently, I discovered scala. All my vague ideas are in scala. The only complaint I have against scala is the slowness of the compiler. I have stopped learning about compilers.

At coursera.org, there is an excellent resource for learning about compilers: "Stanford University - Compilers"

I'm building a toy-DSL and working through the ANTLR4 book right now mostly because it was the first thing that came up when searching. Since I don't have a formal CS background I'm lacking a bit in the "language development" department.

Is there a good overview comparison of strength/weaknesses of what's available (blog post, academic journal don't really care)? I mostly know stuff exists because I've heard it mentioned somewhere (Flex/Bison, yacc, using Prolog :P) and it's not the easiest thing in the world to make an educated decision.

My current plan is to stick with ANTR and muddle through and build a prototype then spend a day or two researching alternatives and seeing what comes up but if someone could cut down that research time I'd be thankful :)

Recursive descent parsers are easy, at least if your grammar is LL(1). I seem to recall there being a lot of resources on how to make your grammar be LL(1), but I don't recall which one I read.

Surprisingly shadow article for someone like Walter Bright. Just some basic recommendations, but nothing practical? No links to books, parsers, implementations?

The easiest start would be "Compiler Construction using Flex and Bison" by Anthony A. Aaby, available as free pdf, and for adding LLVM to the mix the link would be: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/

There needs much more to be said, e.g. links to such languages which were written in a day and are not necessarily lisp or scheme. That would be too easy.

This is a fabulous article.

This article seems like the perfect antidote to the usual "here's the dragon book" article.

Articles that point you to the dragon book, give a few ideas of what parser and lexer are, and then turn you loose are just terrible. They are look articles on "how to build a house" that focus on "here's where to buy a backhoe, here's where you can concrete, etc..."

Essentially, designing your own programming language is a difficult and in many ways, bad, idea. But what you to have a ghost of a chance is "what choice should you make, what will it feel like" narrative (there should also be "what are the milestones" but given programming, that will sadly be too variable to predict).

The author writes his own experiences, and they are directly opposite of the "Compiler Construction using Flex and Bison." And he explains why. I also remember Stroustrup's story where once he wanted to develop C++ he got well meant advices to use lex and yacc, and he claims that he much later realized that much better approach for him would have been to code both things by hand.

This matches Walter's experiences, and not only his.

I tried to avoid all the usual stuff that is well covered elsewhere, and touch on things I thought were poorly covered or not even mentioned.

After all, it's just an article, not a comprehensive tome.

Every so often, there's an article on HN promoting the concept that everyone should learn to code. I feel the same way about programming languages, that every developer should write a language. It doesn't need to be a great language, or overly functional. For me, it has provided incredible insight into what is sugar, what is required, and why languages are syntaxed the way they are. The years I've spent writing languages has made me a far better developer.

If you want to play with languages, you could do worse than taking a peek at OMeta:


I think this shows that Walter has lots of compiler experience, not so much lots of language design experience. You can't get good error messages without semicolons? How can anyone say something that absurd in a serious article? Have you tried a language without them? It works just fine.

Minimizing keywords is important not because of a word shortage, but because of an overlap with words the programmer wants to use. It is obvious that if you make "i" a keyword, someone will likely murder you. But there's tons of grey area where you might be stepping on toes and getting in the way, so minimizing the amount of toe stepping is good. Especially since there is absolutely no reason to have lots of keywords.

Full context: "Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages."

Given the context I think it isn't so much that Walter is saying you can't have good error messages without semicolons. He's saying you can't have good error messages without redundancy (in this example a statement terminator).

>Walter is saying you can't have good error messages without semicolons

That is what he said though. I understand it is one example for a larger point, but it is an example that doesn't support that point at all. It is pretty hard to judge the overall point when the example is nonsense.

I think you're really missing the point.

Your goal is to write a whole programming language.

Calculating where each statement starts and ends in order to serve a good error message is misdirected effort. Just use a terminating character, and use the standard one: a semi-colon.

> Just use a terminating character, and use the standard one: a semi-colon.

Significant whitespace is popular as well. You can argue all day about it, but at the moment I think making a language "look like python" is the safer bet for a new language.

Significant whitespace is _also_ redundant, just along different metrics. Badly indented Python gets caught before execution the majority of the time, and in presentation, it surfaces to the programmer quickly.

His argument is essentially against something like Lisp or Forth where you have such a paucity of syntax that many erroneously written statements are syntactically valid and make it to runtime.

Lisp and Forth are the extreme case, which is especially bad, because you get no error message at all.

But the middle ground are languages that can detect a syntax error, but can't figure out exactly what it is. You don't get an error message saying "You should have a semi-colon here, want me to insert one?" Instead you get some vague message like "didn't expect this keyword here" a few lines after the missing semi-colon.

It's debatable. I know two languages with significant whitespace, that's Python and Haskell. And Haskell's rules are a pain.

For the language I'm working on, I had started with significant whitespace. Then I realized that it would be a pain for things like anonymous callbacks (the kind of thing JS code is littered with) and that it was distracting. At least when you start, don't get hung up on the syntax, focus on basic things. What's important is what you are going to do with your AST, the lexer/parser part are the least important areas, unless you are doing a "syntactic skin" over a language (eg, coffeescript). You can always change the syntax later.

There's also F#, CoffeeScript, Yaml, Sass/Stylus, Haml/Jade and several others, just among the popular languages.

And if we mean "significant whitespace" as in "newlines can act as statement and expression terminators" (since we were discussing semicolons as the alternative) we can also add a ton of other languages, like Ruby, JavaScript, Go, Scala, Visual Basic and Lua.

I don't think it's fair to compare markup languages to programming languages, so we're down to F# and Coffeescript. That's not much of a trend toward significant whitespaces.

I agree with newlines vs semicolons (though I've always been told to write Javascript with semi-colons, and that's how I have encountered it in the wild).

> I don't think it's fair to compare markup languages to programming languages, so we're down to F# and Coffeescript. That's not much of a trend toward significant whitespaces.

I take your point about Haml, though Sass is actually Turing-complete, so I almost feel like it belongs in the list despite being really off-the-wall.

And I was only choosing from reasonably popular languages (so things like Boo are out even though they'd help my numbers). There just aren't that many mainstream programming languages out there. Four in a category seems like a pretty fair number to me. You could just as easily say functional programming isn't a thing if four mainstream languages is considered a paltry showing.

I'm curious why you felt anonymous callbacks were problematic in a significant whitespace language. They seem to work just fine in coffeescript, as an example.

I'm not familiar with Coffeescript. Maybe I should have a look. My thought was that getting the right level of indentation for something like:

    my_func_with_callbacks(arg1, def(x):
would be a pain in terms of defining sensible rules and ensuring they are parsed correctly. So right now, I'm going for something closer to Ruby syntax. But it may just be a symptom of a lack of imagination. I'll remember to investigate Coffeescript when I revisit the syntax.

Ah. I think the practice is to treat the closure as being ended by any dedent that is lower than its first line. This specific case does come up with javascript's setTimeout, which takes a callback as its first argument and the timeout as the second. It usually seems to look like this:

    setTimeout ->
    , 1000
I'm not terribly fond of this, personally, but it does solve the problem. I think in a language without a legacy you would just tend to avoid making callback arguments come before non-callback arguments and things would look a lot nicer.

> It's debatable. I know two languages with significant whitespace, that's Python and Haskell. And Haskell's rules are a pain.

But it's optional in Haskell's case.

>And Haskell's rules are a pain.


Somehow, I find myself caught doing an incorrect indent regularly. I appreciate Python's ':' marker, which unambiguously tells you "either write the next statement on the same line, or indent". I remember it being criticized, but I think it's actually really good.

> at the moment I think making a language "look like python" is the safer bet for a new language

Why Python? Clojure and Elixir are two new languages with growing communities and they look nothing like Python.

You could also say that newlines always terminate statements unless a continuation character is used. That's a trivial policy that works very well in practice.

I don't know how you think that was the point. He doesn't say anything like that. In fact he takes the exact opposite position when dealing with parsing, saying to just write more code. Why would something which literally requires no extra effort (producing useful error messages without semicolons) be misdirected effort, while writing a more complex parser is cool?

I re-read the article and I concluded that we are both right. I direct you to the following statement:

"The principles are rarely orthogonal and frequently conflict."

Producing genuinely useful error messages at all requires a complex parser, IMHO. It's not computationally obvious to look at a line of code and say 'actually you just missed this one character'. Requiring a terminating character gives you at least a sanity check starting point to work with - it alone will catch whole categories of errors.

Forth doesn't have 'keywords' per se, but i is predefined for you. It has the effect of placing the index of a 'do' loop on the stack.

Here's a more technical version of a very similar article by the same author that was previously posted on HN: http://www.drdobbs.com/tools/creating-your-own-domain-specif...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact