Hacker News new | past | comments | ask | show | jobs | submit login
The naked truth about writing a programming language (2014) (digitalmars.com)
250 points by pcr910303 on May 2, 2020 | hide | past | favorite | 204 comments



That's a good set of questions for 2014. Questions that have become important more recently include:

- Imperative? Functional? Some mixture of both? Mixtures of the two tend to have syntax problems.

- Concurrency primitives. The cool kids want "async" now. Mostly to handle a huge number of slow web clients from one server process. Alternatively, there are "green threads", which Go calls "goroutines". All this stuff implies some level of CPU dispatching in compiled code.

- Concurrency locking. Really hard problem. There's the Javascript run to completion approach, which reappears in other languages as "async". Ownership based locking, as in Rust, is promising, but may lock too much. And what about atomic operations and lockless operations? Source of hard to find bugs. Get this right.

- Ownership. Rust got serious about this, with the borrow checker. C++ now tries to do "move semantics", with modest success, but has trouble checking at compile time for ownership errors. Any new language that isn't garbage collected has to address this. Historically, language design ignored this problem, but that's in the past.

- Metaprogramming. Templates. Generics. Capable of creating an awful mess of unreadable code and confusing error messages. From LISP macros forward, a source of problems. Needs really good design, and there aren't many good examples to follow.

- Integration with other languages. Protocol buffers? SQL? HTML? Should the language know about those?

- GPUs. Biggest compute engines we have. How do we program them?


I would say possibly to use a different programming language for GPU than CPU. I like the idea of Checkout (see [0]) for GPU programming, although unfortunately it is not implemented and the preprocessor is not yet invented. (The trigonometric functions are also missing. They should probably have at least sine, cosine, and arctangent, since these functions seem like useful to me when doing graphics.)

[0] http://esolangs.org/wiki/Checkout


Julia has done a fair job of mapping high level mathematical primitives to the GPU (though I disagree with their approach)


I think using trig and especially inverse trig is almost always an antipattern in graphics code. Careful use of dot and cross product will almost always be both more performant and more robust.


> Mixtures of the two tend to have syntax problems.

My gripe isn’t the syntax, it’s the lack of guarantees/constraints.

When I’m working in a language with immutable data structures, I know what to expect. A language like Python with some added functional sugar is much harder for me because I make stupid assumptions. Spent a half hour once chasing down a bug just because I was carelessly assuming list.pop() didn’t alter the list.

Now I keep a Python shell running all the time so I can see whether various functions have side effects.


You make a good point in general, but I'd be upset if a function called 'pop' didn't mutate anything. That is not the right name to use for returning an element without side effects.


> That is not the right name to use for returning an element without side effects

how do you feel about `str.replace`? i'd say that that name implies side-effects too, but everybody's used to the fact that [in python] strings are immutable, so it's no big deal. it's all a matter of expectations and language conventions – e.g. in Haskell, `replace`, `reverse`, `insert` etc. would all be pure functions and no one would think about it twice.

and sure, i'd prefer a "declarative" name if one's available (e.g. `plus` vs `add` on numbers). but if you overdo it you end up naming your functions stuff like `with_inserted` or `with_replaced` and that's just... not up my alley at all (though i'm sure someone likes it)


> how do you feel about `str.replace`? i'd say that that name implies side-effects too, but everybody's used to the fact that [in python] strings are immutable, so it's no big deal.

In that case it's returning a modified string, at least.

'pop' and 'replace' both create modifications to the item you feed in. You can return the modified version, or you can mutate it in place, or you can do both. But it makes zero sense to do neither. Why would you make a modified version of an immutable variable, then immediately throw it into the void? So as long as 'pop' is returning the element from the list/map/set/whatever, and nothing else, it is the wrong verb for immutable structures.

Also, 'pop' works on unordered data structures too, which makes it fit even less since it's now "pick some arbitrary element".


> So as long as 'pop' is returning the element from the list/map/set/whatever, and nothing else, it is the wrong verb for immutable structures.

right, but would anyone really write a functional `pop` like that? that's just `last`. any sensible functional `pop` would return a tuple:

  (x, new_xs) = pop(xs)
so i'm not sure i understand what the problem is tbh. we might be in violent agreement...

and re: unordered containers, i'd say it "makes as much sense" for immutable and mutable ones. maybe it returns the most recently inserted element or whatever's convenient/efficient to implement; but it's arbitrary regardless of any (im)mutability.


> so i'm not sure i understand what the problem is tbh. we might be in violent agreement...

A two-return pop like that is a little weird, but still basically reasonable.

But a one-return pop is a misleading term for immutable data structures. Use list[-1] or last(list) or something.

When I say I'd be upset at pop not mutating a list, I'm talking about a one-return pop specifically.

> and re: unordered containers, i'd say it "makes as much sense" for immutable and mutable ones. maybe it returns the most recently inserted element or whatever's convenient/efficient to implement; but it's arbitrary regardless of any (im)mutability.

If the structure is immutable and unordered, a one-return pop is almost totally useless.

You use it, it gives you some element. Then what? You can't meaningfully use it again, because it might give you the same element, or it might not. You can't whittle the structure down one element at a time, processing each element as you go.

The only things you can dependably do with it are check if the data structure is empty or not, and get an arbitrary example element.


so we agree about single-return-pop – it just wouldn't be very useful. what i tried to say is that it's a bit of a strawman – single-return-pop is pretty obviously useless, so no one would design a pure API like that. so there's no point arguing against it :)

if you want `pop`-like semantics, but also want your operations to be pure, tuples are the only real way to do it. for a real example, see this Haskell implementation of stacks: https://hackage.haskell.org/package/Stack-0.4.0/docs/Data-St...

its `pop` has this signature:

  stackPop :: Stack a -> Maybe (Stack a, a)
(with a `Maybe` thrown in to account for popping an empty stack)

---

another alternative that works well in some cases is algebraic datatypes and pattern matching:

  result = case myList of
    [] -> "whoops, empty"
    (x : rest) -> (
      "the first element is " ++ (show x) ++
      ", and the rest of the list is: " ++ (show rest)
which is pretty similar to

  try:
    x = myList.pop(0)
    result = "the first element is " + str(x) + " and the rest is " + str(myList)
  except IndexError:
    result = "whoops, empty"
so that way you can take your structure apart bit by bit like you would with `pop()`. if your container is unordered, you could convert it to a list first.

---

and if wrangling tuples and passing the "state" around like that gets annoying, Haskell has the `State` monad, essentially a wrapper around functions `SomeState -> (X, SomeState)` that eliminates the state-passing boilerplate


> so we agree about single-return-pop – it just wouldn't be very useful. what i tried to say is that it's a bit of a strawman – single-return-pop is pretty obviously useless, so no one would design a pure API like that. so there's no point arguing against it :)

Remember where the conversation started :) It's not a strawman because it's what confused macintux in the first place!

I was saying that the thing they expected would be a problem, and we're just discussing how to replace it, not disagreeing with that original idea.


good point, i got a little bit caught up in the argument there :)


Curious; why `Maybe (Stack a, a)` rather than `(Stack a, Maybe a)`? The returned stack would be empty like the input.


i guess it's "making illegal states unrepresentable" – with `(Stack a, Maybe a)` you allow an implementation that does

  stackPop [1,2,3,4] == ([1,2,3], Nothing)
which doesn't make sense - `Nothing` is only supposed to happen when the stack is empty, i.e. `([1,2,3], Nothing)` should be impossible. and if `([], Nothing)` is the only "error state", why not just use a `Nothing` for the whole thing instead? and then you can sleep soundly, knowing your pattern-matches are exhaustive :)

ergonomically, it's nicer to to test for failure; and you can do "railway-oriented programming", either doing something with the result or quickly bailing out.

all that said, there's probably cases where `(Stack a, Maybe a)` is more ergonomic... so you know, this isn't dogma


Ergonomics could be the reason; good point.

I don't think your first point is correct, though; nothing about `Maybe (Stack a, a)` stops me returning `Nothing` incorrectly.


i was going to mention this, but the comment was already long :) of course, this doesn't stop you from implementing it as

  stackPop _ = Nothing
or just doing the wrong thing. but it does prevent you from returning `([1,2,3], Nothing)`, and that's always one less thing to worry about - if not for the implementer, then for the users. and the more the signature constrains possible implementations¹, the better!

and this might be subjective, but for me `Maybe (Stack a, a)` results in less cognitive load. with `(Stack a, Maybe a)` if i do

  case stackPop xs of
    (_,  Nothing) -> ...
    (xs2, Just x) -> ...
in the back of my head i'm going to be thinking "hmm, am i 100% sure i can throw away that first value if i get a Nothing in there? if it's useless, why does the function return it?²"

whereas with

  case stackPop xs of
    Nothing -> ...
    Just (xs2, x) -> ...
i know for certain there's no new values introduced in Nothing case – great, less things to keep in my head! it's a small thing to be sure, but i like when the types assist me like this :)

---

¹ parametricity is one of my favorite things in pure FP!

² but i hate throwing things away, so it might just be that...


In clojure, pop returns the input with the item removed (so popping a vector returns a vector with the last element removed). To return the actual element, you use a different function (peek, or last).


that's `init` in Haskell, and in this case i'd agree with the OP - i'm used to `pop` letting you actually access the last element, so i'd prefer a different name.

also, doing it that way will require traversing the structure twice, something that's often not free with functional data structures, whereas a tuple-pop does it in one go. similar to how you'd rather split a list at an index and get back two lists rather than do a `take n` and `drop n` separately


I don’t mind it, to me it reads as “pop the top of the stack off”, just like you would in a mutable version. The fact that the popped item is discarded doesn’t change the operation.

As for traversing twice, at least in Clojure, the reason you have peek and last as two separate functions is that peek never traverses, but what it does depends on the type of data structure passed in (and pop works to match that), eg, for a vector, it’s the same result to last and O(1) because vectors are random access, but for lists, peek is the same as first and still O(1), while last is O(n). For lists, pop does the same as rest.

Why have peek/pop when there are other functions that do the same thing? Well, they are designed to be consistent with each other, so regardless of the data structures used, peek will always return the item that pop removes.


Ruby has a nice convention where impure functions like those have a "!" appended to the name. For example String#gsub takes immutable string arguments and returns a new string, but String#gsub! operates on a String instance and mutates it.

I guess if you write a new programming language, it would be nice to establish that convention early on, when it's still possible to be done.


The problem with Ruby’s approach, if I recall correctly (been 10 years) is that it’s inconsistent.


Actually I was thinking about how to determine function purity in a static analysis manner for a good while, spurred on by how the Chrome developer console sometimes gives "lazy previews" of typed Javascript statements, but only ones that have no side effects. As it turns out V8 has a feature which can guess if an expression is pure or impure, erring on the side of impure due to undecidability and other random issues. It does this by whitelisting a set of opcodes/built-ins that can potentially mutate state or not and running the statement in a separate VM instance with a debug hook set to throw an exception if an impure opcode/built-in is called. The parent developer tools code then prints the expression if the VM instance returns successfully.

https://chromium.googlesource.com/v8/v8.git/+/ba5bac8cebe91c...

I wanted something similar in LuaJIT but it's pretty non-trivial to accomplish. I guess it's one of those mildly useful innovations you can only have with a lot of developer time.


Checking for function purity is easier in some languages than others. It's mostly a transitive closure thing. Pure functions can call only pure functions. It's tough in Javascript, because function declaration is dynamic and you can rebind. It's not that tough in a more static language, where you can tell who calls whom at compile time.

Way too many years ago, I did that.[1], starting at line 204.

Knowing that a function is pure has many uses. You can optimize, using that x=y => f(x)=f(y). Pure functions can safely be used in assertions and conditional debug code.

[1] https://github.com/John-Nagle/pasv/blob/master/src/CPC2/p2tr...


This is still true. Exclamation mostly means "this is dangerous". Array.push doesn't have an exclamation, and libraries like rails use it to signify that a method will throw an exception instead of return whether it succeeded.


That convention may have originated with Scheme, which has non mutating 'let' vs mutating 'set!'


It's reasonable to expect a `pop` function (in the strict sense) that returns a collection without the last element. The point of `pop` is manipulating the pointer of the top element. We can leave the elements and previous copies of that pointer alone in memory.


Like I said, it was stupid. I’d spent a few years working with Erlang and hadn’t yet made the mental switch.


Regarding your last point, as of now we have four main processor hardware namely CPU, GPU, FPGA and TPU (the new kid on the block). The hard question or problem is how can we program these different animals seamlessly?

Chris Lattner of LLVM and Swift fame is currently working on MLIR to program heterogeneous processors [1].

At the same time ETH Zurich is working on LLHD, to unify hardware design and programming (e.g. Verilog, VHDL) including the testing or testbench that is using conventional software programming (e.g. TCL, Perl and SystemVerilog) [2].

Apparently it seems that the key concept to solving this problem is Static Single Assignment (SSA) form. It is well known that SSA can seamlessly support imperative and functional programming [3].

For the past several years, both Chris (CPU, GPU & TPU) and ETH Zurich (CPU, FPGA and Circuit Design) are independently using SSA to solve this hard problem.

I personally believe SSA concept will empower the compiler community as the quaternion concept empower physicists like Einstein overcome Maxwell Equation and 3D game programmers to overcome gimbal lock.

Only time will tell how this idea will pan out.

[1] https://mlir.llvm.org/

[2] http://llhd.io/

[3] https://www.cs.princeton.edu/~appel/papers/ssafun.pdf


Just wanted to say thank you for all the contributions you add.

It's always a pleasant surprise to scroll back up and see your username on a comment I just delved into.


Well, I blew it yesterday, not knowing that Python really does do a > b > c expressions as a special case.


I`m not so sure about ownership, sure it's nice to be able to brag no GC and being memory safe, but this has a huge complexity cost.. IMHO adding to the compiler a GC/ASan for the debug mode (to detect double free, use after free, leaks) is a valid option.


> GPUs. Biggest compute engines we have. How do we program them?

1. Carefully. They only give you - roughly and unless you're super-lucky - around one order of magnitude improvement in raw flops and raw GB/sec memory bandwidth over a perfectly-exploited CPU. (Not that it's easy to properly exploit a CPU of course; it's actually quite difficult.) That means that if you cut even a few corners - you're just going to lose that edge, and a (again, perfectly-programmed) CPU beats you.

2. With a programming language which has zero or very-low cost abstractions, and models well the computations a GPU "thread" can perform. Preferably with some JIT'ing capability to be dynamic enough. For the first two, that trusty warthog, C++, does fairly well - especially with some help from nice libraries. (Shameless self-plug time... this kind of help: https://github.com/eyalroz/cuda-kat ; I'm thinking of submitting that to "Show HN" soon.) JITing for GPUs is maaaaasively under-explored and under-developed IMHO, and if someone is interesting in collaborating on that, feel free to write me.

Would other languages do? If there's not enough ability to abstract, you get a mountain of specialized code; if there's too much abstraction, you start paying through your nose. Could Rust work? I don't know. Perhaps one of those "better C"'s like Zig? The fact that CPU-side libraries can't really run on the GPU kind of levels the playing field against languages with a large base of already-written software.

> Ownership... C++ now tries to do "move semantics", with modest success, but has trouble checking at compile time for ownership errors

Actually, C++ has, over the past decade or so, introduced several measures to address the issue of ownership and resource leakage. If you combine move semantics, library facilities (mostly smart pointers), the problem is half-solved. Static analysis is improving too, especially when you "decorate" parameters, e.g. with `owner<T>` or `non_null<T>` and such. The "C++ Core Guidelines" (https://github.com/isocpp/CppCoreGuidelines) aim to be machine-checkable whenever possible.


> more recently

2014 was yesterday. If you look, you’ll see that nothing on your list is ”more recent.”


The importance of GPUs for performance is new this century. The other items, ok, no.


How would you try to describe the future of software development based on these (excellent) questions of yours, e.g. in 5 years from now?


I want to implement a toy programming language, but I have questions regarding the following in that article:

> Context free grammars. What this really means is the code should be parseable without having to look things up in a symbol table. C++ is famously not a context free grammar. A context free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating in most of a compiler front end, i.e. third party tools become much more likely to exist.

Many complex languages, like c++, rust are not context free. We therefore need semantic analysis.

What programming language features will be compromised if I stick to context free?

Will the end result be as expressive as c++/rust?

Any programming language that is massively adopted is context free?

BTW, the toy language I want to build will be something similar to javascript.


Context-free has nothing to do with symbol tables, it just means that in the grammar, the left-hand side of a production rule can only have a single non-terminal symbol, which can always be replaced by the expression on the right-hand side, without ambiguity.

Language features are an orthogonal issue--you can implement any language feature with a CFG, but you just can't reuse the same keyword or operator to have different meanings in different grammatical contexts.

A classic example is that in C++ it is impossible to know whether << is brackets around template arguments, or the left bit shift operator, or the stream output operator, without arbitrary lookahead--you need to parse the rest of the statement in order to decide which one it is in the current statement's context.

Generally language designers don't choose to implement a context-sensitive grammar on purpose because they desire some feature that requires it; they strive for a context-free grammar, but end up with a handful of special cases that require context-sensitive parsing because of either convenience or legacy reasons.

Popular programming languages nearly all use custom parsers written in other (or the same) Turing-complete programming languages, so it's not "harder" to parse the context-sensitive rules, it only poses a problem if you want to use a parser generator to implement parsing for a CSG.


> A classic example is that in C++ it is impossible to know whether << is brackets around template arguments, or the left bit shift operator, or the stream output operator, without arbitrary lookahead--you need to parse the rest of the statement in order to decide which one it is in the current statement's context.

The classic example, I guess, is the most vexing parse: https://en.m.wikipedia.org/wiki/Most_vexing_parse

"""

The line

  TimeKeeper time_keeper(Timer());
is seemingly ambiguous, since it could be interpreted either as

    a variable definition [...]
    a function declaration [...]
"""

These things you can only resolve by having the symbol tables around and checking. The parser alone can't tell.

A language could avoid this i.e. by requiring specific keywords or having more distinct syntax.

Humans (programmers) however often can deal with ambiguity, we have it in humannlanaguage as well and context on most cases helps and I argue those aren't practical problems. (While that's only opinion)


Yes and programming languages are human interfaces to machine instructions, so context sensitivity can be manageable and even desirable to human users of the language, even if it makes the implementation of the language interpreter more complex or less elegant. Programming language designers make this trade-off all the time.


> without ambiguity

This is not a distinguishing feature. (I'm assuming you meant something else.) Example of a CFG with ambiguity:

    S → S S
    S → x


You're right, I meant that each valid RHS matches one and only one LHS--but that's also true of CSGs.


In BASIC the equal symbol "=" can mean assignment or equality depending on "context". Is this an example of a non context free?

On the other hand, I've seen EBNF definition of BASIC (VB, actually). Is this a contradiction?

(just someone still learning)


It's still context-free, the reason is because by the time you hit the '=' symbol you already know whether you're in a <Statement> or a <Compare Exp> production rule, based on the preceding symbols (namely the LET keyword), based on these excerpts from the BASIC EBNF:

    <Statement>   ::= CLOSE '#' Integer
                    | DATA <Constant List> 
                    [...]    
                    | LET Id '=' <Expression> 
                    [...]
                    | Remark

    <Compare Exp> ::= <Add Exp> '='  <Compare Exp> 
                    | <Add Exp> '<>' <Compare Exp> 
                    | <Add Exp> '><' <Compare Exp> 
                    | <Add Exp> '>'  <Compare Exp> 
                    | <Add Exp> '>=' <Compare Exp> 
                    | <Add Exp> '<'  <Compare Exp> 
                    | <Add Exp> '<=' <Compare Exp> 
                    | <Add Exp>


A context free grammar can handle this fine - as you can see from the EBNF - so no, it's not an example.


Note that "context free grammars", in the technical sense, is completely the wrong thing. Eg, enforcing consistent indentation is not context free:

  foo()
      {
      return;
      }
  ↓↓↓↓↓↓↓↓↓↓↓
  aaaa{
  bbbbreturn;
  cccc}
  ↓↓↓↓↓↓↓↓↓↓↓
  aaaabbbbcccc
This is equivalent to aⁿbⁿcⁿ, which is pretty much the canonical example of a non-context-free language.

What you acually want is not context-free grammar, but, as TFA says:

> the code should be parseable without having to look things up in a symbol table


There is a trick for indentation: Use a context-sensitive lexer in front of a context-free parser. The lexer serves indent and dedent tokens to the parser.

With this trick, you can parse Python with a context-free grammar.


There are plenty, the majority even, of programming languages that do not enforce consistent indentation. You make a false generalisation when you claim that CFG are completely the wrong thing.


I'd suggest that generally the grammar should be context-free, but that a few features = can be non-context-free. Formatting like this should specifically be a separate check IMO.


> This is equivalent to aⁿbⁿcⁿ

Is it? You only indent with spaces, so a = b = c = ' ', and the problem you claim disappears.


I'm not sure that's a solution. The problem is verifying that the three 'n's are equal, not that 'a', 'b', and 'c' are equal. Solving this cannot be done in a context free grammar [0].

[0]: https://cs.stackexchange.com/questions/33153/is-an-bn-cn-con...


Yeah, you're right. I got confused by the original phrasing.


https://stackoverflow.com/questions/898489/what-programming-...

Most languages have context-free syntax, which is what the article refers too. There really is no reason to sacrifice that. Even modern PHP recognises the value of having a parse tree independent of an entire compiler.

Context-free semantics is an entirely different matter, and I'm not even sure what it'd mean...


The two top answers are in conflict. The second answer with 43 points is closer to right:

There are hardly any real-world programming languages that are context-free in any meaning of the word.

The first answer with 41 points is totally wrong: The set of programs that are syntactically correct is context-free for almost all languages

-----

A better source is this whole series by Trevor Jim, which has nice ways of relating theory to practice. He lists a bunch of reasons why you could consider nearly all programming languages not context-free.

http://trevorjim.com/parsing-not-solved/ -- hundreds of parser generators support context-free grammars, but there are almost no context-free languages in practice.

http://trevorjim.com/python-is-not-context-free/ -- A main point here is that you have to consider the lexer and parser separately. Grammars don't address this distinction, which arises in essentially all programming languages. Also there is some lexical feedback, similar in spirit to C's lexer hack.

http://trevorjim.com/haskell-is-not-context-free/

http://trevorjim.com/how-to-prove-that-a-programming-languag...

http://trevorjim.com/c-and-cplusplus-are-not-context-free/ -- the way LALR(1) conflicts are resolved in practice can make a language not context-free

(copying from a recent comment)


The implementation of D has a lexer that is independent of the parser, and a parser that is independent of the rest of the implementation. I have resisted enhancement proposals that would put holes in those walls.


could you give an example for

> Any programming language that is massively adopted is context free?

I want to have an intuition on what is context free and what is not.

Is D context free?

I mean I want to make a useful language, is being context free practical?


D is context free in that the parser does not require semantic analysis in order to complete its parse.


The best way would probably be to work on a language that isn't context-free. When you run into the problems context-sensitivity introduces, then you'll really start to understand it.

In the past, Pascal was widely used in teaching contexts, for both learning to program and later how to implement a compiler. I've seen Walter say in the past that he got into doing languages from a series of articles in Byte magazine that included the listings for a small Pascal compiler. For your task, you can use Oberon instead, which is a smaller and simpler than Pascal. Oberon-07 is context free up to a point, which is another way of saying that it's not context-free. It'll work well for this exercise. Don't worry about writing a full-fledged compiler, because you'd end up spending forever on the code generator backend. Instead implement half of a compiler.

Here's one project that fits the description of being one half of a compiler—a documentation generator:

https://github.com/schierlm/OberonXref/

Implementing this requires doing the lexer and the parser, but its "backend" streams out documentation pages instead of object files or assembler. The truth is, you wouldn't even need to go that far. Just do the lexer and then start implementing the parser—let's say that your program should just read in source code and then say whether it's valid or not—whether it will parse. Since Oberon is mostly context-free and the natural way to do things is to use a recursive descent parser, you'll get into a rhythm implementing things, until you run into the part that's context-sensitive, at which point things will get uncomfortable. You will find true enlightenment then. This can be done over a weekend, but if you reach the end of the first weekend and realize it will take you two, don't sweat it.

> I mean I want to make a useful language, is being context free practical?

It is. If you do the exercise above, it should be straightforward to figure out several different ways to change the grammar to eliminate context-sensitivity in your parser. You'll realize that the transformations would be minor. They won't compromise the power or practicality of the language, and it's possible to end up with one that is almost exactly the same. The biggest casualty in this case would be reworking a dubious design decision made for aesthetic reasons.


Thank you very much for the references. They are very good reads!

In the article about Python, the author said:

> Most languages of interest are context sensitive, and yet we are trying to use inappropriate tools (context-free grammars and context-free parser generators) to specify and parse them. This leads to bad specifications and bad implementations.

Then what do you think is a better tool to specify and parse languages?


what do you think is a better tool to specify and parse languages?

That's unfortunately an open problem. As I mentioned here, lexing is "solved" but parsing isn't:

https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-...

Here are what some recent languages do:

Go: Maintain a written spec [1], with multiple hand-written parsers. (I think the main one is in Go, while gccgo has one in C++?) I believe Go used to be parsed with yacc, but they moved to a hand-written parser for error messages. It looks like [2] was an early attempt to keep it generated.

[1] https://golang.org/ref/spec

[2] https://research.swtch.com/yyerror

-----

Rust: The parser is hand-written, like Go. There's a grammar in the spec, but like Go's, it's not used to generate code, so it isn't tested. They also have an LALR(1) grammar which was suppoed to be an executable specification, but as far as I can tell that effort has not made progress recently.

It used to be done with GNU yacc but it looks like they moved to a Rust-based tool [3]

[1] https://doc.rust-lang.org/grammar.html

[2] https://github.com/rust-lang/lang-team/tree/master/working-g...

[3] https://github.com/rust-lang/wg-grammar/tree/master/grammar

-----

So basically the state of the art is fairly laborious. It's basically write down the grammar, implement it by hand, and try not to make any mistakes. If you need to implement it again, which is extremely common (e.g. for gccgo, for Rust's language server), also try not to make any mistakes.

When you need to change the language, update the spec, and change all the parsers by hand.

I've written multiple recursive-descent parsers based on grammars, and it's not hard if you have examples to test with, but mistakes can easily creep in.

Rely on users to report bugs. There is a "long tail" of programs that will tickle various corner cases.

-----

Here's the approach I took with Oil:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

which is basically to do as much work as possible in the lexer, use a high-level programming language translated to C++ for the parser, and to use a DSL for specifying the AST. This approach keeps the mistakes and tedium down because shell is an extremely large language syntactically. (It could be the biggest of any language except for perhaps Perl. Shell is relatively small semantically, but huge syntactically (which is why I'm preoccupied with parsing :) ).


> A main point here is that you have to consider the lexer and parser separately. Grammars don't address this distinction, which arises in essentially all programming languages.

So why does this happen? Couldn't context-free language parsers emulate a lexer by treating individual characters as symbols?


If there's no feedback, you can consider the lexer and parser separately as languages.

- The lexer recognizes a set of strings of characters.

- The parser recognizes a set of strings of tokens (as returned by the lexer)

But those two languages will have different power in general, so, like Jim points out, when you say something like "Python is context-free" or "Haskell context-free", it's not clear what you're talking about. And it's really false under any reasonable interpretation.

So you can consider them separately, and make a precise statement. But if there's feedback between the two, then you can't do that anymore. The theory doesn't tell you any properties a that the (lexer + parser + feedback mechanism) possess.

That is, regular languages and context-free languages have all sorts of properties proven about them, including ones that let you write code generators. You don't get any of those properties when you have ad hoc feedback mechanism between the lexer and parser.

----

Someone could come up with formalisms for specific types of feedback.

I think OCaml's Menhir has done some of this, but I don't remember the details off hand.

They have written parsers for C (CompCert) and POSIX shell and addressed some of the gaps between theory and practice. I don't use it but they're at least tackling the right problems.

But note that C uses a specific type of feedback (the lexer hack), which isn't identical to what other languages use. So you would have to come up with a formalism for each one, and probably nobody has done that.


Is Haskell context-free if you don't use the indentation layout mode? Haskell does support using braces and semicolons too. However, this is not true of Python (as far as I know).


In theory Haskell supports braces and semicolons but in practice nobody uses them, leading to bugs like this one:

https://github.com/haskell/haddock/issues/12


I do use braces and semicolons when programming in Haskell, and I have encountered that bug before, and I hope that they will fix it (although it still seems to be unfixed after six years, but a few people clearly do care, even if others don't care).


what is context free language anyway? context free grammar had a clear definition in parsing, not sure i understand how that can be extended to languages.


The OP is abusing the term "context free". He's saying it avoids "the lexer hack" [1]:

Context free grammars. What this really means is the code should be parseable without having to look things up in a symbol table

That's NOT what context free means. That's a narrow view from someone designing a C-like language and trying to avoid a very specific property of C.

Another example of being context-sensitive, which has nothing to do with symbol tables, is that resolving LALR(1) conflicts in a yacc-style grammar can make the language not context-free. To resolve ambiguity the parser is doing stuff "outside" the grammar.

----

"Context free" is a mathematical term with a precise definition. It's a class of languages that's part of the Chomsky hierarchy [2], which was discovered and described by linguists and mathematicians before any programming language existed, and has applications outside programming languages. Wikipedia does a good job:

https://en.wikipedia.org/wiki/Context-free_grammar

A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left hand side can always be replaced by the right hand side. This is what distinguishes it from a context-sensitive grammar.

Simple examples of languages that's are context-sensitive (not context-free): Lua string literals, Rust raw strings (see other comments in this thread), and shell here docs:

http://lua-users.org/wiki/StringsTutorial

Yes:

    [[ mystring ]]
    [=[ mystring ]=]
    [==[ mystring ]==]
No:

    [=[ mystring ]]    # mismatched
Matching this language requires a context-sensitive grammar and can't be done with a context-free grammar. It's not all that easy to prove: See the posts from Trevor Jim regarding proofs.

The C language is also not context-free.

[1] http://www.oilshell.org/blog/2017/12/15.html#appendix-lexing...

[2] https://en.wikipedia.org/wiki/Chomsky_hierarchy


> [CF] was discovered and described by linguists and mathematicians before any programming language existed

Nitpick: that was 1956, by then a few PL did already exist.


I read it to mean that the language's grammar is context free, but I guess I'm not sure if that's what was meant.


> Most languages have context-free syntax

Do they? I haven't done a survey of languages but I would guess most are context-sensitive.


I would guess so, too. In my experience, they do strive to be CF, the CS parts express mildly/are few and far between and often are accidental.

That makes using a CFG parser practical and hacking around the problems caused by CS does not take much effort.


Rust is almost context free; there’s one relatively rarely used bit that requires context. And usually it’s only one or two bytes of it.


You're referring to raw string literals, right?


Yes. You need to include more #s than you have in the string, and it’s rare to have a ton of them, let alone a raw string literal in the first place.


Does that make the grammar context-sensitive, or just the lexer?


This is not an area of my expertise, so I can't say.


This point in the article is pretty weird (the rest is good). I don't think IDEs do a lot of parsing to do syntax highlighting, isn't it all just regex matching to identify the types of tokens? I'd be interested in real-world examples of IDEs doing something more complex to achieve syntax highlighting. And conversely, examples of imperfect syntax highlighting of C++ due to the undecidability of its input language. I mean, yes, an actual parser for C++ must be able to execute template computations, and that's a pain, but why would that be relevant to syntax highlighting? It isn't.

Now, there are valid reasons for running a proper language frontend from the IDE: Error reporting of all kinds, not just syntax but also type errors and whatever else the compiler likes to complain about. But there the IDE should not try to replicate the parser. Parsing only a context-free surface language will not catch type errors, exactly because enforcing a static type system requires one to "look things up in a symbol table". So the actual compiler should provide an IDE-friendly mode where it runs its frontend and reports errors back, and the IDE should not try to roll its own version of this.

In other words, the IDE is not relevant either way.

> Any programming language that is massively adopted is context free?

Statically typed languages require semantic analysis and symbol table lookups, so they are out.

And yet we have context-free grammars for all widely adopted statically typed programming languages. And we also have additional semantic checks for all of them. The notion of a "context free programming language" being one that has context free syntax and no semantic checks at all is not a useful notion. Don't worry about it while designing your language.


I'm pretty sure that the JetBrains products (Intellij, PyCharm, etc.) have a pretty deep understanding of the languages they handle. Java is strongly typed, and Intellij does a perfect job, (as far as I can tell), of highlighting, and more importantly, refactoring. Hard to see how they could do that with just regex matching.

By contrast, Python is more dynamic, and so the refactoring in PyCharm is pretty weak. It usually catches some of what it needs, but I often need to find the rest. It does appear to be the case that PyCharm is doing flow analysis to infer types. I.e., even PyCharm is doing more than regex matching.


> highlighting, and more importantly, refactoring. Hard to see how they could do that with just regex matching.

The former is easy, the latter is impossible with regexes only. Which tells you that they are very different things. For refactoring you need not only a proper parse, but also the ability to "look things up in symbol tables". Which is what I said: There are good reasons to run a proper language frontend from the IDE. But syntax highlighting isn't one of these reasons. And refactoring isn't possible with only a simple parse without using symbol tables.


>> Any programming language that is massively adopted is context free?

> Statically typed languages require semantic analysis and symbol table lookups, so they are out.

Isn't that the typechecker, not the parser?


Depends on what level you're thinking of: "Pure" grammar or "well-formed programs acceptable to the compiler". C++ does have a context-free grammar (it's in the standard), and also a lot of additional constraints on what constitutes a well-formed programs (this includes type checking). Since we're discussing a claim that "C++ is famously not a context free grammar", we must be thinking of the latter notion and include the type system in our considerations. In which case, we must consider the entire frontend as a whole.


Many syntax highlighters actually interpret the language to some extent. For example, Atom: https://github.blog/2018-10-31-atoms-new-parsing-system/

Anyway I'm just posting now as I first read this comment when you posted it and tried to recall this blog post by the infamous Steve Yegge, and just finally did: http://steve-yegge.blogspot.com/2007/06/rich-programmer-food...


Emacs definitely does parsing of C/C++ (or delegates to an LSP server, depending) in order to do indentation, users' preferences for which tend to depend heavily on syntactic context.


Indentation is a good point, you need a proper parse for that. But for that you don't need the ability to consult symbol tables, or resolve nasty C++ template metaprogramming stuff. A context-free grammar suffices, and C++ has that. C++ also has some very complicated (undecidable) rules for what programs are well-formed and should be accepted by the compiler, but the editor's indentation tool doesn't need to worry about those.


> I don't think IDEs do a lot of parsing to do syntax highlighting, isn't it all just regex matching to identify the types of tokens?

IntelliJ highlights variables etc. differently if they're unused and can highlight different identifiers in different colors. Not sure about other current IDEs.

That said I never felt like I needed such features, normal syntax highlighting does the job pretty well for me.


IntelliJ has context-aware autocomplete. Type this.field.get and it will show methods and fields starting with get for whatever type field is. There's a lot of context needed to make that happen.


Eclipse has refactoring told for at least Java and c{,++}

Variable renaming respecting scope, function renaming that updates the name at the call sights but ignoring overloads, function extraction... and so on. You may not use it or indeed use eclipse at all.

I can't tell you about the details, clearly something going on beyond syntax highlighting via keyword regex matching. With a beefy enough box it all works well enough.


The purpose of context-free, or rather unambiguous grammars in general, isn't for the benefit of compilers, but for the benefit of humans. A computer can parse almost everything, even regexes (with backtracking etc.), it's the humans that trip over fancy grammars.


Notwithstanding the other comments on context free vs sensitive, the 2014 article predates the language server protocol, which neatly moves the support for programming languages from the IDE to a stand alone tool tied to the language: https://microsoft.github.io/language-server-protocol/

https://en.m.wikipedia.org/wiki/Language_Server_Protocol

So that IDE support point has become much weaker, if not moot.


In some programming languages, e.g. TeX, Forth, PostScript, etc you cannot really syntax highlight the program without executing it. However, syntax highlighting is a feature that I can do without, and I don't think those programming languages are bad. In some programming languages, such as C and Free Hero Mesh, you can read the sequence of tokens despite there are macros, which you need not parse to find where the tokens are (although I think older versions of the C preprocessor did not work like this).

They mention tools. I do think valgrind is sometimes helpful; I do use that. However, Git is not the only version control system; some people prefer others, such as Mercurial, Fossil, etc. I use Fossil.

They also mention error messages. I like the first one; print one error message and quit. However, in some cases, it might be possible to easily just ignore that error and continue after displaying the error message (and then finally fail at the end). It might also be possible to skip a lot of stuff, and then continue, e.g. if there is an unknown variable or function then you might display an error message and then skip the entire statement that mentions it.

They mention a runtime library. This should be reduced as much as possible, I think, and if you can make it so that it has functions which will only be included in the compiled program when they are used, that can be a good idea, since you can include some functions that many people don't need and don't waste space and time with them.

Something they did not mention but I think it is very good to have and useful is metaprogramming and 'pataprogramming.

But I also think that different programming languages can be good for different purposes, and that some are domain specific, and some are easier to write than others. This can be done using existing syntax or new syntax, and I have done both, and so have some others.

Another thing that I can recommend to have is a discussion system with NNTP; even Digital Mars themself have a NNTP-based discussion system. Perhaps better will be to have the same messages with NNTP, web forum, and mailing lists; again, this is what Digital Mars does.

However, I do think that minimizing keystrokes is helpful. Minimizing runtime dependencies and runtime memory usage is also helpful.


> In some programming languages, e.g. TeX, Forth, PostScript, etc you cannot really syntax highlight the program without executing it.

Technically true. In practice, however, I've run into way more issues with syntax highlighters in overly-complex fixed languages than I have with simpler-but-programmable languages.


What a great article. It's something I need to read, as I have been working on my own programming language for a while now. This creating-programming-language-thing is so difficult and so much effort. (I'm stuck at designing how string works right now)

+1 on writing parser by hand. I've done it once with java (https://github.com/tanin47/javaparser.rs), and I switched my own programming language from a parser generator to a hand-written one. Once you know how to write a parser by hand, it's a much better approach (e.g. easier to test, modularizable).

My programming language is a bit strange in a way that I drop the performance requirement. Performance distorts other aspects. I focus on making the language featureful and beautiful instead. I hope I'll achieve this dream of making a usable programming language one day.


After learning s-expression based syntax, I am just baffled why we even bother with anything else.

When you play around with different Lisps, the syntax is always the same, the language differences becomes the semantics only.

Other languages put too much emphasis on the syntax in my opinion. And while I understand the "popularity" appeal. I've almost never seen someone learning the s-expression syntax and afterwards not liking it.

Basically I think it be worth it to push people to learn the s-expression syntax just so we can stop wasting our time with syntax afterwards.

Also if I recall, without user macros, I think s-expression syntax is context free no?


> I've almost never seen someone learning the s-expression syntax and afterwards not liking it.

I don't particularly like s-expression syntax - I think s-expressions suit the computer at the expense of the programmer, which I think is backwards. I think they're verbose and noisy and that obscures the meaning I want to see in the text as a person. Yes they're easier to parse, but I want to make my life easier, not the computers. And yes they're convenient for metaprogramming but I don't want to optimise for the meta case.

A concrete example - I'm very happy working with precedence. I've been doing it since I started school. My five-year-old can understand precedence. Using precedence I can reduce ceremony and noise in my code and allows me to more naturally look at an expression and take in its meaning. But s-expressions don't like using precedence.


> s-expressions don't like using precedence

I'd rather say that in s-expressions, precedence is regular and explicit. To me that's a strong argument in favor of it being more widely accepted as the standard.

With infix notation, operators have implicit precedences, which feel natural to us only because it's what we were taught from an early age.

I'm with the parent comment, that s-expressions ought to be advocated and evangelized (haha), as a foundational approach - even with arithmetic in basic school. If we had started thinking with it as a child, or at least later able to shed the legacy of infix notation, s-expressions are objectively simpler to understand, with fewer things to keep in mind.

Well, that sounds stronger than I meant to. I mostly write in infix-style languages, and they flow fine. But I ideologically align with Lisps - there's a conceptual simplicity and beauty that really should have been the mainstream, and still may in the future.


You might argue that s-expression syntax is "worse is better" but I'd argue that more complex syntaxes are "accidental complexity." You need a parser, you need a more complex prettyprinter.

Without s-expressions you also lose programmer access to the parse tree.

For the usual arithmetic operators I've never had any problem, but I was once caught out when programming in C by getting the precedence of << and >> wrong (at least to me, it's counterintuitive), and it took a few hours to find the bug, because my code looked correct. Had I just parenthesized everything, that bug wouldn't have happened.


Do you mean precedence or infix?

The choice of infix vs prefix vs postfix is semantic. What s-expression syntax imposes is to have everything balanced between brackets. For example, nothing prevents an S-expression based language to allow:

(2 + 3)

The only thing is there are no precedence rules, you must have parenthesis always:

((3 * 2) + 5)


Why are you trying to explain s-expressions to me like I’ve never heard of them when I’m already in a thread debating their pros and cons for language design and implementation?


I'm sorry that you are getting defensive and feeling like I'm attacking your expertise. But there's all kind of people with different experience levels on HN, so I prefer asking for clarification when I can instead of assuming things and talking over each other.

If you truly meant precedence, well, I guess I disagree with you. Even in math, I prefer explicit parenthesised expressions for readability.

If you meant infix, that's not a concern of s-expression so it's irrelevant to the syntax question.

I know that personal preference is basically the reason for various syntaxes off course. But I think similar to how there's a movement that a common standard code formatting is better than everyone having their own custom formats, you could argue that having a common syntax which I think s-expression would make a great common syntax, could also yield an overall better outcome. And that's my argument here.

Most languages have a: here's an s-expression based version of it. Like Python has Hy, Lua has Fennel, etc. So you already see a trend in how s-expression could easily cover all those languages, which is normal since they get closer to the parse tree anyways.


This reminds me of the debates in college (late 1970's) of RPN calculators vs Infix calculators.

RPN's big advantage was reduced number of keystrokes, which is important with a calculator. But when dealing with formulas where you can actually type with a keyboard the number of keystrokes is not important, readability is much more important, and infix wins.

> I've almost never seen someone learning the s-expression syntax and afterwards not liking it.

I'm a counterexample. Sorry :-)


It's whatever you're used to, but using sexprs means you don't need to write parsers because you have the read function. In Lisp, it's important to give the user access to the parse tree.

Since the S-expression syntax for Lisp was chosen, there have been at least three documented attempts to replace it: Lisp 2, CGOL, and Dylan. Lisp 2 syntax wasn't implemented and was only ever used in textbooks, and CGOL and Dylan were never popular and are no longer used. All Lisps today use S-expressions (but Clojure has [] and {} as well as ()), as do some Prologs, as well as a few other languages such as CLIPS. Readability is taken care of by the prettyprinter.

For arithmetic expressions, I agree infix is easier to read because that's what maths uses. For complex formulae, I wrote a Lisp macro which accepts infix notation and I'm sure other Lispers have done likewise.


> I'm a counterexample. Sorry :-)

Haha, fair enough, I did say "almost" ;)

That said, I should have specified, that spent enough time. Like there is a certain amount of time needed to get your brain to really adapt, and generally it's after that period that I've rarely seen anyone not like it. But I admit this might suffer from selection bias, the only people who stick with it long enough for their brain to adapt might be the people who like it.


It's simple, but it's fundamentally reductionist. You can ease the cognitive load by making certain programming structures first-class citizens in the language.


> You can ease the cognitive load by making certain programming structures first-class citizens in the language.

Which, in every s-expression based language I'm aware of, can be achieved using macros.


Isn't that the point - s-expressions aren't good enough so people work around them by writing more conventional languages and parsers (the reader macros) to avoid having to write in them.


I don't think that comment was about reader macros. I don't see reader macros used much, while normal macros are frequently used to create new control structures.


Yes but now you have to worry about parentheses and at what nesting level certain constructs should live.


An adequate editor with autoindent and paren matching solves those problems completely.


FWIW, I've learned s-expressions and never particularly liked them either. I like their elegance, sure, but I found working with them to be cumbersome in practice.


On regexes, he is generalizing from a small number of examples. Some lexers are straightforward to write by hand, but others are better done with a code generator.

edit: I should really say that you should consider using regular languages to lex your programming languages, not (Perl-style) regexes. Those are different things:

https://swtch.com/~rsc/regexp/

I used re2c for Oil and it saves around 5K-10K lines of "groveling through backslashes and braces one a time" (e.g. what other shells do)

http://www.oilshell.org/blog/2019/12/22.html#appendix-a-oils...

And the parser is faster than bash overall:

http://www.oilshell.org/blog/2020/01/parser-benchmarks.html


You might be doing this as a learning exercise, in which case it doesn't need to be particularly innovative, and writing another Lisp or Forth implementation is fine. In fact, I'd recommend beginning along those lines, followed by developing several different new languages, probably domain-specific ones. Your first attempts probably won't be worth keeping. (I haven't kept the object oriented Prolog I wrote about 25 years ago, but I do regularly use a small Prolog I recently wrote in Lisp.)

The languages at the roots of all programming languages are Fortran, Lisp, Cobol, Algol60, APL, Snobol, Forth, Prolog, OPS5, SASL, Smalltalk, Prograph. (If I've missed any out, or any of the above have important predecessors, please let me know.) A new language is unlikely to be radically different from any of those languages or their descendants - of those I listed, the most recent is from the mid 1980s. Within the descendants, which is your favourite? Can you make big improvements to it? If so, make them. Otherwise, can you improve another language so that it becomes your favourite?

One heuristic I've found useful: the right design choices at the start might be unknown, but when I need make them, they're obvious to me, and often a particular choice leads to other equally obvious choices. If the choice isn't obvious, I've probably done something wrong.

My own preference is for pure rather than hybrid languages with simple regular syntaxes (i.e. one idea expressed well), and very strong static typing.

My favourite language is Common Lisp or rather a subset of it. I have trouble thinking of ways to make it significantly better. But I also like the idea of dataflow (a few decades ago I knew some of those working on experimental dataflow hardware), and graphical programming seems to be the right approach to dataflow. Prograph was the closest existing language to what I wanted. It's isn't pure dataflow, it's dynamically typed, it's object oriented (so, unless it's Smalltalk it's hybrid), and entering programs in it involves too many mouse gestures and keystrokes. So I'm implementing a new language which corrects those deficiencies. It's been particularly difficult because I have to write an IDE as well as the language, and there aren't any textbooks I can rely on for help.

I'll have succeeded if I use the new language more often than I do Lisp, and if along the way I get a few more papers (two so far) and some academic interest, so much the better. I don't expect a wide user base. If that was important to me, it would have C syntax and I'd try to get corporate backing. Your criterion for success might differ from mine.


> I have to write an IDE as well as the language

This part caught my attention. I've developed a few small languages over the years, with somewhat boring syntax in Lisp or Algol family. The one that has the most longevity (~7 years) and traction (estimated few thousand users) is an XML-based (!) domain-specific language. Frankly it's verbose and kind of ugly, but apparently easy to learn due to its regular, minimal syntax - a little like Lisp, if parentheses were brackets. It's also "declarative", a bit like SQL, in that the users describe the result, and magic happens internally to make it so.

More recently, I've been working on a tree-structured editor for users to visually build programs/templates. Developing it has been insightful - I'm realizing that this development environment should have been there from the beginning, as a fundamental part of the language's design and user experience.

All syntactic constructs, keywords, patterns, defined variables, functions, etc., can be made available to the user, with auto-complete, suggest, lists of choices. Rather than the user having to remember commands and type them in text, the environment can let them select, compose, and build. Ideally, it can ensure that a program never has a syntax error (conversely, that it makes it impossible to write/build such a program).

Another aspect I've come to appreciate is instant feedback. Any kind of live preview, or automatic build/test, makes a big improvement in the flow of programming and thought process. It becomes like shaping clay.

In an article about Smalltalk, I read how its integrated environment is indistinguishable from the language itself. In my daily work too, I spend most of my time in an editor with language servers for syntax highlight, autosuggest, linting, and other smart features, as well as integration with terminal, Git, remote edit with SSH.. Typing in text is still the fastest way for me to express myself, but the environment is constantly supporting it - it understands what I'm writing (often more than I'm aware), showing me what I need to know to build with minimal effort.

What I like is a programmable development environment, especially one which is built on itself.

Well, this has been a ramble without a point. It's an endlessly fascinating subject, and perhaps it's only natural that a programmer finds delight in creating one's own language. There's something pure and philosophical about it, like working with the very material of thought.

---

A little list of links related to building interactive software visually:

https://airtable.com/

https://bubble.io

https://www.memberstack.io/

https://webflow.com/

https://zapier.com/


Tree structured-editor with built-in language construct, sound like the incarnation of Cedar/Tioga desktop and programming environment by Xerox back in 1980s. This environment replaced three distinct programming environments that were very popular at the time at Xerox namely Smalltalk, Lisp and Mesa and according to Xerox's staffs it is a highly productive programming environment.

Perhaps someone should re-introduce the modern version of Cedar/Tioga similar to Apple and Microsoft re-introduction of the pervasive mouse/windowing user interface originally made by Xerox in 1970s.


Ah, what a delightful rabbit hole your comment led me down.. I've got about twenty tabs open now, articles on the history of (and the relationships between) programming languages and graphical user interfaces. Will enjoy this food for thought!

My recent reading has included:

Unix: A History and a Memoir (Brian Kernighan) - https://www.cs.princeton.edu/~bwk/memoir.html

The Development of the C Language (Dennis Ritchie) - https://www.bell-labs.com/usr/dmr/www/chist.html

JavaScript: The First 20 Years (Brendan Eich) - https://zenodo.org/record/3710954/files/jshopl-preprint-2020... (PDF)

This morning, this little phrase jumped out:

> C was created on a tiny machine as a tool to improve a meager programming environment.

What this made me realize is that a language is a user interface. It makes sense how C evolved with UNIX as an operating system, that a language is deeply fundamental to a computing environment (and its development).

There's a parallel with JavaScript and Netscape. The language was meant to provide users with an interface to operate the web browser, create dynamic documents, and (as it turned out) to let users develop the web environment.

..Which brings me back to the Cedar/Tioga programming environment you mentioned. While searching around, I happened across another comment of yours (I see it's a favorite topic! :) and found the good stuff:

Eric Bier Demonstrates Cedar - https://www.youtube.com/watch?v=z_dt7NG38V4 (video)

Active Tioga documents: an exploration of two paradigms - http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volum... (PDF)

The Mesa Programming Environment - http://www.digibarn.com/collections/papers/dick-sweet/xdepap... (PDF)

Fascinating. I saw a phrase, "active document applications", which interestingly is what the web has become, in its own monstrous way.

It's a good point about "re-introducing" old ideas. I was lucky to have grown up in the earlier days of computers, to have breathed (a little) of its different culture and worldview, that I've learned to appreciate its history and the wealth of ideas - what made possible our current networked computing environment, and visions yet to be fully realized.


>important predecessors

Don't know enough about it but feel missing Simula from your list


It was a descendant of Algol 60. It was the first object oriented language but Smalltalk is different from any language which preceded it in the way it handles object orientation.


Author here. AMA!


Are you familiar with Jai language being developed by Jonathan Blow? What is your opinion on its design?


My opinion is that Jonathan Blow is an amazing developer and he should be joining us on D!

It's been a while since I reviewed the language, and I'd have to re-review it to say anything sane here. But one thing stuck out at me. In D, the support for Compile Time Function Execution (CTFE) is very extensive, and opens the door to a lot of incredible metaprogramming abilities. Jai has this too, but has extended it to allow system calls.

While his demos of programs being run in the compiler is impressive, D doesn't support CTFE system calls for a good reason:

It'll become a vector for malware. People download source code from the intertoobs all the time, and just compile it. Heaven help us if that means your system is now compromised. I'm not going to open that door and make people afraid to compile D code.

It's not worth it.


I mean, do those people who download and compile code not then generally... run it...?

Perhaps I'm missing something but the issue here seems to be whether you trust the source, not whether syscalls are made at compile time or run time?


Trusting the compiler, CI/CD pipelines and verifiable builds.


Why not instead alert the programmer that compiling this code would call such-and-such system calls?

That's IMO how all software should work. Apple did something similar, AFIAK, in the latest version of MacOS, and failed - it's obviously something that is extremely hard to retrofit onto existing software. But allowing arbitrary execution for new software, and sensibly limiting it (e.g. "this program cannot access the internet") is, I predict, fairly feasible.


Other languages (such as python) have install-time code execution. When you pip install a library, the library can run arbitrary code and of course has access to anything and everything via the python interpreter. The only thing protecting you is faith in pypi. How is compile-time code execution any different?


> How is compile-time code execution any different?

I don't know the Python ecosystem and so can only speculate. PIP install is not compiling the code, it's an installation program. Installing anything that can be executed is potentially dangerous, but I imagine PIP install only installs from a trusted source.

But I'm not going to restrict D to compile only source code from a trusted source.


> Installing anything that can be executed is potentially dangerous, but I imagine PIP install only installs from a trusted source.

Nope, unfortunately not, anyone can publish anything to pypi as long as the name isn’t taken. Plenty of room for abuse with typo-squatting, etc.


I assume most people run the code they compile on the very machine they compile it on, so how does preventing compile-time system calls help stop malware?


It means that the D compiler itself is not the doorway for the malware.

Cross-compiling is a common practice, too.


Hi Walter, I've been thinking for a few years that it might be fun to fork the C toolchain and create a modernized version of C which is 80% backwards compatible with standard C but fixes the most egregious issues of C. Do you think this is a worthwhile project?


You'll have to make it better than DasBetterC to make it worthwhile!


For reference, "Better C" subset from D: https://dlang.org/spec/betterc.html


What is your opinion about self-hosting (i.e. writing the parser/compiler in its own language)? Is that really desirable, or even necessary, or just a gimmik (I know what Wirth says, wonder what you think)?


(Not Walter but) Self-hosting in a compiler should be done wherever possible - if we ignore that if the compiler self-hosts it becomes it's own test suite (writing tests is good but it's quite difficult to find weird behavioural bugs using unittests if the tests only test one thing at a time) - it makes it so much easier to work on the compiler (For example, the main D compiler can build itself in a second or two on my machine - if I had to cart around some other huge toolchain it would be much slower). Another boon is that, if people who are really good (let's say) Go programmers want to make the Go compiler better they then don't have to try and write go in C++ if the main compiler is written in C++.

Also, it can also be a plan to rush to a MVP, rewrite the MVP in your language and go from there. This has the added benefit of giving you a decent sized program in your language at no extra cost - aside from testing this should make you design a better language as you learn what works at what doesn't in the "real world".

It's not always possible (Please don't write a compiler in Javascript, or even C if possible - pain and lack of abstraction respectively).


Yes on all points. I'll add that I invented D because D was the language I wanted to write code in. The D compiler was originally written in C-with-Classes, and it became increasingly irritating for me to have to be working in that language rather than D. Thankfully, the DMD compiler is now 100% in D and I've been slowly refactoring the code into much more pleasing forms.


Contrary opinion to the others: In almost all cases, don't bother self-hosting the entire compiler. At least not if you want an end result that optimizes nicely and can generate code for a wide range of architectures. That's a lot of work, and you're better off targeting an existing compiler framework (ahead of time: GCC or LLVM; JIT: the JVM or the CLR; in both cases there are probably other good choices as well). Of course if you start your project with "I wonder what an X compiler in X would look like", you will self host your hobby project end to end. But if you are trying to make something that is maximally useful for others, use the already existing stuff that is out there, and that is much better than what you would pull off on your own.

As for the parser, it's probably true that for many languages, self-hosting the frontend may be attractive. For others it might not. Someone else wrote: "if people who are really good (let's say) Go programmers want to make the Go compiler better they then don't have to try and write go in C++ if the main compiler is written in C++", which is somewhat true. On the other hand I once had to do stuff in the gfortran frontend, and I was very happy that it was written in C and not Fortran, because I know C but don't know Fortran. So I guess as long as your language is niche among people likely to be compiler developers, the point about attracting that talent doesn't hold. If your language gets popular enough, self hosting the frontend should become more interesting. Before that it's more of a gimmick.


If it is a general purpose programming language, this should always be done.

Not necessary at the beginning, but eventually as the language gains mindshare.

It is very hard to understand the language, if all contributions to improving it are done in another language.

Somehow it is similar to being a library author without having ever written a full blown application using the library, thus being blind to what are the possible usability issues.


I think it depends on the language. If you're writing a language that is targeted toward systems programming, yeah, it's probably a good idea to write a compiler in it. But what about languages that are targeted differently? Should people be writing parsers/compilers in CSS or SQL? Probably not a top priority there.


Although the explanation of what you mean by "context-free grammars" is clear, it seems to me that you should be using "unambiguous grammars" instead. An ambiguous context-free grammar is not going to facilitate the job of an IDE. I also think you are mistaken about the C++ grammar not being context-free. The problem is that it is ambiguous. I could be wrong here since I have not kept up with the C++ standards. So can you provide an example of C++ grammar productions you think are not context-free?


Not OP but there were discussions last week regarding this matter in HN [1]. According to the original article C++ is neither context-free nor context-sensitive, it is actually undecidable [2].

[1] https://news.ycombinator.com/item?id=23008599 [2]https://medium.com/@mujjingun_23509/full-proof-that-c-gramma...


Thank you for the links, but I am unconvinced. Your medium link claims that to decide between a function and a variable declaration in the given program, the compiler has to solve an instance of the Post Correspondence Problem (which is undecidable). However, this does not mean that the grammar is not context-free! Context-free languages are closed under union [0] so a context-free grammar (CFG) is perfectly happy with an ambiguity like 'a variable declaration or a function declaration'.

Also, you can use an ambiguous CFG to parse a language: when the parser needs to choose between two productions, you cheat a bit and look at some external context for help. The grammar being used is still context-free though, even though the terminology becomes confusing.

For instance, the ISO C99 standard provides a grammar that is context-free but ambiguous. One of the conflicts involves the `typedef` keyword:

    typedef-name: identifier
    primary-expression: identifier
One way to solve the ambiguity is to look at the symbol table to determine whether the underlying identifier has been declared as a `typedef` previously.

IMO, Walter Bright should talk about "unambiguous grammar" instead of "context-free grammar" when he is criticizing the above situation, because it seems to me to be the proper terminology.

As for C++, I have never implemented a C++ front-end (and I hope I'll never have to). However, it looks like the C++ standard specifies an ambiguous context-free grammar, whose ambiguities must be resolved outside the parser. That solving the ambiguities may require solving instances of the Post Correspondence Problem does not change the fact that the grammar is context-free if it is. On the other hand, I fully expect lots of hidden horrors in that standard, hence my question.

[0] https://en.wikipedia.org/wiki/Context-free_grammar#Closure_p...


The ones where the `template` keyword had to be added to make the template function bodies parse-able.


The article is from 2014. Is it still current? What would you change in 2020?


I'd write the same thing today. 6 years more experience just confirms it.


Good to know, thanks.


Hi Walter, just wondering about the seamless integration of the imperative and functional paradigm in D, do you implement Static Single Assignment (SSA) form for that at the compiler back-end [1]?

[1] https://www.cs.princeton.edu/~appel/papers/ssafun.pdf


Is it true that optimization follows the 80/20 rule? What are some common performance issues that new languages and their implementations face? Are there common optimization techniques that can be applied in order to make the new language competitive with existing ones?

For example, I know that it's generally better to compile programs into a linear code structure such as bytecode instead of interpreting a tree structure directly.


Optimizing code is a book-length topic just for an introduction. It's also true that knowing how optimizers work can feed back into improving the language design.

For example, `const` in C++ doesn't mean the data is immutable - it can change with any assignment through a pointer. No optimizations assuming immutability will work. That's why D has an `immutable` qualifier, giving the optimizer to do optimizations assuming it does not change.

For a famous example, Fortran assumes two arrays never overlap. In C/C++ they can. This is the source of a persistent gap in performance between Fortran and C/C++. C attempted to fix it by adding the `restrict` qualification, but this failed because it is just too arcane and brittle for most users.

In D, I'm working on adding an Ownership/Borrowing system, which will enable the compiler to figure out the pointer aliasing optimization opportunities.


An Ownership/Borrowing system like the one Rust uses? Rust is sadly not (yet) able to use its stricter rules about ownership and aliasing for optimization, because the LLVM backend has been shown to have a number of bugs that prevent such optimizations.

Sadly, I'm not that knowledgeable on optimization. Do you expect that such optimizations would make most programs significantly faster or is it more of a special case that would not improve the performance of most programs?

Keep up the good work with D. I especially love its template/metaprogramming capabilities!


> Do you expect that such optimizations would make most programs significantly faster or is it more of a special case that would not improve the performance of most programs?

I don't know because I have no personal experience with such. What I have to go on is the Fortran vs C/C++ experience I mentioned.


> That's why D has an `immutable` qualifier, giving the optimizer to do optimizations assuming it does not change.

Which optimizations are enabled by immutable data? I can think of constant folding. Is the data statically allocated in a read-only page?

> For a famous example, Fortran assumes two arrays never overlap. In C/C++ they can. This is the source of a persistent gap in performance between Fortran and C/C++.

What sort of optimizations does this assumption enable? Does it allow the compiler to freely reorder or parallelize the code?

C assumes that pointers to different types are never equal. This is incompatible with common systems programming concepts such as type punning. Even something simple like reinterpreting some data structure as an array of uint8_t can make the optimizer introduce bugs into the code. Notably, the Linux kernel is compiled with strict aliasing disabled:

https://lkml.org/lkml/2003/2/26/158

https://lkml.org/lkml/2009/1/12/369


> What sort of optimizations does this assumption enable? Does it allow the compiler to freely reorder or parallelize the code?

Consider this code:

  void foo(int* data, const int& count)
  {
    for (int i = 0; i < count; ++i)
      data[i]++;
  }
The compiler has to account for the possibility of `count` being inside of `data`. It has to reload `count` from memory at every iteration. This also prevents e.g. auto-vecorization:

Compare https://godbolt.org/z/sNRKv4 vs https://godbolt.org/z/iJq5yn.

Similar story if you manipulate data with multiple arrays in play: Unless the compiler knows (via `restrict`, type-based alias analysis or just seeing the definition) that two objects are non-overlapping, it has to correctly handle the worst case. See also this related discussion: https://news.ycombinator.com/item?id=20800076


Which optimizations are enabled by immutable data?

If nothing else, deep immutability means you can pass around a reference to data instead of the data itself but still treat it as having value semantics. This can be a huge saving with large data structures, sharing data across threads, etc.

C++ programs are full of functions taking const references to things instead of the things themselves because C++ programmers have to do this optimisation (or some equivalent) manually to get the same performance boost, because the language doesn't prevent all possible cases of aliasing that would make automatically applying that transformation safe.


> Which optimizations are enabled by immutable data?

    int test(immutable(int)* p, int* q) {
      foo(*p);
      *q = 4;
      foo(*p); // don't have to reload *p
    }


> For example, `const` in C++ doesn't mean the data is immutable - it can change with any assignment through a pointer.

In C++, an object declared as `const` cannot be changed by assignment through a pointer. Maybe you are thinking of const references?

For example, a `const int x = 10;` cannot be changed. Neither through `const_cast` nor through pointers nor anything else - it's UB in every case.


pointer to const int


Aside: this is the most amazing optimisation information I have seen: https://www.youtube.com/watch?v=r-TLSBdHe1A

Unfortunate that it is a video...

Summary: Modern CPUs have so many hidden causes of performance variation, that you require special tools to actually measure small performance gains. He finds an amazing but unobvious speed up in SQLite using the causal analysis tool they developed. And a surprising difference between -O2 and -O3 implying overfitting.


Linear encoding is better than interpreting a tree because accessing adjacent elements in arrays is faster than chasing pointers.

Poor performance depends on what your language does. What's poor performance for SQL will be different than for JS. Features which don't have efficient implementations will be slow. If the language encourages use of inefficient features, then it will generally be slower.

The easiest optimization technique is to leverage an existing back end; for example, target LLVM or JVM or .NET.


> Linear encoding is better than interpreting a tree because accessing adjacent elements in arrays is faster than chasing pointers.

Yes. From this we can derive some general optimization principles: reduce indirection, increase data locality.

Are there any others?

> Poor performance depends on what your language does. What's poor performance for SQL will be different than for JS.

Let's consider modern dynamic languages like Javascript and Python. What performance problems did these languages face? Which optimizations had the biggest impact on code execution performance? Which optimizations were the easiest to understand and implement?

For example, I know Javascript virtual machines will automatically create hidden classes for objects. This allows the data to be accessed by constant offsets rather than lookups.


You're sticking your toe in a very big lake, and this isn't the right forum to meaningfully talk about it.

(I don't mean this in any way negatively, but your two posts have a distinct whiff of Help Vampire - https://meta.stackexchange.com/questions/19665/the-help-vamp... - just a tiny chunk of information, and a naive question where you'd need years to properly survey the knowledge. It's a big topic and you need to be goal directed to get anywhere IMO.)


The author of the article said we could ask anything:

https://news.ycombinator.com/item?id=23054664

I thought it was a good opportunity to ask the questions that have been in my mind for a long time. They were open-ended questions but I don't think they were incoherent. I've read the source code of certain programming language virtual machines and have seen optimizations both simple and complex. For a long time I've been wondering if there was a set of basic optimizations every developer could apply to their language and get significant improvements with little difficulty.

I don't have a project I need help with, I was just curious. I certainly don't think I'm entitled to answers or attention. If I offended anybody, I apologize. My intent was to spawn a discussion.

> this isn't the right forum to meaningfully talk about it

What is the correct place?


"Wanna know the secret of making [X] fast? [...] Use a profiler."

Still really valuable advice. We know it, but do we do it?


,,grammar should be redundant. You’ve all heard people say that statement terminating ; are not necessary because the compiler can figure it out. ''

I'll stay with my non-redundant Julia language, thank you very much. It gave me a lot of joy to programming, and I don't remember having big problems with error messages. Even if I have, syntactic errors are trivial to find.

The hard parts of efficient programming are memory management (garbage collection), controlling vectorizing instructions, balance between ease of GPU programming and efficiency, parallelization, not semicolons at the end of a line.

All programming languages have to decide on how low level control they give to the computer resources, and how they abstract them, to make high level programming possible without sacrificing too much efficiency.


Whilst you're not wrong about what you value about a programming "language", I'd say there is an important distinction to be drawn between two things people tend to conflate: the language and the runtime.

Perhaps a counterexample is in order to explain it: Java the language and Java the runtime (VM, stl) are two different things entirely. Kotlin, for example makes a lot of different choices in language design (e.g. semicolons) while reusing the same runtime (GC, threads, etc.)

Of course, you can often not tease these two concerns apart easily and neatly, but there is merit in giving thought to the ergonomics of the pure language side of things. At the end of the day, a programming language doesn't need an implementation to be useful (e.g. as a teaching tool) as it's an abstract, formal concept.


I am currently developing a language, though more as a research project than anything practical. Apologies for jumping on this post, but it is a good opportunity to organize and set out my thoughts (and possibly someone might find it interesting):

Essentially the language is a pure functional language that takes the untyped lambda calculus and adds decoration terms as first class citizens in the calculus. These terms can be used to wrap selected combinators (eg church numerals). The normal beta reduction rules are extended to handle these decorations in a useful way.

The decorations allow predicates to be formed that are total over the term space (eg isChurchNumeral?), which means that function call identifiers can be dynamically dispatched based on their arguments (using a certain amount of term assembly from the surface syntax). Adhoc polymorphism can thus be encoded in a transparent fashion.

This has been sufficient to build a numerical tower up to and including the Complex numbers, such that the usual operations add, mul, sub, div are defined within and between Natural, Integer, Rational and Complex numbers. It can also manipulate strings as if they were lists, whilst retaining their 'stringyness'. All without needing number or string specific features internal to the runtime (save parsing and formatting on the way in and out).

REPL examples:

  > (sum [1 -3 2.5 3/2 1+2i])
  > 3+2i

  > (reverse "hello")
  > "olleh"
Something resembling Haskell's typeclasses naturally arises, with the usual definitions of Functor, Monad, and Applicative.

It is, needless to say, astonishingly slow.


From your description it sounds like these decorators are an alternative to type annotations?

Mind showing how these decorators look and work? I'm also building a language, and always interested in seeing novel features like these :).


Possibly an alternative - but operating at the term level as predicates - if you consider 'type' to be a total function dividing your term space into 'of the type' and 'not of the type'.

Church numerals:

  0 = (^Nat (\f \x x))
  1 = (^Nat (\f \x (f x)))
The ^Nat is the decoration, and can be recovered with a special predicate function ?Nat. So:

  (?Nat (^Nat (\f \x x))) =>Beta (^Bool (\x \y x)) *i.e. true*
For this to work the inner part must be reduced first. It does seem to be consistent with Normal order reduction - the Y combinator etc work as expected.


It sounds like the trademarks in https://blog.acolyer.org/2016/10/19/protection-in-programmin... -- is that right?


Kind of - but more about dynamic dispatch / overloaded function name resolution than protection per se.

Furthermore, the decorations work at the lowest possible level, and are bound up with the extended beta reduction (there is also an unwrap operator with associated reduction rules).

Of course, any computation can be modelled in the standard lambda calculus, without needing to hack the reduction, but I believe that this extension is in some sense non-trivial as it would require the LC to first implement the LC terms and usual beta reduction within itself, before then implementing the additional terms and redux.


A reason for simple parsing is to make the toolchain for the language easier. And in particular, you don't want a preprocessor. Try to parse your program into an AST? You can't in C/C++ in general -- you can only parse it for a particular set of preprocessor defines, which will leave out large chunks that were deleted during preprocessing.


Zortech C was fabulous, wrote a disk editor in it many moons ago. Kudos!


Thanks for the kind words! Zortech was indeed a great compiler for its time.


Years ago IIRC you put out a request for original Zortech C++ disks and packaging, if anyone had some. I was sorry that I had just disposed of mine six months earlier. .. That was my one-and-only C++ in the early 90's, and I am still programming in D. Sorry I couldn't help!


No problem. By the kind work of many people, I managed to accrete a fairly complete collection.


I still have Zortech C 1.06 on my old P166. I should boot it up to see if it still lives !


As you are probably aware that the author of the article is the designer of the D programming language.

For background reading of the design of D programming language, this is a very interesting and insightful paper on the history and the origins of D programming language up to 2018 [1].

[1] http://erdani.com/hopl2020-draft.pdf


I just started down this road. I’ve mucked with scanning text into object trees but want learn how to make my own language. The difference is this would be a very small domain-specific language that controls an in-memory graph data store targeting interactive stories.

There are great existing tools for this sort of thing, but I have theories/ideas I want to explore.

I’m glad there are healthy discussions on compilers.


>"Context free grammars. What this really means is the code should be parseable without having to look things up in a symbol table. C++ is famously not a context free grammar."

Could someone elaborate on why C++ is not a "not a context free grammar"? Also how does C++ deal with this fact?


Here’s a good Medium post on why C++ is undecidable: https://medium.com/@mujjingun_23509/full-proof-that-c-gramma...


This is a great read! It summed things up quite nicely with some good graphics as well. Cheers.


One should consider creating a VM first - it simplifies life a bit, and makes the language more portable in the future. The implementation task is split in two - compiler and VM. First, compiler targets the VM. Second, the VM targets the CPU. This way porting to different architectures is just a matter of adding to the VM to interface with the new arch. A compiler and VM definitely are 2 separate projects with their own set of issues.

Also we should consider the possibility of going the eDSL (embedded domain-specific language) route. Working at a very high level, re-using existing constructs of the host language. If one is interested, looking into SICP [1] will show an example of such a project. Using Scheme to build an interpreter. This can easily be done in several days. Best to try to reduce doing work that has been done many times before.

----------

[1] https://mitpress.mit.edu/sites/default/files/sicp/index.html


I used to be big into programming language design, but eventually decided that there just wasn't enough headroom to do something innovative enough to warrant the switching costs. The world doesn't need yet another syntax on top of basic C-style or Lisp-style semantic constructs.

Lately I've been wondering if I should reconsider, though, and have had a bunch of ideas that blur the lines of what should be considered in-scope for a programming language. Things like:

1. Universal constructs for serial & parallel execution, but with user-defined semantics for where the code should execute. So for example, at low-scale you'd execute in a loop on the CPU. At mid-scale it'd run on the GPU. At high-scale you could transparently distribute across thousands of GPU boxes.

2. Profiling and execution statistics built into the language. Write your code, run it on representative data, and instantly get statistics on how many times each function is called, what the average size of data structures is, how much memory is used, etc. Profilers do this, but often not at a granularity that's particularly useful (ever tried to find out how many Strings are created within a particular dynamic call-tree?), and most new languages overlook implementing a profiler, leaving it to third-party vendors. This is a shame, because a standardized programmatic API to a profiler could be a very effective compiler tool, leading to...

3. A way of defining transformations between data representations, benchmarking the cost of those transformations, and then having the compiler automatically decide whether it's worth bulk-converting data based on typical profiler results. The motivating example for this is the AoS-SoA transformation, but they also feature in common system design questions like "Do you unpack all of the fields of this wire protocol, or lazily construct an object only when necessary?" or "Should you zip your data before sending it in an RPC?" or "Do you do this work at indexing or serving time?"

4. The ability to dump intermediate data from any function call site to disk, and then restart program execution from there without re-running the whole program. Handy when iterating on a complex algorithm.

5. Drill-down data visualizations for large collections. If you've ever tried to find the logic bug in a 500,000 element array using a standard IDE debugger, you'll know about this pain point.

6. Built-in support for machine-learning. In particular, many ML applications need extensive preprocessing, and you need to preprocess the data in exactly the same way for training and serving. Wouldn't it be neat if you could write your traditional algorithms, dump out your signals at the particular call-site of the classifier, and then have the development environment automatically send it to Mechanical Turk or Craigslist for labeling? Then have the system return the labeled data in a format easily suited for your ML framework of choice, do your data analysis & training, and import the finished model back into the call site of the language.

7. GPU-native, with language constructs that lead to efficient GPU code. This is becoming increasingly important not just for execution speed, but for developer velocity, because devs in data-intensive areas can waste a significant amount of time waiting for their program to run.


1. OpenACC kind of goes into this direction, but the specification is a bad joke, and it's less ambitious than your goal. Still, might be interesting.

4. This sounds cool. I wonder how difficult it would be to implement a version of this using fork. The user would write:

    if (some_condition_I_want_do_debug) {
        suspend_a_copy();
    }
where suspend_a_copy would fork and continue. The forked version would log its PID somewhere and go to sleep until woken up by an external signal, then continue from there. Possibly you would attach a debugger first.


Great list!

I specifically agree that the triple (transformation between data representations + ability to define where the code should execute + built-in profiling) can be a game-changer. It would help to solve the tension between clarity and performance by decoupling the what from the how.


This advice is quite good for certain types of programming language, if you look at the world the same was as he does. We're implementing Dark from a completely different worldview, and from that vantage point, a lot of these aren't exactly wrong, but different.

> Syntax matters

The approach we took with Dark is that there isn't a syntax, per se, in that there isn't a parser. There's certainly a textual view of Dark and so it's important that the code looks good (it currently looks only OK, in my opinion). But as a result, we have other options for minimizing keystrokes (autocomplete built-in), parsing (again, no parse), and minimizing keywords (you're allowed have a variant with the same name as a keyword, which isn't allowed in languages with parsers (well, lexers, but same idea).

> context free grammars

No parser, no need to have a grammar. His point about IDEs is great - we only support our own IDE (a controversial decision to be sure!)

> Redundancy

This is an esoteric parsing problem, that only applies if you have a parser. No parser means no syntax errors. We are left with editor errors (how does the editor have good UX) and run-time errors.

> Implementation

He's right about how hard error messages are, so I feel good about our "not parsing" approach.

> Compiler speed

Our compilation is instant. The way it's instant is:

- very small compilation units: you're editing a single function at a time, and so nothing else needs to be parsed.

- no parser: The editor directly updates the AST, so you don't have to read the whole file (there isn't a "file" concept). Even in JS, that means an update takes a few milliseconds at the most.

- extremely incremental compilation: making a change in the editor only changes the exact AST construct that's changing.

> Lowering

This is really about compilation. One thing you can do, which is what we do, is have an interpreter. Now, interpreters are slow, but we simply have a different goal with the language, which is to run HTTP requests quickly. We do run into problems with the limit of the interpreter, but we plan to add a compiler later to deal with this.

Really what I'm suggesting here is that compiled languages look at having interpreters in addition to compilers.

> i/o performance

> memory allocation

I think he's approaching this with an implicit goal of "it must be as fast as possible", which isn't necessarily a goal for all languages.

> You’ve done it, you’ve got a great prototype of the new language. Now what? Next comes the hardest part. This is where most new languages fail. You’ll be doing what every nascent rock band does — play shopping malls, high school dances, dive bars, etc., slowly building up an audience. For languages, this means preparing presentations, articles, tutorials, and books on the language. Then, going to programmer meetings, conferences, companies, anywhere they’ll have you, and show it off. You’ll get used to public speaking, and even find you enjoy it (I enjoy it a lot).

Well this is certainly correct!


> The approach we took with Dark is that there isn't a syntax, per se, in that there isn't a parser.

Can you explain how that works? Unless Dark is a purely visual programming language (in which case I'd say that there is "syntax", it's just a bit more abstract, and in any case, it seems that visual languages haven't really caught up as an idea), I find that hard to believe.


Sure. You can see it in action at http://darklang.com/launch/demo-video (also, we've gone through our waiting list, so we're pretty much adding new folks who sign up immediately, if you want to try it out).

The main idea is that you when you make a change (let's say, you type a key in the editor), that change happens directly on the AST (the internal set of objects that represent the program). So for example, if you're seeing:

    let x = 5
    x + 2

we represent that internally as `Let ("x", Int 5, BinOp ("+", Var "x", Int 2))`

In a normal programming environment, you type up text, and the parser converts it to the internal representation. With Dark, we always keep everything in the internal representation. That means, if it isn't a "syntactically" valid program, you can't create it in the first place.

In practice, this means that if you type, for example, `z` in the middle of `let`, in a traditional language you'd get an error like `lzet is not a valid token` and in Dark, the `z` just wouldn't appear (nothing would happen). So it's not possible to get a syntactically invalid program.

The overall idea with Dark is that programs are always live (and also safe). So to make that be true, we need to ensure that you don't spend a period of time in a state where we can't understand your program (such as when the syntax is invalid).


> That means, if it isn't a "syntactically" valid program, you can't create it in the first place.

It's an interesting idea. I don't know if it would work for me, though. I don't write code as a stepwise refinement of a valid program, I massage the text gradually into a valid program. This is especially true when refactoring.


Yeah, I think a bunch of people work like that. We've had the feedback that some people have had to change their workflow to work well with Dark.

People generally use what we call "trace-driven development", where they make a HTTP request to the code they're editing (typically, people are making APIs or backends to JS apps), and then they always have "live values" of what they're working on (similar to a REPL, or to running unit tests on a loop, but instantly).


So, looks like it's a textual language, it clearly has a syntax. How you manage the transition from text to AST and back, that's a different story - it might be that your approach is much better! (Though on the other hand it looks similar to how CLR/Roslyn C# compiler does it.)


The code always lives as AST, and the code that you see in the editor is a pretty-print of the AST. We don't go back and forth between AST and text - when you make a change in the editor, we look up the "token" where your cursor was, and use it to find the place in the AST that needs a change, and changes only that.


You have a grammar... you're just enforcing the rules when data is entered. The same as if my editor stopped accepting input if I miss typed a line until it was fixed.. or if I was required to select one of it's autocomplete suggestions.

let x = 5

"hi " ++ name

DB::set{}

Are you saying these lines from the video are not consistent throughout? Like in one part I'll type "x = 5" to assign 5 to x, but somewhere else I'll have to type "5 # x" to do the same? Like just random syntax everywhere?

If not, then you have a grammar. You should document it and treat it as such.


I had to look up whether you were technically accurate (I've a PhD in compilers but parsing's not really my thing), and I think the answer is "it depends, maybe?"

To get to the root of the issue though, there is not a parser.

Well, one could consider the function that takes each keystroke to be a parser - though I don't know if that's technically correct either. So perhaps the more accurate thing to say is there is not a step that takes a textual source file and parses it.


What would happen if you paste a complete program in all at once?


That's not really how Dark works. You're not manipulating text in your editor - you use our editor, and our editor only uses ASTs. So when cutting and pasting, you're cutting and pasting AST, not text. So to answer your question, you can paste an entire handler or function at once, but it will be one you copied from another place in DArk and it will be structured.

We've looked at allowing people to paste text (eg from Stackoverflow) into Dark. We think we can do it by just inserting each character one at a time (though there's a technical flaw with that at the moment, which could be addressed). In that sense, Dark is sort-of a parser (I don't know if it technically counts).


That sounds…potentially confusing? Correct me if I'm wrong, but it just continues to run the last version of the program without errors, but it won't tell you if the current version is syntactically invalid?


Think I made it more confusing, not less, sorry! Did you watch the video - it's clearer after watching it.


Off topic:

Annoying that this is totally unreadable even at 300% zoom on an iPhone 11 Pro.

I feel like a set of people refuse to learn proper HTML/CSS as some sort of statement, not realizing their laziness renders their work unavailable to the visually disabled.

HN behaves similarly poorly.


The irony is that if this site were actually simple (i.e. just text without a sidebar) it would be as responsive as you needed it to be. In the worst case scenario, you could fix it with client-side CSS.

The problem here is that the site is complicated beyond what HTML is meant to do: it has a sidebar. It's no longer just a document; it's a document and a navigation menu. Effectively it's a small software application, and once you're in the business of writing software applications you ought to be in the business of accessibility and UX design.


Probably the problem with iOS browser. Works well with Firefox on Android as well.

Also does your apple phone browser not have anything like a "reader mode"?


It doesn’t work like that. I believe one should strive to make their content accessible to all possible viewers (it’s not like mobile Safari is niche), not simply throw their hands up in the air and blame it on the OS when some visually impaired are unable to access the content. iOS has the mechanisms to zoom, and websites by default will work with zoom. But this person has written just enough CSS to break native accessibility, but not enough to bring it back.

Reader mode is enabled for many websites, but not this one. Another case of someone writing just enough CSS to screw over their visually impaired viewers.


What's wrong with Hacker News?


it doesn't respect iOS font size, and you can't alter the zoom to make the text actually bigger. It zooms the entire page rather than just the text, so the text doesn't re-flow. It's pretty much equivalent to just pinching to zoom, when it should be adjusting the font size.


Why can firefox on the desktop resize the font on HN and reflow the content, but iOS cannot?


Beats me. But usually the onus is on the application developer to make their content accessible on all platforms. Especially one as common as iOS Safari


I cannot comment on mobile (brave) - text box expands, and reply button disappers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: