Hacker News new | past | comments | ask | show | jobs | submit login
Build me a Lisp (kirit.com)
127 points by KayEss 37 days ago | hide | past | web | favorite | 68 comments

This article is chock-full of misinformation.

> an s-expression [is] a fancy term for a list of [cons] cells

No, it's not. An S-expression is a serialization of a (single) cons cell, whose elements might be other cells.

> We're using a vector instead [of cons cells], but the two are equivalent

No, they are not. When represented as cons cells, CDR is a non-consing O(1) operation. When represented as vectors, CDR is a copying O(n) operation (and because it necessarily makes a copy, the semantics are different if you introduce mutation).

The fact that S-expressions represent cons cells and NOT vectors is crucially important. It is the feature from which Lisp derives much of its power.

It is possible to make a Lisp-like language where the surface syntax represents vectors instead of cons cells. In fact, this makes a useful exercise. If you undertake this exercise you will come to know the answer to the question: why has this idea (a vector-based Lisp) not gained more wide-spread adoption?

> There are three different sorts of data structure that we have so far: Strings. S-expressions, and Cells

As above, S-expressions are not a data type, they are a serialization of cons cells, i.e. they are a mapping between cons cells and strings, and this mapping has a very particular feature from which much of Lisp's power is derived, which is that it compresses a linked list of cons cells into this:

(1 2 3 4)

instead of this

(1 . (2 . (3 . (4 . nil))))

This language is missing that core feature. (It's also missing symbols, without which a language cannot rightfully call itself a Lisp. And first-class functions. And macros.)

While I don't want to discourage any attempt to make Lisp more accessible, it's important to actually get it right. What this article is describing is similar in spirit to Lisp, but it's not Lisp. It's a different (toy) language at a completely different point in the design space.

This is the big problem with a lot of people's attempts to implement lisps to show how simple lisp is: they don't implement even a reasonable subset of lisp. A real lisp implementation will eventually require a garbage collector! Which is hardly simple.

The misunderstanding, I think, comes from the fact that lisp code is typically quite terse, which is often a measure of simplicity. But part of the reason lisp code is so simple is that a lot of the heavy lifting is done behind the scenes. There's no getting around that software is complicated: the benefit of lisp is that it does a good job of managing the complicated parts while giving the programmer responsibility for a small subset of the functionality which is still powerful enough to enable the programmer to do what they need/want to do.

> The misunderstanding, I think, comes from the fact that lisp code is typically quite terse, which is often a measure of simplicity.

I have been programming in Lisp for a few years, and I think that Lisp programs are verbose, and not terse at all. This is because Lisp special forms and functions tend to be explicit and self-contained, which makes Lisp code more robust and easier to change and compose, than more terse notations that leave a lot of details implicit.

Non-trivial programs are just shorter to write in Lisp than many other languages, partly because the Lisp primitives are better thought out, and partly because that explicitness turns out to be more concise than working around the implicit special-case rules and poor error handling of other notations, that you end up needing to do in larger programs. Things that work for one-liners may not work as well once you get past one thousand lines.

You also see this in Common Lisp with DSLs like LOOP and FORMAT, which seem convenient until you run up against their limits (part of the reason why many people dislike them). The thing is that it is much easier to go from fully delimited/explicit to terse and implicit than vice-versa, if you have good primitives. Because of dynamic scoping and read macros, you can do really terse DSLs in Common Lisp in a very straightforward way (for example, Andrew Sengul did an APL subset that can be wrapped in a read macro, and I think sed would be pretty easy to do).

isn't this verbosity CL only and not Lisp per se ?

clojure code is often super terse (mostly because lots of funcallable and fp idioms)

> But part of the reason lisp code is so simple is that a lot of the heavy lifting is done behind the scenes.

Heck, HN is powered by Arc which is (last I checked) running on Racket (or any other Lisp engine)!

> If you undertake this exercise you will come to know the answer to the question: [...]

We could also know the answer if you just told us. It's bad form to make readers jump through hoops for insights you could easily share. This is not school, you are not supposed to assign homework to make us learn.

Sharing this insight is not so easy. It's not a pithy aphorism that fits in an HN comment. It would take at the very least a blog post that would probably take me a couple of hours to write. Doing the exercise would take less time (do it in Python or Ruby).

What is the observable effect then? I can really only think of two: A particular performance characteristic, or that modifications of shared tails in lists are observable across lists. Is this latter that is the problem?

Vectors are a lot of work for no gain. They make everything more complicated, especially if you want an efficient CDR operation, and there's just no reason for it. Yes, your code might compile a little faster, but Lisp code is generally compiled incrementally anyway, so compile cycles tend to be measured in milliseconds already. There's just no point in speeding it up any further.


Don't write the blog post, only write the "conclusions" section. "Not using conses leads to more copying and more garbage being generated" or whatever.

Conclusions: Lisp is invariably implemented in terms of conses rather than vectors for a wide variety of reasons, which were discussed in detail in the preceding post. No one of these multiple reasons stands out as the "killer advantage". Vectors just turn out to be the death of a thousand complexities, whereas conses are simple, extremely flexible, and sufficiently efficient for the purpose. A great deal of additional insight can be gained by actually implementing Lisp both ways and experiencing the differences first hand.

All this an no pointers to Ciel?

I haven't worked on Ciel in such a long time I can't even remember if it's cons-based or vector based. But sure...


> If you undertake this exercise you will come to know the answer to the question: why has this idea (a vector-based Lisp) not gained more wide-spread adoption?

It's gained incredibly widespread adoption. Clojure doesn't use cons cells at all; lists are trees of vectors internally, and vectors and maps enjoy first-class status, even as syntactic elements, right alongside lists. Clojure enjoys far more business interest than Common Lisp or any other Lisp dialect, meaning that non-cons-cell based Lisp has won.

Furthermore, the C++ programmers, being performance minded, have learned something the Lisp guys still don't seem to grok: on modern architectures where cache is fast and RAM is slow, the big-O inefficiency of using vectors vs. linked lists is usually more than made up for by the speed gains of cache locality, to the extent that for real-world data structures it almost never makes sense to use a linked list rather than a vector.

Anyway, if you represent a list as a slice of an underlying vector, cdr is still O(1).

> non-cons-cell based Lisp has won

In terms of usage, yes. Not in terms of adoption as an implementation technique. If you want a non-cons-cell-based Lisp (and no one actually wants this -- that code is vectors is not the reason people use Clojure) you have only one option.

> vectors and maps enjoy first-class status,

They do in Common Lisp as well.

> even as syntactic elements

Vectors have a standard surface syntax in Common Lisp, and it's trivial to define a reader macro for maps if you want one. But anything you would do with a static map is probably better done with a CASE form.

> the big-O inefficiency of using vectors vs. linked lists is usually more than made up for by the speed gains of cache locality

All of this only matters when processing code, which is usually a tiny fraction of the CPU time. When processing data, Lisp has vectors, and Lisp programmers can and do use them.

> if you represent a list as a slice of an underlying vector, cdr is still O(1).

Yes, but then you still have to allocate memory to store the result.

> Furthermore, the C++ programmers, being performance minded, have learned something the Lisp guys still don't seem to grok: on modern architectures where cache is fast and RAM is slow, the big-O inefficiency of using vectors vs. linked lists is usually more than made up for by the speed gains of cache locality, to the extent that for real-world data structures it almost never makes sense to use a linked list rather than a vector.

Contrary to your claim, that's nothing new. Memory hierarchies have existed decades ago, too.

Lisp users have found that linked lists as a primitive data structure were fast enough for many applications. Other native data structures like vectors, multi-dimensional arrays, hash-tables, etc. have been supported in Lisp for decades.

There is no such thing as a non-cons-cell-based Lisp.

Just like there is no non-H2O water.

Well, there are isotopes: D2O is "heavy water". Similarly, not all Lisps have exactly the same cons representation down to the sub-atomic level; there are various isotopes.

So are you asserting that Clojure is not a Lisp, or that Clojure indeed uses cons cells, contrary to the comment mentioned above?

How would Clojure be a Lisp, given that it runs zero Lisp code?

To my mind the defining characteristic of a Lisp is that code is represented as a first-class data structure, one that can be produced by parsing text, but that has more structure than text. In the case of most Lisp's that first-class structure is a cons cell, and I think that is by far the best choice, but I don't think that using a different structure should be disqualifying by definition. If you're really going to be that strict about it, then Scheme is not a Lisp because its standard does not actually specify the data structure that is used. It only specifies that code can be represented "as data" but it doesn't say what kind of data that is. It could be a string and still conform to the standard.

The argument you made is cyclical -- it assumes what it's asked to prove.

To say "Clojure doesn't run Lisp code" is to already assume Clojure is not a Lisp (else you would have to admit that it does run Lisp code: its own).

> The argument you made is cyclical

It's not. Lisp code existed before Clojure.

If it can only run its own code, then it is no Lisp. Lisps share code.

I have a zillion lines of Lisp code. Then Clojure appears. It runs none of my zillion lines of Lisp and none will be directly translated to it (because it uses different data structures, has different operators, has different semantics, has different pragmatics). -> it's no Lisp

It's derived from or influenced by Lisp, that's it.

Like ML was derived from Lisp (it was originally even implemented in Lisp) and does not run any Lisp code. It's also no Lisp.

Languages that don't share code cannot be called dialects, QED. According to the most common linguistic definition of "dialect", they lack "mutual intelligibility".

> Like ML was derived from Lisp

Like English was derived from Latin, French and Scots (among other things), yet is not Latin or French or Scots.

Like Romantic music was derived from Baroque, yet is not Baroque; or rock was derived from blues but isn't blues.


To say "Swahili speakers don't understand English" is already to assume that Swahili is not a dialect of English, and therefore a circular argument. Or else you would have to admit that Swahili speakers do understand an English dialect: their own!

Take someone who is trained in Clojure programming and nothing else; which Lisp's code can they understand that is not derived from Clojure, like ClojureScript?

Basically, the criterion for a "dialect" is mutual intelligibility. A language which is an island isn't a dialect of anything. Two languages that are not mutually intelligible are not dialects.

Lisps are dialects which share a mutual intelligibility formed by the core that comes from the ancient Lisp. They are based on the same or very similar core concepts and use the same names for them.

>To say "Swahili speakers don't understand English" is already to assume that Swahili is not a dialect of English, and therefore a circular argument. Or else you would have to admit that Swahili speakers do understand an English dialect: their own!

That's not really a counter-argument by absurdity to my argument, as it's also valid.

To say "Swahili speakers don't understand English" indeed already assumes that Swahili is not a dialect of English. When one says "Swahili speakers don't understand English" they don't make an argument -- they make a statement of fact.

Now, one could argue that the fact that Swahili is different than English is well established, and needs no further arguments (as people will already be familiar with it and accept it).

But regarding whether Clojure is or is not a dialect of Lisp, I don't think that's the case.

> as it's also valid.

Possibly so, but it isn't sound.

What about it is not sound? I wrote an extended explanation of why it's not an argument, but a statement (that needs to be evaluated as such).

'(zero) ?

'(zero . nil)

Which Lisp code are you asserting it can not run? Scheme? CommonLisp? Racket? Emacs Lisp?

Furthermore which of those Lisp dialects can or cannot run each other’s code? Is your definition of “Lisp code” a primordial core of which a set of languages descend? If so, what is that primordial core?

Typically, lisps will implement their lists internally in terms of a VList[1] or similar data structure, which has some of the benefits of both linked lists and vectors, and the downsides of each are minimized. VLists are [O(1) cons, car, cdr, index average] and [O(log N) length, index worst case], making them a pretty good all-rounder. There is also improved cache locality as compared to a linked-list.


Lisps do not "typically" use this relatively obscure data structure. Can you name one Lisps that use it? If so, how about two more?

It's also fascinating that none of the 'benchmarks' in the paper practically matter (today) in a native Lisp implementation: creating, copying, appending, mapping, ...

The paper is annoyingly undated; the highest year that occurs among the citations is 1999, so I'm taking that to be its date.

The discussion on garbage collection of these vlists is noteworthy. The paper admits that their aggregation causes retention of garbage ("Large parts of the unused list structure may not be garbage collected."), and since they have heterogeneous sizes, fragmentation also. But, oh, that's totally okay, because all you have to do is make your collector copying to compact these things and you're good!

cons cells cause no fragmentation whatsoever (they can be placed in a dedicated cons heap in which everything is a cons of the same size), and an unreachable cons cell is always reclaimable.

That would be surprising. Typically Lisp implements lists in terms of cons cells and the empty list.

Really nice code, as someone who isn't caught up entirely with the recent work that's gone into C++, it taught me some useful features!

The taxonomy isn't quite right for Lisp. It's actually simpler than the post. It should be something like this (very minimalistically):

    <s-expression> ::= <atom> | ( <s-expression> . <s-expression> )
    <atom> ::= <number> | <symbol> | <string>
The difference between this and what the author presents is that everything is an s-expression. s-expressions are either an atomic value or a pair of s-expressions. Convenient notation `(a b c ... z)` called "list builder notation" is provided so that we don't have to write `(a . (b . (c ... (Z . NIL))))`.

Yeah, it's deliberately simplified in order to keep things short and sweet. Maybe I'll write some more posts that extend this into a real language and then these things will come out.

I really wanted to show that core part of a programming language can be quite simple and nothing to be afraid of so everything else was subservient to getting that across :-)

Instead of shoehorning everything into a generic "pair" structure, wouldn't it be much more powerful to build a separate type for everything?

The usefulness of having a common structure for which to build types is that you can break them down from code into smaller types, or combine them into larger types, using just the functions `cons`, `car` and `cdr`. This is particularly useful for working on tooling which deals with the programming language code itself.

A pair is just the most basic kind of compound data type. A triplet is just an atom plus a pair, and a quadruplet is an atom plus a triplet, and so forth. For any data structure containing N items, it can be represented as a sequence of N-1 cells.

It's possible to map this to custom data types. Say you wanted a type of `vec {x, y, z}`, then you can represent it as a list `(x (y (z . ())))` where you could define `get-x = car`, `get-y = cadr` and `get-z = caddr` to access the elements by name. Some lisps let you encapsulate the result such that only those functions can be used on the encapsulated list.

Rather then being more powerful to build everything into a language, it actually makes it less powerful when you try to do everything rather than providing the foundations on which other types can be built. See also: "Growing a language" by Guy Steele.

In practice, that's done.

For instance, in common lisp there are procedures like `MAKE-ARRAY` and `MAKE-HASH-TABLE`. The interesting thing is that these hash tables and arrays are considered to be atomic.

Clojure generalizes the notion of pairs, which in some ways is more elegant (a collection of items being atomic isn't necessarily intuitive) and in some ways is less (now your language has to describe how this abstraction operates).

Well the point is that you do not need to; it would just be an optimization opportunity.

But types have more uses than just optimization.

That is completely correct. In lisp, the fact that syntax is so flexible enables the addition of a lot of functionality on top of the base language. Types and related functionality (checks, enforcement, conversion, etc.) is one of those things; but if everything is executed on top of pairs, there is a performance penalty that you may avoid if you provide more efficient data structures from the onset with a not-so-minimal implementation (such as clojure providing vectors, where you dont search a list in linear time). My “just” above meant that if you can do it slowly in the upper abstraction level, providing it in the lower one is basically an optimization.

In any case in second thought what I said is not completely true, as there are things you may need to do in compile time and the minimal implementation does not provide macros.

This is a really nice and illuminating post!

Especially cool how all evaluation is done by line 33. (Minor comment: Line 14 may benefit from using a different variable for the function instead of “c”.)

An “aside” comment, as the title mentions “literate C++” [Edit: This was submitted as something like “Build me a Lisp: in less than 50 lines of literate C++”]. This post exemplifies one aspect of literate programming (as Knuth envisioned it), which is “let us concentrate rather on explaining to human beings what we want a computer to do.”

But there are two further features that Knuth-style LP systems have, which are missing here:

• The ability to write your program / explanation in any order (ideally, an order that is best for presentation, rather than an order demanded by the compiler). In this case, the program is explained strictly in the order of lines in the program, which imposes some awkwardness. (Not only in the explanation, but also in the code: e.g. putting a #include on line 34 of 49.)

• Named chunks. So instead of writing

    int main() {
        try {
in one code block, and having to continue it later, you could write something like, say,

    int main() {
        try {
            [[say hello world]]
        } [[handle errors]]
and later fill in the corresponding sections by name.

The original examples by Knuth are somewhat hard to read because they were written in Pascal and the programs were quite low-level but there's a great example available today (a book which won an Academy Award!), now free online — http://www.pbr-book.org/ (random page: http://www.pbr-book.org/3ed-2018/Light_Transport_I_Surface_R... )

But then again, the “no-tools” approach here does have the advantage of not requiring any special build step and not requiring the reader to understand any such conventions, so I guess that's a tradeoff. (They say one of the reasons literate programming never took off is that there are as many literate-programming tools as the number of people trying literate programming; everyone ends up having to write the tool that works for them.)

So this (excellent) post would actually be semiliterate C++.

The Babel tool in emacs org-mode looks like it does the real thing: https://orgmode.org/worg/org-contrib/babel/intro.html

Honestly, I found the code a bit hard to follow, mostly because everything was so disjoint. It would be nice to have "folding" so I could move comments out the way once I'm done with them, to allow me to look at the code as a whole. Once I know what a snippet does, I can "name" chunks myself mentally–any further labels, especially multi-lines ones that show up in between code, are unnecessary.

Your comment illustrates another reason literate programming never took off :-) Until a reader has spent enough time reading and getting familiar with the conventions, how to navigate, etc (in my case it was several dozen hours), they experience a lot of resistance against simply reading the literate program, and keep wanting to read “the real code”.

Ultimately a program is a set of instructions to a computer, and when we as humans look at a program we're trying to conceptualize how these instructions are organized. (Essentially, “translate” them into a mental map.) We have so much more experience doing this with “real” code, knowing what we can ignore and when, and being able to do this at well — e.g. with things like function docstrings or public/private declarations, we're able to read them once and then have them basically fade away from our consciousness so we don't read them again. When faced with unfamiliar conventions, it takes a while to get that ability again.

Perhaps literate programs shine when they help us build that mental map. One way is positively describing parts of the map with metaphor or intent (where the OP shines). Another way is by discussing alternatives that weren't used. Yes, this can be a lot of content and sometimes crowds out the code, as in this example.

Follow the link to the Gist <https://gist.github.com/KayEss/45a2e88675832228f57e2d598afc0... and your code editor should give you the comment folding you want.

Fair point, although my editor (Sublime Text) is apparently to stupid to recognize those as block comments…

Here's a case where the previously submitted title, which included something like "in less than 50 lines of literate C++", was more indicative/compelling and I think more useful for news.yc.

I don't think of that as click-bait in the same way as "you won't believe #7", but rather as additional information which truthfully and transparently makes it more likely that I'll accurately gauge my interest in the linked article.

"Don't editorialize" in this case is a negative, but perhaps is globally optimal.

For whatever it's worth, here is the same in 24 lines of OCaml:

    type cell = Atom of string
              | Sexpr of cell list

    let library = Hashtbl.create 1

    let rec eval = function
      | Atom s -> s
      | Sexpr [] -> failwith "Empty sexpr"
      | Sexpr (head :: args) ->
        let fname = eval head in
        match Hashtbl.find_opt library fname with
        | Some builtin -> builtin args
        | None -> failwith (String.concat " " [fname; " not found"])

    let main () =
      let prog = Sexpr [Atom "cat"; Atom "Hello "; Atom "world!"] in
      let prog2 = Sexpr [Sexpr [Atom "cat"; Atom "c"; Atom "a"; Atom "t"];
                         Atom "Hello "; Atom "world!"] in
      Hashtbl.add library "cat"
        (fun args -> String.concat "" (List.map eval args));
      Printf.printf "%s\n" (eval prog);
      Printf.printf "%s\n" (eval prog2)

    let _ = main ()
Using linked lists instead of vectors makes separating an application's head and args (CAR and CDR, if you prefer) much easier than using clunky C++ iterators. This is because linked list cells are composed of cons cells.

If you enjoy reading code and want to know a bit more about lisp, maybe this is interesting:

"Scheme 9 from empty space" -- https://t3x.org/s9fes/

It's a scheme built in c using literate programming. You can even buy a printed version.

Also, there's "Lisp in small pieces" if you want to go deeper -- https://books.google.de/books/about/Lisp_in_Small_Pieces.htm...

Thanks, that is a very nice writeup. I have been experimenting a bit with creating new Lisp-like languages in Racket and the idea of growing a language to solve specific types of problems is powerful (and is the way I program in Lisp, from the bottom up). Sorry if this is a bit off topic, but you might enjoy the experiences of someone building their own Lisp Machine with a eZ80, complete with hardware hacking support: https://youtu.be/Ad9NtyBCx78

Part of the motivation for writing this up is to explain internally about a DSL used for testing web APIs, for example: <https://github.com/KayEss/fostgres/blob/master/Example/films...

I expect we'll extend use of this sort of mechanism much further the more people understand it.

I'll watch that video, sounds interesting. A company I worked for nearly ended up getting a Symbolics LISP machine for CGI in the early 90s. Would have been pretty sweet I think.

Nice post... even if before clicking through I expected a post about why asking "build me a lisp" could have been a good interview question :)

As a side note, I wrote a similar post in Swift: https://www.uraimo.com/2017/02/05/building-a-lisp-from-scrat...

nice post...you can implement a lisp parser in C++ relatively easily too due to lisp's regular syntax

any nice cpp peg library ?

Can someone familiar with modern C++ explain the type mechanics of std::visit on line 14? In particular, it seems the “auto” parameter of the lambda expression could stand in for either type in the variant, dependent on its runtime value, but I have always thought of “auto” as a compile-time mechanism.

Is the “auto” in this case being filled in by some compiler-generated proxy type that can dispatch any common operation available across all variants?

The auto there works like a templated function would (except it's anonymous as you'd expect from a lambda).

I’m still confused. Presumably you can’t pass an uninstantiated template to a function, so what’s the type of the lambda expression?

It’s effectively a callable object with multiple overloads but I didn’t know lambda syntax could create something like that.

Lambdas have an un-utterable type, so the addition of a template like mechanism in their arguments is something the compiler can do behind the scenes and it all just works. Look for "generic" in this page <https://en.cppreference.com/w/cpp/language/lambda>

For those who can explain the difference is this a LISP 1 or a LISP 2 ?

In a Lisp 1 function and variables share the same namespace.

    (defvar foo 5)   ; variable with the name foo
    (defun foo () 5) ; function with the name foo
In a Lisp 1 the defun would overwrite the previous variable declaration. In a Lisp 2 though, both expressions coexist.

Last week I came across a Usenet post from Kent Pitman (who helped create Common Lisp, and coined the terms Lisp1 and Lisp2 in this paper: https://www.dreamsongs.com/Separation.html) which provides the best explanation I have seen on why Lisp2 is a good idea:


Also as pointed out by KMP later in that thread, block and go tags are separate namespaces, and so are type names. I like to say that it is really Lisp1 vs LispN. Any language with macros and first-class identifiers is going to be capable of having arbitrarily many namespaces.

Ah, that's really interesting. This one only has a single namespace, but the one I simplified it from actually has two. I guess that makes that one a Lisp 2 :-)

> Normally programming language blog posts get started with grammars and syntax and parsing and all sorts of stuff like that. I can't be bothered with all of that,

And I can’t be bothered with C++ for compiler construction.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact