Hacker News new | comments | ask | show | jobs | submit login
Kaleidoscope: Implementing a Language with LLVM (llvm.org)
199 points by rspivak on Dec 25, 2015 | hide | past | web | favorite | 42 comments




Lots of goodies here. Shows how to write a sort of Pythonic little language with code generation for LLVM--and I love that it's from the LLVM krewe--but doesn't skimp on things like an AST. It does rely on globals, sadly, but for those of us who like compiler writing texts in plain language, it's a keeper.


Does it explain how to integrate a garbage collector? Because that is the most difficult part of implementing a compiler nowadays, especially if it is a concurrent one, which doesn't stop the world on each collection cycle.


I've implemented Common Lisp using llvm (github.com/drmeister/clasp). I've integrated the Boehm and Memory Pool system garbage collectors. How? It would take a long time to explain - but it's doable.


Nice!

Have you benchmarked against Julia? Would be interesting how Clasp compares in their micro perf tests: https://github.com/JuliaLang/julia/tree/master/test/perf/mic...

http://scilua.org/ was added in the tests recently and does pretty well.


LLVM has a decent amount of documentation on integrating garbage collection, not on how to write one though.

Some more advanced GC integration functionality was added more recently because of the work on Webkit's FTL JIT: http://llvm.org/docs/Statepoints.html



I don't know of any general purpose concurrent collectors which have no stop-the-world phase.


What do you mean by that? I have never known a compiler to include a custom garbage collector at all, much less to have such unusual allocation patterns that it would count as a difficult part of the job.


Looks good, but... what's the version I'm to read? The first one or the second one? They seem like different versions of the same document... Since the table of contents is close to being the same it looks confusing.

Looks like the first one is in C++ and the second one in OCaml . Second one it will be then :)


I'm pretty sure the second is C++... At least, I HOPE that is the case, as it is more reflective of the core llvm libraries, not my language preference


Anything specific to know in order to implement the language VBA ?


You have to start with the top-tier talent that delivers enterprise apps in Visual Basic 6. They need to know VBA language inside and out. They should implement templates for what every primitive does in terms of stacks, registers, and/or control flow (esp jumps). This includes common representations, type system rules, binary interfaces... you name it. Everything one needs to know to parse and implement VBA files along with reference implementations in VBA.

Then, you pay a LISP consultancy to implement that as a DSL in Racket or Common LISP that runs during development and outputs portable C for production w/ frameworks like WxWidgets and/or NPRS runtime. They should be able to throw the whole thing together in under a year. :)


I am not quite sure what you are making a parody of?


Domain experts, executable spec both parties understand, source-to-source compiler using DSL's, and targeting C to reuse backends? Is that a parody or a recipe for VBA compiler small team can handle?


It's Poe's law for software.


Lmao. That's perfect.


I really don't know anything about VBA or VB, but they're very similar to each other, right? FreeBASIC is AFAIK also pretty similar to VB, but draws inspiration mainly from C++, and has an experimental backend that emits LLVM assembler/IR.

http://www.freebasic.net


Man, I can't wait to see the comments THIS gets


You ask. I deliver. :)


Apart from intellectual entertainment, why would anyone create another general purpose language ? Isnt there more than enough choice already ?

I really want to, but I struggle to come up with enough features or syntactic tricks to justify creating a new language.

What would your motivation be?


Perhaps this isn't what you mean, but "isn't there more than enough choice already" always comes across to me as "isn't there at least one language with zero flaws for your use case?" I think the answer to that question is definitely "no"- this is why people write DSLs for niche use cases, but it also applies to general purpose languages. If we had just stopped where we were and were content with existing languages we'd be missing out on a lot of useful features today.


For practice. Understanding what LLVM is and how it works. Gaining more mastery over the language you are currently using. Understanding parsing and the inner workings of programming languages better.


Find a specific field your language would be useful to.

I tend to think that having it be a scripting language (eg. one that's ment to be used as, for example, creating quests / scripting events. Like old Gothic [I and II] used to have) justifies creating it (and potentially could be useful, since the main contenders as far as I am aware are AngelScript, LUA and Squirrel - so there's not a plethora of choices to choose from).

Though you could find any other field where the language could be used for.

Perhaps processing data, or maybe analyzing human language. Whatever you can come up with.

Then it kind of makes sense :D


We care about code quality. So http://nianiolang.org is as limited as possible, but enables for complex logic implementation. It is like a functional language with imperative syntax, no higher order functions, but has optional types, etc. We embed it in other languages (c, js, perl, Java) without the need for a vm.


how about a general purpose language targeted for a specific domain, like Elm.


There is no reason whatsoever for the so called "general purpose" languages to exist at all, no matter old or new ones. But, LLVM is a very useful tool for implementing DSLs.


I haven't read the tut yet but… could I write a language like python with that tutorial? If not, what would it be the best tutorial for indented syntax like python?


There is a trick for parsing "indented" syntax which is not covered by this tutorial: the lexer (not the parser) keeps track of indentation. Lexer inserts Indent and Descent tokens, which the parser considers like { and } tokens in a C syntax language.

Now your parser implements a context-free grammar although the language is not context-free.


There is even a better trick - ditch the lexer altogether.


Yes a lexer is primarily a performance optimisation. A good one though.


First, if you know Python, play with writing your own interpreters, parsers, etc. in Python itself. It'll be slow, sure, but you'll learn a lot.

Next, try using the python ast modules for taking 'real' python code, and turning it into an AST which you can then do stuff with. For instance, you could take a subset of python, and turn it into C, or Javascript, or whatever. This will be fun, but don't expect to be able to turn all python into those other languages, there's just WAAY too many edge cases, differences in scope, etc. But for a subset of the language, however much you choose to use, it's pretty fun.

In general, parsing syntax should be the LEAST complex issue in writing a programming language. Playing with the above ideas in Python will be a lot of fun and you'll learn a lot anyway.

Some resources to look for... writing a LISP in python:

http://norvig.com/lispy.html

which is EXCELLENT. as is the followup article.

Another fun look at how someone does parsing, etc. in Go is 'lexical parsing in Go': https://www.youtube.com/watch?v=HxaD_trXwRE which is also fascinating.

Good luck! Merry holiday hacking!


Another way to approach it is to drop markers using a preprocessor after indentation. A group of friends and I were hacking on something similar and were trying to allow Lisp syntax sans parentheses. We created a preprocessor that would wrap indented child lines with the appropriate parentheses. Here's the OCaml file below:

https://github.com/udaysinghcode/superscript/blob/master/src...


Will help you in a very limited way. Unfortunally, you will need to hunt from several places to go after any half-decent implementation of any decent-enough language.

But do anything first. So, take this tutorial and complete it. If after it you wanna move...

I'm working in my own language, and after read like hundreds of links/blogs/books/etc, you will find that a lot of that is heavily biased torwards parsing, LISPys, FORTHs, and ultra-heavy-monads-worship or academic complexting that will make your head spin. After all that read, I'm still confused. Because a lot of things are not simpler or obvious in the way how make a blog or a ecommerce site can be.

So, read a part of the list I have collected:

https://www.reddit.com/r/coding/comments/2ocw2r/how_to_creat...

----

Syntax is very important, but after read a lot, I think that do the parsing stuff is not the very best first step.

INSTEAD DO THE AST WALKER. Seriously. Get clear how the the AST can be processed and focus in that. Changes in the AST will cause heavy impacts in your parsing stuff, and a lot of question need to get clarified first. For example, from the link above:

    My first questions? How make a REPL, a debugger, how implement pattern matching, type checking, if a interpreter can be fast enough, if possible to avoid to do a whole VM for it, etc...

    Exist a LOT of practical questions that are left as "a exercise for the reader" that need clarifications.
So, my advice:

- Get a list of what do you want for your language. Then think harder and try to remove features and get to the core of it.

- From the above, start to understand the details. For example: I wanna to do GO-style concurrency, so... how implement GO-style concurrency? If that is not solved, your syntax is pointless. That is why try to fill the "core" of the language as much as possible before to commit to the GUI (aka:syntax) of it.

- Then do a interpreter. Is FAR easier. Even if you wanna do a compiler, start here is more easier.

- On the side, play with your syntax, and imagine how write in the language could be. Then start to think "to make this syntax do that, how I can do inside the guts of the compiler"? But you can delay the parser for a while.

When the time come for the parsing, probably is better to do as a top-down parser and/or pratt-parser. I don't waste time with parser generators. If your disagree, ok with me ;)

I have find that use F# is great for this stuff. However, OCalm have more code-samples (ie: ML languages are made for this!). Lisp will work great if your are insane to like that ;). Hardcode C-developer? Tons of code to look, but not very much in clarity to see.


Writing an interpreter is much, much harder than implementing a simple, straightforward compiler.

Interpreter is unavoidably convoluted, it is full of very complex things like environment handling, it cannot be split into smaller parts in any sane way.

Compiler, on the other hand, is nothing but a trivial sequence of very simple tranforms (as simple as you like), each is nothing but a couple of term rewriting rules. You do not need a Turing-complete language to implement a compiler.


Maybe if the compiler is very simple. But I have said:

""" How make a REPL, a debugger, how implement pattern matching, type checking, if a interpreter can be fast enough, if possible to avoid to do a whole VM for it, etc... """

REPL, debugger, no-VM and other stuff look easier with interpreters than compilers.

However, I will be happy to be wrong: I wanna the simplest way to get where I want to.

Currently: Working with F#, wanna REPL, debugger, AGDTs, pattern-matching, go-like concurrency, iterators, semi-functional...

Where I hit a big block is how do interop with a interpreter (ie: Call .NET methods) and that look easier with a compiler...


And with REPL and debugger requirements it is even easier to implement a compiler than an interpreter.

Pattern matching must be compiled anyway, even if the rest is interpreted, same for the static typing.


Any tutorial on that? Because I don't have find a resource that show me how do this.


Syntax like python, sure.

Execution like python, not as much.


Though it will run faster than a python script that wasn't compiled to binary, that's a given.


That's not a given.


I can't imagine python's BC run faster than a binary (C# sometimes can, but python?).

Care to elaborate how do you picture that happening?




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: