Hacker News new | past | comments | ask | show | jobs | submit login
Writing a C compiler in 500 lines of Python (vgel.me)
510 points by vgel on Sept 4, 2023 | hide | past | favorite | 165 comments



> Instead, we'll be single-pass: code generation happens during parsing

IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once


Linked from another thread: http://cm.bell-labs.co/who/dmr/chist.html

It explains the memory limits and what happened :)

> After the TMG version of B was working, Thompson rewrote B in itself (a bootstrapping step). During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size. For example, B introduced generalized assignment operators, using x=+y to add y to x. The notation came from Algol 68 [Wijngaarden 75] via McIlroy, who had incorporated it into his version of TMG. (In B and early C, the operator was spelled =+ instead of += ; this mistake, repaired in 1976, was induced by a seductively easy way of handling the first form in B's lexical analyzer.)


Infix parsing chews up a remarkable amount code and memory.

It's scary just how much easier it is to parse languages without infix parsing.


where it takes up the memory in the human brain, where we have more limited working set, than in computers. That is probably why postfix notation is mostly a thing of the past by now in languages intended to be used by humans.


(in (is "prefix notation" (and "alive" "kicking")) "all lisps")


    (-<> "alive"
         (and "kicking")
         (is "prefix notation" <>)
         (in "all lisps"))


But creates a habit of having exactly two individual items for an operation.

Which led us to imperative programming and object oriented programming. Which everybody now recognizes as limited in various artificial ways.

Which we are now struggling to shake off given that the majority of chips have a whopping number of parallel cores.

Programming languages restrain your thinking about your solution space.


I wonder why =+ is so obviously a mistake. It does look vaguely wrong for some reason, but I’m prejudiced by current languages.


>I wonder why =+ is so obviously a mistake

Consider

  x = -5
now how about:

  x =- 5
One makes x negative 5, the other subtracts 5 from it. And under this design of the operator the only difference is a space.

It's the same with +, just that we're not used to seeing unary plus in the wild.


>> x =- 5

I think you mean: x -= 5, which indeed is different.


I'm showing how it would be under the original operator design which was reverted, which was "=-".


I think because it's ambiguous with unary plus (a = +b), since C isn't supposed to have significant whitespace in most circumstances.


You also run into problems with a=*p and a=-b, which are perhaps more likely.


But they could've fixed that by going a=(*p) and a=(-b);

Kind of how we use (*p)->next instead of *p->next where p is node_t**


That seems a little backwards and barbaric, though right?

Imagine if we had to watch out for this as a common pitfall:

    // BUG! Actually subtracts x from current val of neg_x.
    neg_x = -x;
Even moreso, how would these two lines behave? Would they differ in semantics?

    n = -5
    n =- 5
Overall, -= is just so much less ambiguous.

EDIT: To your point about ->, I personally think C would be better if:

    *p->next
parsed as:

    (*p)->next
instead of:

    *(p->next)
but maybe now I'm not thinking through all the parsing impliciations :)


I've found struct fields that are pointers far more common than pointers to pointers to structs, so if nothing else it feels like *(p->target) is a more widely useful interpretation of *p->target than (*p)->target would be.

Regarding "n = -5", it would presumably be interpreted as "n=(-5)", same as today. Operators don't have spaces in them. So "n- -5" is "n-(-5)", rather than "n--5" (not valid).


As you note in your edit; we already have to watch for that pitfall :)

so really the best way out is to be as verbose as possible imo; a = a + c or auto nodep = *nodepp; nodep->next;

Compilers and compute performance have grown to make the difference negligible for code output and compilation times but they definitely take a lot of mental complexity out of such scenarios (anything helps when grokking 10k+ lines of code).


But originally there wasn't a += or =+ lexical token! Those things were scanned as separate tokens, and could be written = +.

So that is problematic; If {a}{=}{-}{b} means a = a - b, then you have no way to write a = -b without parentheses like a = (-b).


You're exactly right. This makes for a small, memory-efficient compiler. But this entails a lot of compromises that we're not willing to put up with anymore, because there's no longer a reason to.


I'm not sure, haven't looked at the codebases of old compilers in a long time. Definitely a lot of the language is pretty amenable to it, especially if you have unstructured jumps for e.g. the for advancement statement. I had a distinct feeling while writing the compiler every time I added a new feature that "wow, the semantics work exactly how I'd like them to for ease of implementation."

Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.


What you are saying is true for a naive C compiler.

Once you want to optimize or analyse, things become more complicated.

> Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

Type inference also spans a whole function, so you can't do it in a single pass through the code. (But it's still tamer than in eg Haskell, where type inference considers your whole program.)


In C, nothing you have not parsed yet (what it to the right) is necessary for what you've already parsed (what lies to the left). (Necessary for type checking it or translating it.)

E.g. to call a function later in the file, you need a prior declaration. Or else, an implicit one (possibly wrong) will be assumed from the call itself.

This is not true in some C++ situations, like class declarations. In a class definition there can be functions with bodies. Those are inline functions. The functons can freely refer to each other in either direction. Type checking a class declaration therefore requires all of it to be parsed.

A one pass language is advantageous even if you're building a serious multi-pass compiler with optimization. This is because that exercise doesn't require an AST! Multi-pass doesn't mean AST.

Building an AST doesn't require just more memory, but more code and development work: more complexity in the code to build the abstraction and to traverse it. It's useful if you need to manipulate or analyze the program in ways that are closely related to the source language. If the language cannot be checked in one pass, you might need one; you wouldn't want to be doing the checking on an intermediate representation, where you've lost the relationship to the code. AST building can be reused for other purposes, like code formatting, refactoring, communicating with an IDE for code completion and whatnot.

If the only thing you're going to do with an AST is walk it up and down to do some checks, and then to generate code, and you do all that in an order that could have been done without the AST (like a bottom-up, left to right traversal), then it was kind of a waste to construct it; those checks and generation could have been done as the phrase structure rules were parsed.


I once read that this is why the MSVC compiler didn't support two-pass template instantiation until very recently: the original compiler implemented templates almost like a macro that re-emitted a stream of tokens with the template parameters replaced.


I can't say if that was a design goal, but it sure looks like it. That's also the way to avoid scaling compiler memory use to program size.

At first I thought that it wasn't possible for C. After I thought about it, as long as you disallow forward references, and rely on a single source file as input, it's possible to compile a complete C program in one pass. Anything else requires a preprocessor (e.g "#include") and/or linker (e.g. "extern" and prototypes) to solve. The implementation in the article dodges all of these and focuses on a very pure subset of C.


I think this goal may also have shifted over time. I remember when I learned C, we used c89, which required declaring all local variables at the top of the block. This seemed like a weird/arbitrary requirement at the time (and is no longer required in later versions), but it makes a lot of sense in a single-pass context! It would allow the stack frame for the current function to be fully sized before any other logic is compiled


I think one thing some early compilers did was read the source serially in one pass and write the output serially in one pass. If you were doing multiple passes you had do that for each pass. That means your compiler speed is IO bound. So one pass is faster.

My cousins ex had a workflow in the late 70's that involved two floppy drives and a little dance to compile and link. Later he got a 5M hard drive which improved things a lot.


using recursive descent, you don't need to build an ast


the call stack during recursive descent is an ephemeral ast, for the recursive descent parsers I've written.


That implies the memory usage is proportional to the depth of the AST, not the total size (which is the point, I think).


Only if the compiler doesn't do anything beyond basic peephole optimizations.


I made similar project in TypeScript[1]. Basically multipass compiler that generates x86 assembly, compiles it to binary and runs it. The worst thing were register allocator, designing IR code and assembler.

[1] https://github.com/Mati365/ts-c-compiler


Ooh, this is cool! Using WASM let me avoid writing a register allocator (though I probably would have just used the stack if I had targeted x86/ARM since I wasn't going for speed).


Nice project!


I am pretty certain the following is a valid "for"-loop translation:

    block
        ;; code for "i = 0"
        loop
            ;; code for "i < 5"
            i32.eqz
            br_if 1
        
            i32.const 1
            loop
                if
                    ;; code for "i = i + 1"
                    br 2
                else
                end
        
                ;; code for "j = j * 2 + 1"

                i32.const 0
            end
        end
    end
It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass:

        ;; code for "i = 0"
    .loop_test:
        ;; code for "i < 5"
        jz  .loop_end
        jmp .loop_body
    .loop_incr:
        ;; code for "i = i + 1"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
        jmp .loop_incr
    .loop_end:
Of course, normally you'd want to re-arrange things like so:

        ;; code for "i = 0"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
    .loop_incr:
        ;; code for "i = i + 1"
    .loop_test:
        ;; code for "i < 5"
        jnz .loop_body
    .loop_end:
I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)


Oh, interesting--I remember messing around with flags on the stack but was having issues with the WASM analyzer (it doesn't like possible inconsistencies with the number of parameters left on the stack between blocks). I think your solution might get around that, though!


A time-honored approach!

https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...

(minus directly emitting opcodes, and fitting into 500 lines, of course.)


Somewhat unrelated question, but I think one of the second most difficult things of learning C for coders who are used to scripting languages is to get your head around how the various scaler data types like short, int, long,... (and the unsigned/hex version of each) are represented and how they relate to each other and how they relate to the platform.

I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?


>if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?

You'd lose something because those decisions would be impractical for 8-bit and 16-bit targets (which still exist in the world of embedded programming).


If you’re writing code for those why not just use the smaller data types if you don’t need bigger ones? That way it will work efficiently on both and the behaviour will be consistent


Fair point. I guess the issue is with library code that uses int. But you aren't typically going to use lots of general-purpose library code if you're targeting a microcontroller.


If that library code doesn’t use the smaller types then it probably isn’t designed for those platforms, which means it won’t be tested with the smaller int types and they will likely cause lots of bugs. The one advantage I do see is that it might avoid extra sign extension instructions on some architectures when working with less than the native size, but there’s int_fast16_t for that


Would it be possible to create two versions of the language: microC for embedded C programs where these decisions matter, and standard C for PC/servers which basically all use the same representation for these scalers?

My main point is that a C programmer today is forced to learn dozens of rules just to cater for many niche platforms that they will probably never target, so if you were to separate those two use cases you can get a much more neat C that targets modern 64 bit architectures with all the power of traditional C but a bit less portability.


Yes, for sure. I don't disagree with your overall point. Just pointing out that people are still writing C code for 8 and 16 bit platforms today.


In fact, the default integer promotions together with the requirement that unsigned int hold numbers until at least 2^16 is the main reason C code for 8-bitters is somewhat... stylized, to put it mildly.


I think that's actually an interesting example of how C sees itself as a high level language (from the perspective of the early 70s), even though we now categorize it as low-level. You'd expect the default integer type in a high level language to be able to store values that are large enough for an interesting range of practical tasks; 16 bits is arguably the bare minimum to meet that requirement.


IIRC a retrospective by one of the designers listed the default promotions as a careless mistake, because the original—offhand—reasoning was that the registers are 16-bit anyway, so they might as well widen everything when it’s free (on the only machine the language was implemented on at that moment).


The int was supposed to be the native word size. So 16-bit on 286 and earlier, 32-bit on 386 and later, and 64-bit on x64. Except, of course, int has been 32 bits for so long on x86 (which has been the single most important ISA for just as long), and short was 16-bit for even longer that moving to 64-bit-wide int and 32-bit-wide short (which is what x64 naturally suited for) was just impossible, so it didn't happen, we're stuck with LP64 (on Linux) and LLP64 (on Windows) data models.


The simple version is that there are two use cases - the world where you want the size of types to match the target (e.g. int) and the world where sizes are defined by the coder (uint32_t). You want to handle both of those.

That's a nice theory and is what we've got, but it falls down in a few places.

The first is that the "int" world has got a bit munged - some platforms make some slightly strange choices for long and short and so you can't always rely on it (although int is usually pretty sensible).

The other is that when doing unsigned maths, rollover is silent so generally you really need to know the exact size at coding time so that you can ensure that rollover doesn't happen silently.

Together, these mean that you're generally just better using uint32_t (etc.) all over the place and you get more predictable results.


I learnt C about a decade ago (after using scriping languages 10 years prior) and just stuck with using the uint values, no second thoughts about how big a uint32_t is.


Explicit variable size is useful for limited memory space envs like microcontrollers.


If we were to write C today, we would never have such cases.


Is there a C compiler written in Python that aims for maximum readability rather than trying to get as much done under X lines of code?


I think the code is fairly readable! It's formatted with Black (and therefore limited to reasonable line lengths) and well-commented.

IMO, being under X lines of code is part of the readability—10,000 lines of code is hard to approach no matter how readable it otherwise is.


Not quite a C compiler but arguably better:

http://cwerg.org


This looks a lot like the Tiny Pascal compiler that BYTE published a listing of back in 1978.

http://www.trs-80.org/tiny-pascal/

I figured out the basics of how a compiler works by going through it line by line.


Oh, that's neat (funny that they skipped out on similar things to me, like GOTO and structs :-)

I didn't see a link to the source in the article, but this seems to be it: https://sourceforge.net/p/tiny-pascal/code/HEAD/tree/NorthSt...


I think Borland’s Turbo Pascal was also a single pass compiler that emitted machine code as COM files.


Surely it is a feature of all Pascal compilers that they are single pass. I thought that it was part of the specification of the language that it be possible to compile in a single pass.


It's disturbing to me that I remembered this, but the IBM Pascal Compiler for DOS (1981) had two passes. https://winworldpc.com/product/ibm-pascal-compiler/100


Not all, although the language was designed that way.

Some dialects and optimising compilers, had multiple passes.


There's a bunch of LLVM-based Pascal compilers these days. I doubt they are single pass, given how LLVM works. (And in general, any optimizing compiler is most likely doing multiple passes.)

You are right about Pascal's original design. Though I'm not sure if that's still true about modern versions of the language?


Not even old ones, if we taken optimising compilers like VMS Pascal into account.


Microsoft Pascal was a two-pass compiler. It was slower to compile than Turbo Pascal and that pissed off Bill Gates when he realized than Turbo was the most successful.


It makes development so much more fun when you see the results right away.

Pressing "build" in Turbo Pascal on my 386sx it was already done before you could even perceive any delay. Instant.


I think Turbo Pascal had the capability of generating code directly to memory without generating disk files.


It did that on floppy based systems too. The '87 equipped pcs in the physics building were much faster at 4500 line project, which I worked on for a year. So... First thing to test on a new 33Mhz 386 w/ wolf3d (install only, no source) my project loaded and 33mhz itt 387. Less than 2 seconds. If you brought everything into the ide,and hit run... It screamed.


Thanks for sharing this, Walter. I'm always curious where language developers get their experience from.


Figuring out how recursive descent worked was just magical.


It is interesting to think that 500 lines of code is something one can write in one or two days. But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?


> 0.1 Bellard

Off topic, but a log scale might be useful: 0.1 Bellard --> -10 deciBellards. That allows for: 0.001 Bellard --> -30 deciBellards.

Problem: Programmers with negative productivity cannot be represented on the same log scale.


Sure they can. Abs(). Or if you prefer to not have quite so much micro-nano, you could also use Cumulative Distribution Functions [1] which are basically just sums of Probability Density [2].

Are they a 4σ programmer, 1σ programmer, -0.5σ programmer, -2σ programmer?

Plus, most people are "average" not negative productivity, and CDFs let you use really fun stuff like Beta Distributions (variable, shapable distributions) and Gamma Distributions (exponential distributions). They're super sweet as far as probability statistics.

[1] https://en.wikipedia.org/wiki/Cumulative_distribution_functi...

[2] https://en.wikipedia.org/wiki/Probability_density_function

[3] https://en.wikipedia.org/wiki/Beta_distribution

[4] https://en.wikipedia.org/wiki/Gamma_distribution


The number of people that are average is infinitesimal, even if you define average as “mode” instead of median or mean.

About half are better than median though.


>>> Problem: Programmers with negative productivity cannot be represented on the same log scale.

This is similar to the problem of price-to-earnings ratio. The ratio goes asymptotic as earnings goes through zero. It would be better to quote earnings-to-price ratio. Another screwy reciprocal unit is miles per gallon for cars.


Now that I have an EV with 130 miles of range, I’m finding miles per kwh to be much more useful than efficiency.

I’d bet range anxiety was also a thing for early gasoline powered cars, so the early adopters of those probably preferred mpg over gpm.


at the very least it'll remove a lot of 'magic' from programming. Today a lot of people seem to be not so fond of university education but I'm personally very glad it made me go through implementing a shell, a compiler, a little toy kernel and so on.

The feeling that you write code somewhere in the skies and have no idea how something works underneath has always really bugged me when I've used something.


You don't need a university education to do those things, just some curiosity.

The function of the university in the near future will probably just be to have like-minded curious people to discuss ideas with, and to get a better grasp of what problems need to be solved (specifically scientific ideas, rather than just applying engineering).

The prestige element (specifically of certain universities over others, perhaps not university over high school) is dwindling, and hopefully will be abolished with this new generation.


The "uni degree value" discussion on here is very biased towards the US, where expensive courses exist, some excellent but others perhaps not providing sufficient value for money.

Note that in several countries in Europe, studying is free of charge or only costs a symbolic fee (scholarships aside).

The university system is thousands of years old, and it would be disastrous if it were abolished. I'm a self-taught developer, who also obtained two Masters and a Ph.D. later, and I can attest to it that the speed of learning at uni cannot be compared with the speed of learning via self study, especially self-study in isolation.

Having said this, learning only in groups/herds is also not the best approach IMHO, there is something to sitting alone and figuring something out without external help, at least occasionally, as it trains your analytical skills, your ability to concentrate and it gives you grit/perseverance.


Lectures were made obsolete by Gutenberg. There is little of value (but not none, ok) in wasting time of reputable professors on repeating the same stuff each semester to a hall of 200 students.

Seminars/labs and office hours is where the biggest value of university can come from, but that’s the part that is often left to TA…


I found that attending my home state university afforded me access to facilities and resources I would never have had on my own, and to not just a wealth of knowledge but also mentors and incredibly capable guides for how to navigate that knowledge while putting into productive practice.

Motivation and internet access can get some of those things, but even among those things one would struggle to find them in one place and so readily available to them.


A university degree is much more than this, and I think most people who view its value as “dwindling” have not had the experience…


It would be great if there were more emphasis on some of the aspects that made academia so great many years ago, but unfortunately the scene has changed dramatically.

My perspective on the value dwindling comes from several graduate degrees and teaching at some of these 'prestigious' universities in the northeast.

It is of course just my singular experience, but the handful of research institutions I have worked at actually provided more of a typical 'academic' atmosphere than the universities, unfortunately. I still really want to believe that universities provide the most innovative research environment, but the incentive and operational structures don't really make it as optimal.

The reason I mention the new generation viewing its value as dwindling is that the cost isn't really 'worth it' and many of the best educational resources are actually free online now.

I'm an academic, so the fact that the investment doesn't pay off doesn't really bother me (academics typically don't care about money), but many people do want some financial advantage from education, and that isn't as present anymore.


What is implied by “much more than this” and why shouldn’t those things be available to people not enrolled into a university?


It massively depends on what degree and where I suppose. Most people I know with degrees view it as a complete waste of money.


Most people with a university degree don’t have a traditional university education.

I’m very much pro-education, and think that US high school should include the first two years of a current university education (grades should be roughly split between D, C, B and A, where D means “meets curriculum requirements”).

After that, I don’t think many people should go to a university. Instead, the state should provide up to 10 years of vocational education (splittable and redeemable at any point in life) or 8 years of university education (subject to a B or better average in high school).

Most vocational programs would be under 3 years, so you’d get three-four cracks at finding the right profession for you.

University track would be targeted at educators, researchers and entrepreneurs.

I think this would be better for every socioeconomic demographic in the US, and also help alleviate the housing shortage, enable modern manufacturing, fix many current issues with healthcare, and so on.)


I'm largely taught outside the academic world. so I sympathize with your position.

however, the engineering culture which took the time to tell me about all these cool things and let me grow into being an expert in them seems to be largely gone.


That was already and always the function and value of a university... Notice the academics aren't sitting in lectures 6 hours a day or trying to pass exams to obtain degrees. They're the longterm users of universities and educating themselves their entire lives.


> You don't need a university education to do those things, just some curiosity.

The same is true of any field. Medicine for example. Then again I have several acres of swamp behind my place to help me when lerning from my mistakes.


It does remind me of a project [1] Andrej Karpathy did, writing a neural network and training code in ~600 lines (although networks have easier logic to code than a compiler).

[1] https://github.com/karpathy/nanoGPT


This is an implementation of GPT using the pytorch library. It is not meant to be the shortest implementation of a trainable GPT, however it is very clean code. Pytorch does a lot of the heavy lifting, especially when it comes to training on multiple GPU. This implementation only works with data distributed parallel training, so one could not train models of the size of GPT-4 with it out of the box.


Perhaps they were thinking of https://github.com/karpathy/micrograd


> But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

The people behind this project avoided that caveat by simply not implementing C. Apparently they kept a bit of the syntax but then proceeded to cherry-pick features that suited them and not make.an effort to even try to comply with any version of the standard.


It’s a little bit bigger than Small C https://en.m.wikipedia.org/wiki/Small-C


As an experienced developer who did not do a compilers course at university I was able to write a SQL/JSONPath evaluator in TypeScript in a week or so. I don’t expect a minimal C compiler would be that much more complex.

Essentially all you need is a grammar, parser library and a couple of tree walkers to convert the AST first to expand macros and then convert to assembly.

A production compiler with all its optimisation steps is of course far more complex and more modern languages have many more features, but C is really pretty simple (the K&R book is concise and good!) as it was built to work on the computers of half a century ago.


You don't need a parser library; writing the tokenizer + parsing logic isn't a very time consuming endeavor and the code doesn't suffer from it either. The upside is also that you're not taking on some monstrosity of a library for something you can implement in a tenth of the lines yourself. You'll also end up with something you actually fully understand yourself as well, which should be a big plus.


While I suspect I would learn more writing a tokenizer and parsing logic myself I find grammars much easier to read and maintain.

ANTLR is pretty good and is supported across several languages and something I had previously used for some quick Elasticsearch query syntax munging in Python. It also means you can often start from an already existing grammar.

The JS version of ANTLR didn't seem to work for me so for the SQL/JSONPath stuff ended up using the Moo lever and Nearly parser which was rather pleasant. https://nearley.js.org


Using a library is cheating. Any one with enough knowledge can create their own programming language in two weekends using parser combinators.


You might like the book 500 Lines or Less, Experienced Programmers solve interesting problems

https://www.amazon.com/500-Lines-Less-Amy-Brown/dp/132987127...


Writing your own compiler

- demystifies compilers, interpreters, linkers/loaders and related systems software, which you now understand. This understanding will no doubt one day help in your debugging efforts;

- elevates you to become a higher level developer: you are now a tool smith who can make their own language if needed (e.g. to create domain specific languages embedded in larger systems you architect).

So congratulations, on top of other forms of abstraction, you have mastered meta-linguistic abstraction (see the latter part of Structure and Interpretation of Computer Programs, preferably the 1st or 2nd ed.).


I dunno. I did a compiler writing course once, writing a compiler for a subset of Pascal in Ada, generating a kind of quasi assembly. It was a team project. I did most of the codegen and static optimisation.

It was super fun and interesting. But I wouldn't say it was a terribly useful exercise that has greatly enriched me as a programmer.

And somehow I have ended up with a very strong bias against DSLs.


I wrote so many compilers and interpreters in my life, simply because I cannot stand the long compile times and waiting while developing. Only single pass to make it as fast as possible, foregoing many optimisations (I don’t need them during dev); I wrote them for parts (so basically dsls which are compatible with the original compiler of the respective language) of c++, Java, c#, typescript etc. It gave us a competitive advantage as others were waiting for builds. Then I went back to Lisp (sbcl) and found it to be perfect; instant feedback, high performance, trivial to make dsls.


> And somehow I have ended up with a very strong bias against DSLs.

Most DSLs are bad, partially because most people are bad at designing languages.

Embedded DSLs can be quite neat. Eg Haskell makes it easy to embed something like DSLs inside your Haskell code.

(If you squint a bit, the ability to define your own functions in any language goes in that direction of allowing you to define your own mini-sub-language that handles your problem better. Haskell (and the Lisps) just dial that kind of approach up to 11.)


I don't necessarily think the DSLs are badly designed. But personally I find the overhead in learning and remembering a new language to be enormous. I instantly drop from extremely high productivity in my language of choice to fumbling around like a newbie. And often DSLs are used for smallish tasks that you work on, write the code, then leave for a long time. Then you come back to it and have no idea how that language worked, its pecularities, etc.

It had better have an extremely high payoff to justify that cost.


Embedded DSLs have much lower learning overhead, since they are just embedded in the language you are already using. Eg in Haskell, they are still valid Haskell programs. So they eg inherit the control structures from the host language, and most of the tooling in your editor, like 'jump to definition' also still works.

See http://wiki.haskell.org/Embedded_domain_specific_language

Because the overheads are lower, the payoff doesn't have to be as high to make it worthwhile.

(However, there's still some overhead, otherwise you wouldn't really label them as embedded-DSLs.)


I wasn't really aware of the embedded DSL distinction.

But even then. I was never really able to get into Observable because although it looks like JavaScript, there are kind of hidden objects and methods available to you that I found weirdly hard to discover.


What's 'Observable'?


https://observablehq.com

Kind of like Jupyter notebooks for JavaScript? Kind of?


Interesting. Why did you bring it up?


The issue is that a good DSL shouldn't be a "language", it should be a tool, a textual UI that is designed by and for power users for a particular task. When you are doing something 500 times a day you want a hyper efficient tool to accomplish that. Learning curve is irrelevant, only efficiency and expressibility. A DSL is a great fit. If you are doing some task once a month you just want something simple you can wrap your head around quickly. The problem is that the former often forget that the latter exists and so recommend their workflow to everyone, forgetting they have different needs. But both sets of needs are valid.


Hi, unrelated but I just wanted to say hello.

Stumbled across your post about spin2win a while ago and I'm glad you found it useful! I still use it every day and wish I could fix the jankiness but oh well hahaha.


> Most DSLs are bad, partially because most people are bad at designing languages.

A perfectly-designed DSL is still bad just because it's a whole 'nother language for you to build, and for others to learn, that probably isn't needed for whatever project.


I don't know. Eg people seem perfectly happy to use regular expressions for matching text, instead of accessing the same functionality via eg regular functions in their favourite language.


There are a few exceptions that have made the cut for a useful DSL, like SQL, Solidity, and Regex. Unless your project is totally groundbreaking, odds are low that you invent a new language that can be widely adopted.


I thought the same when i did compiler course, but 8 years after that course i wrote tooling to mass refactor a java codebase by creating AST manipulate AST and convert the AST to code again.


Oh, yeah, true. I have done a couple of pretty complex things that involved parsing code or manipulating ASTs and it was probably useful for that.

I attempted to write a parser for Wikitext once in ANTLR which was an interesting nightmare.


Yeah, nowadays a DSL is almost never a good idea. A library for some pre-existing flexible language can do the job without reinventing a whole lot of wheels. The most specific languages I can think of that make sense are SQL and Solidity.

But DSLs are tempting, especially among the more passionate programmers, so at work we've ended up with a lot of them. All of them are tripping hazards.


DSLs are awesome for what they are. DSLs let you ingest/load untrusted user code directly and have strong isolation barrier, allowing for very powerful user customizations with "native" implementations of certain features /functions.

Anything from programmable authorization to custom views becomes safely and relatively easily available. Say what you want about ESB pattern, but you can implement ESB transformers in a custom DSL.

Obviously not every piece of software needs these features, but when it does DSLs are very handy.


React and RN let users define views with JSX that are natively implemented. Not sure if that counts as a DSL, but it's JS extended to allow embedded HTML with custom tags, so it's not really a new thing to learn. Part of what made React popular was how vanilla it felt compared to other things that defined whole new languages or heavy-handed frameworks. And now that it exists, it doesn't need to be reinvented.

At work, we had two painful DSLs for monitoring queries. Our team changed monitoring systems entirely just to use SQL instead.


I'm wondering, is Bash/Shell a DSL?

You can do anything with it because it allows launching other programs, but I'd say it was designed specifically for the domain of launching other programs and building data pipelines with their in- and outputs.


While DSLs are primarily designed to easily express certain idioms (e.g. Logo-family primarily describes pencil movement), generally (!), DSL is embeddable, isolated, Turing incomplete language. Although some DSLs are Turing complete, they are first and foremost designed to express domain specific idioms.

IMO Bash and friends are too powerful and generic to be considered DSLs.


As far as languages go SQL is pretty bad, though. It's not even really a good representation of what it's trying to do. I think if SQL is a good domain specific language the bar seems very low.


Nothing has done SQL's job better than SQL (or its Postgres/Oracle/MySQL variants) so far. The closest would be the map-reduce idiom.


Took me seven years to do my own dynamic one that ended up being very similar to python2 with curly braces. Every programmer should try it


“Crafting Interpreters” is a phenomenal place to start. It’s very accessible. If you think you’re “not good enough” to write a programming language, it will show you just how wrong you are. Really boosted my confidence. (Hi Bob!)


Yeah that's where I started


Writing your own interpreter can be a lot of fun.

Not sure if most people will get much extra out of writing their own compiler, too?


How do later editions of SICP differ? I have no idea which I used.


> [Building parse trees] is really great, good engineering, best practices, recommended by experts, etc. But... it takes too much code, so we can't do it.

It takes too much code in Python. (Not a phrase one gets to say often, but it’s generally true for tree processing code.) In, say, SML this sort of thing is wonderfully concise.


Actually with SLY (https://sly.readthedocs.io) now dead, what is the recommended Lexer/Parser library in Python?


I’m partial to the Python port of parsec. (https://pythonhosted.org/parsec/)


Just for comparison the LOCs for some other small C or C like compilers. It's not that far away from Ritchie's

C4x86 | 0.6K (very close)

small C (x86) | 3.1K

Ritchie's earliest struct compiler | 2.3K

v7 Unix C compiler | 10.2K

chibicc | 8.4K

Biederman's romcc | 25.0K


This one is certainly stretching the definition of "C like", but it's just under 512 bytes : https://news.ycombinator.com/item?id=36064971


Oh, C4 is neat—technically it has me beat since it also implements the VM to run the code—though their formatting definitely takes advantage of long lines :-)


These kinds of posts are one of the things that keeps me coming back to HN. Right when I start thinking I'm a professional badass for implementing several features with great well tested code in record time, I stumble along posts like this that set me in my place.


I have to wonder if there's a Scheme to WASM compiler out there someplace right now I haven't found yet.


Looks like Schism (https://github.com/schism-lang/schism) got part of the way there, but it unfortunately seems to be dead.



No, thanks! Had a look, doesn’t seem to be ready to support WASI, but it’s active.


WASI support is more a property of the host, no? If I compile some guile to WASM and it imports things from WASI, the compiler doesn't need to do anything to support it. The WASM host simply has to provide those imports according to the WASI spec. Unless I'm misunderstanding you.


This is crazy cool! Esolangs have been a hobby of mine, (more just an interest lately, since I haven't built any in a while,) so this is like a fun code golf game for compilation. Nice work, and even better, nice explanation article!


I wrote a C compiler back in the day as a learning exercise. It was the most fun and rewarding project.


I don't see he use match case... while it's clearly a good use case.


I am really confused by what people call compilers nowadays. This is now a compiler that takes input text and generates output text, which then gets read by a compiler that takes input text and generates JIT code for execution.

This is more of a transpiler, than an actual compiler.

Am I missing something?


To quote the great Bob Nystrom's Crafting Interpreters, "Compiling is an implementation technique that involves translating a source language to some other — usually lower-level — form. When you generate bytecode or machine code, you are compiling. When you transpile to another high-level language, you are compiling too."

Nowadays, people generally understand a compiler to be a program that reads, parses, and translates programs from one language to another. The fundamental structure of a machine code compiler and a WebAssembly compiler is virtually identical -- would this project somehow be more of a "real" compiler if instead of generating text it generated binary that encoded the exact same information? Would it become a "real" compiler if someone built a machine that runs on WebAssembly instead of running it virtually?

The popular opinion is that splitting hairs about this is useless, and the definition of a compiler has thus relaxed to include "transpilers" as well as machine code targeting compilers (at least in my dev circles).


For some value of “C”:

> Notably, it doesn't support:

> structs :-( would be possible with more code, the fundamentals were there, I just couldn't squeeze it in

> enums / unions

> preprocessor directives (this would probably be 500 lines by itself...)

> floating point. would also be possible, the wasm_type stuff is in, again just couldn't squeeze it in

> 8 byte types (long/long long or double)

> some other small things like pre/post cremements, in-place initialization, etc., which just didn't quite fit any sort of standard library or i/o that isn't returning an integer from main()

> casting expressions


Well, I set the 500 line budget up front, and that was really as much as I could fit with reasonable formatting. I'll be excited to see your 500 line C compiler supporting all those features once it's done ;-)


C--23

(Respect to the author for doing this, I just couldn’t resist the obvious joke)


I actually almost made it a C-- (https://www.cs.tufts.edu/~nr/c--/download/ppdp.pdf) compiler, but IIRC the `goto`s made me go with the regular C subset instead.


Basically like many C compilers outside UNIX during the 1980's.

RatC did not need 500 lines for its preprocessor support, by the way.


I love the graphic - would go see the worlds largest chomsky


Inevitably we have to ask: and how many lines of C in library functions?


So, given the python is an interpreter and very well understood, can we say that we are sure this compiler does not include Thompson virus?


No


Finally, one can have inefficient C.


There's always the CINT interpreter for C and C++.

https://root.cern.ch/root/html534/guides/users-guide/CINT.ht...


A PTSD trigger for me. Only half joking. Funny thing is, I never checked out Cling to see if it was at long last the real deal.


Why would language choice of compiler make any difference for efficiency of final output?


They didn't say the language was the issue. It doesn't support the full C spec. But if you want a reason why language might be an issue for a compiler, it could make compilation time slower. But I think the point of this project is not real world use but fun demonstration of skill


Maybe not the language choice, but the codegen of this compiler is terrible because of the single-pass shortcuts (for example, it unconditionally loads the result of all assignment operations back to the stack just in case you want to write `a = b = 1`, even though 99% of the time that load is immediately thrown away.)


Always remember _bashcc_.


The *point* of a compiler is to compile itself.


Is it?


[flagged]


The new Detect UB instruction is great.


Interesting stuff


Cool. Now try writing a Python compiler in 500 lines of C.


The fact this is hidden says something about the disparity here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: