Hacker News new | past | comments | ask | show | jobs | submit login
C4 – C in 4 functions (github.com)
517 points by petercooper on Nov 4, 2014 | hide | past | web | favorite | 138 comments

On a first skim, this looks really nice; complaints that it's unreadable are unfounded. The background that makes it readable are Wirth's Compiler Construction http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf plus precedence climbing http://en.wikipedia.org/wiki/Operator-precedence_parser#Prec...

> On a first skim, this looks really nice; complaints that it's unreadable are unfounded.

Man, I can't even tell what this is supposed to be. My confusion is entirely founded. My thought process with articles like this goes something like "C in four functions, huh? Sounds like it could be clever. I'll just click and read the explanation... Oh, there isn't an explanation. Well, maybe this file will explain things! ...Nope, it's 500 lines of mostly-uncommented if-else statements. Maybe it's a compiler? I dunno!"

I'm sure there's a subset of the programming community for whom this is crystal clear on first sight, and that's great; but there's a lot more of us who could probably get the joke with a few hints, so it would be nice if you'd help out instead of declaring that if you understand it, it must be easy.

In the sense that some people won't have an idea of what's going on, this community altogether isn't particularly inclusive at all. Personally, I really don't want the topics this site covers to cater to a lowest common denominator, and I'm sure that isn't what you had in mind either, but that's the effect of taking "more of us" to mean more than you personally.

Actually, a lot of us probably don't understand what this is doing. I sure don't, but I really don't consider myself "lowest common denominator" either. I come here to learn, to be honest!

My point isn't that not knowing what this does makes you a lowest common denominator. It's that most stuff that is shared on this site at least borders on being what I'd call esoteric, and if making each individual submission more approachable or catering to a larger general audience is a goal of this community, it isn't really going that way. If it was going that way, I doubt the community would be particularly interested in this site.

This submission in particular has dubious practical use, and the description, "an exercise in minimalism", is telling of a sort of artistic intent. If you don't understand what it does, how it does it, or if you don't like it, it won't lower my opinion of you in any way, but its inclusion on this site is part of why I like to come here every now and then. I get to discuss subjects that relate to my work and hobbies, but I also get to look at weird alien code and think hard in unfamiliar terms. From what you are saying, I think you can relate.

Personally, I could glance over it and get the idea that it is a C compiler, but if you were to show me some code in written with the latest JS MVC or FRP framework, don't hold your breath for me to tell you what it does. I can't say that I fully understand this, and that's why I enjoy the rich discussion the submission spawned here.

Fair enough :-) thanks for clarifying.

The logical leap from adding a "few hints" to everything becomes "lowest common denominator" is the size of the Grand Canyon.

Personally I don't see anything wrong with suggesting to add a few hints, but the basis of that suggestion in this case was that "there's a lot more of us who could probably get the joke" with the hints. If making it approachable to more people is inherently a good thing, the logical conclusion is to make it approachable to everyone.

With code like this, the readability obviously isn't a high priority consideration and sometimes the exact opposite of the goal, with the impenetrability sometimes being part of its charm. This is Hacker News after all, and if your reaction to of a piece of code that describes itself as "an exercise in minimalism" is to leave a snarky comment about the lack of documentation, you should probably check your news elsewhere.

If you have any interest in the subject, the initial comment "just enough features to allow self-compilation and a bit more" should give the purpose of the code away. If not, it ought to have been a clear sign of dragons.

Yes, I'm sorry: I meant to defend this code from the charge of being pointless code golf, and inadvertently disparaged people without the background to enjoy reading it. It really is hard to follow without that background, which lots of good programmers don't have.

I hope someone will write an explanation. I'm still working on one for my own (quite different) little compiler.

complaints that it's unreadable are unfounded

Not exactly. You have to remember that language and compiler design require a LOT of work and experience to understand, and that many programmers will only see this as, frankly, spaghetti.

I think it could have used some more block comments, but that's just me.

I had the same first instinct, but given that a) it's very very tidy code and b) if you want to understand the inner workings of a compiler then you really do need to figure this out, I decided on review that it's basically self-documenting.

Of course figuring out what it's doing is one thing - understanding why it is done in this particular way is another, and while I was able to find my way around fairly quickly I'd cry if I had to re-implement it. I do love how small it is though, that gives it great educational value.

> understanding why it is done in this particular way is another

Isn't that the reason for comments in the first place?

I look to comments to tell me what a block of code is doing rather than why, eg 'Performs a Discrete Cosine Transform on the contents of the buffer' or 'Bubble sort algorithm rearranges the records in at least as much time as required to enjoy a nice cup of tea.'

The 'why' of a very low-level tools like this is the sort of thing that needs to be explored at length in a paper or (in this case) a book, otherwise they'll swamp the actual code. Sometimes as a learning exercise I'll take something like this and comment the hell out of everything, but the value there is more in writing the comments than trying to read them again later. Of course this is very much a matter of personal taste.

I am a junior dev without a ton of experience so correct me if I'm wrong, but I strongly disagree. Comments should explain "why" something was written. Wouldn't the function name indicate what you are doing (and comments in the function)? This is especially true in business logic.

reverseNaturalSortOrder(listOfItems); // case sensitive sort of items by reverse alphabetical order


reverseNaturalSortOrder(listOfItems); // sort this way because the upper layer expects items in reverse order since user requested it

I think it is usually significantly easier to understand what something is doing rather than why it is doing that. To answer the former it usually requires a narrow scope of focus, but the latter requires a very broad scope.

I agree with you completely - The code explains what you're doing, comments explain why you did it that way. Ideally, any comments that explain what you're doing would end-up being redundant when looking at the code.

I think this code details a special case of the above though, in that it comments what the enums are instead of just naming the enums. I give that a pass strictly because this code needs to be able to compile itself, and I don't think it supports named enums, so the comment was necessary to make up for that.

Its not that simple though, error fixes and edge cases often obfuscate something that was understandable. A why comment is never bad, but a what comment is often as valuable as a test

Like I noted with my special case, it's not always that simple, but I routinely find the best commented code to be code which was written with the comments explaining why and the code explaining what. There are definitely time where a what comment is warranted, but it's just not the general case.

That should almost never be the case. If you find this to be a frequent occurrence, then the code base in which you are working is not designed for the problem domain to which it is being applied.

Sure, I'm not arguing against ever saying why you'd do something :) I just like descriptive comments because it speeds up the business of figuring the high level structure of the code - you're right that understanding exactly what's going on can be the most difficult part, but for me that answer usually falls out as I build a mental model of what it's doing. On the other hand some algorithms need in-depth explanation that's beyond the scope of comment (hence my example of the transform).

Mind, I mostly do hobby/experimental programming, I've just been doing it for a long time. So I'm not commending this as a work practice or anything - your point about business logic made good sense.

Agreed - code should explain what code does. Duh.

There are three levels to consider:

1. Readability for modification

2. Readability for the "what"

3. Readability for the "why"

All human-readable description in code is there to make the difference between having a piece of documentation pointing you in the right direction, and having to reverse engineer your understanding. It's an optimization problem with definite tradeoffs. Although CS professors and many tutorials will tend to encourage you towards heavy description, over-description creates space for inconsistent, misleading documentation, which is worse than "not knowing what it does."

When you see code that is dense and full of short variables, it's written favorably towards modification. It is relying on a summary comment for "what" it does, and perhaps on namespaces, scope blocks, or equivalent guards to keep names from colliding. Such code lets you reach flow state quickly once you've grokked it and are ready to begin work, because more of it fits onscreen, and you're assured the smallest amount of typing and thus can quickly try and revise ideas. The summary often gets to stay the same even if your implementation is completely different. And if lines of code are small, you enjoy a high comments-to-code ratio, at least visually.

Code that builds in the "what" through more descriptive variable names pays a price through being a little harder to actually work in, with big payoffs coming mostly where the idea cannot be captured and summarized so easily through a comment.

In your example, one might instead rework the whole layout of the code so:

    var urq; /* user request type */

    ... (we build list "l") 
    /* adjustments for user request */
      if (urq=="rns") { /* case sensitive sort by reverse alphabetical order */ ... }
If you aren't reusing the algorithm, inline and summarize. And if you're writing comments about "what the upper layer expects", then you(or your team) spread and sliced up the meaning of the code and created that problem; that kind of comment isn't answering a "why," it's excusing something obtuse and unintuitive in the design, and is a code smell - the premise for that comment is hiding something far worse. If the sequence of events is intentionally meaningful and there's no danger of cut-and-paste modifications getting out of sync, it doesn't need to be split into tiny pieces. Big functions can be well-organized and easy to navigate just with scope blocks, comments, and a code-folding editor.

"Why" is a complicated thing. It's not really explainable through any one comment, unless the comment is an excuse like the example you gave. The whole program has to be written towards the purpose of explaining itself(e.g. "literate programming"), yet most code is grown organically in a way that even the original programmers can only partially explain, starting with a simple problem and simple data and then gradually expanding in complexity. Experience(and unfamiliar new problem spaces) eventually turns most programmers towards the organic model, even if they're aware of top-down architecture strategies. Ultimately, a "why" has to ask Big Questions about the overall purpose of the software; software is a form of automation, and the rationale for automating things has to be continuously interrogated.

Compilers are also a very well understood topic. If you have seen (and understood) a recursive decent parser with precedence climbing, the code looks as expected. It is a pretty straightforward implementation.

I'm writing a chapter for AOSA on something like this: a self-hosting compiler of a subset of Python to Python bytecode. It'll present the full code (about the same length) and try to explain it well for people not into compilers yet, but in the meantime I recommend the Wirth book.

What's your favorite shorter intro? I'd especially like to reference other educational compilers that are self-hosting.

So, there is another version of Architecture of Open Source Applications in the works? I love the first two editions - they are unique books that really are gems. Are you able to divulge what projects are covered in the new version?

https://github.com/aosabook/500lines and I'm looking forward to reading it too. :)

I'm tempted to link to my draft chapter, but though the code is essentially done the text needs a lot of work.

Basically, what you're saying is that any toy compiler example should be accompanied with a copy of the Dragon Book.

You seem to suggest that a background in compiler theory is somehow table stakes for commenting on HN. Since many here are not developers, and many developers don't have a CS degree, a few contextual comments seem appropriate.

If you're commenting about a remarkably clever example of an obscure topic which requires prolonged study to understand, then yes I'd suggest that a background in _______ is somehow table stakes for commenting on a focused discussion of _______ on HN.

"Toy examples" are often the result of long & deep study and practice of a subject, creating something profound which casual observers are not entitled to instantly understand. In this case, it's a very clever compiler: everybody understands this summary, and if you want "a few contextual comments" beyond the source code itself then you know where to get enough information to learn what you need to understand this.

If you don't "get it", and don't want to "get it" on your own, it's not for you.

You seem to suggest that a niche, minimalistic toy example should always be accessible to non-developers.

It's false dichotomy that you either get it or you don't. People will access things according to their ability. Where to draw the line on being inclusive? If some has genuine curiosity and motivation to ask a question and learn, then providing a few lines of overview doesn't clutter the board much and can be a positive contribution.

Exactly, where to draw a line? Explaining a concept of abstract and virtual machines may take a few pages of a dense text, explaining how to parse expressions with precedence may require dozens of pages, explaining C types will add a few more.

So, yes, it's either you're curious enough to dig into a code and find the relevant explanations somewhere else (the said Dragon Book and alike), or you won't get it, regardless of how comprehensive comments are.

Well, the goal is for it to be minimal, and C doesn't actually do a lot of work for you. On reading it, the algorithm it uses to parse was of less interest than the various tricks it uses to initialize and manage the state of the parser in a compact way.

> complaints that it's unreadable are unfounded

    int *a, *b;
    int t, *d;
I can't really see how anyone can say that code that uses one-letter variable names (with the exception of the "standard ones", whose meaning is defined at the top) is readable.

OK, 'unfounded' was a little too strong. I'd change some things myself, including comments on those declarations; but if this looks like a code-golf game to you, it's not, it's a style you're not used to.

Most of it was readable, but the printf on lines 57--59 made me retch. I see what it's doing, but it's not what I'd call easily maintainable:

  printf("%8.4s", &"LEA ,IMM ,JMP ,JSR ,BZ ,BNZ ,ENT ,ADJ ,LEV ,LI ,LC ,SI ,SC ,PSH ,"
         "OR ,XOR ,AND ,EQ ,NE ,LT ,GT ,LE ,GE ,SHL ,SHR ,ADD ,SUB ,MUL ,DIV ,MOD ,"
         "OPEN,READ,CLOS,PRTF,MALC,MSET,MCMP,EXIT,"[*++le * 5]);

I'd like to know why this was downvoted, and if people who down voted it can explain what the code is doing please?

It's taking an integer (representing an operation) and printing out the name of that operation.

First thing to say is that "* ++le" is the integer representing the operation to perform. This basically walks through the array of instructions returning each integer in turn.

Starting at the beginning of the line, we have "printf" with a format string of "%8.4s". This means print out the first 4 characters of the string that I pass next (padded to 8 characters). There then follows a string containing all of the operation names, in numerical order, padded to 4 characters and separated by commas (so the start of each is 5 apart). Finally, we do a lookup into this string (treating it as an array) at offset "* ++le * 5", i.e. the integer representing the operation multipled by 5 (5 being the number of characters between the start of each operation name). Doing this lookup gives us a char, but actually we wanted the pointer to this char (as we want printf to print out this char and the following 3 chars), so we take the address of this char (the & at the beginning of the whole expression).

It's concise, but not exactly self-documenting.

Does that make sense?

(I didn't downvote.)

How is that not self-documenting if one knows C?

I think you and I might disagree on the meaning of self-documenting. ;)

I don't think we really do.

It is impenetrable black magick if one "knows" C -- but quite clear if one /actually/ knows C.

Ah, the No True Scotsman finally arrives to the party.

No. The printf() requires that one has read K&R. That's not a high barrier to clear. Pointers are chapter 5.

It is unreadable. Lumping code together into big functions just so you can say "Look I've created a compiler in 4 functions" is pointless, unless your goal is to post it to HN to show everyone how clever you are.

This is not how you code when you work in a team or when you know some other poor soul has to come along and maintain it.

I suggest you take a look at [1] then go and read this excellent book by Martin Fowler: "Refactoring: Improving the Design of Existing Code" [2]

[1] https://en.wikipedia.org/wiki/Code_smell

[2] http://www.goodreads.com/book/show/44936.Refactoring

This also is not code that would need to be written by a team.

The functions may be big, but they also don't have all that much duplicate code inside them. The choice of 4 functions isn't arbitrary either - they nicely divide the problem into:

- next() - splits the source code into a series of tokens

- expr() - parses expressions

- stmt() - parses statements

- main() - starts the processing of the source, and also contains the main interpreter VM's execution loop

Code generation is integrated into the parsing, since it's generating code for a stack-based machine and that also very nicely follows the sequence of actions performed when parsing.

In fact I'm of the opinion that the obsession with breaking up code into tiny pieces (usually accompanied by the overuse of OOP) is harmful to the understanding of the program as a whole since it encourages looking at each piece independently and misses "seeing the forest for the trees".

In contrast, this is code that is designed to be easily read and understood by a single person, showing how very simple an entire compiler and interpreter/VM can be. It doesn't attempt to hide anything with thick layers upon layers of abstraction and deep chains of function calls, but instead is the "naked essence" of the solution to the problem.

Someone used to e.g. enterprise Java may find this style of code quite jarring to their senses, but that's only because they've grown accustomed to an environment in which everything is highly-abstracted and indirect, hiding the true nature of the solution. Personally, I think the simplicity and "nakedness" of this code has a great beauty to it --- it's a functional work of art.

Breaking up code is more than just removing redundancy - it's about exactly what you have written: "Encouraging looking at each piece independently". That actually has advantages, the main being that it is easy to understand how each piece operates independently of the other pieces - simply cohesion & coupling which is applicable to any language, not just "Java". I don't believe in my experience that I miss "seeing the forest for the trees" - I encourage you to try it, perhaps by learning some functional programming.

There is one obsession that I am tired of is people posting "X awesome thing in 120 lines of JavaScript" or "Y in 4 functions". Just because the problem is reduced to small as possible metric, it doesn't make it good.

PS: Also you mention negatively connotated terms like "OOP", "Java", "thick layers of abstraction" and "deep chains of function calls" as if you've ascertained that I'm some enterprise developer that doesn't have any C experience and wouldn't know simplicity if it bit me in the ass.

I have a feeling that the project is a mix of for-fun and to-see-if-its-possible.

Not all programming is enterprise quality, some programming is intentionally not.

Being so dismissive of that, seems a little silly to me.

The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.

Of course this occurred to me, my comment was more of an emotional response to "complaints that it's unreadable are unfounded".

If quite a few people are saying that it's unreadable, you don't just dismiss them as making "unfounded" statements.

> The demoscene doesn't exist because of a bunch of programmers trying to make the lives of every other programmer more difficult, it exists because people like the challenge.

I'm pretty sure that this doesn't need to be stated.

I suggest you consider how someone could know of, and understand, your suggestions very well, while still having the opinion he does. Don't underestimate other people.

I just wanted to credit Reddit's /r/tinycode sub-Reddit for this link: http://www.reddit.com/r/tinycode - it's a pretty cool place to discover minimalistic implementations of things.

Thank you for sharing that link to the tinycode Reddit. Did the original poster (for the C in four functions) participate in that sub-reddit ?

It would appear so. https://www.reddit.com/r/tinycode/comments/2la785/c4_a_c_com... is the post from /r/tinycode and its OP[0] is the same as the github user[1] for c4.



Update: We seem to have made /r/tinycode a trending sub-Reddit of the day :-) http://www.reddit.com/r/tinycode/comments/2lcq17/rtinycode_h...

Other than Minecraft I don't go to Reddit, this is a cool link though and one I'll be checking, thanks.

Now you mostly go to reddit like anyone else. Is not going to reddit the new "I don't watch TV"?

nope. Facebook/Kik/Snapchat/Dropbox is still the new "I Don't Watch TV".

Neat! Then I really Don't Watch TV!

From a cursory glance this appears to be a much-condensed recursive-descent parser with all of the usual parsing functions moved into one, so it's not all that difficult to understand. I think recursive-descent is one of the easiest to intuitively understand parsing algorithms, and it also makes for some concise code; it's also far easier to debug a recursive-descent parser than one of the traditional table-driven ones.

Edit: OTCC (http://www.bellard.org/otcc/ ) is another extremely tiny (to the point of being obfuscated) example of a compiler using a recursive-descent parser.

It's an operator precedence parser, which is actually quite a bit faster than recursive descent for expressions when operator precedence is defined using grammar productions that would otherwise get turned into methods.

For example, a simplified expression operator precedence can be expressed like this (where {} denotes repetition, implemented with e.g. a while loop in a recursive descent parser (RDP)):

    expr ::= term { '+' term | '-' term } ;
    term ::= factor { '*' factor | '/' factor } ;
    factor ::= '-' factor | <number> | '(' expr ')' ;
Typically this would be parsed in RDP using expr(), term() and factor() functions, where expr() calls term(), term() calls factor(), and factor() calls expr() if it sees an opening parenthesis.

The trouble is that this means that every single expression parse needs to call through this deep code flow before it can see a single interesting bit of the expression being parsed. Parsing "2 + 3" will result in 5 function calls: expr, term, factor, then term and factor again.

The operator precedence parser presented avoids this deeply nested set of calls before it starts parsing interesting things.

This problem is far more acute in a language like C, which has lots of precedence levels, than it is in a language like Pascal, which only has a handful.

I was contrasting it against table-driven parsers (e.g. typical of the Yacc/Bison type) which I believe don't have any recursive calls and use a separate stack, while this one does make recursive calls. In some ways operator precedence/precedence climbing is like an optimisation of RDP, replacing a set of recursive functions with a level counter and a loop.

I can view using the CPU stack instead of an explicit stack, and the CPU instruction pointer instead of an index into a state table, as optimizations too (direct encoding). There also exist recursive ascent parsers, which are the RDP analogue of table-driven LALR parsers you're talking about.

Point being, there are a lot of different parsing approaches, with pros and cons, and it can be useful to treat them separately rather than view them within only a couple of lenses.

This shit doesn't scale:

  $ time ./c4 c4.c c4.c hello.c 
  hello, world
  exit(0) cycle = 9
  exit(0) cycle = 22614
  exit(0) cycle = 9273075

  real	0m0.067s
  user	0m0.067s
  sys	0m0.000s
  $ time ./c4 c4.c c4.c c4.c hello.c 
  hello, world
  exit(0) cycle = 9
  exit(0) cycle = 22614
  exit(0) cycle = 9273075
  exit(0) cycle = 933197195

  real	0m5.834s
  user	0m5.827s
  sys	0m0.000s
  $ time ./c4 c4.c c4.c c4.c c4.c hello.c 
Just kidding. :) Amazingly cool! Does anybody have a smaller self-hosing compiler & bytecode vm?

For the record:

  $ time ./c4 c4.c c4.c c4.c c4.c hello.c 
  hello, world
  exit(0) cycle = 9
  exit(0) cycle = 22614
  exit(0) cycle = 9273075
  exit(0) cycle = 933197195
  exit(0) cycle = -1428163377

  real	9m23.409s
  user	9m22.673s
  sys	0m0.020s

Self-hosing, haha. Not that I can think of. My https://github.com/darius/ichbins is shorter but it's a self-hosting Lisp compiling to C.

Would you call this a compiler, an interpreter, a virtual machine, a scripting engine, or a combination of those?

It compiles a c subset to byte code, then executes in a virtual machine. I think generally an intepreter can refer to either a byte code interpreter (ie virtual machine) or an AST walking interpreter. I didn't see a way to embed c4 into a host language, so maybe not a scripting language?

IMO the real value of exhibits like this are boiling the problem (lexing, parsing, compiler, interpreting) down to their most basic parts. One could easily imagine this same language being implemented over 10's or 100's of class files in a more verbose language.

I think generally an intepreter can refer to either a byte code interpreter (ie virtual machine) or an AST walking interpreter.

This brings to mind how fuzzy "interpreter" is as terminology. What is a virtual machine that JIT compiles the byte code or the AST? Is it a JIT implementation of an interpreter? What of implementations that only JIT the most frequently used functions? Wouldn't those be half interpreter and half virtual machine?

When it comes down to it, they're all really virtual machines. The real distinction is how we've come to think of different implementations and the representations sub-culturally. For some reason, it makes us feel better when we call certain things interpreters, because of some meaningless (and sometimes factually challenged) competitive instincts concerning implementation speed. (Also, we arbitrarily feel that byte code is somehow more "machine-y" than an AST.)

So do I have a problem with "interpreter"? Only when people correct others, as if they're making a correction about something fundamental and factual. In reality, the distinction is between machines that are intended to have the same runtime semantics and really the distinction is only around what optimizations are present in their implementations. Furthermore, if you look at those optimizations in detail, the distinction gets even hazier.

Most interpreters (not all, but most) are actually a compiler and a VM, yes. The difference between a "compiler" and an "interpreter", in practise, seems to be that "compilers" lack a built-in interpreter.

Most static language compilers include an interpreter for constant expressions at a minimum, because otherwise statically allocating things like arrays is a bit tricky. Handily, this interpreter can be reused for constant folding.

C++ compilers nowadays necessarily include a Turing complete interpreter.

C has one in the preprocessor, to do platform-independent conditionals.

That's another very good point. One can just as well think of the VisualWorks Smalltalk VM as a compiler (which is actually implemented in Smalltalk) with an interpreter+JIT which also functions as a linker. This means, one can also think of Smalltalk as a compiled language with lots of late binding that makes for a more complicated linker. (JIT) Then the part that's a byte code interpreter can be thought of as an "optimization" on the compiler+linker combination. In fact, if you don't want to bother with flexible debugging, you could implement Smalltalk as a compiled language with "fancy linking" and no interpretation at all.

Any interpreter vs. VM distinction is mostly a social construct. Looked at technically, it's a mishmash.

Also, a linker is very similar to a simple garbage collector: it follows code references, marking blocks as it goes (physically allocated in the final output), and adding the references in those blocks to its current list of references to resolve, until it runs out of references to resolve (done) or fails to resolve a reference (link error). All the references then need to be patched up according to their final address, very similar to a GC fixing up memory references.

That is pretty stripped down. Could probably be ported to JavaScript without breaking a sweat.

That was my first thought, a micro-C to asm.js compiler wouldn't be hard.

Pretty damn impressive.

Though I must wonder: how complete is it? What does it and does it not support? It's at least complete enough to be self-hosting, but beyond that? The code doesn't use that much of C.

Judging from the comments in c4.cpp, it probably only supports enough of a subset to compile itself.

Granted, while building a parser that can parse (let alone compiling) the full C language is nontrivial, any undergrad should be able to build a parser and compiler for a sufficiently simple subset of it. (In my undergrad, we used this subset to build a "compiler" in second year: https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html)

You can build a C parser in an afternoon. It only has a few language constructs. Declarations are the hardest. Scanners are readily available for expressions and constants.

C is not a simple language as a CIL guy says [1]. I wrote my own C compiler [2] and I can say that writing a parser was harder than I thought. It would take more than half a day at least.

[1] http://www.cs.berkeley.edu/~necula/cil/cil016.html [2] https://github.com/rui314/8cc

That first link is almost all about language semantics, not parsing issues.

As for your example, I'll be solomonic and say that you and Joe are right, of sorts (though I do think it'd be more than an afternoon).

It's certainly far more than a days work if you handwrite a lexer and parser that does the amount of additional work that yours do (AST construction; a lot of error reporting and sanity checking). But you can get very far with C very quickly if you use parser generation tools and have prior experience writing compilers and your goal is "just" to get something to parse it as quickly as possible - it's a tiny language.

Of course, in practice most real compilers don't use these parser-generation tools exactly because things like proper error reporting etc. is far harder, and a simple recursive descent parser is so much easier to work with.

Yes, I also wrote my own C interpreter and the parser was a lot more than a day's work.

I would be worried about handling the not-context-free parts of the language: http://trevorjim.com/c-and-cplusplus-are-not-context-free/

How would you handle these?

Scanner has to have access to the symbol table, emit 'ident' or 'type' as appropriate. So not a pure context-free parser.

Yes. This is the reason it is way easier to hand-build a parser for C than try to use any kind of parser generator.

Bison and Flex work fine - do they count as 'parser generators'?

Flex does not work fine, that's why you need the lexer hack: https://en.wikipedia.org/wiki/The_lexer_hack

Non-working (no lexer hack) Flex and Bison specification for C99: https://gist.github.com/codebrainz/2933703 653 lines

Hand-written recursive-descent C parser and pre-processor (something not handled by Lex/Yacc) that produces executable Common Lisp code: https://github.com/vsedach/Vacietis/blob/master/compiler/rea... 935 lines

You can put whatever code you want into the Flex phrase rules. The 'lexer hack' can be implemented there. That's what I do whenever I have to parse something like C.

Exactly. You have to drop hooks into the lexer from the parser, and by the time you're done you end up with just as much code. Only it's slower than a recursive-descent parser would be, and a lot of the time parsing speed really matters because it shows up as user-visible latency.

While I have a strong dislike for lex/yacc and descendants, the "hooks" you need for C are trivial. As far as I remember the only thing you need is an ability for the lexer to check whether or not a given identifier is a variable or type.

Even that is only needed if you want to report more specific information up to the parser. E.g. Clang doesn't. In Clang it is instead the parser that looks up the information in order to figure out what type of identifier it has received.

Right but by then the rule matching has been done - now you're writing code to distinguish rules instead of code to handle rules.

The advantage of a parser tool is, it makes the language constructs explicit in the code - here be assignment, here be for() etc. That has some value.

And what are the alternatives? I've always wanted a parser object (in C++ anyway) that I can add constructions to, feed it scanner output and have it build a symbol table and semantic tree. Does such a thing exist?

Indeed. The lack of a case statement in the run loop is a bit of a giveaway. Excellent work nonetheless.

It's clear from the huge number of 'else if' statements that it doesn't support switch statements.

This is also a bit of a clue

    p = "char else enum if int return while "
    "open read close printf malloc memset memcmp exit main";

Ah, the lack of struct explains why the compiler doesn't use any structs.

This is really similar to Marc Feeley's tinyc.c[0], which implements a C subset parser, code generator, and virtual machine in about ~300 lines of C.

[0]: http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Comp...

That's a nice one. The big difference is that it doesn't compile itself -- not that there's anything magical about that, but it's a kind of threshold of seriousness.

One of the things I've wondered about is how small one could make an ISO C89-compliant compiler (that would be able to compile itself), and all these tiny compiler projects have inspired me to revisit that thought now... I've written pieces of compilers like expression parsers and tokenisers, and even then I felt like it wouldn't be so hard (if I had the time) to write a full compiler.

These are all great for dispelling the notion that all compilers somehow necessarily have to be greatly complex and impenetrable to anyone but highly-trained professionals and theoreticians. (Look at the Dragon Book, for example.)

You might enjoy http://canonical.org/~kragen/sw/urscheme/ -- it's by someone who'd just written his first compiler, with links to other sources he found inspiring (one of them mine).

C89 I think has too much complexity for the amount of power it offers. lcc is a nice example: Norman Ramsey said he asked one of the authors what he learned in writing it, and got an answer like "Not to write a compiler in C." But anyway the book about it https://sites.google.com/site/lccretargetablecompiler/ is very good. http://www.cs.princeton.edu/~appel/modern/ is my favorite general text.

Next challenge: a fully compliant C++11 compiler written in 4 functions (1)

Then, I'd be impressed :)

(1) At maximum 500 lines per function, where each line is at most 200 characters long.

Neat, but this is not a general purpose C interpreter. It seems to lack the preprocessor and is only able to execute the example program and itself because it has wrapper implementations for a static set of standard library functions including printf(), open(), read() and malloc().[1] Use a standard library function it does not readily support, and you're out of luck.

[1]: https://github.com/rswier/c4/blob/master/c4.c#L492-L498

I wonder if it's faster than http://bellard.org/tcc/ (in compiled line per sec)

For mac users: gcc -Wno-all -arch i386 -o c4 c4.c

Unfortunately, I cannot compile it on my OSX 10.10... here's what I get: http://pastebin.com/cVvaYFEH

EDIT: just make next() function void and it works.

EDIT2: still no fortune :(

  $ ./c4 hello.c
  [1]    33920 segmentation fault  ./c4 hello.c

For now, you need to add both -Wno-return-type and -m32 when compiling.

I think it doesn't work on x86-64, since the code assumes

    sizeof(int) == sizeof(void*)

Yep, you can compile it on 64 bit OS X with clang's -m32 option and it should work:

    ➜  c4 git:(master) ✗ clang -m32 c4.c
    ➜  c4 git:(master) ✗ ./a.out hello.c 
    hello, world
    exit(0) cycle = 9

Please try kbrock's fork at https://github.com/kbrock/c4 Check out his 'longs' branch.

Wow, nice. I once managed to write a brainfuck compiler/vm in ~100 lines of C [1], but this is impressive.

[1] https://github.com/kgabis/brainfuck-c

http://git.dzervas.gr/c4 An attempt to make this readable :) (at least with right syntax...)

that's really few points blowing up a plane!

I honestly think this is ridiculous. Sure, this is an incredible feat, and congrats. But serioulsy, I would be ashamed to publish such unreadable code under my name.

What about naming your variables with descriptive names?

What about extracting complex conditions into well named function to understand what is going on (thus defeating the purpose of the "4 functions") ?

This list could go on forever...

Writing software is not a contest for who can write the most amount of code in the most cryptic way.

I actually found the code surprisingly easy to read; "tk" stands for "token", "ty" for "type", and so forth.

It's worth noting that compilers don't pop out of thin air -- you have to start with something simple in order to compile a more complicated compiler. Bootstrapping your own self-hosting compiler is a useful academic exercise, and you should try it sometime if you haven't already: http://en.wikipedia.org/wiki/Bootstrapping_(compilers)

Well, if this particular compiler is defined in 4 functions, why couldn't it be made out of more functions, enhancing the readability and maintainability of the code?

Who says it's designed to be maintainable? Not everyone writes code for the reason that everyone else does.

There is no reason why couldn't it be made out of more functions. Absolutely none. In fact, you can fork the repository and do the refactoring yourself, right now.

If I had to guess, the functions are tied pretty closely to how the author is parsing the file; "next" (next token), "expr" (expression), "stmt" (statement), and "main".

As for the project being called "C in 4 functions"; at best, I'd argue that's just a linkbait-y title since it's not actually C (it's a subset). I don't have a problem with the code _per se_; just the title.

> Writing software is not a contest for who can write the most amount of code in the most cryptic way.

It can be: http://ioccc.org/

And indeed, that is how TCC http://bellard.org/tcc/ began its life http://bellard.org/otcc/

The IOCCC is more about writing the least amount of code in the most cryptic way.

For example, my entry http://www.ioccc.org/2012/tromp/hint.html is a 25 line "BLC in 7 functions".

Similar to C4, but completely unmaintainable, it compiles Binary Lambda Calculus to bytecode which is then interpreted by a virtual machine.

> This list could go on forever...

Yes it could. While you are adding to your list of coding rules, the OP will have written another fun, tiny compiler.

Who is having more fun?

Those are not rules, they are mostly common sense to help the next programmer who has to maintain your code.

And most common sense of all is that you choose your rules for the audience. This is obviously not production code.

In any case if someone were competent enough to work on it then the style is actually quite readable. It's even full of comments.

I honestly think this comment is ridiculous. Sure, this is an incredible feat, and congrats. But seriously, I would be ashamed to publish such unreadable English under my name. What about spelling all the words correctly? What about not being an asshole and criticizing someone just because? This list could go on forever... Writing comments is not a contest for who can write the most amount of bullshit in the most non-constructive way.

> I honestly think this is ridiculous. Sure, this is an incredible feat, and congrats.

It's not that bad (or difficult), really. It's a hand-written parser for a subset of C that emits assembly code right away. This is how compilers like Turbo Pascal used to work (see http://www.pcengines.ch/tp3.htm for an explanation of what's happening).

Sure, you could apply cosmetic changes like making "*++e = bla;" into "emit(bla);" and you could move the cases into independent methods (that are used once), but this isn't meant to be a state-of-the art compiler and it won't become one if one applies best practices to it.

I don't think it supports forward references either, which also makes it more like Pascal.

Look, one aspect of programming is producing production code that is maintainable, etc.

There are other aspects that are equally as important, and sometimes even more so. Exploring, learning and prototyping are among them, as are "back of the envelope" constructs.

This is a really cool example that shows how far you can go with tiny amounts of code.

Pithyness is actually one way to make code _more_ understandable, given that the reader is familiar with the subject area.

Don't be a snob. The world is a better place because someone wrote this and not worse off.

Since compiler construction is kinda a deep technical field, documenting this in pedagocially sane way would have been a huge task.

I'm happy the author took the effort of writing and publishing this piece, it would be sad if he hadn't just because it's missing an embedded tutorial.

He's not asking you to maintain it. There are references in this thread that explain what is going on...

I write and maintain C++ production code as a day job and some of that stuff is an order of magnitude harder to grok than this (no, it's not documented in any _helpfull_ way either).

I assume it's a minified version of the other C compiler in his Github account, which does have better variable names and some comments. https://github.com/rswier/swieros/blob/master/root/bin/c.c

The goal here is to write a compiler that can compile itself.

Yeah I loved this part:

  ./c4 c4.c c4.c hello.c
Granted, most compilers can do the equivalent, but it's rarely so simple to invoke.

Writing software is not a contest, period :) It may be for some, it may be for you, but you don't get to sign up other people for it without their consent.

But, "it is only as an aesthetic phenomenon that existence and the world are eternally justified." This piece of code can be viewed aesthetically. Artificial guidelines take the fun out of coding.

The variable names thing really annoyed me too - a habit of code golf, and people who were originally trained in an old FORTRAN edition that had a 6 or 7 char limit on names.

Unless you're talking about a large piece of software composed entirely of single character functions and variable names, I pretty much disagree. Verbose variable names do not magically teach those reading a piece of code how it works, simultaneously they tend to make it impossible to write many kinds of expressions concisely, and consequently they regularly damage the readability of more complex pieces of code (e.g. arithmetic expressions involving 3 or more terms).

The same goes for symbolic constants. Sometimes (but not exactly "often"), use of numeric literals can vastly improve the maintainability of some code, assuming the reader understands how to maintain it in the first instance.

As for increasing reader comprehension, carefully thought out comments are a better mechanism by far.

In this case, it is sufficient to know that the file is a compiler/interpreter for its entirety to make sense, assuming the reader has implemented (or at least understood the principles behind) a compiler/interpreter in their past. Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc., does not magically improve the uninitiated's chance of understanding what is going on

I mostly strongly disagree.

I see no value in naming a variable 'tk', 'pp', or 'bt'. It can only help to make the code more readable with less context. I do not need to understand compilers in detail to know what this program is doing, except, for the names being useless. And if I do understand them but have not spent half an hour or probably much more to digest the exact system by which it operates, I would be completely unable to identify bugs or talk about modifying or extending it in any useful way.

Well written comments do not make the code below them more intelligible as text even if they do give you good hints for how to load it into your brain.

And "Expanding the function/variable names, splitting "complicated" expressions out into their own functions, etc." all ABSOLUTELY improve the uninitiated's chance of understanding what is going on. Speaking as one of the uninitiated (though I can mostly make out how this works). Man, I wish I knew what 'tk' was. That whole section at line 204 would be so clear if I could follow the intent.

I agree with you on symbolic constants and arithmetic though.

As you wish. From the program, line 19:

    tk,       // current token

yeah, my point is that should be swapped out everywhere. There's no reason not to.

replace all: tk -> current_token

Writing expressions concisely is an interesting problem, but in order to do, for example, tensor algebra in C you basically must use macros to define a DSL. It is not a goal which can always be achieved, but structures higher than variable names are what allow one to achieve it (usually.)

I didn't say that "verbose" variable names were mandatory everywhere - i and j have their place - but names which are at least pronounceable words are essential, especially if they appear in more places than a five line function.

This project is a special case, certainly, but toy compilers are nothing if not to learn from.

I used to rather long names too... but now I think short variables have their place, and sometimes they even improve readability.

This code seems to follow a lot of conventions (if I see the var "i" I could bet a million dollars is an int that is being used as a counter, probably to go through the positions of an array). It uses plenty of enumerated constants, which is good too.

I've been doing some functional programming, where you find that often types are more important than names. See the "Names are overrated" section of this article [1] ( although this point may not apply to this piece of software... C being the language that it is :p).

1: http://techblog.realestate.com.au/the-abject-failure-of-weak...

I'm not against local variables for counting called i and j.

I'm against a big list of forward declarations, with tens of different variables, each with a very short name, and a comment explaining what this is an abbreviation for. Just replace the names with the comments, with underscores for spaces, using find and replace. two minute job for the whole code base, much more readable code almost everywhere.

I agree that functional languages may be a counter case; but codebases in C and python (in my experience), benefit greatly from well named variables.

> Writing software is not a contest for who can write the most amount of code in the most cryptic way.

Except when it is.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact