Hacker News new | past | comments | ask | show | jobs | submit login
ChocoPy: A Programming Language for Compilers Courses (chocopy.org)
189 points by matt_d on Sept 12, 2019 | hide | past | favorite | 36 comments



Concise: A full compiler for ChocoPy be implemented in about 12 weeks by undergraduate students of computer science. This can be a hugely rewarding exercise for students.

Written in itself, and thus self-compiling? IMHO that's one of the biggest advantages of using C/C-subset compilers, that it's relatively easy to do so, and I think writing a self-compiling compiler is especially important in a course because it really drives home the point that a compiler is just another piece of software.


> because it really drives home the point that a compiler is just another piece of software

This fact is already conveyed conveniently by writing the compiler itself in any language. The important this is the transformation from structure to structure and the recursive nature of grammar rules.

Having wrote an assembler, a linker and for another project a sort of compiler with Flex & Bison (FsLexYacc actually) I think emphasis should be on how close writing a parser and compiler are (because one task include the other ). Thus, understanding the main ideas in compilers can be reused for real tasks such as data parser, where is can solve problems more eloquently than ad hoc coding.


Agreed, however there are plenty of languages one can use for bootstraping, I don't see it as advantage from C/C++, rather how many lectures are lazy in how they build their curriculum.

This one here is a great alternative approach, specially to dispel the idea C or C++ are required to be anywhere in a compiler stack.


specially to dispel the idea C or C++ are required to be anywhere in a compiler stack.

According to the sibling comment, the compiler is written in Java. The Java compiler is written in Java and runs in a JVM which is itself written in C/C++, and we all know how C compilers were created...


Almost there, then a wrong turn.

Java has multiple implementations, some of them happen to be fully done in Java, like JikesRVM, MaximeVM and Graal.

Naturally you might say, that you only care about OpenJDK, which incidentally now has the Project Metropolis with the long term roadmap to use Graal and SubstrateVM to replace those C/C++ parts.

As for C, many probably aren't aware that BCPL was originally designed to bootstrap CPL, a memory safe systems programming language based on Algol, which due to mismanagement never took off, while BCPL bare bones as it was, took a life of its own.

Also well into the mid 90's there were quite a few OSes where C was still far from doing its mark, thus it was not even an idea to use it when bootstrapping a compiler.


>Java has multiple implementations, some of them happen to be fully done in Java, like JikesRVM, MaximeVM and Graal.

Yes, but most of them, including Graal atm, are insignificant curiosities (as far as developers working with them are concerned. If they were a different language, they wouldn't even be on TIOBE 100, whereas Java in C/C++ is in the top 5).

So pointing them out gives mainly "well, actually" pedantic points, than illuminates any real importance they have.


Java the programming language is in the TIOBE top 5, not OpenJDK.

Just like we don't have gcc, clang, msvc, whatever-cc on the TIOBE top 5.

Apparently mixing languages with implementations keeps being a thing.

Maybe we should also start arguing that C requires C++ to be implemented, given that all major mainstream C compilers are now written in C++.


>Java the programming language is in the TIOBE top 5, not OpenJDK.

Yes, I'm obviously familiar with the distinction between language and implementation.

My point is that regarding Java, it doesn't matter, as there's a single (well, 2 if you count Oracle - OpenJDK as different) and an insignificant assortment of others.

The concrete implementation of "Java the programming language" actually used by enough people to put it in the TIOBE index is the Oracle/OpenJDK one.

If we took all the other Java implementations mentioned, called them a different name (e.g. Jakarta) and added their users together, the new "language" user-base wise, they wouldn't even have made it to the Top 100.

(Google's Java-copied Android APIs aside of course -- those do have tons of users and matter)

>Apparently mixing languages with implementations keeps being a thing.

Apparently the distinction between language and implementation is pedantic in most cases, where the alternative implementations are niche products and hobby projects, and as far as one is concerned Java is 99.9% a specific implementation.


It is a little more than two.

OpenJDK, IBM J9, PTC, Aicas, Ricoh, Xerox, Gemalto, SAPVM, GraalVM (delivering your tweets)

Niche products that have been in business since Java is around, which are a couple of years now.

So they do matter, on a age where many developers don't want to pay for software while feeling entitled to be paid for.

Yes it is pedantic, which is why Linux happens to be written in GCC C, not C. Good luck compiling it with another ISO C 100% standards compliant compiler.

Regarding TIOBE, neither Ctrl+F Oracle nor Ctrl+F OpenJDK turn out to find anything, just plain old Java, zero references to implementations.

https://www.tiobe.com/tiobe-index/programming-languages-defi...


I usually don't pay attention to names on HN (usually answer blind or don't remember if I talked again to someone), with a few exceptions that have a distinctive style or comment very frequently. I can almost always tell when I'm reading something from you based on two things:

1) It will mention some experimental/research languages from the 60s-90s (if Oberon is mentioned, there's pjmlp for example).

2) It will mention some otherwise niche commercial compilers/IDEs.

I enjoy that you dug into all this history, and remind people of the legacy and innovations of all those research/minor/forgotten languages.

And that you also remind people that paying for software can help sustain some high quality products we don't have as FOSS at all (though it's a bitter pill to swallow when you're a student for example, or a programmer making much less than the US/EU devs somewhere else in the world, not a consultant where the customer pays for your software).

But, personally, I find that you have a tendency to not account for the niche factor, and all it means, talking about some products/languages/implementations as if they're an equally valid and even common choice with the stuff people actually use.

E.g. "Who said Java doesn't do X? There's the commercial JVM/compiler "Foobar" that does X!", yeah, only that has like 1000 users and costs $10K, and 99.999% of the people when they want Java to do X, they want the Java implementation they actually use to do it.

So my comment above was in that spirit...

>Regarding TIOBE, neither Ctrl+F Oracle nor Ctrl+F OpenJDK turn out to find anything, just plain old Java, zero references to implementations.

Missed my point, the "plain old Java" in TIOBE depends on programmers using particular implementations, not the abstract idea of Java or the Java specs. And those implementations are Oracle/OpenJDK by a far far far margin.

If you take the companies programmers that use Oracle/OpenJDK out of the picture, remove their Google searchers, Stack Overflow questions, blog posts, conventions, job postings, etc there wouldn't be any mention of Java on the TIOBE index, even if there would still be 20+ additional implementations.

Same way that if the Python community was reduced to just the PyPy users, Python would be nowhere in TIOBE -- even though TIOBE says "Python" and not "CPython".


If Python community would be reduced to PyPy, the TIOBE index would be exactly the same as it is now, because they would still be asking questions about the programming language named Python.

As for the rest, I guess I can only say thanks for the remarks and feedback.


>If Python community would be reduced to PyPy, the TIOBE index would be exactly the same as it is now, because they would still be asking questions about the programming language named Python.

You're proposing an alternate universe, where CPython didn't exist, and PyPy built more or less the same Python community up.

Which is plausible, but not relevant to the mechanics of my point.

My argument wasn't "what if we didn't have CPython, but PyPy from the start, then Python wouldn't be successful" (which is obviously wrong).

It was rather "a feature that's only in PyPy, is niche and not really that relevant as far as Python is concerned, even if PyPy is a Python, since PyPy has an insignificant amount of users -- so few that, if we removed the CPython user's impact and only kept PyPy's impact (as it stands, not going back in 1990 and replacing one with the other), Python wouldn't be in the TIOBE Top 100".


No, they seem to write the compiler in Java.

The thing with actually self-compiling is that you need language features such as file management, which are not trivial and would take time that we don't have in a compiler course to cover. There is already so much to be said about compilation, and generally the course must also cover assembly language and hardware architecture at least.


> The thing with actually self-compiling is that you need language features such as file management

You should be able to "just" use the C library if your compiler uses a C-compatible calling convention (which is a good idea anyway) and allows a C-compatible representation of strings (for file names) and characters (for fgetc). File management is then mostly just passing opaque FILE * pointers around. IIRC the compiler course I took at university did this.

That said, your parent's point is bogus: The act of writing a compiler in not-C is sufficient to drive home the point that a compiler is just another piece of software. On the contrary, writing a compiler in C because of self-hosting reinforces the misconception that compilers and C are somehow special.


"because of self-hosting"

Although I follow Bootstrapping projects, I think the self-hosting focus itself is a bad idea. PreScheme, a systems language, was easier to parse, more productive than, and compiled to C. The latter to leverage its optimizing compilers but showed it was not necessary to give up Scheme benefits to do it all in C.

Likewise, many of these compilers would iterate faster with more correctness if written in a higher-level language instead of being self-hosting just because. The libraries will already test the language in a variety of situations. So, that excuse doesn't fly for me if trading away productivity and/or maintainability of the better language for the compiler/interpreter.


A bit disappointing that the course isn't openly available in any way. It sounds really interesting.


The course website from last semester (when I took this course!) has just about everything except for lecture recordings and starter code for the project, and it’s still up: http://www-inst.eecs.berkeley.edu/~cs164/sp19/



Interesting, are there any other free, modern compiler courses?


University Of Washington CSEP501: Compiler Construction http://courses.cs.washington.edu/courses/csep501/

Keith Schwartz Stanford Compiler course CS143 http://www.keithschwarz.com/cs143/

Stanford's Engineering Compiler course on Lagunita https://lagunita.stanford.edu/courses/Engineering/Compilers/...

Matt Might's courses are self educational blueprints http://matt.might.net/teaching/compilers/

Book Introduction to Compilers and Language Design https://www3.nd.edu/~dthain/compilerbook/

I'm not a fan of uncurated link dumps, however: Awesome compilers link aggregation on Github https://github.com/aalhour/awesome-compilers


I really wish we have a compiler analog of "Operating Systems: Three Easy Pieces".


Something like "Compiler Construction in Oberon", "lets build a compiler", "Turbo Pascal Internals"?

https://inf.ethz.ch/personal/wirth/CompilerConstruction/inde...

https://compilers.iecc.com/crenshaw/

http://turbopascal.org/


Various flavors of a course based on "An Incremental Approach to Compiler Construction" [1] have most or all materials free online, some with excellent notes. Taught at UCSD, Northeastern, Swarthmore College:

https://ucsd-cse131-s18.github.io https://course.ccs.neu.edu/cs4410/lec_let-and-stack_notes.ht... https://ucsd-progsys.github.io/131-web/lectures/05-cobra.htm...

(I designed the original version, though it's improved a lot in the past few years)

[1] http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf


If you don't dislike videos then these "Foundations of Programming Languages" videos seems nice: https://www.youtube.com/channel/UCpSoGwyH5yHHvQut3x6c_2g/pla...


This looks cool, and if you click the "Compile to RISC-V" button you get to a page where a lot of the assembly code has meaningful comments explaining what each instruction does. I wish we had that in every compiler...

As for GC, if I read the code correctly it doesn't do any and simply aborts on out of memory. Fair for a first compiler course.


> where a lot of the assembly code has meaningful comments explaining what each instruction does

Cutter (which is based on Radare2, basically a GUI for r2) can do that under the "Disassembly" tab! It works with executables and source code. You have to configure it to show the additional information though. It is such a neat tool!


Sorry, I was probably unclear. I imagine Cutter just tells you what each opcode does, no? gcc.godbolt.org does that too, with links to the ISA docs.

But here I mean that in the ChocoPy code you get explanations of what instructions do in the context of the program's semantics, i.e., how they relate to a higher-level view of what's going on. An example:

    .globl $print
    $print:
    # Function print
      lw a0, 0(sp)                             # Load arg
      beq a0, zero, print_6                    # None is an illegal argument
      lw t0, 0(a0)                             # Get type tag of arg
      li t1, 1                                 # Load type tag of `int`
      beq t0, t1, print_7                      # Go to print(int)
      li t1, 3                                 # Load type tag of `str`
      beq t0, t1, print_8                      # Go to print(str)
      li t1, 2                                 # Load type tag of `bool`
      beq t0, t1, print_9                      # Go to print(bool)
    print_6:                                   # Invalid argument
      li a0, 1                                 # Exit code for: Invalid argument
      la a1, const_4                           # Load error message as str
      addi a1, a1, @.__str__                   # Load address of attribute __str__
      j abort                                  # Abort
Note that the different occurrences of lw and li explain the meanings of the magic constants in the code. This would be pretty hard to do from disassembly alone.


Ah sorry, at first I thought you meant something like: https://i.imgur.com/RZeFKZQ.png. I did not try ChocoPy out myself, so I had no idea of the output to what you were referring, which is my fault. In any case, I hope I introduced Cutter/Radare2 to someone at least. :D


Make the course content available for public please.


A ChocoNim could also be an interesting choice, closer to the metal and opens up potential for self compilation.


yeah but is it COOL



Not knowing anything about this, I thought this would be about C# :)

https://en.wikipedia.org/wiki/C_Sharp_(programming_language)...


What! The best ever python name has been taken for complier course. My both sons love chochopy and they will be devasted to know that chochopy isn't a sprite based programming language for kids.


What? I don't know what "chochopy" is, but this project is called Chocopy.


Choco-pie is a delicious Korean snack made with chocolate and marshmallows: http://www.trifood.com/chocopie.asp

You can find it at your local Korean market, and in Costco if you are in the SF Bay Area




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: