Hacker News new | past | comments | ask | show | jobs | submit login
Porting the Go compiler from C to Go (sourcegraph.com)
212 points by sqs on April 25, 2014 | hide | past | favorite | 93 comments

Author of the post here. Happy to answer any questions I can, and FYI, we (Sourcegraph) are liveblogging all of GopherCon at http://gophercon.sourcegraph.com. Let us know if you have any questions or find it useful!

You might want to clarify that you aren't Russ Cox and that this post is a summary of a presentation he gave.

Thanks for mentioning this. I initially included "(unofficial liveblog post)" in the HN title, but it apparently was edited out by a moderator. So, I added a clarification to the top of the post itself.

Are you at all concerned about the possibility of the Trusting Trust[1] problem manifesting in any of your Go compilers?

[1] http://cm.bell-labs.com/who/ken/trust.html

Since this is an exercise divorced from reality, the usual vehicle was FORTRAN.

Dang it I cheated and looked up Rus's quine. For anyone not wanting to cheat a good hint would be "tail recursion"

It was suggested long ago that having multiple compilers would make it incredibly difficult to play Trusting Trust games.

It's great that they are aiming for an automated conversion from C to Go. It's clear they aspire to convert their code which was written in a certain way. But I think it would be a huge boost in Go usage if they could eventually aim to transpile any C code into Go code.

A little dream of mine would be if in the future when Rust is stable Mozilla developed a transpiler from C++ to Rust. That would be brilliant.

By the way, all the other talks at GopherCon seem pretty interesting, I hope someone uploads videos of them soon.

It is unclear how one would compile arbitrary C code into useful Go. The stereotyped conventions of well-written compiler code allows for a more idiomatic translation than a general translator could ever aspire to.

C++ to Rust would be even crazier.

(Transpile is a silly word. It's "compile". Compiling already trans-es.)

I know it's crazy difficult to do it, that's why I was talking about dreams.

But maybe a set of guidelines about translatable C code plus a translator which is good enough for a variety of cases would make refactoring the resulting code such a manageable task that many projects could consider switching to Go directly.

You are technically correct in that compile already implicates a translation, but we usually use that term refering to a translation into a lower level language and not another language at the same level. You can say transpiler or source-to-source compiler for those cases and I think it's clearer and more accurate for the reader.

> ... but we usually use that term refering to a translation into a lower level language and not another language at the same level.

Rust is not lower-level than C++ and neither is Go lower-level than C.

Exactly. Hence "transpiler".

What part of his definition didn't you understand? He said a transpiler trasnlates from a higher-level language to a lower-level language. We are talking about translating from C -> Go and C++ -> Rust both of which are in the opposite direction (i.e., low -> high).

> He said a transpiler trasnlates from a higher-level language to a lower-level language

No, he said a compiler translates from a higher-level language to a lower-level language:

"You are technically correct in that compile already implicates a translation, but we usually use that term [= compile] refering to a translation into a lower level language"

Ahh, my bad.

This should not have been down-voted. We are talking about translation from C -> Go and C++ -> Rust (which are both in the opposite direction of the definition given for 'transpiling').

I agree that your comment added value to the discussion, and gave a counter-vote.

However, arguably the term could make more sense there, if we ignore the earlier definition and instead assume "trans" is short for "transcendent", as in "climbing to a higher level".

If we then simplify that to "compile from one language to another of roughly equal or higher level", it becomes a useful word to indicate a specific subset of ways one can do compilation.

Of course, I'm completely pulling this out of my ass and I might piss off actual CompSci people who use very specific and strict definitions in their papers (kind of like how some get annoyed with the procedure/function mixup), so take this with a grain of salt.

> You are technically correct in that compile already implicates a translation, but we usually use that term [ie compile] refering to a translation into a lower level language and not another language at the same level.

The way I understood what he wrote, he was making the same point as you. Presumably the people downvoting you had a similar interpretation of what he wrote.

Good automated translations involve matching and translating the project idioms. This results in code that fits the target language's language-idioms. Automated translation only at the level of language constructs results in odd looking generated code.

The use case is to bootstrap the conversion from C to Go, if one has made a decision to do so. You pick a cutoff time, and say “now were going to do our work in Go”.

What comes out the other end, as Russ phrases it, is a C program in Go syntax. The next phase of work is to turn it into “Go” as a human might write it.

The translated program is not meant for deployment, really. It’s to give the humans a starting point.

"a transpiler from C++ to Rust"

I think the best approach for something like this is a disassembler for LLVM IR. I am not extremely familiar with the LLVM IR, but I figure there are common patterns or debug symbols clang produces that could let you get a decent idea of what the higher level gave you. If not targeted to clang output too much, you can then transpile any LLVM-able language to Rust.

I think, but am not sure, that the problem with Rust being a target for these other languages (my pipe dream is to write a JVM in it w/ a JIT using the bootstrapped rustc lib) is that you have to use unsafe code all the time (unless there's 0 performance penalty for ARC) because the original code is not written with ownership/borrowing semantics.

I think there are probably lots of chunks of code that would be perfectly valid under Rust's ownership semantics and could therefore be transpiled safely. Figuring out which chunks are and aren't would probably be quite difficult though.

One approach might be to just convert to Rust, and let the Rust compiler tell you what things don't type check.

Reading https://docs.google.com/document/d/1P3BLR31VA8cvLJLfMibSuTdw..., I don't see that they are aiming for an automated conversion; they are aiming for an automation-assisted conversion.

They want to go quite a step up from doing some global regex replaces, but in the end, step 3 has "cleaning up and documenting the code, and adding unit tests as appropriate.". I don't think they aim to automate anything there.

It has been stated that there will be videos available, but I'd not expect them until next week sometime (when the organizers get home and have recuperated).

Automated rewrite FTW! This can help you avoid freezing a project while it is being ported. Also, if you have a code base with its own idioms, then those idioms can be matched and translated, which can produce cleaner target code.

Go 1.0 spec is already frozen and will not change any time soon.

I believe he referred mostyl to freezing the efforts on impoving the implementation.

I was also talking about projects and porting in general. Incidentally, I've been paid to do exactly what we're discussing. (Porting a production program using automated translation.) It works.

Implementing the compiler in Go becomes the effort to improve the implementation. Obvious, no?

Russ' paper on the subject from Dec 2013


I'm reminded of the fortran to c compiler f2c that was used to produce pure c implementations of lots of libraries like LAPACK.

I'm curious how they will handle things like pointer arithmetic and memory safety in C vs Go. If they mange to do so in a performant way I could see translating lots of numerical or computationally intensive code to Go so that it could be run in a shared cloud environment without worries about memory safety and without having to resort to vms for separation.

If you took arbitrary fortran, especially with i/o, the results were an unreadable mess - but they compiled and ran.

For a project I had (in the early 90s iirc), I had to extensively modify the fortran and make multiple versions before the generated C was readable. Still, much easier than a rewrite.

> There are currently 1032 goto statements in the Go compiler jumping to 241 labels.

Wow, that's really striking. I know that goto statements have their uses but for something written in the last couple of years to have over a thousand of them is very surprising (it might not be surprising at all for those who write C code all the time). I guess they're mostly just for error handlers?

As others have pointed out, this is actually idiomatic C. You have a portion of your function below a label which is "clean up everything and return this variable (usually error status)." Then at any point where you have an error, you goto that label and everything is released and the error gets returned. It's sort of a manual 'defer' pattern.

"goto considered harmful" is taken out of context and bemoaned mostly by people unfamiliar with idiomatic C, I think. (Edit: sorry, I mean just for C. For other languages, e.g. Go, there may be more appropriate patterns like the 'defer' keyword.)

I'm not surprised to see them in a compiler. When you're parsing and want to treat it as a big state machine, gotos are your best friend.

Encoding state implicitly in control flow tends to give you nicer and more natural code. A big state machine implemented with gotos can be tough to read if it's also a parser (which may need to build a tree, recurse, etc.). A small hand-rolled lexer might be reasonable to implement with goto. Although I would prefer to enumerate the states and use a switch.

If you're writing a recursive descent parser for C while keeping minimal lookahead, I can think of a place or two where goto might really help. One is to deal with casts, compound literals and subexpressions. They all begin with a left-parenthesis, so you can't tell what you're dealing with until a bit further. Since these all appear in the parsing of simple expressions (identifiers, constants, unary operators), the code can go in one function and you can jump to the right place with goto as soon as you know what you are dealing with.

I did not look in the Go code but C programmers mostly use goto for error paths.

Except in languages with tail recursion such as Scheme! A state ends with a call to the next state instead of a goto next state.

For example: http://golang.org/src/cmd/6g/gsubr.c

Some are for single error return, but others are just to make a nice structure for the code.

I dunno, a few of them seem hard to justify. The one on line 475 could just be an "else" branch, frex, and the ones in the loop on 327 could as easily be continues.

Only one goto in the loop could be a "continue", the one in the nested loop cannot. The one that could be is most likely a goto just for consistency.

When the full video of the talk comes out, you will see that Russ agrees with you. He lists the types of uses of goto. Various are used to make loops or elses, and he says those should not have occurred.

I would be curious to hear the justification behind the implementation of angryrealloc, the "goto ok" is strictly equivalent to a "continue" as far as I know. Maybe it triggers an optimization with a certain compiler.

It's to continue the outer loop from the inner loop.

Ah the old C style of function types in a different line.

This smells of very old code (as the header testifies)

??? This is modern BSD (and probably plan9) style. Here's a line from the header:

> Portions Copyright © 2009 The Go Authors. All rights reserved

That's pretty recent.

And here's BSD style(9): http://www.freebsd.org/cgi/man.cgi?query=style&sektion=9

    The function type should be on a line by itself preceding the function.

I use that style all the time. I like having the function name start in column 1, it can make searching easier.

I do wish that in C1x (for x > 1), they could find a way to let us declare multiple formal arguments of the same type without repeating the type:

    my_graphics_hack(float x, y, z, r, g, b, u, v) { ...
instead of

    float my_graphics_hack(float x, float y, float z, float r, float g, float b, float u, float v) { ...
which gets really tedious and obfuscates that fact that they are all the same type, especially with the typename is more complex than just "float". When the new syntax was added to ANSI C, it wasn't possible to do multiple variables, for reasons I've forgotten but which I think had to do with forward type references. It would be awfully nice to find a fix for this.

That's a poor justification, there are plenty of tools to index and search C(++) code like ctags or cscope, and your example could use a vector or a struct, I have a hard time imagining when a function with such a prototype would be useful in real life.

Compound literals are not too pretty and not too well known. Plus, if you care about c89... So, do you wrap all your values in a struct because the prototype gets too long otherwise? What a chore.

Preferring that function names start at the first column to make searching easier is a perfectly good justification.

You, on the other hand, seemingly would impose your tools (which have their shortcomings) and workflow on people with no justification at all.

Just because a tool exists, doesn't mean it's good (or better than what people are accustomed to) or that everyone must use it.

Do you think everyone should use Vim on Linux too?

No, I think everyone should use Emacs on FreeBSD, but that's not my point.

Adding syntactic sugar to a language makes the language bigger and harder to completely understand, it makes it easier to misuse a feature (and C already has a lot of trickery with types, like when you declare a parameter as an array but it behaves like a pointer). I'm a strong believer that implicit is better than explicit and that while there are many ways to do the same thing some are better than other in practice. Of course, "in practice" can change from one project to the other, what matters in consistency. I applaud the choice of Go to standardize on a single coding style for instance.

For the particular example of the parent, while I write a lot of C code for work and for fun I have yet to encounter a situation where I saw a function taking 10 floats as arguments and thinking "yup, that's completely the right way to do that". If you have an example of such a code I'd me more that willing to reconsider my position, otherwise we're just talking about the best way to tame a unicorn.

structs are a good way to hide away all that state

Agreed, but there are still times when I end up having several arguments of the same type, and I see no virtue in having to repeat the type for each formal parameter.

I don't see how having the type makes it easier to grep, grep finds the function name regardless

You can also search for something like "\w+\s(.?)\s*{"

>I don't see how having the type makes it easier to grep, grep finds the function name regardless

It's not about finding the function name in general, it's about finding where the function is actually declared -- not in the call sites.

>You can also search for something like "\w+\s(.?)\s{"*

Yeah, or use ctags. That's why they said "simpler".

"^foo" beats that.

Yeah, ctags is a good choice

> "^foo" beats that

because it's so hard to use ctags...

So how do you use ctags with grep?

It needs setup. I'd rather just grep.

BSD/Plan 9 style is extremely useful because it allows grepping for functions. More specifically, it allows the creation of acme(1) addresses of the form /some/file.c:/^foo where foo is a function. In acme you can just B3-click on that string and it will open the file at the correct function...

It is also a pretty common style in the open source world.

You can do that with ctags

> It is also a pretty common style in the open source world.

Depends, one proeminent C project doesn't use that: the linux kernel

>You can do that with ctags

And I can do it without ctags. That's the point.

>Depends, one proeminent C project doesn't use that: the linux kernel

That doesn't make his statement "depends" at all. A single project not using that style does not contradict it being "pretty common".

this is a style i still use today (almost)

    (int argc, char **argv)
although most people think i'm weird for it..

Yep, that's pretty weird. It harks back to the early days of C. When I first saw the currently recommended form:

    int main(int argc, char **argv) {
I thought it was nice but not obviously advantageous, but over the years I've come around to thinking it makes code easier to follow. Even many bash shell scripting guides now recommend a similar form:

    for x in y; do
      # body

it certainly helps when your function isn't as simple as main..!

it is so you can type

    grep -n ^functionname *.c 
and get the line that starts the definition

which line?

Are you sure you aren't confusing it with K&R style with the argument names on a different line? That would indicate old code. Having the return type on a different line is incredibly common, virtually every C project with decent code uses it.

>but for something written in the last couple of years to have over a thousand of them is very surprising

Hardly surprising. Gotos will be found by the hundrends or thousands in C projects like compilers, kernels etc.

It's used for localized consolidated error handling, but also for stuff like parsing code (bison generates tons of gotos IIRC), state machines, etc.

If anything it's the old "Goto statement considered harmful" that's a little too naive.

>If anything it's the old "Goto statement considered harmful" that's a little too naive.

No, people who didn't read it and just repeat the title without knowing what he was talking about may be too naive, but the original point was not.

Yeah, there are a lot of them in e.g. Linux kernel code for jumping to end-of-function cleanup. I suspect that in the process of translation to Go most will be replaced by the defer keyword.

I saw one goto that could cleanly be replaced with an else (goto fp).

I don't think the authors were very goto-averse.

This might become a nice benchmark for the Go language; the same code base implemented in C and Go! It may not be fine-tuned for optimisation, neither in C nor in Go, but may still give a good ball-park estimate.

Indeed, but it would be much more useful and impressive if they rewrote the compiler in Go by hand. I'm disappointed that they are aiming for an automatic translation instead - some people are going to ask themselves whether Go actually isn't that much fun to program in. An independent reimplementation is better for correctness too, they can compare outputs and find bugs in both implementations instead of porting over old bugs and adding new bugs where the translation goes wrong.

I'm really thankful for the liveblogging, as I couldn't manage to get my body to the conference.

I understand the desire to promote the Sourcegraph app by doing the blogging, and I think its effective. However, the blog is real annoying to browse, as every (prominent?) link points to Sourcegraph the app instead of the blog.

Just to clarify, because I think my original comment is disbalanced.

I'm REALLY thankful for sourcegraph's liveblogging. The above comment was a suggestion as I thought they might want to know that (at least for me), the navigation of the blog was confusing.

Thanks! We are having a fun time liveblogging and are glad you're finding it useful. We got complaints when the blog image did NOT link to Sourcegraph, too. :) It's almost the end of this conference, but next time we liveblog, we'll have 2 separate header images, or try something else to make it less confusing.

"They’re deciding to automatically convert the Go compiler written in C to Go, because writing from scratch would be too much hassle."

When transcribing a talk, there isn't any need to write "They're". Just use the same pronoun the presenter used, otherwise it stands out like a sore thumb.

3) Go has turned out to be a nice general purpose language and the compiler won’t be an outsize influence on the language design.

In what sort of ways does self-hosting early influence a language design? Were they hoping to avoid something in particular by delaying self-hosting?

The general way that self hosting influences language design is that the compiler is often one of the first major projects to be built using a language. This does not give it more influence then other major projects, but if your goal is to have a language designed for use case X, it is generally best to have your early projects with it be for X. Additionally, self hosting may encourage a language design that makes bootstrapping easier (such as a stricter divide between the state-1 language and the general language).

I sort of had the general sense of that already. I guess I was hoping for something more specific about what language features are so useful when writing a compiler but get in the way for general problems.

The part I quoted above almost sounds like wiping sweat from one's brow after having dodged a bullet: "phew, the language is safe from influence by those gull-durn compiler-writers..."

To expand a bit on your last point, self-hosting helps answer the question "can I build Real Software (tm) with this language, or is it really difficult?" earlier in the lifetime of the language.

"Note: There’s a book written about converting goto code to code without goto in general, but this is a sledgehammer and not necessary here."

Anyone have any idea what the title of that book is?

  >A Union is like a struct, but you’re only supposed to use one value
  >(they all occupy the same space in memory). It’s up to the programmer to know which variable to use.
  >  There’s a joke in some of the original C code:
  >      #define struct union /* Great space saver */
  >  This inspired a solution:
  >      #define union struct /* keeps code correct, just wastes some space */
Somewhere in Scotland, a sum type sheds a single tear.

  >      #define union struct /* keeps code correct, just wastes some space */
Not always, though...

Yes, always. And if you don't believe me, it's not my trick. I learned it from Dennis Ritchie (he was thinking about a C to Limbo converter).

It works when unions are used correctly. If they are used to deliberately subvert the type system (e.g., put in an int, get an array of char out) it doesn't work. But C has so many other ways to subvert the type system that there's no need to do that.

Probably about the only place you'll see them used like that is in the code produced by web2c in compiling TeX. Knuth used variant records a lot to get around Pascal's type safety, and they get translated to unions.

That code is not valid according to the C standard, so there is no guarantee it will work anywhere. In particular, many modern compilers have optimizations that would break that code. I would be a little surprised if modern web2c still uses unions this way and gets away with it.

The only standard compliant way to, say, convert a float to an int is to use memmove:

  uint32 i;
  float32 f;
  i = 0x80000000;
  memmove(&f, &i, 4);

I finally managed to track down the definition of memoryword from texlive-20130530-source/texk/web2c/texmfmem.h. Here it is:

        typedef union
	#ifdef TeX
	  glueratio gr;
	  twohalves hh;
	  twohalves hhfield;
	#ifdef XeTeX
	  voidpointer ptr;
	  integer cint;
	  fourquarters qqqq;
	#else /* not WORDS_BIGENDIAN */
	#if defined (TeX) && !defined (SMALLTeX) || defined (MF) && !defined (SMALLMF) || defined (MP) && !defined (SMALLMP)
		halfword junk;
	#endif /* big {TeX,MF,MP} */
		integer CINT;
	  } u;

	#ifndef XeTeX
	#if defined (TeX) && !defined (SMALLTeX) || defined (MF) && !defined (SMALLMF) || defined (MP) && !defined (SMALLMP)
		halfword junk;
	#endif /* big {TeX,MF,MP} */
		fourquarters QQQQ;
	  } v;
	#endif /* not WORDS_BIGENDIAN */
	} memoryword;
Once you sort through all the ifdefs and typedefs, it boils down to a union of a float (glueratio), an int32 (integer), a pair of int16s (twohalves), and a struct of 4 bytes (fourquarters). When TeX creates its memory dump file, it does it all in terms of the bytes in the fourquarters struct. (And I think there are other times that it does similar things.)

This may very well be undefined behavior, but it seems to work when compiled with GCC anyway. Thankfully gc isn't such crazy code as this.

C99 seems to suggest that non-overlapping union members could break conforming code:

``One special guarantee is made in order to simplify the use of unions: if a union contains several structures that share a common initial sequence (see below), and if the union object currently contains one of these structures, it is permitted to inspect the common initial part of any of them anywhere that a declaration of the complete type of the union is visible. Two structures share a common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or more initial members.''

Which part of the C standard would forbid type punning a uint32 to a float32? http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm suggests that C89 explicitly allowed this, C99 at first glance did not, and was errata'ed to make it clear that this is still allowed. C11 seems to have the same verbiage.

Devil's advocate example:

  union hack { int x; float y};

  int foo( int *i, float *f)
     *i = 3;
     *f = 42.0;
     return *i;
Aliasing rules say foo need not read from that int pointer, may switch the order of the write to f and the write to i, and can assume that foo returns 3, so

  struct hack h;

  foo( &h.i, &h.f);
might return 3 or something else. I think that last call introduces undefined behavior, but only becuase of the definition of foo that the writer of that call might not even have the source for.

But of course, that is an "you shouldn't do that" edge case. One could also claim that the corrigendum doesn't apply because foo doesn't "use a member to access the contents of a union".

Sure, but that's an aliasing issue, not a type punning issue. I agree that the code you cite there is a violation of the aliasing rules, and will not work "as expected" on modern compilers unless one does the equivalent of gcc's -fno-strict-aliasing.

And I agree that the corrigendum doesn't apply in this case. Once you hand different-typed pointers to the same memory to people, whether via union or just casting pointers, the aliasing rules will up and bite you.

That's why I said "not always". Here is an example which will print different result if you define

  #define union struct
Here you go:

  #include <stdio.h>
  union U { char a; char b; };

  int main() {
    U u;
    u.a = 13;
    u.b = 10;
    printf("%d\n", int(u.a));
    return 0;

I don't think a and b need necessarily overlap. I think the types need to be more like this:

    struct A {char a;};
    struct B {char a;};
    union U {struct A a; struct B b;};

It would break my current usage of unions.

    typedef union
      struct sockaddr     sa;
      struct sockaddr_in  sin;
      struct sockaddr_in6 sin6;
    } addr__u;
as a way to work around the BSD socket API without horrible casting all over the place. Change the "union" to a "struct" and the code no longer works. Is it subverting the type system? Yes. Is casting subverting the type system. Yes. Pick your poison.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact