Hacker News new | past | comments | ask | show | jobs | submit login
Why are my Go executable files so large? (cockroachlabs.com)
468 points by caiobegotti on Dec 29, 2019 | hide | past | favorite | 330 comments

> prior to 1.2, the Go linker was emitting a compressed line table, and the program would decompress it upon initialization at run-time. in Go 1.2, a decision was made to pre-expand the line table in the executable file into its final format suitable for direct use at run-time, without an additional decompression step.

This is a good choice I think and the author of the article missed the most important point - it uses less memory to have an uncompressed table.

This sounds paradoxical but if a table has to be expanded at runtime then it has to be loaded into memory.

However if a table is part of the executable, the OS won't even load it into memory unless it is used and will only page the bits into memory that are used.

You see the same effect when you compress a binary with UPX (for example) - the size on disk gets smaller, but because the entire executable is decompressed into RAM rather than demand paged in, then it uses more memory.

If you decompress it to an mmaped file it'll be one of the first things written to disk under memory pressure anyways and instantly available in normal situations.

With the ever decreasing cost of flash and it's ever increasing speed relative to the CPU compression is not really worth what it used to be to startup times 10 years ago though.

Swapping is fine for workstations and home computers. But high performance machines running in a production environments will absolutely have swap disabled.

The performance difference between RAM and disk is not an acceptable tradeoff. RAM will be tightly provisioned and jobs will be killed rather than letting the machine OOM.

Memory mapped files don't go to swap under pressure they go to the file they were mapped to, no different than the OS knowing you're debug data is in the file just not loaded to memory yet which is the only other scenario here.

Are you arguing you prefer your production jobs to be killed rather than slowed ?

Generally yes. Swapping changes the perf characteristics of that process (and often any other process on the same machine) in unpredictable ways. It's better to have predictable process termination -- with instrumentation describing what went wrong, so capacity planning and resource quotas can be updated. The process failure would generally be compensated-for at a higher level, anyway.

> “process failure would generally be compensated-for at a higher level”

I don’t quite get it. What does it mean? You re-run the process at an other time? On an other machine?

I personally do have swap on all my production machines. I’m not relying on it, but it acts as a safety net in case unexpected memory usage sneak in some jobs. This goes with a sane periodic review of host metrics of course to ensure safety nets do not become the norm.

I much prefer a slightly delayed job than a failed job.

I guess it depends on the complexity of your distributed system (assuming you’re operating one).

We prefer to have job OOM kill and get retried elsewhere (which could be a completely different machine) and we have plenty of infrastructure that makes this trivial. This infrastructure also deals with other types of partial failure, such as complete machine failure.

As mentioned above, paging introduces strange pref behaviour. Which may not always be important, but if your working under tight latency requirements then paging can push you over that boundary.

That sounds strange when we’re happy to see jobs die entirely (that screws latency). But the issue with paging is you have no idea when it’s gonna hit you, and may impact a job that’s behaving perfectly fine, except something got paged out by a badly behaving job.

Ultimately disabling paging is a really good tool for limiting the blast radius of bad behaviour, and making cause-and-effect highly correlated (oh look thing X just OOMed, probably means thing X consumed too much memory. Rather than thing Y has strange tail latency because thing X keeps consuming too much memory). It’s failing fast, but for memory rather than exceptions.

I get your point, and I think we agree here. I mainly wanted to argue against the original statement, which was quite broad “high performance machines running in a production environments will absolutely have swap disabled”. Would you agree to rephrase both our arguments by:

- paging incurs seemingly random performance degradation of processes and should be avoided

- if you have a form of task queue/job distribution system which handles automatic re-run, and can afford at no business cost to restart a process from scratch, then disabling swapping allows fail fast behaviour

- otherwise swapping can be used as a safety net for programs that would be better off slightly late than restarted from scratch

- both scenarios require sane monitoring of process behaviours, to catch symptomatic failures/restart in case 1) and recurring swap usage in case 2)

> if your working under tight latency requirements then paging can push you over that boundary

Even if you're not under tight requirements, swap can do strange things. I've actually seen situations where hitting swap, even trivially, can cause massive increases in latency.

I'm talking about jobs which took 10s of milliseconds to complete now taking multiple 10s of seconds.

I've even seem some absurdly bad memory management where Linux will make very very poor choices about what to page out.

> Ultimately disabling paging is a really good tool for limiting the blast radius of bad behaviour

1000% agreed. Fail fast rather than fail slowly.

>> “process failure would generally be compensated-for at a higher level”

> I don’t quite get it. What does it mean? You re-run the process at an other time? On an other machine?

I expect the commenter means whatever thing kicked off this job will detect it failed and do something about it (try it again elsewhere, log an alert triggering an admin to go provision more capacity, etc.)

Their view is that a failure is easier to troubleshoot and fix than success with intermittently anomalous characteristics.

Got it.

To be honest, what I really wanted here is to force the commenter on agreeing that whatever the architecture/framework/etc., there is no magical solution, and saying that it will "be compensated-for at a higher level" just hides the only 2 real possibilities:

- Kill the process and restart it somewhere else (no swap)

- Wait for the process to finish anyway (using swap)

Both choices have their own pros and cons, as discussed in other comments. But arguments such as "it's better to have predictable process termination", and "under tight latency requirements then paging can push you over that boundary" are exposed as much less meaningful once the problem is stated in simple terms.

If the process indeed uses more memory than planned, then you have no guarantee than restarting it will work this time. Worst case, you can even have to allocate urgently a new node to handle that 1MB of memory above what was available. Not sure that is really the best solution "under tight latency requirements".

My experience would indeed be the reverse: if you are faced with unexpected memory usage growth, then having some swap on the side to allow your process to finish makes for a much smoother production system than having to restart it all from scratch somewhere else/at an other time.

In my experience, production is not the place where you want to enforce ideological concepts such as "a process shall not use more memory than what was planned". Production is the place where you make compromises and account for everything that slipped through these concepts, because you have no choice: it _has_ to run. Staging would be a good place for this though, as it's much less critical if something fails, and you have more time to fix and account for it.

I really don't want to debug/profile/release/deploy a program in a hurry just because somehow it ended up using more memory than planned. (or convince infrastructure to deploy a new node type ASAP to handle that new memory constraint).

Now one could argue that such unplanned memory consumption should have been caught earlier in the development process, to which I would reply:

- Not everyone is able to properly test programs in a production like environment, especially when dealing with resource-hungry processes. In our case for instance a typical staging machine would be a downscaled version of the production one (less cores, less memory, less bandwidth).

- I tend to plan _also_ for what should _not_ happen (because it will happen). Even with a perfect environment to calibrate programs before a release, there will be releases in which the memory usage was not accounted for. Guaranteed.

I’m guessing this is a disagreement on the meaning of “reliable,” and there is room to disagree on that. From my perspective, if your failure model is “it has to run” then you don’t really have a failure model and are just hoping for the best. If you have the resources to improve on that approach and formalize expected failure, then OOM logs are another tool that can give you metrics from production while failing operationally. The real benefit is ease of debugging.

I have `vm.swappiness = 1` for that. Who would want a job to be OOM killed?

For those of us who build distributed systems, it’s all about number of modes we need to design for, test, and monitor. If all I have to worry about is process death, I can design my service around that. Monitoring for process death is generally pretty straightforward as well.

Gray failures (like process slowdowns), on the other hand, are fairly difficult to design for, and detect. And it can wreak havoc on distributed systems if one of the nodes suffers a gray failure.

It depends on the memory profile of that particular job of course, and the overall latency/throughput requirements of the application. Swap memory being as slow as it is I've experienced few jobs that can reasonably handle it.

When slowed is orders of magnitude slower -- then they are effectively being slowly choked to death anyway.

Virtual memory works even without swap enabled. Since the mapped file is the binary, and code is never changed after loading, the OS can simply take pages backing the memory. But when there is a page fault, it will be bought back in.

Yeah, but why go through all that trouble when you can just add more memory? It's not 1996, machines can have more than 1GB of ram

You haven't added memory yet because you didn't know your memory requirements before running your prod workload.

I don't buy the "never swap" school of thought.

Sure, swap is bad on steady state loads. But for transient loads it works

Unless you like playing "why did my service got killed this time" every once in a while

> Unless you like playing "why did my service got killed this time" every once in a while

There is no playing: the error is logged and quite clear, by the application or by the kernel.

Being consistent, predictable and as deterministic as possible is a key feature of reliable software, and paging into swap kills all that.

Reliable software used swapping for years. Artificially having a low virtual memory ceiling by disabling swap is just austerity for almost no benefit.

Paging into swap often has almost no perceptible effect in regular usage unless you are running software with microsecond guarantees on infrequently accessed processes (those that get swapped out).

In 'modern' microservice architectures, this is not true. If you look at the k8s approach, reliable software is created through redundancy. A part of the app being killed by the OOM shouldn't matter, it should automatically be rescheduled on another node.

Kubernetes as a platform recommends disabling swap completely, and you have to explicitly allow nodes to have a swap, otherwise it fails.

This is sane behaviour if you're dealing with a large cluster with a complex architecture you no single person could or should know all the ins and outs of. There is no "let's log on to the machine and see what's happening" when dealing with this types of architectures, even at smaller scale.

And a massive part of Go's target is exactly these modern architectures/workloads...

Yes, in k8s it's different and I agree swap should be avoided in that case

But still, you could end up with a case where your pods start using a little bit more memory (for a while) and getting killed by OOM

On linux there is the best of both worlds approach. You can use zram for swap and userspace OOM killer, like earlyoom. If processes start using a bit more memory or leaking they won't get killed and nothing will slow down much, but will get killed if things start to go too far to cause performance problems.

I think you're confusing memory mapped files with swap. They may be sibling technologies but they're not the same thing at all.

No, there's no need for a tradeoff at all in a properly implemented compiler. You can certainly compress a table in memory and still have that table be indexable at runtime. Entropy coding is not the only type of compression that exists. DWARF solved this a decade ago.

I thought Go did use DWARF for debugging info. Your remark sound as if some easy optimization was ignored. Can you please elaborate?

This "pclntab" format duplicates what DWARF does, but is not DWARF: https://golang.org/src/debug/gosym/pclntab.go

Is it just me or something like that runtime.pclntab shouldn't be included in production builds at all?

I mean, it makes all sense while you're developing and testing, but it should be reasonably possible to strip it out from production build binaries, and instead put it in a separate file so that if you do get a crash with a stack trace, then some external script can transform the program counters to line numbers, not have it embedded in every deployed binary.

The Go language literally requires that pclntab be included in release builds. I'm with you—it seems kind of crazy that this was designed into the language—but there you have it.

The reason is that Go's standard library provides functions that allow retrieving a backtrace and symbolicating that backtrace at runtime:

* https://golang.org/pkg/runtime/?m=all#Callers

* https://golang.org/pkg/runtime/?m=all#CallersFrames

Unlike in C or C++, where you can link in something like libbacktrace [0] to get a best-effort backtrace, those Go functions are guaranteed to be correct, even when functions have been inlined. This is no small feat, and indeed programs compiled with gccgo will often be incorrect because libbacktrace doesn't always get things right when functions have been inlined.

[0]: https://github.com/ianlancetaylor/libbacktrace

Is it that common for these functions to be used? Perhaps transitively through some popular libraries? Just being in the standard library doesn't even necessarily mean that these functions (and the data they need) should get included by the linker.

Any program that uses a logging framework, including the stdlib log package, will wind up depending on runtime.Callers at least transitively. That’s probably most Go programs; certainly most of the programs large enough to be worrying about binary size.

Unlike in C, there are no macros like __FILE__ and __LINE__, so there is no alternative to runtime.Callers (short of preprocessing your Go source code).

You can still get a backtrace without symbols though.

Why couldn't the Go team introduce a flag that strips symbols, while making clear to people that they should only use it if they are okay with backtraces looking like

  #1  0x00007ffff7ddb899 in __GI_abort () at abort.c:79
  #2  0x0000555555555156 in ?? ()
  #3  0x0000555555555168 in ?? ()
  #4  0x000055555555517d in ?? ()
  #5  0x0000555555555192 in ?? ()
or similar

Because one of the mantra of Go is not telling users to have a "debug" build and a "release" build. The development build is the one that goes into production, with no difference in optimizations, symbols and whatnot. This has pros and cons, like all tradeoffs.

Thanks; I didn’t know that, but it makes sense. Not a design decision I agree with, but it’s coherent, at least.

It seems like this can only be a con if other languages give you a choice. In other words, in C++ you can decide that your debug and release are the same, or you can make them different if you know what you're doing (like a good IDE does).

Having more options is inherently better in C++, a language designed to reward mastery. Having fewer choices is better in Go, a language designed to make onboarding a new developer easy.

For any given tool you're going to spend most of your career at the top of its learning curve, so it doesn't make sense to weaken the tool to optimize for the brief period when you weren't.

As a developer I agree with you, but I think if I were a FAANG I'd be thinking: developers are crazy expensive, how can I commodify them? Every C++ developer is unique. There's endless variety of language subsets and build situations. It takes enormous investment to bring someone up to speed on a project, and only mandating a heavily-restricted subset of the language can make teaching it to fresh college grads practical. These are exactly the things Go is designed to avoid.

In my experience, new developers are not normally charged with determining the details of build flags, deployment systems, etc.

Not to be too judgmental, because I recognize that every organization has its own way, but that sounds like a sign of organizational disfunction.

I just mean, when you join an existing project, most of that would already be set up.

In every project I’ve been on, I was asked to ramp up in the first few weeks by fixing small bugs here and there in the core codebase, not by setting up CI/makefiles/build infra/etc...

I actually think I might be totally missing your point, since I don’t really understand what you mean or how you arrived at the conclusion that it’s a sign of dysfunction.

I hope that's how it would work for me- getting the task of setting up makefiles on the first day- before I know anything about the codebase- would feel pretty daunting, to say the least.

That's bullshit. C is simple and you have masterpieces written on it. And 6502 ASM is even more simple and yet some C64 games are a pure wizardry.

The "26M pclntab" issue sounds like an excellent argument against that design decision.

Counterpoint: our work infrastructure for symbolizing stripped binary stacks sucks. I'd love to get backtraces with symbols from the field.

But also this entire section could be compressed and seekable and only expanded on demand. No need to leave it as a bloated uncompressed blob.

> our work infrastructure for symbolizing stripped binary stacks sucks. I'd love to get backtraces with symbols from the field.

If stripped binaries don't work for you, nobody is saying you should be forced to use them. You would presumably just not avail yourself of the option.

(As an aside -- what's wrong with your symbolification infrastructure at work? This is something I spent significant time working on in a past job, so I'm always curious how other people are doing it).

My employer's internal infrastructure for this just sucks; it's not endemic to stripping binaries.

Specifics? The automation to associate cores with specific builds is bad or non-existent; the automation to load up GDB with the right files and corresponding sources or source branch is bad/non-existent; the fileserver storing symbol data is slow and sometimes remote to the developer across a very thin pipe; etc, etc. All to some extent foot-shooting by my organization. But we have TBs and TBs of build artifacts and hundreds of unique daemons producing cores and also have to debug kernel cores and multiple product branches.

Part of the pain is perhaps that we're cross-building for an embedded FreeBSD-derived system, but the majority of our developers use Linux or Mac, so we can't necessarily use host-native (or host-native-only) tools.

I think basically the situation is begging for someone familiar with the problem-space to sit down and bang out a FUSE filesystem or two (e.g. fetching sources on-demand as GDB reads them, instead of checking out a full multi-GB repo copy) along with a shell or Python script to load up a core. But no one has done it yet and management doesn't really prioritize developer tools.

Why? What problems is it causing you in practice?

Are you sure this is true? Doesn't delve for example build Go source with special flags (gcflags=all='-N -l')to generate debugging symbols. I also remember having to build Go code with those flags for Stackdriver to get the correct debugging information without any optimisations.

This shouldn't be necessary anymore.

Thanks for the reply. Would appreciate if you could expand on this a little, since when? FWITW Google Stackdriver documentation [1] still states that you have to build source with those flags for version 1.1>.


Backtrace can be amazing at run time if your going to put that info into logs. It makes finding "fringe" errors a lot less painful or more easily reproducible.

Java and .NET can also get backtrace at runtime. Can't remember if they are symbolized or not but i think they are. How do they handle this?

They generate unwind information on the fly as they compile the code which is then stashed to one side, compressed and accessed only if a true language-level stack trace is required. Also they don't expose argument or return values in stack traces.

They’re interpreted/JIT’d and not compiled like golang.

They, have release and debug build modes.

.NET has AOT support since ever via NGEN, followed by Mono AOT, Xamarin,IL2CPP, Bartok, .NET Native and a couple of less know projects.

On Java side AOT has been always supported by JVMs for embedded deployments like Aonix (now part of PTC, PTC, Aicas, ExcelsiorJET, IBM J9.

And nowadays there is SubstrateVM as well, renamed recently as Graal Native Image, part of OpenJDK since Java 11.

Not entirely true with regards to the question.

Yes, Java is interpreted and JITed, but on the bytecode level. When compiling sources to bytecode, line number information is written to class files.

libunwind[1][2] should unwind correctly if a compiler emits correct DWARF CFI unwind table (required by C++ w/ exceptions; available for C in gcc/clang with -funwind-tables).

[1]: https://github.com/llvm-mirror/libunwind

[2]: https://www.nongnu.org/libunwind/

I don’t think libunwind gives you function names if you strip symbols and don’t have debugging information on hand.

Sure. libunwind is taking the place of pclntab here, not the symbol table. Go must also not strip symbols in order to print backtraces; if you want the same thing in C, don't strip the symbol table there either.

Not sure if it is just you but normally you DO want this information in the production build. It is quite a bad situation to have a run time exception in PROD and having no idea how it happened. Sure, there is defensive programming and checks and asserts but most of the time you cannot foresee everything.

I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.

> I get the point about external symbols and location database, but oftentimes time is precious and having fully laid out stack trace in the log will allow you to get to the root much faster.

You can also set up a service that automatically symbolicates everything in a log file as soon as it is generated, before a human ever even looks at it.

Granted, yes, this is slightly more complicated, but the point is that the toolchain should let the developers choose which strategy they want to use.

> let the developers choose

Absolutely not. Developers making choices about stuff like that is asking for troubles. I'm an SRE - and this is how production issues start, because some dev decided that somehow, in production, something like this should not be there, until problems pop up, but then it's too late.

For us it's simple, the moment it hits the CI/CD pipeline, it's a production build, even if it happens just end up on some test or staging environment. If you possibly need it there, you need it on production. Our way of working means that the exact same build artifact should be promotable to production.

There is a big difference between choose and do.

I don't understand what you mean; can you elaborate?

Not a parent commenter but I interpreted that as: given choice between alternatives A (fast and good enough) and B (more complex and much better) you would want to choose B but end up doing A for various reasons (lack of time, unclear ROI etc)

Sure, but the toolchain designers can (relatively easily) offer both, so that people who really do have the resources and/or need to do B can do so.

Are we still talking about a language that protects its users from the ability to write generics?

That’s a language feature; we’re talking about build flags. IMO, adding a build flag wouldn’t meaningfully change how simple the language is to learn and use, whereas there is at least a credible argument that generics would.

There are very different production scenarios - in many of them noone will ever look (or even be able to look) at a stack trace if it crashes after it's shipped (at best you'll record bug reports from customers and attempt to reproduce them on your test hardware), so the debug information is literally useless there. And these are the same scenarios extra 50mb of disk and memory matter more than for some software running in a cloud environment.

I was pleasantly surprised how good Microsoft's tooling was around firing up a debugger to examine the final state in a crash dump using external symbols from that build. Everything seemed to work except you couldn't resume any thread. I agree symbols don't need to be embedded in every running binary, but having a warm copy somewhere can be pretty helpful.

External symbols have forever been the default on Windows, with the binary only containing the path to the pdb file.

Of course this is partially out of necessity; Windows is proprietary software, so they don't want to give you full debug info. But then in practice their tooling is just so vastly superior. You can fire up WinDbg, a rock stable debugger, attach to a random process and get a proper backtrace including full symbols for all the proprietary Windows stuff, because they run a public symbol server and all the symbol data you need for your Windows build is downloaded in a few seconds, fully transparently. And they ship the same symbol server with their tooling so you can run it for your own binaries, too. You can't do that on any Linux distro without starting to manually install random -dbg packages.

(And don't get me started on trying to debug crashed processes on Linux. A big reason Android has their custom libc is so that they can install default signal handlers for things like SIGSEGV, where on a normal Linux system that process goes immediately to core dump and is essentially fucked for a debugger wanting to look at its state.)

What’s wrong with core dumps when trying to introspect state at crash-time?

So what would a solution be? How do other languages solve that problem?

In my experience you would strip the symbols out of the prod binary, and save them separately somewhere.

Then your production binary will give you stack traces like [0x12345, 0xabcde, ...], but you can use the separately-stored files to symbolicate them and get the source file/line info.

Not sure if this is possible on all platforms but it at least is for all combinations of {C, C++, Objective-C, Rust} and { Linux, macOS, iOS } .

And if that added operational complexity is not worth the size savings, you can freely choose not to do it, and things will work like they do in Go.

Separable debuginfo which can be loaded at runtime. DWARF uses an efficient compression mechanism much smarter than a table for this sort of mapping. And of course things like coredumps and crash dumps being sent to automated processing where devtools have the full debug symbols, while production deployments do not.

Go's insistance not just reinventing the wheel but on actively ignoring core infrastructure improvements made in the last 20 years is bizarre.

DWARF will also just gzip sections, in addition to the compressed format.

That is debug information. Just have it stored elsewhere (not on the binary you ship everywhere) and use that in conjunction with your core dump to debug.

A lot of them have symbol files separate from the binary. Unixy tooling doesn't do this by default but for example objcopy(1) in binutils can copy symbols to another file before you run strip(1), and on Mac my memory is rusty but I think it may be dsymutil(1) that lets you copy to a .dSYM bundle. Microsoft has its .pdb files and never even keeps debug info inside the binary proper.

The debug info is in a separate file. You only need that file when you’re inspecting a crash report, so it doesn’t need to be pushed out to the host device(s).

Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.

Here is an example from my past. As embedded programmer I went and added manually a hundred lines of constants, which initially were just an array generated at start and increased that code by about 5%. Why? Because I gained 5 ms in execution speed. And in embedded world that's huge. Especially when your code is executed on the lowest 10 ms cycle. So the department head approved such a huge change because the code size doesn't matter, you can always buy a bigger chip, but if your car don't start because the watchdog keeps resetting your car's on-board computer then speed in code execution is everything.

> Because it's not a problem. So everybody does the same. And is not about the programming language, is about the programmer's choice. If it wants debug info inside a production program, the language let it happens. In today's age size of your executable is a non-issue. The only issue should be your performance.


Well, that settles that.

I figured that it would be fairly obvious why claiming that executable size means little today and that performance is the only metric that matters is a gross misrepresentation of the software industry.

I do not know much about go, but languages like C++ and Java give you the tools to make tradeoffs appropriate to your situation: externalizing or stripping symbols and/or debugging information.

Is it just me or something like Golang shouldn't be included in production builds at all?

This is where letting a large enterprise guide the development of a piece of widely-used software becomes questionable. At a FAANG the constraints are fundamentally different.

At work I routinely see CLIs programs clocking in at a gigabyte, because it was simpler to statically link the entire shared dependency tree than to figure out the actual set of dependencies, and once your binary grows that big, running LTO adds too much time to one's builds. And disk space is basically free at FAANG scale...

> This is where letting a large enterprise guide the development of a piece of widely-used software becomes questionable. At a FAANG the constraints are fundamentally different.

I see where you're going with this, but the conclusion of the article is that Go isn't (what you call) FAANG-y enough for them.

From the article: "This design choice was intended to lower the start-up time of programs [...] This performance goal is not relevant to server software with long-running processes, like CockroachDB, and its incurred space cost is particularly inconvenient for large, feature-rich programs."

What they want is a way to pass 1000 parameters to the compiler so that nobody ever gets a working binary unless they have a team of release engineers to make one for you. The Go team and their users at Google easily have access to said team of release engineers, but took the opposite approach. They optimized for programs that start fast even though most of those teams at Google are writing programs that run for a long time. They optimized for a compiler and runtime that don't have a billion knobs to turn, even though they can afford to pay an army of engineers to tune those knobs.

So I actually think this has very little to do with optimizing for FAANG. This is more about every program getting the same compiler options, which is merely a philosophy that some people on the Go team have, not really a philosophy that every FAANG has. (They use Java at Google too, and Java is more than happy to give you knobs to turn. It also doesn't start up very fast.)

Maybe with bad C toolchains you can get combinations of switches that result in dead binaries, but I've never seen a JVM that wouldn't start because of bad tuning flags. At most you can hurt (or help) performance.

I'm not sure the Go 'philosophy' here has anything to do with it. If you click through some links you'll end up at a GitHub issue where they discuss a prototype for making Go's code generation not completely awful. The justification for not pursuing a 5-10% improvement (which is huge for compiler opts) is that it'd break existing inline assembly by changing the calling convention. This isn't a philosophical objection, it's just bad design painting them into a corner.

Java starts up pretty fast if you bother to AOT compile it, plenty of commercial options available since early 2000's.

For the free beer folks there is AOT support on OpenJDK, OpenJ9 and GraalVM.

> It also doesn't start up very fast

This has been changing rapidly with GraalVM and projects like Quarkus, which can build a static binary of a Java program that starts, as you'd expect, much faster (still not performance of a C/C++ program, but way faster than a regular Java program).

Did I miss something important about GraalVM? I thought it was another runtime on top of JVM, not a replacement for it. For a Java program, for example, GraalVM would presumably have no impact since it wouldn't be used for it.

GraalVM is several things.

In this context people are talking about the native-image tool which uses a small JVM written in Java (SubstrateVM) to compile to native code ahead of time, no HotSpot or other JVM dependencies.

There's also Truffle which is a way to put other languages onto the JVM (both HotSpot and SubstrateVM).

Disk space in general is pretty much free these days. 123mb for a whole database is really not that big of a deal, IMHO. For example, my local PostgreSQL docker image is 140mb plus the alpine OS (5mb). And the Ruby on Rails application using that image clocks in at a little over 1gb, also using the ruby alpine image as a base (50mb).

With my company, the cost really started to become a burden with data transfer. But transferring images to and from the AWS container registry is so expensive that we actually build production images inside the Kubernetes cluster (plus the cluster has access to all the secrets and stuff), even though it was a bit harder to implement.

If you're FAANG and you can run your own stable and highly-available cloud, data transfer rates don't matter, so you can deploy your application the "right" way in a containerized world.

Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data. At scale this could be even more important, not less, than for a smaller operation.

Of course the real (i.e. explicitly stated by Pike) driver for go was the assertion that inexperienced new hires write poor code and so the harder it is for them to get into trouble the better, even at the cost of other issues.

Executable size on disk doesn't dictate effective size at higher levels of the memory hierarchy.

IOW, you're paying by the page/cache line. If the extra bloat (debug information) in your executable isn't interspersed within your actual code (it shouldn't be), you aren't paying for it in runtime efficiency.

> Yes but when running a long-lived application on a lot of data it's typically important to keep the executable small both so it can be "hot" and to leave more room for data.

Who really runs server applications these days where data/rss are not a large multiple of the actual code segments? People happily run JVM server processes these days, how much code/data do you think that pulls in just to start up?

Genuinely wondering how a CLI would be 1GB, even at Facebook. Never encountered anything _close_ at Amazon.

Not your usually usecase, but the binary for running the Bloomberg terminal is well over 1GB (In fact running into the the 4GB executable boundary was a problem).

Oh, that's fascinating. I never checked the binary size but that makes sense.

I’m curious, what’s in it? Which sections take up the space?

Sorry, I should have clarified, this is the server side binary. I believe the part that's run on the user's computer is just a kind of dumb terminal.

The reason it's so big (or at least was several years ago) is that every single function is compiled into one binary monolith.

I’m sorry, that doesn’t sound possible. A clean Windows 10 install (kernel plus userland) can clock in at 6GB and you’re telling me the Bloomberg server CLI was hitting the 4GB wall just because it’s statically linked? No amount of pure code gets that big, it must be something else in the tooling or object format. Perhaps it includes binary assets.

That's quite believable to me for an old codebase. Command line utilities at Google routinely reached hundreds of megabytes in binary size due to static linking of everything. A Bloomberg server is surely far larger and more complex than those tools, so gigabytes isn't unbelievable.

Binary code can be quite bloated compared to source code. It depends a lot on the encoding. The same author as this blog post has another actually much more interesting one on how Go compiles common constructs. It's rather enlightening as the example generated assembly (implicitly) explains why Go compiles so fast. It has to be the dumbest, least optimising compiler I've ever seen. Very short and trivial source code constructs like function calls balloon into pages of instructions, far far more than a C compiler would generate.

C++ compilers also have a reputation for generating bloaty code, but that's mostly due to compile-time specialisation of containers and the way template expansion works. If you write C-as-C++ then you get pretty tight binaries.

I think you're underestimating just how much the bloomberg terminal backend does. It's 40 years of obscure functions packed in to one massive binary.

I never saw one a singular tool that big at Amazon either, but the general concept held. The AWS service I worked on had ~1,000 jars in its dependency closure. All of those had to be uploaded to S3 after each build, and then downloaded onto each EC2 instance during deployment.

We're talking on the order of a terabyte of data transfer each time we deployed to a thousand instance fleet (ideally deploying weekly)

Mmmm I don't know if I agree with that logic.

I'd argue the opposite -- as a startup, you can't afford to micro-optimize. Labor, time and opportunity cost dwarf all but the grossest resource waste. If you need to use 100GB/k8s node instead of 50GB, it will have 0 effect on the success of your venture.

At Google scale, it becomes worth it to optimize:

- You are delivering more product per engineer, and more product means more resources. Instead of a single customer instance which costs $1/month more, you have 100,000 customer instances, costing a significant amount more. It becomes worth trimming margins.

- You have economies of scale, and it might be worth it for an engineer to spend a month trimming 2% of the cost of a software deliverable.

The common refrain for startups is "do things that don't scale", and this is for good reason. Google has to actually worry about fixing things AFTER they are scaled.

I think you're both right.

A 125MB binary is relatively beefy, but still easily fits in RAM. The amount of disk that you're spending on a single executable (your database, in this instance) is tiny in comparison to the amount of data stored in that database.

It's definitely worth it for Google to trim 2% off of their storage requirements - but if your binary is 0.1% of your storage, it's barely even worth glancing at.

There are plenty of efforts and tooling that reduce the number of dependencies and the deploy size of binaries at Google. The notion that we don't care about size isn't true.

But it's true that it's not worth optimising first, it's done by first evaluating the impact across the fleet and then prioritising the most effective changes.

LTO is no magic bullet for binary size either. A binary that does nothing will still link in the whole C library. It doesn't end up decreasing large program sizes that much in my experience either.

Tree shaking?

I'm not sure why people are so worried about the size of the executable file here. If the runtime.pclntab table is never[1] used then it won't be paged into memory, and disk space is mostly free these days.

[1] Well, hardly _ever_! (Sorry not sorry for the obligatory Gilbert and Sullivan reference.)

If you're using the Go executable on a system without virtual memory support, yeah, that's going to suck, but it appears the Go runtime is horribly bloated and not really suited for super-tiny 16-bit processors in the micro-embedded space. But for something like Cockroachdb, why worry about the file size?

> disk space is mostly free these days.

This is the only "argument" ever presented, and I don't think it is any good. I care about file sizes. I want to get the most out of my hardware. Not needing to buy another drive is always going to be cheaper for me and every other user.

> Not needing to buy another drive is always going to be cheaper for me and every other user.

128GB+ drives are standard on mid-range laptops. Even at 64GB are you really going to fill up disk space because of Go executables?

CockroachDB (a large software project) is only 123MB. I doubt most people even have 100 pieces of non-default software on their laptop or that executables are going to fill up storage and break anyone's bank these days.

If you're short on HD space, I'm typically targeting photos and videos, not software.

Well. Grow your entire system by 60%. Even with 128GB it will be non negligeable.

Then don't use Go, you aren't their target audience in that case. And I don't mean this in a harsh way, just that Google is clearly opinionated in how they are building Go.

I used to think that but now with containers its annoying to have to wait for a big binary image to get copied to the node and loaded up.

In my experience, the bare Go container images are the smallest of them all, averaging out at 35mb here. The nodejs stuff clocks in at 500mb, of which only 130 is the shared base node image, the rest is "application" and dependencies...

While it's true that disk space is virtually free, that is not true for bandwidth.

Bandwidth [to transfer big binaries around] is not free however.

True, but you’d need a lot of transfers before it starts to add up but if you run into that edge case then thankfully you don’t actually need the file to be executable during transit so you probably should compress the file for transit and decompress it at the other end (assuming your CI/CD pipeline allows for that) or compile it on the destiny nodes (since Go’s compiler is fast).

Admittedly neither are perfect solutions but software architecting is always about making smart compromises.

If you’re using something like GCP Cloud Run to execute containers on demand, cold start time (which affects both new invocations and scaling events) is directly impacted by container size. As you said, not as much of a concern for a database, but extremely relevant for an HTTP server.

Also if you have multiple instances, I guess it is better to not allocate N versions of the same thing in anonymous memory.

Since Go is statically typed, the runtime data should be constant. Couldn’t a copy on write cache mean that the logical RAM redundancy doesn’t actually affect real memory?

If you have to decompress it at startup, you will typically do it to anonymous memory. You can attempt to be fancy at user-level by trying some silly tricks like putting them to shared memory, although I don't know the API of common ones enough to know if it is even possible in reality because of all the details to handle (ref count of the users with auto destruction when the last one closes, and that atomic with the creation of just one when none exists, etc.)

Ideally, to get all the optims, you would want some compression support at FS level, or even a specialized mapper of data coming from executable files in the kernel (or in cooperation with the kernel), but this will bring added complexity.

(Thinking more about it a solution involving a microkernel would be really cool, but I digress...)

I guess this would specifically be a benefit of fork/exec. Though would it need to be decompressed after 1.2? That was my assumption is that it trades speed for memory on the first launch, and memory in subsequent launches would be virtual only

This is where go’s insistence on reinventing the wheel feels terribly misplaced. Every major debug format has a way to associate code locations with line numbers. Every major debug format also has a way to separate the debug data from the main executable (.dSYM, .dbg, .pdb). In other words, the problem that the massive pclntab table (over 25% of a stripped binary!) is trying to solve is already a well-trodden and solved problem. But go, being go, insists on doing things their own way. The same holds for their wacky calling convention (everything on the stack even when register calling convention is the platform default) and their zero-reliance on libc (to the point of rolling their own syscall code and inducing weird breakage).

Sure, the existing solutions might not be perfect, but reinventing the wheel gets tiresome after a while. Contrast this with Rust, which has made an overt effort to fit into existing tooling: symbols are mangled using the C++ mangler so that gdb and friends understand them, rust outputs nice normal DWARF stuff on Linux so gdb debugging just works, Rust uses platform calling convention as much as possible, etc. It means that a wealth of existing tooling just works.

I am not a fan of Go, and I also wish these things were true (and more[1], actually), but I find it hard to agree that its priorities are "terribly misplaced." Inside the context of Go's goals (e.g., "compile fast") and non-goals (e.g., "make it easy to attach debuggers to apps replicated a zillion times in Borg") these trade-offs make a lot of sense to me. Like: Go rewrote their linker, I think, 3 times, to increase the speed. If step 1 was to wade through the LLVM backend, I am not sure this would have happened. Am I missing something?

I love Rust, but Go is focused on a handful of very specific use cases. Rust is not. I don't know that I can fault Go for choosing implementation details that directly enable those use cases.

[1]: http://dtrace.org/blogs/wesolows/2014/12/29/golang-is-trash/

I'd check out the HN comments in response to the parent's [1]: https://news.ycombinator.com/item?id=8815778

Specifically the top reply there is by rsc (tech lead for Go)

> non-goals (e.g., "make it easy to attach debuggers to apps replicated a zillion times in Borg")

But wouldn't it still be nice to have a standardized way to analyze post-mortem dumps across languages?

Google's anointed production languages used to be five: C++, Java, JavaScript, Python, and Go. Not much to reasonably standardize across, especially if a standardized solution ends up with more compromises than a custom one.

But DWARF uses less space than Go's native format. So inventing a custom "linetab" format seems like the compromise, not using DWARF.

I suspect that the format is again copy-pasted from somewhere in Plan9, and existing Plan9 tools for it are ported, too.

Insert standardization XKCD. It's been tried. And even so, you can still use the "standard" coredump tool to analyze a Go program's coredump with decent success.

It’s not fun, usually.

This sentence got me:

> Instead of creating (or borrowing from Plan9) an “assembly language” with its own assembler, “C” compiler (but it’s not really C), and an entire “linker” (that’s not really a linker nor a link-editor but does a bunch of other stuff), it would have been much better to simply reuse what already exists.

"Simply reuse what already exists"...like the things that they reused, for example?

I'm usually really willing to forgive a lot of stuff when justified by genuinely different design goals or priorities.

Unfortunately with Go I become less convinced with every passing year that they can keep getting away with this. They keep spinning obvious weaknesses as philosophical strengths, rather than admitting it is due to limited resources and backwards compatibility constraints. Their use cases (servers, UNIX tools) aren't actually unusual or different to other teams. It seems like every time I read about Go they've made what is simply a bad design decision that they later regret and explore fixing, but their rules cause them to keep compounding self-inflicted wounds. Compared to other language and runtime teams they just don't seem to know what they're doing.

Here are just some of the examples we've learned about so far.

Stack unwinding and hyper-inefficient calling conventions. Despite being designed for servers at Google, where throughput really matters a lot, they generate extremely bloated code and hardly use registers rather than generating unwind metadata that's consulted on need and using a tuned calling convention. Tuned calling conventions are optimisations that date back literally decades in the C world and yet Go doesn't have them!

This significantly reduces icache utilisation (hurting throughput) and means they can't use any existing tools, and yet the only benefit is it made their compiler easier to write initially. They increased the server costs of Go shops permanently by taking this shortcut which benefited only the compiler authors. Now they struggle to fix it because they don't have any de-optimisation engine either, so changing calling conventions makes it harder to get useful stack traces and would break user-authored inline assembly (which is rare for Go's use cases).

Compare to how the Java guys did it: the compiler generates highly optimised code and tables of metadata that let the runtime map register/stack state back to the programmers de-optimised view of the program. Methods can be inlined aggressively because the VM can always undo it, so it doesn't get in developer's way. That metadata is only consulted when a stack unwind is actually needed, which is rare. The rest of the time it sits cold in far away RAM or swapped to disk. Calling conventions aren't exposed to the user and can be changed as needed, but if you need custom assembly you go via JNI that uses the platform calling convention and accept a slower function call.

Go's approach isn't some principled matter of design, as evidenced by their explorations of fixing it. They just didn't plan ahead.

Garbage collectors. The Go team originally tried to claim their GC was some sort of massive advance, a GC for the ages. A few years later they gave a presentation where they admitted they had explored replacing it several times because it's extremely inefficient, but are hamstrung by a (self imposed) rule that they're only allowed one knob and don't want to make their compiler slower. Once again, the constraints of their compiler causes massive cost bloat for projects in production (where cost really matters).

Compare to Java: the default GC tries to strike a balance between throughput and latency, but if you need super low latency or super high throughput you can flip a switch to get that. The runtime can't know if your task is a batch job like a compiler or a latency sensitive HTTP server, so can tell it, but if you don't it'll take a middle path. Given the huge costs of large server farms, this is sensible!

Compile time. Whenever you read about the Go team's choices it's apparent they are willing to mangle basically anything to get themselves an easier to write or faster compiler, hence the fact that it hardly optimises and generates massive binaries. But this isn't the only way to get fast compile times.

Compare to Java: compilation is done in parallel with program execution and only where it matters. During development where you frequently start up and shut down programs, you're only waiting for the compiler frontend (javac) which is very simple and doesn't optimise at all, so it's fast like Go's is. When deployed to production the program automatically ends up optimised and running at peak performance and you don't even need to flip a "in prod" switch like with a C compiler: the fact that the program is long running is itself evidence that you're in prod and worth optimising.

This heuristic used to hurt a lot for small command line tools, which usually don't need to be very fast. But you can produce binaries with the GraalVM native-image tool that start as fast (or even faster) than C programs do now, so that's not a big deal any longer.

Generics. Well, this one has been thrashed out so much I won't cover it again here. Suffice it to say that other languages have all concluded this is worth having and managed to introduce it in either their first versions, or in a backwards compatible way later.

Debugging. The ability to easily debug binaries and get reasonable stack traces is known to make program optimisation hard, because optimisation means re-arranging the program behind the developer's back. It's hard to put a breakpoint on a function that was deleted by the optimiser, or inlined. That's why C compilers have debug vs non-debug modes. Debug binaries can be significantly slower than release binaries, hence the difference. In fact in the past I've seen cases where debug-mode C binaries were so slow you couldn't use them because getting the program to the point where it'd experience issues took so long. And of course forget about debugging production binaries.

Golang faces the same problem but in effect just always runs every program in debug mode.

Compare to Java: See the above description of the de-optimisation engine and tables. If you request a stack trace or probe a method with a method, the program is selectively de-optimised so the part being inspected by the developer looks normal whilst the rest of the program continues running at full speed. This means you can attach debuggers to any program at any time, without flags, and you can even attach debuggers to production JVMs at any time. This feature doesn't impose any throughput hit (it does consume memory, but it's cold memory).

So we can see that repeatedly the Go guys have made choices that seem to have wildly wrong cost/benefit tradeoffs, tradeoffs that literally nobody else made, and almost always the root cause is their duplication of effort vs other open source runtimes. They use a variety of fairly condescending justifications for this like "our employees are too young to handle more than one flag", but when you dig in you find they've usually explored changing things anyway. They just didn't succeed.

I'm the author of the article linked in this entry and https://science.raphael.poss.name/go-calling-convention-x86-..., and this is IMHO the best comment in this thread.


Well go also uses its own assembler, on top of that a kind of modified garbage version of real ones. You can only justify so many reinventions of the wheel, yet they redid everything.

Did they actually redo everything, or does it just look that way from starting from the Plan9 toolchain? Which could also be said to be re-doing everything, but from a much earlier starting point.

IIRC Go started out shipping with a port of the Plan9 C compiler and toolchain - it was bootstrapped by building the C compiler with your system C compiler, then building the Go compiler. Which, until re-written in Go circa-2013, was in Plan-9 style C. It all looks deeply idiosyncratic but it was a toolchain the initial implementors were highly familiar with.

Perhaps the other assemblers would not provide desired compilation speed?

Perhaps their IP requirement would not sufice Google lawyers?

Perhaps Go devs would rather have more control on the development of assembler by writing it from scratch to understand every design decision instead of inheriting thousands of unknown design decisions?

I don't know. Neither do others outside of the project.

I find these baseless micro-aggressions against Go missplaced and unfruitful.

> I don't know. Neither do others outside of the project.

> I find these baseless micro-aggressions against Go missplaced and unfruitful.

Hu? Ok then Go is perfect because it is developed in secret.

We are discussing here, I'm not "micro-agressing" anyone. If I don't like a design / re-implementation decision, and I in the mood to share that opinion with this cyber-assembly, I do it. And I expect developers to not be offended by me having a technical opinion; and I expect third parties to be even less offended. And yes, it might be a bad opinion in some cases. I'm not even 100% sure it is not the case here, because like you said they could have had some kind of justification to do that. But it suspect it is extremely rare to have a good justification to rewrite an assembler, with really big quirks on top of that, when they did it.

> Ok then Go is perfect because it is developed in secret.

Didn't implied that at all. No need for straw man.

Yes, you absolutely did, by stating (not just implying) that any criticism of the project that does not take its internal decision making into account is "baseless micro-agression".

I was referring specifically to this:

> You can only justify so many reinventions of the wheel, yet they redid everything.

Don't extrapolate what I write.

> micro-aggressions

We're adults here, can we please not talk like tumblr blog posters?

I totally agree with the above - was never able to click with Go _but_ I totally understand how reinventing the wheel has worked well for them.

The days when the Go project fired up were different than the days when Rust started. Rust made different tradeoffs by relying on LLVM and it has advantages (free optimizations!) and disadvantages of their own.

The first releases of each were only 8 months apart, far from “different days”. The projects simply have different goals.

A lot of the insularity and weirdness comes from the Plan 9 heritage. Go's authors (Rob Pike, Ken Thompson, and Russ Cox) cannibalized/ported a bunch of their own Plan 9 stuff during initial development. For example, I believe the original compiler was basically a rewrite of the Inferno C compiler.

This is a large part of why Go is not based on GCC or LLVM, why it has its own linker, its own assembly language, its own syscall interface, its own debug format, its own runtime (forgoing libc), and so on. Clearly Go's designers were more than a little contrarian in their way of doing things, but that's not the whole answer.

Being able to repurpose existing code is an efficiency multiplier during the bootstrapping phase. But when bootstrapping is done, you have to consider the ROI of going back and redoing some things or keep a design that works pretty well. The Go team is undoubtedly aware of some of these issues, but probably don't consider them to be a priority.

In some cases the tools are a benefit. Go's compiler and linker are extremely fast, which I appreciate as a developer. A possible compromise would be to offer a slower build pipeline for production builds, which made use of LLVM and its many man-years of code optimizations.

Personally I more wish Rust would take this approach. Rust desperately needs a fast, developer oriented compiler. The slow compile times is potentially Rusts biggest flaw, to the point where I find it keeps me off the language for anything non-trivial. Even better might be a Rust interpreter, so you'd get REPL and fast development cycles.

This is why Cranelift is being worked on. There is also a Rust interpreter, miri.

I think starting with LLVM was the right decision (and one that was I was primarily responsible for). Rust would lose most of its benefits if it didn't produce code with performance on par with C++. LLVM is not the fastest compiler in the world (though it's not like it's horribly slow either), but its optimization pipeline is unmatched. I don't see replicating LLVM's code quality as feasible without a large team and a decade of work. Middling code gen performance is an acceptable price to pay until we get Cranelift; the alternative, developing our own backend, would mean not being able to deploy Rust code at all in many scenarios.

Forgive me if this is ignorant, since I havent done any benchmarks on this in a while, but doesnt GCC produce slightly faster code on average across a wide set of benchmarks compared to clang/LLVM?

Perhaps, but the advantages of a large third-party ecosystem around LLVM outweighed any performance differences between GCC and LLVM.

At least in these benchmarks that phoronix run time-to-time, (so they at least can be compared to their older self) LLVM, in its Clang incarnation, is finally getting some parity in execution times with GCC


Of course, benchmarks, yada yada, but at least is some sort of comparison axis where the improvement over the years is clear.

Thanks for the link. I was probably thinking about some older phoronix benchmarks when I made my post

Thanks for the pointer! I was unfamiliar with Cranelift and it seems like a promising tech. I'll keep an eye on it in hopes that once it is stable I'll be able to put together a development environment that allows for the fast turnaround I prefer.

I have not used rust for anything very large, but the using an editor that supports the rust language server mitigates the compile time problem. In VSCode it show you the compiler warnings and errors as you are editing a file. There is a little lag in updating but the workflow is faster than switching to a terminal to do a full compile.

Slow compile is a developer problem. Big binaries is a problem for both developers and users.

Typically, developers can afford to throw more cores and more ram at their build machines.

Isn't Cranelift going to be usable for that?

If you need any other evidence for this, just look at GOPATH and similar. That was plan9 through and through; they wanted to delegate work to the filesystem. No need for a package manager or anything, just pull down URIs and they'll be where Go wants them to be.

What are you talking about? Plan 9 doesn't even use $path. At least not consistently -- binaries live in /bin.

It's derived from convention of /n/sources and /n/contrib and the like. Sources mounted from network fileservers from various places, etc.

The git support was added to make it a bit easier outside plan9.

Go has had to walk back on some of its choices recently; most notably on platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS) they link against the system libraries.

The only popular platform with a stable syscall ABI is Linux. This is a product of the historical accident that Linux doesn't control a libc and ensuing drama.

Almost everyone else doesn't have a stable ABI below the (C) linker level.

I don't think Linux actually guarantees syscall-level compatibility, so no need to single it out, it's just like everyone else.

It does - the syscalls are part of the official userspace interface which the Linux kernel promises not to break. They can add new syscalls, options or flags, but can’t break existing ones.

It very much does, explicitly, in a way that every other operating system does not.

It's still not an explicit guarantee. Actually, Linux the kernel doesn't guarantee or promise anything, it's only distros that try, and those that do promise some compatibility, don't promise all that much. The best promise you can find is like a promise of ABI compatibility within a couple of future releases.

You are extremely wrong, so it's probably worth thinking for a moment about how you became so misinformed and why you feel so strongly that you're not misinformed.


  Most interfaces (like syscalls) are expected to never change and always be available.

  This interface matches much of the POSIX interface and is based
  on it and other Unix based interfaces.  It will only be added to
  over time, and not have things removed from it.
This is literally an explicit promise of the Linux kernel; distros have no influence over the Linux syscall ABI whatsoever.

I think you're perhaps extremely confused about the difference between the userspace syscall -> kernel interface, and kernelspace API/ABI such as out-of-tree kernel modules might use. About the latter, yes, there are no API/ABI guarantees in vanilla Linux.

Expecting something is not a promise, just an attitude they want to have towards it at the moment. I think you are also confusing who is even in a position to promise what. Kernel is not an OS people can use, but only something an OS (distro) itself can use and, given the license, in any way it wants. And so kernel cannot promise or force ABI compatibility or anything really on behalf of any OS that uses it. It's up to the OS, but OSes modify kernels, backport things, build with various ABIs and so on. Look for example at the mess around x32 ABI, some distros had it, some didn't, some had it and dropped it, some had and promised ABI compatibility for some time, but Linus wants to drop it from the kernel (don't know if he actually did it), so they are in a pickle. Read RedHat's application compatibility guide if you want an example on what the best a linux distro can promise wrt ABI compatibility.

> platforms without a stable syscall ABI and a very strong push for dynamic linking (…so macOS)

That's an even better description of Windows. The macOS system call table isn't officially stable, but it's at least slow to change. The Windows equivalent has been known to change from service pack to service pack.

Plan9/9front uses u.h and libc.h, among nc and nl as the compiler and linkers, being n the architecture. That allowed free cross-compiling, as plan9/9front does since your base install and now, OFC, Go.

Small note, we don't use the C++ mangler (https://github.com/rust-lang/rfcs/pull/2603), and did the upstream work in GDB to get it to understand things. (There's also more work to do: https://github.com/rust-lang/rust/issues?q=is%3Aopen+is%3Ais... )

That being said, yes, we see integration into the parent platform as being an important design constraint for Rust. I think Go made reasonable choices for what they're trying to do, though. It's all tradeoffs.

Indeed, though it's worth mentioning that the Rust mangling scheme is based on that of the Itanium C++ ABI.

For those who don't know, it's also worth mentioning that while it's called the "Itanium" C++ ABI (because it was designed originally for the Itanium), it's nowadays used for every architecture on Linux.

Not just Linux, either :-).

> to the point of rolling their own syscall code

It makes the Go concurrency mechanism possible, this is not just a kind of whim.

Most importantly, this allows the scheduler to hook on syscalls in order to schedule an other routine. But this also allows to control what happens during a syscall, since the libc tends to do more than just calling the kernel in its syscall wrappers, which might not be thread safe or might not play well with stack manipulations.

This has never been a problem in my experience.

The problem is that there is exactly one OS that maintains the system call ABI as a stable API: Linux. On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior, and this was particularly problematic on OS X, which occasionally made assumptions about the userspace calling code that weren't true for the Go wrapper shell, since it wasn't the expected system wrapper library.

> On other systems, trying to invoke the system calls manually and bypassing the C wrapper opens you up into undefined behavior

It opens you up to a bit more of behavior changing in the future, but just a tiny bit more. No need to make a big deal out of it. It's a very normal thing in software. Nobody is going to promise you a perfect stable interface to rely on forever, not even Linux. But syscalls are actually pretty easy to keep up with, they change slowly, and it's easy to detect kernel version and choose appropriate wrappers to use with very little extra code.

OS X problem is its own thing. Apple making breaking changes is not a new thing. I use an Apple laptop super rarely and still got fed up with breaking changes, even not upgrading past 13.6 at the moment.

Can any of the downvoters explain disagreement? I actually maintain a small Go library of syscall wrappers for Linux and BSDs and don't get why people are spreading FUD about it, as if it's a minefield. It's not, it's a non issue at all even for someone working independently on it. I find it even easier than dealing with all the libc crap on those systems.

Windows syscall numbers change with every service pack release (see https://j00ru.vexillium.org/syscalls/nt/64/ for an incomplete table), and the interfaces themselves are not guaranteed to be stable in any way. ntdll is the only reasonable way to make syscalls on Windows. Trying to build your own syscall code on Windows is fragile and unmaintainable.

Linux maintains stable syscall interfaces. BSDs don't guarantee it, but generally don't change much. But macOS and Windows can and will change their interfaces.

The problem is that that one OS is right. Systems calls form an API and it needs to be stable and managed. We (developers) have been working on this issue for years and have at least attempted solutions (eg. semantic versioning) where most OS developers feel free to break them on a whim. It is a terrible practice that forces others to spend their time working around.

I would note that system calls only form an external API if the developer says they are an external API. Which it is for Linux but for other OSes the external API is a C library, kernel32, etc.

But right and wrong aside, there's the practical matter of reality. You can't simply pretend everything works how you want them to. At the end of the day you have to deal with how they actually work.

The Linux model is not the "right" one, it's a choice that they've made. Just like static linking isn't the "right" choice either, it's an option with its own drawbacks. Other OSes provide an approved, API stable layer to access the OS; it's just not the syscall layer.

The Linux approach reflects the social structure this project has been developed in.

The Linux project needs to be able to evolve independently of other projects, so they do just that.

This is a classic case of technical architectures following most of the time social structures.

> Systems calls form an API and it needs to be stable and managed.

In some cases (e.g; nearly all the other oses...) system calls form an internal API, and they don't need to be stable, and they actually even don't need to be accessible except to intermediate layers provided in a coordinated way.

No one here is disagreeing on the need for the operating system to provide a stable interface to applications: the question is where that stable interface should lie. Linux takes the most restrictive approach, asserting that the actual hardware instruction effecting the user/kernel switch is the appropriate boundary. OS X and Windows instead take the approach that there are C functions you call that provide that system call layer (these are not necessarily the POSIX API). OpenBSD and FreeBSD have the most permissive approach, placing it at an API, not ABI level (so the function calls may become macros to allow extra arguments to be added).

My preference is that the Windows/OS X model is where the boundary should belong.

There's no "right" about it. You're arguing that _having_ a stable ABI is important, and nobody is denying that. There are other ways to get a stable ABI. All of the other non-Linux OS's have one, they just guarantee it in a different place (generally in a userspace library that manages the syscall interface)

Linux needs stable syscalls because it basically offers no other interface apart from extensions to posix abstractions for calling into the kernel. Windows has a much larger, stable external C API that provides the same functionality Linux syscalls do such that no one uses anything else.

I agree with some of your points; but zero-reliance on libc is the reason why it's so easy to use Go in containers; and Docker is one of the primary reasons why Go is popular. It's what they have got right.

You could statically link in libc and get the same effect.

Statically linking libc is it’s own minefield. It can and is done but even if you statically link everything else you should almost always dynamically link against your platform’s libc.

Statically linking libc is harder than dynamically linking it, but certainly easier than rewriting it.

Good thing we don't want to use much of libc functionality from Go, so nobody needs to reimplement all of it. It's not like people would be begging to call strcpy. All that's needed is the syscalls.

Statically linking a libc seems equivalent to statically linking Go's standard library, but with a whole lot less effort (on the part of the Go developers, that is).

Why? I see no reason you couldn’t on platforms with a stable syscall ABI, other than the standard reasons against static linking.

It depends on the libc.

glibc doesn't really "want" to be statically linked. It can be done sometimes, depending on how it's used, the phase of the moon and so on, but breaks from time to time until it's repaired.

Some of the issues are fundamental. For instance the dynamic linker is a part of the C libraries. If you statically link libc and then dynamically load another shared library, you can end up with two copies of the C libraries loaded at once.

The C library expects various external files to be found on disk, in a format that isn't totally forwards/backwards compatible. That's reasonable when things are dynamically linked because the linker and C library abstract the developer from format changes. If you statically link you should really statically link the data files too, but the C toolchain has no provision for that.

The ISO C standard library has no such expectations regarding external files.

Except glibc does not really support static linking of you want network support.

glibc is just one possible ISO C implementation with lots of extra stuff on top.

Says someone who has never actually tried to do that. You can statically link with Musl. But you can't really statically link with Glibc.

Technically speaking OP only suggested statically linking (a) libc, not glibc specifically, and musl is a libc.

Not sure what’s up with the combative tone? FWIW, I work as an operating systems developer. We sell several products incorporating a few in house libc variants. They are most certainly statically linked.

You can, it just might do some strange things when dealing with iconv.

You'd still have to get a static libc (not usually preinstalled), possibly compile it for a different os/architecture..

An issue here is that statically linking against glibc is generally regarded as a poor idea and I’ve seen a couple non trivial programs that refuse to run with other libc’s.

> Docker is one of the primary reasons why Go is popular.

Based on what data?

I've deployed all sorts of things which dynamically link libc in containers. This just isn't an issue in practice.

If you want static linking use musl..

You'd supposedly sacrifice performance though.

At least that's the common complaint for Alpine docker images... It's based on musl and halve of the community always complains about serious performance degradation

Well if the performance difference matters, going to a GC language makes no sense.

Tell that to the go enthusiasts. They're all claiming it to be peak performance surpassing everything else.

though even java is faster in most benchmarks

Except that Docker was originally written in Java by the former team that actually started the project, and nowadays contains modules written in OCaml taken from the MirageOS project, for the macOS and Windows variants of Docker.

So how much they got right regarding Docker's success and Go is a bit debatable.

Docker was not written in Java. It was shell scripts, python, then go. Dotcloud was primarily a python shop.

Perhaps you are thinking of a different project?

Probably Kubernetes which was indeed Java in formative years.

Kubernetes was never publicly available (open source) in any other language than Go. Early internal prototypes may have been in Java, but those bear no more resemblance to current Kubernetes than Borg does.

You are right, I got that wrong, too late to edit now.

Kubernetes was never Java.

Unless she's talking about an internal prototype that never saw a public repository, Kris is wrong.

The original prototype that saw a public repo was written by Joe and Craig in Python (mostly Craig IIRC) and lasted for all of about a week before they switched to Go.

The original crew of contributors, all from Google, came from a Java background and definitely wrote Java-flavored Go, but no version of Kubernetes was ever written in Java.

Source: I know all the principals and was a contributing member when Kubernetes was initially released.

Yup, watching that talk at last years FOSDEM is what informed me of that part of the history also. Great talk, too!

Indeed, I got that wrong.

I think the parent was reffering to using Go in Docker containers, not using Go for implementing Docker itself.

That said, I agree that Docker was the first major project written in Go many people were exposed to and probably had some influence.

I'm just wondering, how is your line about the code taken from the MirageOS project relevant? Nobody uses the Windows and macOS variants of Docker in production.

Windows shops do use Docker in production, there are plenty of them.

It is relevant in the sense that Docker isn't 100% Go nowadays.

As an aside, do they use Windows containers in that context? Otherwise why?

Yes, for example in Azure deployments.

The calling convention is a serious wtf. They're relying on store-load forwarding to make the stack as free as a register, but that's iffy at best and changes heavily between microarchitectures.

I'd assert the calling convention is strange by design: there is the underlying reality that, to support actual closures and lambdas, as Go does, in the Lisp sense, not the fake Java sense, one can't use the C calling conventions. In particular, it's not true that a called function can expect to find bindings for its variables on a call stack, because of the upward funargs issue: some bound variables for a called function in the presense of true lambdas and thus closures will necessarily NOT be found on the C call stack, because of the dissociation of scope with liveness in the presence of lambda (anonymous functions).

What you describe is a non-problem: you can trivially spill upvars to the stack on-demand, as most compilers do, while keeping formal parameters in registers. Java needs upvars to be final because it doesn't have the concept of "reference to local variable", but that's just a limitation of the JVM, and one easily solved in other runtimes that very much can pass arguments in registers (e.g. .NET).

The Go developers have considered changing to a register-based calling convention[0][1].

I found these tickets a few weeks ago and they explained why the Go developers haven't yet made this change.

[0] https://github.com/golang/go/issues/18597

[1] https://github.com/golang/go/issues/27539

Interestingly, one of the suggestions to deal with issues in panic backtraces due to this change is to use DWARF.

I'm not familiar with the issue: what makes Java's lambdas/closures fake? Is it that bound variables need to be effectively final?

I don’t know if they’ve done anything new, but as originally implemented, they were inner classes.

The inner class gets copies of the variables, so imperative code that wants to reassign them isn't allowed because it probably won't do what you expected.

The goal is not to GC stack frames. But I'm not sure why the didn't create an inner class to hold the closed-over variables in non-final fields (moving them from the stack to the heap) for both the function and all closures it creates.

(Obligatory "doctor, it hurts when I use mutable state!")

Ah, gotcha. Honestly, I always use this as an example of one of the subtle design points that I really appreciate Java for.

Nitpick, but saying copies in Java can get confusing. Both primitives and references are bound by value. I'm sure you know, but for others: no objects are copied.

I always found this limitation had reassuring regularity; it's the same way arguments are bound to function parameters (minus bring final). Local variables being isolated from "other scopes" means that any interthread communication must be mediated through objects.

They were never implemented that way, rather make use of invokedynamic bytecode.


Android Java is the one making use of anonymous inner classes instead.

I believe they still are, with the caveat that the bytecode is built at runtime for lambdas not compile time like regular inner classes.

Invokedynamic is not related at all to inner classes.

Maybe my memory is a little rusty or I glossed over a bit too much, but I was thinking of how hotspot does lambdas from here[0]. It seems to use the Invokedyanmic Bootstrap method to spin an InnerClass at runtime. To be fair, it's a hotspot thing and not in the JVM spec.

[0]: https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/c...

Better check out from Brian's talk.

Not really, because the class file with invokedynamic bytecodes is supposed to work across all JVM implementations.

I think we agree? The bytecode is transferrable because the classfile only contains an invokedynamic that calls the LambdaMetaFactory for bootstrapping. The LambdaMetaFactory is provided by the runtime JVM itself so that linkage dosn't introduce an implementation dependence.

Hotspot's just happens to spin an inner class at runtime.

Yes we agree, I do conceed that I wasn't fully correct.

> Is it that bound variables need to be effectively final?

I believe this is it.

Even with store-load fw, you get a penalty (~3 cycle latency) over register accesses, no?

yeah, but it's cheaper than full L1 hit, which is where it would go if not for that.

I was trying to cite a typical full L1 hit latency... I thought store-load fw simply avoid having to flush the complete write buffer before the access is even possible, which risk to take far more than ~3 cycles. Now maybe it can be faster in some cases than an L1 hit, I don't know.

Edit: it seems that store-load forwarding is actually slightly slower than L1: https://www.agner.org/optimize/blog/read.php?i=854#854

I'm guessing that the reason was simply ease of porting 32-bit x86 assembly code to 64-bit.

Let's not forget their attempt at inventing yet another Asm syntax for x86, when there is already the horrible GNU/AT&T as well as the official syntax of the CPU documentation.

Go's assembler syntax is inherited from Plan 9 project, which started in late 1980 and first released in 1992.

For context, gcc was first released in 1987 i.e. about the same time that Plan 9 started.

Go authors didn't attempt to re-invent asm syntax. They re-used the work they did over 30 years ago.

And at the time Plan 9 happened it was hardly re-inventing anything either. It was still the time of invention.


* https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs

* https://en.wikipedia.org/wiki/GNU_Compiler_Collection

And at the time Plan 9 happened it was hardly re-inventing anything either.

Intel's Asm syntax was defined in 1978 with the release of the 8086, and the 32-bit superset in 1985 with the 386. CP/M, DOS, and later Windows assemblers all used the official syntax.

Plan9 assembler syntax didn't start out on x86, and is kept the same across all platforms as much as possible.

The question remains, why not reuse the work that somebody else did even earlier, and that has a lot more adoption already?

Reinventing the wheel is sometimes a feature - using other people's stuff, you gain their features, but you inherit their bugs, their release timelines, whatever overhead they baked in which they thought was okay, etc. You lose the ability to customize and optimize because it's no longer your code...

It's all just tradeoffs in the end - I think golang is finding some success because they didn't make the same tradeoffs everyone else did.

I think by doing it everything their own way, they are not shackled to all of these dependencies - especially to some rusty old C++ compiler. That way, among other benefits, they get some very nice compiler speeds.

I installed golang the other day to check it out for the first time. For whatever reason, I chose to input the 'Hello world' program from golang.org by typing it in manually. As with most C/C++ code I would typically write, I put the brackets on their own lines.

Welp, so much for Go.

Like all opinionated formatters, you adjust to it, or you don't. I don't hate gofmt, other than tabs. Sweet jesus.

It's not about the brace format. It's about the mentality that went into the underlying design decision. The more you look into why they did it that way [1], the more dysfunctional their decisionmaking process sounds.

Basically, while (e.g.) Python's mandated formatting style arose from Guido van Rossum's philosophy of best programming practice, Go's mandated style arose from the fact that it was easier to implement from the point of view of compiler authors who had evidently never used a lexer before, much less written one.

[1] https://stackoverflow.com/questions/17153838/why-does-golang...

> the more dysfunctional their decisionmaking process sounds.

can you expand, because I can't see their dysfunctional decision making from the link you provided.

The second answer?

"Go uses brace brackets for statement grouping, a syntax familiar to programmers who have worked with any language in the C family. Semicolons, however, are for parsers, not for people, and we wanted to eliminate them as much as possible. To achieve this goal, Go borrows a trick from BCPL: the semicolons that separate statements are in the formal grammar but are injected automatically, without lookahead, by the lexer at the end of any line that could be the end of a statement. This works very well in practice but has the effect that it forces a brace style."

Nothing about that is OK. Nothing about it makes sense.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact