Hacker News new | past | comments | ask | show | jobs | submit login
GERT: Run Go on Bare Metal ARMv7 (github.com)
267 points by chuckdries on June 19, 2017 | hide | past | web | favorite | 97 comments

I'm really not sure the drive for garbage collected languages in memory constrained systems with no user recoverability. Seems like a recipe for a device that just stops working from time to time.

Embedded systems, IMO, must be deterministic, reliable and consistent. Introducing garbage collection violates these three principals. Without them, how can you guarantee an interrupt can be reliably serviced in time? How can you guarantee that memory growth won't be exhausted because of some unexpected condition which prevents a timely GC? Many embedded systems developers don't even use malloc() in lieu of static allocations so they can actually understand their memory requirements.

It's either big enough for Linux, in which case have at it, or you need to reconsider why you're down in the kilobytes of total memory with a garbage collector.

This seems awfully skeptical. The http://nerves-project.org/ project (which puts Elixir/Erlang on bare metal) seems to be gaining some success, and the BEAM VM is considered "near-realtime" and notably features no "stop the world" GC, doing it on a per-node basis (all of which are cheap to create/destroy and are entirely isolated).

Iirc beam has fixed memory constraints for its processes and a deterministic scheduler that gives even time to all processes, which help.

The BEAM is admittedly, a very interesting option for embedded systems.

http://www.rosepoint.com/ uses Nerves under the hood on its marine navigation products (so, real-world results). Garth (CEO) did a presentation a year ago or so where he said his entire industry is java-driven and was skeptical about his decision to go Nerves but after testing it on some realtime data for a month and seeing absolutely zero memory leaks or any faults (and relatively low CPU usage), he decided to go forward with it and has been pretty happy with it so far.

I assumed between that and the fault tolerant nature of the architecture that it would be ideal for a standalone device that would be harder to service. I remember doing specs for in vehicle systems in college and the "what if it crashes" question was a nightmare.

Unfortunately, BEAM is not really very easy to hack on. The community has gotten much better than it was a few years ago, but the code is pretty hard to understand.

I think the BEAM approach could be very attractive for embedded systems, given the right investment.

I'll grant you that there's not a lot of documentation for the internals, but I don't think the internal code is that hard to understand. To start with, unlike some languages I've worked with, a large amount of OTP is built in erlang, not C. Another key thing Erlang does is avoid complexity when it can. For example, the GC algorithm is just about the simplest GC you can get (caveat it is generational); because of the language constraints, a very simple GC is effective.

You certainly need to spend some time figuring out all the data types and how to access them in C, but if you're willing to spend some time, and you're capable of mucking about inside a VM; I don't see how it's that hard to understand. It feels to me to be on about the same level as the FreeBSD kernel; after chasing down surprising behavior enough times, I've got a pretty good feel for how to read the code, and whereabouts to start looking for code I'd like to read; but making changes can be a stretch, depending on where it needs to happen. OTOH, I only have to dive into the depths when my team manages to break beam or the kernel, which isn't everyday... If more things broke, I'd have more skill here. ;)

You work at WhatsApp then?

I've written a bunch of Erlang, and a bit of Linux Kernel bits.

You're right, rarely do I break BEAM because most of what I'm able to break is in Erlang. The only times when this is untrue is performance. If you spend a lot of your day hacking on BEAM, I'd love some more documentation like the BEAM handbook, if you're interested.

> I'm really not sure the drive for garbage collected languages in memory constrained systems with no user recoverability. Seems like a recipe for a device that just stops working from time to time.

Remember all those home computers from seventies and eighties, such as C64, Apple II, MSX compatibles and Spectrum? Almost all of them were running "garbage collected" BASIC interpreters.

I don't think blanket statements are justified. There are a lot of different types of embedded systems.

> Without them, how can you guarantee an interrupt can be reliably serviced in time?

You wouldn't allocate memory in IRQ service routine in the first place, GC or not. GC, dynamic (malloc) and static system would all take exactly as long to service an interrupt.

GC can also be a subset of the system, where less time critical functionality is running.

That's not to say embedded systems should do allocation at runtime. It's often reasonable to avoid it. But perhaps not all the time.

Are you sure those BASIC interpreters used garbage collection? I wouldn't have thought so. I don't think you even had dynamic memory allocation in most of these, you had to size your arrays in advance. The memory might not have gotten allocated until you got to that line of code but that's not the same thing...

Yes for string and array management.

Many BASICs had REDIM for resizing arrays on the go.

I'm really not sure the drive for garbage collected languages in memory constrained systems with no user recoverability. Seems like a recipe for a device that just stops working from time to time.

J2ME would like to have a word with you...

J2ME is nowhere near being hard real-time. Minimal Java ME Embedded configuration: 32-bit MCU, 130 KB RAM, 350 KB Flash/ROM. This is close to border where systems have MMU and is considered very "big" configuration in realm of embedded systems. 10-20 kB RAM/64kB flash is a lot for many many applications. Rust seems to be on a good path to be the real C competitor in general bare-metal development. Right now Cortex-M* family support is on level when you can start writing some apps in it. I'm not considering Ada to be mainstream.

J2ME is not the only game in town for embedded Java, check Aonix and Aicas offerings instead.

As another example, MicroEJ is targeted at Cortex-M CPUs.

And there is also Astrobe selling Oberon-07 compilers for M3 and M4 processors.

> 32-bit MCU, 130 KB RAM, 350 KB Flash/ROM

Anything less than this is most certainly some kind of PIC, not everyone is using them for embedded real-time deployments.

Correct me if I'm wrong: Aonix is not existing - it was bought by PTC. Their real time Java is now targeting only x64/x86 systems.

You are wrong about "anything less" Anything less is probably 90% of market. You will find 130kB RAM, 350kB flash only in high end products from NXP(LCP family), Kinetis(KL family) or STI( STM family). You only need so much RAM for JAVA :)

I believe his point was they implemented hard-real-time, embedded Java with GC on some components that was fielded successfully. That refutes any objection to whether it can be done. At that point, we're talking whether the new project can pull that off on what kind of hardware with what constraints.

EDIT to clear up two sub-conversations here: what Aonix pulled off for hard-real-time Java; that there's also hard-real-time GC's for embedded. Example of latter.


Yes, they were bought by PTC.

However I don't know what they have done with the PicoJava offerings, since PTC isn't as friendly as Aonix regarding the making the documentation available.

As for the market, it depends on which one the products built with those CPUs are actually being sold to.

My Cisco phone and the Ricoh printer around the corner both are running some form of embedded Java.

Maybe you think they are part of the remaining 10%, however Cisco and Ricoh though it was worthwhile for their sales.

I believe we have different definition of embedded system. There are a lot of them, for me simplest definition is CPU/uC system without MMU. Those Cisco Ricoh gear have Linux running inside (at least for phone I have it's some 200MHz MIPS)

"I believe we have different definition of embedded system. "



The 32-bit market was at $6 billion by 2014 per Amtel's report. There's also a huge amount of sales for Windows Embedded and embedded Linux's. That represents a significant chunk of a massive market. So, it's quite worthwhile to call even a 32-bit-targeted, hard-real-time GC useful for "embedded" systems. As he said, it's part of the standard definition used in the embedded sector. The 32-bit side is going up rapidly, too, due to decreasing cost and power usage.

EDIT: The specs on them are also starting to look like the desktops of old. Actually, started to do that quite a while ago.


My definition is the market definition, it doesn't single out to one specific architecture design.

The market goes all the way from something like the PIC10 to something ARM v8 64 bit.

It is all a matter of what a company is targeting as customer market, and how much it is willing to spend.

Just because a given language is not able to target 100% of the market, it doesn't make it invalid to such market.

If that was the case, C wouldn't be a valid language for embedded development as well, given that many CPU/uC aren't able to cope with straigh ANSI C compliant compilers and require either a C dialect or pure Assembly.

J2ME is nowhere near being hard real-time.


Believe it or not, not every embedded application has a hard real-time requirement.

You're just moving the OP's goalposts to make a point, here.

No, he's making the point that the latency introduced by a GC is most important in the case of a hard real-time system. If J2ME isn't one, then using it as an example is specious.

Hah, that's a fair point, actually. Maybe I'm using a different definition of 'embedded system'. To me, anything that's a general purpose application processor these days (i.e. capable of running Linux) barely fits the definition. I wouldn't really call the iPhone CPU an 'embedded system' although, I guess, it kind of is.

Hah, that's a fair point, actually. Maybe I'm using a different definition of 'embedded system'. To me, anything that's a general purpose application processor these days (i.e. capable of running Linux) barely fits the definition.

I'd say that's a bit myopic.

There's a huge range of devices between "a few kilobytes of memory" and "smartphone" that would be well-served by something like this.

I agree with you that there's many device in that range, but why are they not well served by using an operating system? When is a device large enough to run the whole Go runtime but too small for (say) Linux?

Compared to a complete Linux kernel, the Go runtime is pretty tiny.

Buy to your point there is no magic answer. The question is rather: when is the capabilities of a full OS kernel like Linux worth the resources needed to run it? And the answer is ultimately: it depends.

J2ME would like to have a wo





rd with you.

It appears you are running on DalvikVM instead.

He just compiled it with gcj. It offers the best of both worlds: The performance of an interpreted language with the standard library of C.

and JavaCard 1.0 wants to know what a GC is. No need to throw away perfectly fine objects.

A few of it's family members will even proof mathematically that they won't throw those objects away.



I've written a prototype bit of software in Rust to run on a fleet of Raspberry Pi units. So far it seems impeccably reliable, which is nice, but cross-compiling it was not the most fun I've ever had programming.

As the tooling improves I can definitely see it being a good language to use on embedded devices. And it's a rather fun language too; I certainly find it more pleasant than Go to write (but I seem to be in the minority, considering the popularity of Go lately).

Are you running it on linux, or without an operating system?

It actually runs inside Docker, on a Linux system, as I've been deploying it using Resin.io for our test units. Annoyingly I've had a couple of units crap out on me, might be a bug in an older ResinOS image that has since been resolved though.

When I get time (eventually) I'm going to be working on our own minimal Linux system for the devices. Really all I want is a device that can be accessed from behind firewalls (looking at Teleport for this with their new ARM support), and the rest can be compiled Rust binaries using upstart or somesuch :)

We do have some upcoming projects where I might get a chance to try writing stuff without an operating system. That'll be an interesting challenge!

OP is talking about more restrained embedded devices. Devices on which there is no docker or even OS. I would not name Pi as embedded device in this context.

Relevant to your question, there's work towards a bare metal ARM stack for Rust, called Zinc: https://github.com/hackndev/zinc

I'd trust a GC implementation a lot further than I'd trust a typical C programmer. Yes, there's a certain risk that your device will occasionally stop working - but absent formal verification that's a risk for a device who's software is written in any language. Rather than an absolute notion of risk/no risk, let's start talking about acceptable defect rates.

What about Rust, which doesn't have either of those downsides (yes, it has the porting problem due to bootstrapping issues but Go has the same issue)?

Also, the "acceptable defect rate" is not necessarily very large for a majority of cases that users will care about.

Have a look at RTFM: http://blog.japaric.io/fearless-concurrency/

Theres nothing like it out there. This is zero abstraction, and it works.

Rust is pretty cool. If I were writing code for a platform like this I might use it (but probably only if I was convinced that GC issues were going to be a real practical problem if I used OCaml/Haskell/Scala-native).

GERT can't deterministically service interrupts, but it's technically because of the armv7a architecture and its non-deterministic generic interrupt controller. Also, unless your Go program is constantly creating and destroying objects, the garbage collector won't really run. I wouldn't put GERT on my ABS brakes just yet, but I think you can engineer around a GC.

I wish there was an armv7R or armv8R dev board around (that doesn't cost thousands) because those are actually meant for realtime applications and I would really like to try GERT on one.

Maybe you can go check TI hercules series. I couldn't find a multicore one but I found it cheapest possible Cortex R board. http://www.ti.com/lsds/ti/microcontrollers-16-bit-32-bit/c20...

Hard, real-time GC's exist. So, you can definitely do it.

> Embedded systems, IMO, must be deterministic, reliable and consistent.

This is the definition of a hard real-time system. In most of the literature, 'embedded system' is a broader term that just means there is some compute embedded in a device that performs a larger task.

It's looking more and more like mainstream embedded SoCs will combine the general-use HMI processor core (like an A8/A9) with a smaller real-time core for control tasks.

TI Sitara (Beaglebone family) does this via the PRU, and Freescale added a Cortex-M4 to the i.MX 6SoloX for a similar purpose.

I think it's worth to see how this experiment will work out in practice. Go's concurrent GC has predictable pauses less then <100 microseconds (not milli-) even on large heaps (>50GB) and heavy objects allocation pattern. I believe for the embedded software the heap will be much smaller :) and objects allocation pattern as well, so real pauses will be almost negligible with 99.9999% guarantee to be less <100microseconds (and, I believe, less then 10microseconds). Which may be just enough for many cases.

There is a price you pay for this, you can even get 1 microsecond pause but how much work you will make in this 1 microsecond? You should measure total time spent in GC through x seconds instead of measuring one pause. If your task takes a lot of time then all those GC pauses times add together to the task execution time. In practice those numbers you gave (provided by golang developers) are not always true. I know because I run apps written in Go in production.

Many modern embedded systems are more powerfull than Xerox PARC Star with Mesa/Cedar, ETHZ Ceres with Oberon, DEC Topaz with Modula-2+, Washington SpinOS with Modula-3 were.

Also embedded real time JVMs fit in a few hundred KB and are being used by the likes of military, e.g. Aonix picoJVM, to control real time stuff like battleship missile tracking systems, which I assume is quite real time.

its not about size, its about predictable response time.

Embedded systems range from the tiniest microcontroller up to multi-core xeons, DSPs and FPGAs.

Embedded != small.

The typical issues associated with embedded development are 1) cost and 2) response time (for real-time embedded systems).

The big one here is cost. If you're wasting a single byte in your code, you have to pay that cost in every single unit you make (e.g. millions).

Given that there are real time JVMs controlling ballistic systems and aiming turrets on battleships, I guess they have a pretty good predictable response time.



Also I have a JVM running on the Cisco phone on my desk and the Ricoh laser printer down the hall.

Just, because there is a portion of the market that a certain concept doesn't apply, it doesn't mean it isn't viable in other segments of the same market.

For Go to be successful on embedded systems, doesn't mean it must run everywhere.

Heck there are even embedded CPUs that cannot cope with ANSI C, and that hasn't prevented people to make use of it on other market segments of the embedded space.

"its not about size, its about predictable response time."

Far as GC's, it's also about size if it's a constrained embedded system. I've seen a number of GC papers discussing tradeoffs between size (i.e. RAM use) vs speed/latency. This even factors in a bit on the large ones like Vega where they were still balancing those factors to get an optimal, "pauseless" GC for accelerating enterprise apps.

I agree on the other points.

You can write Go programs in a certain style that doesn't use GC. It's the equivalent of not using malloc in a C program.

There's a fair amount of implicit allocation in Go. When you reference a slice, you create a slice descriptor, which takes up space. Maps, channels, and goroutines all allocate. It's not impossible to avoid allocation in a block of code, but it's not just avoiding explicit allocation.

How do interrupts work in this system? Do they map to goroutines, or what?

The thing is, this allocation is usually on the stack. With a bit of experience, it is relatively easy to write your program such that all these allocations are on the stack. The compiler also gives you a clear breakdown on which objects are allocated on the heap. It is one big advantage of Go vs. Java for example, that you do get much better control about the allocation. All Go types are "value types", so this creates much more control and less memory pressure. (Except for interfaces, they are great when you need dynamic typing, but they behave more like reference types and do some unavoidable heap allocation, so avoid them in code parts which must not allocate from the heap.)

No, it's really not. How do you allocate a data structure on the stack and then call a function passing a pointer to it at a polymorphic call site in Go? (A polymorphic call site is one in which the compiler cannot statically tell which function is being called.)

Yes, it is possible. The Go GC is written in the Go subset that can be compiled without heap allocation. The compiler even has a switch for this, which turns any heap allocation into a compile error. This is a subset of all possible Go programs, but still a very useful one.

What is the compiler switch?

Sorry, I don't know and haven't been able to locate it in the help for the build tool, but it was mentioned in a talk about the GC, especially about it being written in Go itself.

Interfaces allow polymorphic coding, and they support stack allocated data structures last I checked.

I didn't say you couldn't make polymorphic calls in Go. I said that they don't mix well with relying on escape analysis to avoid tracing GC.

Oh I see what you mean. Yeah forcing vars to actually be stack allocated can be challenging when using interfaces everywhere. What are the Go authors' opinions on improving escape analysis?

I think it's a design issue that can't be truly solved, period. You could try stuff like k-CFA I guess.

What is k-CFA ?

Ah neat, thanks.

Nim can run on bare metal natively and even on microcontrollers like arduino. The GC is deterministic, the GC algorithm is pluggable and can be also turned off.

>Embedded systems, IMO, must be deterministic, reliable and consistent. Introducing garbage collection violates these three principals

There are a number of commercial real time JVM's out there.

Sure, though they generally achieve that by running on big iron, with lots of spare CPU and memory, and running a lot of big exotic software. That is, environments with the opposite of where you'd generally want to deploy any code bare metal.

Not necessarily. These real time JVM's are more for embedded systems.

Which ones?

Aonix for example.

Heavily used by the military, including weapons control systems.

Another well known ones are IBM WebSphere RealTime and JamaicaVM.

Also companies like Gemalto, Ricoh, Cisco have JVMs on their devices, but not real time.

Is this what is now PTC Perc? Don't see much references to Aonix after 2010.

The Aonix data on their website was very useful. It proved most of our points. That's why we reference it. Still available through archive.org. Although, just checking, I find a lot of the good links are dead on the times I tried. Uh oh.

I did find an announcement:


In the Aonix design, there is no GC on the hard-real-time threads or portions of the heap. Those are usually done statically anyway for max predictability. The GC can apply to anything else with it preempted by real-time threads when necessary. It was usually tied into a RTOS like LynuxWork's (now Lynx).

Yes they were bought by PTC.

I still refer to Aonix, because they were more developer friendly, had more information on their website than the few whitepapers from PTC and the web site is still partially up.

I thought their main customers are high-speed trading firms, not so much embedded systems.

Military make heavy use of them as well, see Aonix.

You can pre-allocate in garbage collected languages as well.

>deterministic, reliable and consistent

Is there a formal definition for all three terms?

* Deterministic - the system is intrinsically incapable of undefinable behavior, provably so. (Though extrinsic factors like hardware or network failure could result in undefinable behavior).

* Consistent - Every read receives the most recent write or an error (from CAP Theorem https://en.wikipedia.org/wiki/CAP_theorem)

* Reliable - ?

It seems a easy way to avoid GC in Go is statically allocate everything. only use Global references, etc.

Are memory leaks more inevitable in garbage collected programs than manually collected ones?

No, but in manually collected ones you usually have complete control over when an allocation happens, and often you can keep everything entirely on the stack. That's one advantage C would have over Go for example.

Of course, in the end, it doesn't matter as much as people make it out to, because you can easily blow the stack in C. In reality, one of the worst disadvantages of garbage collectors is latency, and Go's GC is best-in-class in that respect.

And, obviously, while Go is pretty competitive in memory usage to many higher level languages, in my experience you can still be much, much more frugal on memory when coding in C.

Wow, it's a great piece of work. I was always wondering how to implement such big golang runtime on bare-metal, but someone finally did it. I glad to have a paper as well. So how big is the compiled binary of golang runtime? I hope it would be small enough to port to Cortex-M series MCUs.

Hi I'm the author of GERT. The size of the ELF for the laser projector program is 2.1M so it probably will not fit on a cortex-m :(. Additionally, I don't think GERT will be as useful on a single-core SOC as it is on a multicore chip because blocking operations (like reading a UART) may literally take forever. The memory safety can certainly be useful though! I'd say just start tinkering and try it out. You can probably gut enough of the elf to get it to a few hundred kilobytes.

Impressive work! One question though, what is the bootup time? There's the GIF on the Github repository, but I can't accurately tell it from that.

Would it be possible for you to make a program that just exists and then time the whole bootup process? Thank you.

I have a specific use-case and would be willing to buy the board if it is fast enough.

>"The minimal set of OS primitives that Go relies on have been re-implemented entirely in Go and Plan 9 assembly inside the modified runtime."

Is the minimal set of OS primitives that Go relies on documented anywhere?

It looks like https://github.com/ycoroneos/golang_embedded gives details on what changes were necessary to the Go runtime package.

That's correct. My thesis (https://github.com/ycoroneos/G.E.R.T/blob/master/thesis/main...) also has detailed info in chapter 3. The github documentation is still a work in progress :P

It's awesome that you've released your masters project under a free software license. You see a lot of research that took several years of labour but wilts away because it remained proprietary. Well done!

Thanks, this PDF is great.

Great. When will this target ARMv8 64-bit 'A' profile chips?

It's planned. If ARMv8 had market penetration ~1.5 years ago, then I probably would have started that way. One big issue with most SOCs is the lack of publicly available data sheets for writing good drivers. That's also why I picked the imx6Q; its data sheet is very detailed.

I'd say that even two years ago most mobile phones being sold were already ARMv8. That doesn't help with the SoC documentation, you're right that this has been a consistent weak spot. Usually documentation is offered up or it washes up only when the market relevancy of any one SoC approaches zero. Before then, it's passworded up and jealously guarded. Makes no sense to me, especially when you consider most of any one SoC to be consisting of reused generic IP blocks. I mean, I can deal with an NDA for something tricky like your new GPU, but that doesn't explain why I can't figure out the interrupt routing or your clock and GPIO blocks.

If you're referring to ARM servers, then things are still pretty solid (it takes a while to line up an entire hardware and software ecosystem, even in a world where you're all set if it runs Linux). There are specs like SBSA and SBBR that ensure servers from any SoC vendor look roughly the same, but I would wonder why you would target bare metal in that case anyway. Have you considered targeting ARMv8 VMs, like the one modeled by KVM/qemu? Extra bonus in that it looks like an ARM server.

I've been developing with iMX over 5 years and I'll heartily recommend that part over any other Linux-class SoC on the market right now.

Freescale's support is probably the best available out there in this class of chips. Documentation is mature and plentiful (excepting the GPU of course but that's being worked around), and there is plenty of code sitting on their Github servers including Yocto recipes that are pretty close to mainline.

Congratulations on the work achieved.

It looks quite interesting.

I just updated Debian on my cubox-i (also iMX6) last week, which also runs mostly Go code. Never thought that Go would be low-level enough to run on bare hardware without replicating a lot of OS-level code. Interesting project, thanks for sharing... I hope I get to check out how you did it when/if I have some more time.

haha that's my name!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact