
Ask HN: Resources for getting started with bare-metal coding? - DATACOMMANDER
Several years ago I bought a used copy of the original x86 manual and wrote a proof-of-concept OS. I’m interested in getting back into it, but with more of a focus on HPC and utilizing the more advanced features of modern architectures. Where should I start? Even back when I wrote my toy OS, the contemporary Intel manual was 10x the size of the original that I worked with. Does anyone even work with assembly anymore? (If not, how is software keeping up with hardware advances? Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?)<p>My history: I’m a devops guy with about four years of experience in IT and about a year of experience writing Python at a professional level. My degree is in general mathematics, though I did best in the prob&#x2F;stat courses (and enjoyed them more than the others).<p>Side note: I wonder if I “33 bits”’d myself above...
======
nabla9
If you want to be fluent in bare-metal, you must know the bare metal. Writing
functions in bare metal provides performance boost over compilers only if know
how to match the computation and data to the underlying architecture better
than the compiler. Assembly is just a way to write it down.

There are two recently updated great books I recommend:

\- Computer Architecture: A Quantitative Approach (2017) by Hennessy &
Patterson

\- Computer Organization and Design - RISC-V Edition (2017) by Hennessy &
Patterson (I have the older MIPS edition)

You also need a book and documentation for the specific architecture (x86 or
ARM), but the two books above teach generic stuff that is useful everywhere.

If you do numerical HPC programming, you usually write very little assembly.
You might add some some inline Assembler tweaks inside (C/C++/Fortran)
functions when needed. You must know to program in C, C++ or Fortran depending
on what the code base you are working on uses and how to embed assembly code
inside them.

EDIT: CUDA programming might be important to learn if you want to do low level
numerical programming.

~~~
na85
In the case of x86 where the microcode is fluid and can be patched, and
frankly does not really represent what the processor is really doing, can
assembly really be considered "bare metal" these days?

~~~
p1esk
These days even programming mainstream FPGAs with Verilog is not really “bare
metal”. Lots of complexity and control is hidden.

------
CoolGuySteve
The problem with x86 in particular is that there is tons of cruft. You can get
lost for days reading about obsolete functionality.

Here's my general workflow for optimizing functions in HFT:

Write a function in C++ and compile it. Look at the annotated disassembly and
try to improve it using intrinsics, particularly vector intrinsics, and rdtsc
times.

Then compare your output to the compiler's "-ftree-vectorize -march=native"
and compare what it did to what you did. Lookup the instructions it used and
compare them with what you did, check for redundancies, bad ordering, register
misuse/underuse in the compiler output.

Then see if you can improve that.

But all that being said, note that in general this kind of cycle-counting
micro-optimization is often overshadowed by instruction/data cacheline loads.
It's rare that you have a few kilobytes of data that you will constantly
iterate over with the same function. Most learning resources and optimizing
compilers seem to ignore this fact.

~~~
elcritch
I’ve wondered why there aren’t more tools for predicting how a program fits
into cache lines and data caching effects. For given cpu parameters it seems a
reasonable task to estimate cache lines based on a sample dataset. Am I just
missing what tools are used out there?

~~~
CoolGuySteve
The best tool for this in my experience is callgrind with assembly notation.
You can configure it to more or less mimic the cache layout of whatever
particular chip you're running and then execute your code on it.

You can use the start and stop macros in valgrind.h to show cache behaviour of
a specific chain of function calls, like when a network event happens, then in
the view menu of kcachegrind select IL Fetch Misses, and show the hierarchical
function view.

It doesn't mimic the exact branch prediction or whatever of your architecture
but when you compare it to actual timings it's damn close.

~~~
elcritch
Wow, that's cool!

------
Const-me
> Does anyone even work with assembly anymore?

I do, but very rarely code in assembly. Usually just read output of C++
compiler.

It’s hard to beat well-written manually vectorized C++ code. Even when the
compiler obviously doing something wrong:
[https://developercommunity.visualstudio.com/content/problem/...](https://developercommunity.visualstudio.com/content/problem/350355/the-
compiler-optimized-a-single-very-fast-instruct.html) The compiler appears to
know about typical latency figures for these instructions, i.e. it reorders
them when It can if it gonna help.

> how is software keeping up with hardware advances?

Barely. The only way to approach advertised compute performance of CPUs is
SIMD instructions. Despite available in mainstream CPUs for couple decades now
(Pentium 3 launched in 1999), modern compilers can only auto-vectorize very
simple code.

Fortunately, these instructions are available as C and C++ compiler
intrinsics. Supported by all modern compilers and quite portable in practice
(across different compilers building for the same architecture).

> Do newer low-level languages like Rust and Go really utilize the massive
> advances that have taken place?

Both are worse than C++.

About what to start with… I would start with picking either CPU or GPGPU. GPUs
have way more raw computing power. Programming models are entirely different
between them.

If you’ll pick GPU, I’d recommend “CUDA by example” book, helped me when I
started GPGPU programming. BTW, for GPUs, assemblies and instruction sets are
proprietary i.e. no one works with assembly. There’re low-level assembly-like
things, Nvidia PTX, MS shader assembly, but these instructions are not
executed by hardware, GPU drives compiles them once again into proprietary
stuff.

Don’t know good books for CPU SIMD, I started organically with some random
articles and Intel reference, but I had many years of C and C++ programming
when I did. Not sure it’ll work for you.

~~~
bitcoinmoney
Can you share what you do for a living?

~~~
Const-me
I've been developing software for living since 2000.

Lately, working on CAD/CAM/CAE Windows desktop software, also some Linux
embedded. Before that worked in game development, HPC (have not coded for
supercomputers, just commodity servers i.e. Xeon + nVidia), realtime
multimedia (video processing, encoding, broadcasting).

------
geofft
Agner Fog's page of optimization resources might have some things that
interest you:
[https://www.agner.org/optimize/](https://www.agner.org/optimize/)

One notable use of raw assembly is that Intel themselves contribute optimized
strcpy etc. implementations to glibc. You might find it interesting to go see
how the most recent ones work.

------
jandrewrogers
For high-performance bare metal computing in 2019 on typical hardware, your
elementary toolset will be an up-to-date C++ compiler and programming with
vector intrinsics. Note: most performance these days comes from highly
optimizing memory locality (i.e. data structures), clever instruction
sequences will gain little if locality is poor. While you need to be
adequately fluent in reading the assembly code generated by your compiler, you
virtually never need to write it outside of some very rare cases if you are
programming in C++ because almost everything is exposed either as intrinsics
or implicitly as part of the language/library. Writing code that uses vector
instructions is still a manual process via intrinsics, compilers are still
poor at doing that automatically. The set of vector instructions provided seem
to have many weird gaps in them (e.g. only '>' and '==' operators for
comparison in SSE) because you are expected to figure out how to logically
compose most of the operators programmers are accustomed to in the vector
domain.

The details will vary significantly with the type of high-performance code you
are writing e.g. is it floating point numerics, integer domain, memory-hard,
trivially parallelizable, etc. Different architectures are optimized for
different kinds of codes, but a modern CPU with strong vector support likely
gives the best performance across the broadest range of code types. Contrary
to popular impression, most HPC is not doing linear algebra and quite a bit of
it is integer code.

Becoming intimately familiar with the details of microarchitectures is hugely
important to understanding how to optimally structure codes for them. Agner
Fog's resources are a good starting point for understanding some of these
issues.

------
timClicks
Compiler engineers need to read assembly. I don't think too many people write
lots of it by hand, unless they're extending packages like LAPACK of BLAS.

Software has probably been free-riding on hardware, but hardware has also not
been keeping up with hardware. The big difficulty is the latency between
memory and the processor.

Rust is closer than Go. Go isn't designed to make optimal code. It's there to
make good enough code. And it does so very quickly.

In terms of resources, I really like Crafting Interpreters. Would love for you
to take a glance at my book (Rust in Action, Manning), if you want to learn
about other systems topics outside of compiler design.

~~~
viraptor
Go also schedules N:M, which means it's not trivial to ensure that your code
keeps running on the same CPU... which may be important for hardware.

~~~
majewsky
Why do people keep repeating this myth, when thread pinning is literally a
one-liner?

> LockOSThread wires the calling goroutine to its current operating system
> thread. The calling goroutine will always execute in that thread, and no
> other goroutine will execute in it, until the calling goroutine has made as
> many calls to UnlockOSThread as to LockOSThread. [...] A goroutine should
> call LockOSThread before calling OS services or non-Go library functions
> that depend on per-thread state.

Source:
[https://golang.org/pkg/runtime/#LockOSThread](https://golang.org/pkg/runtime/#LockOSThread)

~~~
viraptor
It's usually enough, but not always. For example namespaces and new threads
don't handle this well, even if you do LockOSThread
[https://www.weave.works/blog/linux-namespaces-and-go-don-
t-m...](https://www.weave.works/blog/linux-namespaces-and-go-don-t-mix)

~~~
rantanplan
That has been fixed quite a while ago, in Go 1.10

------
viraptor
If you haven't seen it yet, have a look at
[https://wiki.osdev.org/Expanded_Main_Page](https://wiki.osdev.org/Expanded_Main_Page)
\- it's probably the best first destination when looking for OS implementation
info. It may not go too deep in many areas, but it has reference to other
important places.

Assembly is still used where needed. Kernels of media encoders (for
performance), some interrupt handler (for control), other things that
shouldn't comply with ABIs for whatever reason, and for things that can't be
accessed from higher level languages (architecture specific registers).

------
orbifold
You should check out
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/).
Besides that it probably is a good idea to figure out a concrete thing to
implement, do an implementation in C/C++ and figure out using something like
godbold how to improve the generated assembler code. HPC is not about
assembler implementations at all, more about Problem specific algorithms and
MPI.

------
z3phyr
[https://www.coranac.com/tonc/text/toc.htm](https://www.coranac.com/tonc/text/toc.htm)
[http://ianfinlayson.net/class/cpsc305/notes/01-intro](http://ianfinlayson.net/class/cpsc305/notes/01-intro)

By programming for the Game Boy Advanced ofcourse

------
matthewmacleod
It’s worth reading through Phillip Opperman’s blog series ([https://os.phil-
opp.com/](https://os.phil-opp.com/)) where he writes about creating an OS in
Rust. Not necessarily for the Rust part, but it has lots of nice info on how a
modern system is bootstrapped.

------
rramadass
For bare-metal coding it is best to get started with Embedded Programming
using MCUs (ARM, AVR, 8051 etc.).

For x86 you may find the following useful;

1) Computer systems: A programmer's perspective by Bryant and O'Hallaron.

2) Low-Level Programming: C, Assembly, and Program Execution on Intel® 64
Architecture by Igor Zhirkov.

3) Modern X86 Assembly Language Programming by Daniel Kusswurm.

4) Agner Fog's writings.

5) Ciro Santilli's x86 examples - [https://github.com/cirosantilli/x86-bare-
metal-examples](https://github.com/cirosantilli/x86-bare-metal-examples)

------
janci5243
I find this course at Cambridge to be a wonderful resource for bare metal
programming on ARM (Raspberry PI):
[https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/i...](https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/index.html)

It starts off with booting a simple OS written in assembly from SD card,
blinking LED and then continues to more advanced topics such as controlling
the GPU/screen.

------
raxxorrax
> Does anyone even work with assembly anymore?

Sometimes on simple 8-bit architectures for some interrupt-routines where it
does perhaps under certain conditions increase performance on I/O operations.
And that is to 99% read -> write.

On more complex architectures like x86? Not really. I wouldn't even come close
to modern compilers. Maybe for fun.

If you want to create a simple system I would recommend to just ignore
optimization and focus on system design. That is probably the only thing
agnostic about assembler.

You could also use a µC with external programmer to get a basic system
running. mnemonics are often very similar to x86 and while optimizations are
hard and in-depth knowledge about architecture is required, the first step to
take is probably the system design itself and the efficiency of the assembly
is secondary. All that should come later.

For the very first steps, I would recommend linux or windows assembler and
learning how to interface the underlying OS. That could help hardening the
requirements for the design.

------
justaaron
This is going to end up somewhat specific, as most low-level initialization
routines are. ARM vs x86 vs RISC-V vs MIPS vs a BASIC stamp or propeller
etc...

I will suggest studying some general Assembly language concepts, memory
locations, how to set-up RAM timings and bring offchip RAM into an address
space, etc.

This stuff is used daily in the world of microcontrollers, I'm a huge fan of
the Parallax propeller, in which the Spin interpreter in ROM launches Assembly
routines on individual cores in about the simplest fashion possible...

Some things you may find interesting:
[https://github.com/dwelch67/raspberrypi](https://github.com/dwelch67/raspberrypi)

[https://github.com/rsta2/circle](https://github.com/rsta2/circle)

[https://ultibo.org/](https://ultibo.org/)

Think of the latter 2 as "HAL plus some primitives" rather than RTOS...

~~~
justaaron
Go is garbage collected, and thus not remotely suitable for bare-metal work.

C, C++, Rust, possibly Ocaml, FORTH, FreePascal, etc... or else you are
actually on about a virtual machine, not bare metal.

Learning how to create your own heap. Make your own malloc/free on a baremetal
device and that will be a huge boost for you, confidence wise...

~~~
pjmlp
> Go is garbage collected, and thus not remotely suitable for bare-metal work.

Others beg to differ, selling GC enabled development environments from small
PICs to grown up ARM deployments.

[http://www.astrobe.com/default.htm](http://www.astrobe.com/default.htm)

[http://www.microej.com/](http://www.microej.com/)

[https://www.aicas.com/cms/](https://www.aicas.com/cms/)

[https://www.ptc.com/en/products/developer-
tools/perc](https://www.ptc.com/en/products/developer-tools/perc)

[https://www.microdoc.com/ibm-websphere-everyplace-custom-
env...](https://www.microdoc.com/ibm-websphere-everyplace-custom-environment-
wece_)

Some solutions even considered suitable for bare metal work by the US and
French military.

And speaking of Go on bare metal, [https://tinygo.org/](https://tinygo.org/)

~~~
justaaron
Well, is any of this actual bare-metal environments for coding? It's more of a
HAL meets RTOS, and it's certainly not self-bootstrapping in Go, which is
impossible on real hardware, or?

~~~
pjmlp
Astrobe is certainly bare metal, just like some of the Java deployment
options, with AOT compilation to target boards, like Aicas and PTC are capable
of.

~~~
justaaron
Thanks, that's interesting and very new to me. I appreciate the links

~~~
pjmlp
If you want an example of GC in bare bones all the way from building your own
FPGA up to the graphics display, check the 2013 update from Project Oberon.

[http://www.projectoberon.com/](http://www.projectoberon.com/)

[https://inf.ethz.ch/personal/wirth/ProjectOberon/index.html](https://inf.ethz.ch/personal/wirth/ProjectOberon/index.html)

Sadly the ready made OberonStation boards are no longer on sale.

------
l00sed
I have a friend at NIU who's in the CS program and they're still teaching
assembly there. I think with web dev and all the new flashy stuff that's
readily available on top of all these foundational languages people are much
more drawn to the immediacy of JavaScript and Python and other high-level
languages.

------
yitchelle
Many paths to follow here. However, one thing that may help is to decide
whether which platform you want to start with. Something like the ever popular
Pi and friends, or a lower powered platform such as Ardurino. Each platform
will need a different approach due to its available resources.

------
readme
This:

[https://github.com/cirosantilli/x86-bare-metal-
examples](https://github.com/cirosantilli/x86-bare-metal-examples)

He has a lot of them, and you can also check out his stack overflow answers.

>Does anyone even work with assembly anymore?

in infosec, definitely

------
shakna
> Does anyone even work with assembly anymore?

As a little bit of a different fun one:

I have a MicroPython project I'm building, a limited run toy, which live-
assembles some assembler that it deploys to a co-processor which handles low
power mode.

A real mix of very high level and very low level code.

------
akkartik
I don't have any answers to your questions (you probably know more than me),
but this thread seems like a good beginning of a support group for people
programming on the bare metal.

I do so not for performance but for reducing dependencies, for building a more
parsimonious stack at all levels:
[https://github.com/akkartik/mu/blob/master/subx/Readme.md](https://github.com/akkartik/mu/blob/master/subx/Readme.md).
It's surprisingly pleasant. Lately I find myself thinking heretical thoughts,
like whether high-level languages are worth all the trouble.

------
Taniwha
Probably the best resource you'll find these days is a virtualised environment
- do most of your development in a VM or an emulator, get it all up and
running, then port it to real metal

~~~
lallysingh
Yes. Start with this, osdev.org, and gcc -S.

------
stevekemp
I mostly dabble in assembly language for fun, but it can be very useful to
know how it works when you're patching binaries, or reverse engineering
protection/encryption systems. Some of the binary reverser-challenges are
great for that.

Otherwise the only things I've done recently have been writing a simple
"compiler" for maths, converting "3 4 + 5 *" into floating-point assembly, or
generating code for a toy-language.

------
nohope
The best book on x86 Assembly for beginners ever:
[https://savannah.nongnu.org/projects/pgubook/](https://savannah.nongnu.org/projects/pgubook/)

