Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Resources for getting started with bare-metal coding?
179 points by DATACOMMANDER 84 days ago | hide | past | web | favorite | 68 comments
Several years ago I bought a used copy of the original x86 manual and wrote a proof-of-concept OS. I’m interested in getting back into it, but with more of a focus on HPC and utilizing the more advanced features of modern architectures. Where should I start? Even back when I wrote my toy OS, the contemporary Intel manual was 10x the size of the original that I worked with. Does anyone even work with assembly anymore? (If not, how is software keeping up with hardware advances? Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?)

My history: I’m a devops guy with about four years of experience in IT and about a year of experience writing Python at a professional level. My degree is in general mathematics, though I did best in the prob/stat courses (and enjoyed them more than the others).

Side note: I wonder if I “33 bits”’d myself above...




If you want to be fluent in bare-metal, you must know the bare metal. Writing functions in bare metal provides performance boost over compilers only if know how to match the computation and data to the underlying architecture better than the compiler. Assembly is just a way to write it down.

There are two recently updated great books I recommend:

- Computer Architecture: A Quantitative Approach (2017) by Hennessy & Patterson

- Computer Organization and Design - RISC-V Edition (2017) by Hennessy & Patterson (I have the older MIPS edition)

You also need a book and documentation for the specific architecture (x86 or ARM), but the two books above teach generic stuff that is useful everywhere.

If you do numerical HPC programming, you usually write very little assembly. You might add some some inline Assembler tweaks inside (C/C++/Fortran) functions when needed. You must know to program in C, C++ or Fortran depending on what the code base you are working on uses and how to embed assembly code inside them.

EDIT: CUDA programming might be important to learn if you want to do low level numerical programming.


In the case of x86 where the microcode is fluid and can be patched, and frankly does not really represent what the processor is really doing, can assembly really be considered "bare metal" these days?


These days even programming mainstream FPGAs with Verilog is not really “bare metal”. Lots of complexity and control is hidden.


The problem with x86 in particular is that there is tons of cruft. You can get lost for days reading about obsolete functionality.

Here's my general workflow for optimizing functions in HFT:

Write a function in C++ and compile it. Look at the annotated disassembly and try to improve it using intrinsics, particularly vector intrinsics, and rdtsc times.

Then compare your output to the compiler's "-ftree-vectorize -march=native" and compare what it did to what you did. Lookup the instructions it used and compare them with what you did, check for redundancies, bad ordering, register misuse/underuse in the compiler output.

Then see if you can improve that.

But all that being said, note that in general this kind of cycle-counting micro-optimization is often overshadowed by instruction/data cacheline loads. It's rare that you have a few kilobytes of data that you will constantly iterate over with the same function. Most learning resources and optimizing compilers seem to ignore this fact.


I’ve wondered why there aren’t more tools for predicting how a program fits into cache lines and data caching effects. For given cpu parameters it seems a reasonable task to estimate cache lines based on a sample dataset. Am I just missing what tools are used out there?


The best tool for this in my experience is callgrind with assembly notation. You can configure it to more or less mimic the cache layout of whatever particular chip you're running and then execute your code on it.

You can use the start and stop macros in valgrind.h to show cache behaviour of a specific chain of function calls, like when a network event happens, then in the view menu of kcachegrind select IL Fetch Misses, and show the hierarchical function view.

It doesn't mimic the exact branch prediction or whatever of your architecture but when you compare it to actual timings it's damn close.


Wow, that's cool!


Why not just write the function in ASM in the first place?


1) Because the compiler gives you a clear reference implementation to test against for correctness and performance.

2) Because after you do this enough times, you will learn when to write your own, when not to, and when to spot inefficiencies in the compiler output. The point is to learn, both about how the instructions work and how the compiler works.

3) The C/C++ implementation serves as documentation of intent and is portable across architectures (including future x86-64 architectures). It's fucking atrocious when devs write pure assembly without a C/C++ reference that can replace it. To me, finding random assembly without a code implementation in the project is the ultimate indictment of a hot rod programmer not thinking about the future or future maintainers.


Can you talk about your day job?


> Does anyone even work with assembly anymore?

I do, but very rarely code in assembly. Usually just read output of C++ compiler.

It’s hard to beat well-written manually vectorized C++ code. Even when the compiler obviously doing something wrong: https://developercommunity.visualstudio.com/content/problem/... The compiler appears to know about typical latency figures for these instructions, i.e. it reorders them when It can if it gonna help.

> how is software keeping up with hardware advances?

Barely. The only way to approach advertised compute performance of CPUs is SIMD instructions. Despite available in mainstream CPUs for couple decades now (Pentium 3 launched in 1999), modern compilers can only auto-vectorize very simple code.

Fortunately, these instructions are available as C and C++ compiler intrinsics. Supported by all modern compilers and quite portable in practice (across different compilers building for the same architecture).

> Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?

Both are worse than C++.

About what to start with… I would start with picking either CPU or GPGPU. GPUs have way more raw computing power. Programming models are entirely different between them.

If you’ll pick GPU, I’d recommend “CUDA by example” book, helped me when I started GPGPU programming. BTW, for GPUs, assemblies and instruction sets are proprietary i.e. no one works with assembly. There’re low-level assembly-like things, Nvidia PTX, MS shader assembly, but these instructions are not executed by hardware, GPU drives compiles them once again into proprietary stuff.

Don’t know good books for CPU SIMD, I started organically with some random articles and Intel reference, but I had many years of C and C++ programming when I did. Not sure it’ll work for you.


Can you share what you do for a living?


I've been developing software for living since 2000.

Lately, working on CAD/CAM/CAE Windows desktop software, also some Linux embedded. Before that worked in game development, HPC (have not coded for supercomputers, just commodity servers i.e. Xeon + nVidia), realtime multimedia (video processing, encoding, broadcasting).


of course someone works with it. someone has to write and maintain the tooling that eventually ends up as x86, right?


> Both are worse than C++.

How so?


If you need performance, you must embrace vector nature of the hardware.

Intel makes CPUs and invented many of these instructions sets, they also make a C++ compiler, and their implementation of intrinsics is what's adopted by other compilers. They also provide decent documentation. Intel doesn't support any golang or rust packages or language extensions.


My sibling already talked about the intrinsics, but I’d also like to point out that Intel does use Rust. https://github.com/intel/cloud-hypervisor for example. They also sponsor Rust conferences.


Does Intel support Rust versions of their intrinsics? If no, how’s that relevant?


I’m not even sure what that would mean. A compiler is a compiler, regardless if it’s Intel making it or not.

They’re the exact same thing as if you used clang.

(It’s only relevant because you seem to imply that Intel doesn’t care about Rust at all. That’s not true.)


> They’re the exact same thing as if you used clang.

Are you sure functions calls that crate adds on top of every single instruction, transmute(), as_i32x4(), etc., compile into nothing? And do so reliably i.e. every single time regardless on the surrounding code?

BTW, functions built on top of intrinsics aren’t reliable in clang. I sometimes have to use compiler-specific trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.


In my understanding, if it does not, that’s a bug. Compiler bugs do happen. Sounds like you’ve hit a few with clang.


They don't need to support it. It's the exact same linking process as from a c binary.


You can use intrinsics from rust. You can use intrinsics from a lot of languages.

https://github.com/AdamNiederer/faster

In fact, in Rust, they are easier to use.


> You can use intrinsics from rust

Right, I know there’s some support.

> You can use intrinsics from a lot of languages.

Yes. However, Intel (the guys making CPUs actually implementing these instructions) only supports them for C/C++. Just because you can use them from other languages (e.g. modern .NET has them as well, System.Numerics.Vectors) doesn’t necessarily mean it’s a good idea to do so.

> In fact, in Rust, they are easier to use.

That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.

When I code at that level of abstraction, I want to get whatever instructions are implemented by CPU. No more, no less.

I’ve looked at example on the front page. There’re two methods to compute rsqrt in SSE/AVX, fast approximate one (rsqrtps), and precise one (sqrtps, divps). There’re several methods to compute ceil/floor, again with different tradeoffs. Do you know which instructions their example compile into? Neither do I.

Also, one tricky part of CPU SIMD is cross-lane operations (shuffle, movelh/movehl, unpack, etc). Another one is integers: the instruction set is not comparable to any programming language, saturated versions of + and -, relatively high-level operations like psadbw, pmaddubsw, palignr, lack of something simple (e.g. can’t compare unsigned bytes for greater/less, only signed ones).

For trivially simple algorithms that compute same math on wide vectors of float values, you better use OpenCL and run on GPU. Will likely be faster.


You can link the exact same intrinsics in a rust binary to get the same intrinsics.

> That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.

When I grow up I want to be as smart as you.

And re: the faster project, you can use the exact same instructions. This is nothing about rust or not.

Also, Rust is simply a better language for low level things even ignoring intrinsics.


> You can link the exact same intrinsics in a rust binary to get the same intrinsics.

Intrinsics are not library functions. You don’t link them anywhere. They’re processed by compiler not linker, and for SIMD math, each one usually becomes a single instruction. Linked functions are too slow for that.

> This is nothing about rust or not.

When I code C and write y=_mm_rsqrt_ps(x) I know I’ll get my rsqrtps instruction. When I write y=_mm_div_ps(_mm_set1_ps(1), _mm_sqrt_ps(x)) I know I’ll get slower more precise version. I don’t want compiler to choose one for me while converting a formula into machine code.


Ok not link but include header.

You can do the same in rust. See the explicit section of the faster project.


> Ok not link but include header.

Sorry to disappoint but Rust can’t include C++ headers. Even if it could, they wouldn’t work, because intrinsics are not library functions.

> See the explicit section of the faster project.

These aren’t C intrinsics, they are library functions exported from stdsimd crate. Which in turn forwards them to LLVM. Requires Rust nighty. Also I’m not sure that many levels of indirection are good for performance. You usually want these m128/m256 values to stay in registers. In C++, I sometimes have to write __forceinline to achieve that, or the compiler breaks performance by making function calls, or referencing RAM.


> Sorry to disappoint but Rust can’t include C++ headers.

You are pedantic.

https://doc.rust-lang.org/1.29.0/std/arch/#static-cpu-featur...


That’s library functions from stdsimd crate.

Looks like significant overhead over C intrinsics. Two calls to transmute() for every instruction. And other calls for every instruction, stuff like as_i32x4.

It’s technically possible every last one of them compile into nothing at all, and emits just a single desired instruction. I don’t believe these optimizations are 100% reliable, however. They aren’t reliable in clang or vc++, I sometimes have to use trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.


There's no transmute. Look I don't have time for this.


> There's no transmute.

That crate calls transmute twice, for every single instruction.

https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...

https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...


What he’s trying to say is that transmute is also an intrinsic.

They correspond to the machine instructions, that’s their entire purpose. That’s also why they’re intrinsics.


Check the actual compiler output.


(This is exactly how Rust’s intrinsics work.)


Agner Fog's page of optimization resources might have some things that interest you: https://www.agner.org/optimize/

One notable use of raw assembly is that Intel themselves contribute optimized strcpy etc. implementations to glibc. You might find it interesting to go see how the most recent ones work.


For high-performance bare metal computing in 2019 on typical hardware, your elementary toolset will be an up-to-date C++ compiler and programming with vector intrinsics. Note: most performance these days comes from highly optimizing memory locality (i.e. data structures), clever instruction sequences will gain little if locality is poor. While you need to be adequately fluent in reading the assembly code generated by your compiler, you virtually never need to write it outside of some very rare cases if you are programming in C++ because almost everything is exposed either as intrinsics or implicitly as part of the language/library. Writing code that uses vector instructions is still a manual process via intrinsics, compilers are still poor at doing that automatically. The set of vector instructions provided seem to have many weird gaps in them (e.g. only '>' and '==' operators for comparison in SSE) because you are expected to figure out how to logically compose most of the operators programmers are accustomed to in the vector domain.

The details will vary significantly with the type of high-performance code you are writing e.g. is it floating point numerics, integer domain, memory-hard, trivially parallelizable, etc. Different architectures are optimized for different kinds of codes, but a modern CPU with strong vector support likely gives the best performance across the broadest range of code types. Contrary to popular impression, most HPC is not doing linear algebra and quite a bit of it is integer code.

Becoming intimately familiar with the details of microarchitectures is hugely important to understanding how to optimally structure codes for them. Agner Fog's resources are a good starting point for understanding some of these issues.


Compiler engineers need to read assembly. I don't think too many people write lots of it by hand, unless they're extending packages like LAPACK of BLAS.

Software has probably been free-riding on hardware, but hardware has also not been keeping up with hardware. The big difficulty is the latency between memory and the processor.

Rust is closer than Go. Go isn't designed to make optimal code. It's there to make good enough code. And it does so very quickly.

In terms of resources, I really like Crafting Interpreters. Would love for you to take a glance at my book (Rust in Action, Manning), if you want to learn about other systems topics outside of compiler design.


Go also schedules N:M, which means it's not trivial to ensure that your code keeps running on the same CPU... which may be important for hardware.


Why do people keep repeating this myth, when thread pinning is literally a one-liner?

> LockOSThread wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread. [...] A goroutine should call LockOSThread before calling OS services or non-Go library functions that depend on per-thread state.

Source: https://golang.org/pkg/runtime/#LockOSThread


It's usually enough, but not always. For example namespaces and new threads don't handle this well, even if you do LockOSThread https://www.weave.works/blog/linux-namespaces-and-go-don-t-m...


That has been fixed quite a while ago, in Go 1.10


If you haven't seen it yet, have a look at https://wiki.osdev.org/Expanded_Main_Page - it's probably the best first destination when looking for OS implementation info. It may not go too deep in many areas, but it has reference to other important places.

Assembly is still used where needed. Kernels of media encoders (for performance), some interrupt handler (for control), other things that shouldn't comply with ABIs for whatever reason, and for things that can't be accessed from higher level languages (architecture specific registers).


You should check out https://software.intel.com/sites/landingpage/IntrinsicsGuide.... Besides that it probably is a good idea to figure out a concrete thing to implement, do an implementation in C/C++ and figure out using something like godbold how to improve the generated assembler code. HPC is not about assembler implementations at all, more about Problem specific algorithms and MPI.



It’s worth reading through Phillip Opperman’s blog series (https://os.phil-opp.com/) where he writes about creating an OS in Rust. Not necessarily for the Rust part, but it has lots of nice info on how a modern system is bootstrapped.


For bare-metal coding it is best to get started with Embedded Programming using MCUs (ARM, AVR, 8051 etc.).

For x86 you may find the following useful;

1) Computer systems: A programmer's perspective by Bryant and O'Hallaron.

2) Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture by Igor Zhirkov.

3) Modern X86 Assembly Language Programming by Daniel Kusswurm.

4) Agner Fog's writings.

5) Ciro Santilli's x86 examples - https://github.com/cirosantilli/x86-bare-metal-examples


I find this course at Cambridge to be a wonderful resource for bare metal programming on ARM (Raspberry PI): https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/i...

It starts off with booting a simple OS written in assembly from SD card, blinking LED and then continues to more advanced topics such as controlling the GPU/screen.


> Does anyone even work with assembly anymore?

Sometimes on simple 8-bit architectures for some interrupt-routines where it does perhaps under certain conditions increase performance on I/O operations. And that is to 99% read -> write.

On more complex architectures like x86? Not really. I wouldn't even come close to modern compilers. Maybe for fun.

If you want to create a simple system I would recommend to just ignore optimization and focus on system design. That is probably the only thing agnostic about assembler.

You could also use a µC with external programmer to get a basic system running. mnemonics are often very similar to x86 and while optimizations are hard and in-depth knowledge about architecture is required, the first step to take is probably the system design itself and the efficiency of the assembly is secondary. All that should come later.

For the very first steps, I would recommend linux or windows assembler and learning how to interface the underlying OS. That could help hardening the requirements for the design.


This is going to end up somewhat specific, as most low-level initialization routines are. ARM vs x86 vs RISC-V vs MIPS vs a BASIC stamp or propeller etc...

I will suggest studying some general Assembly language concepts, memory locations, how to set-up RAM timings and bring offchip RAM into an address space, etc.

This stuff is used daily in the world of microcontrollers, I'm a huge fan of the Parallax propeller, in which the Spin interpreter in ROM launches Assembly routines on individual cores in about the simplest fashion possible...

Some things you may find interesting: https://github.com/dwelch67/raspberrypi

https://github.com/rsta2/circle

https://ultibo.org/

Think of the latter 2 as "HAL plus some primitives" rather than RTOS...


Go is garbage collected, and thus not remotely suitable for bare-metal work.

C, C++, Rust, possibly Ocaml, FORTH, FreePascal, etc... or else you are actually on about a virtual machine, not bare metal.

Learning how to create your own heap. Make your own malloc/free on a baremetal device and that will be a huge boost for you, confidence wise...


> Go is garbage collected, and thus not remotely suitable for bare-metal work.

Others beg to differ, selling GC enabled development environments from small PICs to grown up ARM deployments.

http://www.astrobe.com/default.htm

http://www.microej.com/

https://www.aicas.com/cms/

https://www.ptc.com/en/products/developer-tools/perc

https://www.microdoc.com/ibm-websphere-everyplace-custom-env...

Some solutions even considered suitable for bare metal work by the US and French military.

And speaking of Go on bare metal, https://tinygo.org/


Well, is any of this actual bare-metal environments for coding? It's more of a HAL meets RTOS, and it's certainly not self-bootstrapping in Go, which is impossible on real hardware, or?


Astrobe is certainly bare metal, just like some of the Java deployment options, with AOT compilation to target boards, like Aicas and PTC are capable of.


Thanks, that's interesting and very new to me. I appreciate the links


If you want an example of GC in bare bones all the way from building your own FPGA up to the graphics display, check the 2013 update from Project Oberon.

http://www.projectoberon.com/

https://inf.ethz.ch/personal/wirth/ProjectOberon/index.html

Sadly the ready made OberonStation boards are no longer on sale.


"Steel Bank Common Lisp: because sometimes C abstracts away too much"

https://www.pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembl...


To be fair, you will learn a lot more getting a garbage collected language to run bare metal runtime and all.


implementing in a non-garbage collected lower-level language, of course. Even FORTH's stack requires some ASM words to set it up... or am I missing something here?


I have a friend at NIU who's in the CS program and they're still teaching assembly there. I think with web dev and all the new flashy stuff that's readily available on top of all these foundational languages people are much more drawn to the immediacy of JavaScript and Python and other high-level languages.


Many paths to follow here. However, one thing that may help is to decide whether which platform you want to start with. Something like the ever popular Pi and friends, or a lower powered platform such as Ardurino. Each platform will need a different approach due to its available resources.


This:

https://github.com/cirosantilli/x86-bare-metal-examples

He has a lot of them, and you can also check out his stack overflow answers.

>Does anyone even work with assembly anymore?

in infosec, definitely


> Does anyone even work with assembly anymore?

As a little bit of a different fun one:

I have a MicroPython project I'm building, a limited run toy, which live-assembles some assembler that it deploys to a co-processor which handles low power mode.

A real mix of very high level and very low level code.


I don't have any answers to your questions (you probably know more than me), but this thread seems like a good beginning of a support group for people programming on the bare metal.

I do so not for performance but for reducing dependencies, for building a more parsimonious stack at all levels: https://github.com/akkartik/mu/blob/master/subx/Readme.md. It's surprisingly pleasant. Lately I find myself thinking heretical thoughts, like whether high-level languages are worth all the trouble.


Probably the best resource you'll find these days is a virtualised environment - do most of your development in a VM or an emulator, get it all up and running, then port it to real metal


Yes. Start with this, osdev.org, and gcc -S.


I mostly dabble in assembly language for fun, but it can be very useful to know how it works when you're patching binaries, or reverse engineering protection/encryption systems. Some of the binary reverser-challenges are great for that.

Otherwise the only things I've done recently have been writing a simple "compiler" for maths, converting "3 4 + 5 *" into floating-point assembly, or generating code for a toy-language.


The best book on x86 Assembly for beginners ever: https://savannah.nongnu.org/projects/pgubook/




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: