C Runtime Overhead

justincormack · on Jan 28, 2015

Static linking to Musl libc is pretty minimal, not sure you really need assembly

strace -tt /tmp/test

15:38:33.245549 execve("/tmp/test", ["/tmp/test"], [/* 16 vars */]) = 0

15:38:33.246436 arch_prctl(ARCH_SET_FS, 0x601070) = 0

15:38:33.246617 set_tid_address(0x6010a0) = 17227

15:38:33.246875 exit_group(0) = ?

15:38:33.247060 +++ exited with 0 +++

lunixbochs · on Jan 28, 2015

My solution in the final trace was a minimal (300 sloc) libc with a total of two lines of assembly (to bootstrap syscalls) and didn't do any syscalls beside read, write, and exit. I've also toyed with the idea of testing exit via error (SIGSEGV, SIGILL, etc) because cleanup might be faster on a forced exit.

falcolas · on Jan 28, 2015

Now imagine the speed if you could get rid of the syscall overhead. You would have to write a lot more code (and it probably wouldn't run on Linux due to trying to access protected memory locations or IO ports), but it would be significantly faster.

Aah, the promise of the Exokernel[1].

[1] http://u.cs.biu.ac.il/~wiseman/2os/microkernels/exokernel.pd...

SixSigma · on Jan 28, 2015

The HPC guys use OS-Bypass to avoid system calls. [0]

Ron Minnich presented a paper at the Plan9 conference

Using Currying and process-private system calls to break the one-microsecond system call barrier [1]

Which is a technique he's been using on Blue-Gene. There are some more papers linked from the IBM Research website. [2]

[0] http://www.networkworld.com/article/2317965/tech-primers/os-...

Sorry the conference repository server is awol, all I could find is this :

[1] https://www.yumpu.com/en/document/view/7777923/using-curryin...

[2] http://researcher.watson.ibm.com/researcher/view_group_pubs....

pjmlp · on Jan 28, 2015

Well, when I started programming, C and Pascal execution speed were looked down like managed runtimes are still nowadays.

Serious application development for consumer computers was done in straight Assembly. Anything higher than that was "scripting". :)

scott_s · on Jan 28, 2015

Back when digital, electronic computers were new, assembly was called "automatic programming".

smhenderson · on Jan 28, 2015

Good old Autocode. Finally you didn't need to enter all instructions by number, it was considered a great leap forward at the time...

The first autocode and its compiler were developed by Alick Glennie in 1952 for the Mark 1 computer at the University of Manchester and is considered by some to be the first compiled programming language. His main goal was to make the programming of Mark 1 machine, known for its partcularly abstruse machine code, comprehensible. Although the resulting language was much clearer than the machine code, it was still very machine dependent.

From: https://en.wikipedia.org/wiki/Autocode

agumonkey · on Jan 29, 2015

Funny that I felt the scripting side of while reading some projects like wmii, or some plan9 utils. I'm under the impression that in UNIX early days, C was almost a rapid prototyping / scripting language.

blt · on Jan 28, 2015

The promise of TempleOS!

lunixbochs · on Jan 28, 2015

I had to implement buffered IO due to syscall overhead (the CTF solution called fgetc() and relied on libc's buffering for speed). My first working non-buffered test took on the order of seconds and made several thousand syscalls. When I mentioned optimistically buffered IO, that was "just flush on EOF, don't even look for newlines". Reducing the syscall count definitely helped.

malkia · on Jan 28, 2015

And The Interrupt List!!! (Ralph Brown's)

slashdev · on Jan 28, 2015

It's interesting to note that Go doesn't depend on libc. However, I rather doubt the Go runtime bootstraps any faster than libc, so it probably wouldn't help here. One consequence is that Linux and libc are tightly coupled and some of the "syscall" behavior you depend on is actually implemented as a wrapper in libc. This has caused some behavioral differences between Go and C that have surprised the language developers.

zokier · on Jan 28, 2015

It also interesting to note that in contrast Rust stdlib does depend on glibc. It does not even work on musl, that is on the wishlist: https://github.com/rust-lang/rfcs/issues/625

Of course you can use Rust without stdlib, but that leaves you with fairly barebones language.

lunixbochs · on Jan 28, 2015

My pre-C solution was Go. It bootstrapped pretty fast, but I actually ran into a problem where my large embedded data structures (hash table, bloom filter, etc) took a very long time (hundreds of milliseconds?) to initialize. I guess it has a startup cost for embedded data. Go also has a slightly larger binary size, which may or may not have affected startup time on the CTF worker boxes.

ufo · on Jan 28, 2015

can you give an example?

justincormack · on Jan 28, 2015

Linux has very non Posixy ideas about threads, which are basically processes with shared memory. libc is supposed to present a Posix compatible interface. See [1] for some examples. Go simply exposes raw syscalls.

[1] http://ewontfix.com/17/

neonscribe · on Jan 28, 2015

Micro-benchmarks lead to optimizing at the wrong level. In the real world, if process startup time is a bottleneck in your application then you are creating too many processes too quickly.

lunixbochs · on Jan 28, 2015

You're absolutely right about the majority of the time in the real world. A ctf level is a contrived case, but that doesn't make this any less interesting for me. This was mostly a challenge of "how fast can I run my existing program?" and I think I nailed it pretty well with lib43 [0] (which is a working extremely minimal libc and crt0 replacement in about 300 lines of code).

My favorite part is my x86_64 linux syscall interface, in which I define a short assembly stub (syscall.c) and simply list syscall names (syscall.list) to wire them up. The calling convention between user-space functions and the kernel was close enough I was able to reduce the inline asm to two instructions (`mov rax, syscall num; syscall;`).

[0] https://github.com/lunixbochs/lib43

101914 · on Jan 28, 2015

"...huck the whole thing and rewrite it in assembly."

What are the functions in libc that you truly need and use?

How many of those routines are you incapable or unwilling to rewrite yourself?

I have asked myself these questions on many occasions.

The cruft laden operating system I use (which I imagine by many people's standards is "bare bones") is hopelessly dependent on "libc". But I find that the userland programs I truly need can be written using only a minimal subset of the many functions that are compiled into it.

Could I write any of these in assembly and create my own "libasm" to do the syscalls? This is a project I have contemplated many times.

To motivate myself to keep writing more assembly in the age of scripting language du jour madness, I write a small program in assembly language and then write the same program in C and then step through each compiled program (e.g., with ald). Seeing some of the overhead is an effective motivator.

blt · on Jan 28, 2015

It's sometimes startling how major foundational tools and systems ignore bloat.

For example, I wanted to use C++11's std::array to return fixed-lengths arrays by value on the stack. This was in a minimal API that originally included NO other headers. In MSVC, std::array included a tree of 100 other headers including gems such as istream, exception, new, malloc, and float.

Yes, I'm aware that none of this stuff gets compiled into the binary, but it shows a disregard for lean dependencies that is emblematic of a bigger problem.

StephanTLavavej · on Jan 28, 2015

From our (STL maintainer) perspective, <array> includes <algorithm> because it needs equal(), <iterator> because it needs reverse_iterator, and <tuple> because it needs to provide get(). Minimal!

But yeah, it drags in a whole tree of headers. Minimizing this is possible, and we've taken steps to do so in the past, but it's a lot of work for possibly minimal benefit - user translation units tend to drag in many STL headers, especially if they're using precompiled headers like they should. We think our time is better spent fixing correctness bugs and implementing features.

blt · on Jan 28, 2015

I understand. We definitely want C++14 features first :) But it seems like some of the problems would be easy to fix. Like - istream_iterator is defined in <iterator>, so <iterator> needs to include <istream>, which has a huge subtree. Iterators are a much more "leaf" idea than input streams. Why doesn't <istream> include <iterator> instead?

Or, <algorithm> gets most of its subtree via <memory>, which it needs for a few algorithms that use temporary buffers. That one is beyond your control... The standard should add an <algorithm_lightweight> header containing only algorithms that don't need extra memory. Or you could add your own such header to use internally.

StephanTLavavej · on Jan 29, 2015

> Why doesn't <istream> include <iterator> instead?

Users must be able to include <iterator> by itself, and get istream_iterator.

We do break things up into internal headers, we just haven't done that as much as possible. Come to think of it, equal() is defined in one of our central headers, so we should probably take advantage of that in <array>.

tptacek · on Jan 28, 2015

You for the most part don't need to rewrite anything. You can take the source code for most libc functions and compile them independently.

justincormack · on Jan 28, 2015

Musl libc has every function in a separate file, so if you statically link you just get what you use, modulo a number of functions that use each other. So building it yourself is not going to be much better.

cnvogel · on Jan 28, 2015

Just a remark: If you find it confusing to scatter your code over too many small files with only one function each, look up "-ffunction-sections" on the gcc manual.

101914 · on Jan 29, 2015

Both you and tptacek understand what it is I want. Apparently the musl author is on the same page too. Very good.

But note my goal is not to achieve "better" than what someone else has written.

My goal is having more control over the result and to practice assembly language programming. And... to eliminate some of the "overhead" I see.

Big difference between my handwritten assembly and what gcc generates. Regardless of which is "better" one is more succinct.

Perhaps I drifted a bit from the original blog post; I think the author may have just been trying to find a speed increase, rather than chasing after some sort of minimalist aesthetic.

oso2k · on Jan 29, 2015

This is cool. I've been working on something similar[0]. It only takes a few more lines to add some of the other useful libc/posix/sysv things. My eventual goal is to rewrite asmutils'[1] httpd [2] in C and get a binary about 2-3x in size (2-3K).

[0] https://github.com/lpsantil/rt0

[1] http://asm.sourceforge.net/asmutils.html

[2] https://github.com/leto/asmutils/blob/master/src/httpd.asm

tedunangst · on Jan 28, 2015

Something funny about the numbers here. The assembly stub took 8ms, but the whole program only takes 1ms?

darshan · on Jan 28, 2015

That was a typo -- the assembly stub took about 0.8 ms as shown in the trace.

lunixbochs · on Jan 28, 2015

Thanks, fixed.