My solution in the final trace was a minimal (300 sloc) libc with a total of two lines of assembly (to bootstrap syscalls) and didn't do any syscalls beside read, write, and exit. I've also toyed with the idea of testing exit via error (SIGSEGV, SIGILL, etc) because cleanup might be faster on a forced exit.
Now imagine the speed if you could get rid of the syscall overhead. You would have to write a lot more code (and it probably wouldn't run on Linux due to trying to access protected memory locations or IO ports), but it would be significantly faster.
Good old Autocode. Finally you didn't need to enter all instructions by number, it was considered a great leap forward at the time...
The first autocode and its compiler were developed by Alick Glennie in 1952 for the Mark 1 computer at the University of Manchester and is considered by some to be the first compiled programming language. His main goal was to make the programming of Mark 1 machine, known for its partcularly abstruse machine code, comprehensible. Although the resulting language was much clearer than the machine code, it was still very machine dependent.
Funny that I felt the scripting side of while reading some projects like wmii, or some plan9 utils. I'm under the impression that in UNIX early days, C was almost a rapid prototyping / scripting language.
I had to implement buffered IO due to syscall overhead (the CTF solution called fgetc() and relied on libc's buffering for speed). My first working non-buffered test took on the order of seconds and made several thousand syscalls. When I mentioned optimistically buffered IO, that was "just flush on EOF, don't even look for newlines". Reducing the syscall count definitely helped.
It's interesting to note that Go doesn't depend on libc. However, I rather doubt the Go runtime bootstraps any faster than libc, so it probably wouldn't help here. One consequence is that Linux and libc are tightly coupled and some of the "syscall" behavior you depend on is actually implemented as a wrapper in libc. This has caused some behavioral differences between Go and C that have surprised the language developers.
It also interesting to note that in contrast Rust stdlib does depend on glibc. It does not even work on musl, that is on the wishlist: https://github.com/rust-lang/rfcs/issues/625
Of course you can use Rust without stdlib, but that leaves you with fairly barebones language.
My pre-C solution was Go. It bootstrapped pretty fast, but I actually ran into a problem where my large embedded data structures (hash table, bloom filter, etc) took a very long time (hundreds of milliseconds?) to initialize. I guess it has a startup cost for embedded data. Go also has a slightly larger binary size, which may or may not have affected startup time on the CTF worker boxes.
Linux has very non Posixy ideas about threads, which are basically processes with shared memory. libc is supposed to present a Posix compatible interface. See [1] for some examples. Go simply exposes raw syscalls.
Micro-benchmarks lead to optimizing at the wrong level. In the real world, if process startup time is a bottleneck in your application then you are creating too many processes too quickly.
You're absolutely right about the majority of the time in the real world. A ctf level is a contrived case, but that doesn't make this any less interesting for me. This was mostly a challenge of "how fast can I run my existing program?" and I think I nailed it pretty well with lib43 [0] (which is a working extremely minimal libc and crt0 replacement in about 300 lines of code).
My favorite part is my x86_64 linux syscall interface, in which I define a short assembly stub (syscall.c) and simply list syscall names (syscall.list) to wire them up. The calling convention between user-space functions and the kernel was close enough I was able to reduce the inline asm to two instructions (`mov rax, syscall num; syscall;`).
"...huck the whole thing and rewrite it in assembly."
What are the functions in libc that you truly need and use?
How many of those routines are you incapable or unwilling to rewrite yourself?
I have asked myself these questions on many occasions.
The cruft laden operating system I use (which I imagine by many people's standards is "bare bones") is hopelessly dependent on "libc". But I find that the userland programs I truly need can be written using only a minimal subset of the many functions that are compiled into it.
Could I write any of these in assembly and create my own "libasm" to do the syscalls? This is a project I have contemplated many times.
To motivate myself to keep writing more assembly in the age of scripting language du jour madness, I write a small program in assembly language and then write the same program in C and then step through each compiled program (e.g., with ald). Seeing some of the overhead is an effective motivator.
It's sometimes startling how major foundational tools and systems ignore bloat.
For example, I wanted to use C++11's std::array to return fixed-lengths arrays by value on the stack. This was in a minimal API that originally included NO other headers. In MSVC, std::array included a tree of 100 other headers including gems such as istream, exception, new, malloc, and float.
Yes, I'm aware that none of this stuff gets compiled into the binary, but it shows a disregard for lean dependencies that is emblematic of a bigger problem.
From our (STL maintainer) perspective, <array> includes <algorithm> because it needs equal(), <iterator> because it needs reverse_iterator, and <tuple> because it needs to provide get(). Minimal!
But yeah, it drags in a whole tree of headers. Minimizing this is possible, and we've taken steps to do so in the past, but it's a lot of work for possibly minimal benefit - user translation units tend to drag in many STL headers, especially if they're using precompiled headers like they should. We think our time is better spent fixing correctness bugs and implementing features.
I understand. We definitely want C++14 features first :) But it seems like some of the problems would be easy to fix. Like - istream_iterator is defined in <iterator>, so <iterator> needs to include <istream>, which has a huge subtree. Iterators are a much more "leaf" idea than input streams. Why doesn't <istream> include <iterator> instead?
Or, <algorithm> gets most of its subtree via <memory>, which it needs for a few algorithms that use temporary buffers. That one is beyond your control... The standard should add an <algorithm_lightweight> header containing only algorithms that don't need extra memory. Or you could add your own such header to use internally.
> Why doesn't <istream> include <iterator> instead?
Users must be able to include <iterator> by itself, and get istream_iterator.
We do break things up into internal headers, we just haven't done that as much as possible. Come to think of it, equal() is defined in one of our central headers, so we should probably take advantage of that in <array>.
Musl libc has every function in a separate file, so if you statically link you just get what you use, modulo a number of functions that use each other. So building it yourself is not going to be much better.
Just a remark: If you find it confusing to scatter your code over too many small files with only one function each, look up "-ffunction-sections" on the gcc manual.
Both you and tptacek understand what it is I want. Apparently the musl author is on the same page too. Very good.
But note my goal is not to achieve "better" than what someone else has written.
My goal is having more control over the result and to practice assembly language programming. And... to eliminate some of the "overhead" I see.
Big difference between my handwritten assembly and what gcc generates. Regardless of which is "better" one is more succinct.
Perhaps I drifted a bit from the original blog post; I think the author may have just been trying to find a speed increase, rather than chasing after some sort of minimalist aesthetic.
This is cool. I've been working on something similar[0]. It only takes a few more lines to add some of the other useful libc/posix/sysv things. My eventual goal is to rewrite asmutils'[1] httpd [2] in C and get a binary about 2-3x in size (2-3K).
strace -tt /tmp/test
15:38:33.245549 execve("/tmp/test", ["/tmp/test"], [/* 16 vars */]) = 0
15:38:33.246436 arch_prctl(ARCH_SET_FS, 0x601070) = 0
15:38:33.246617 set_tid_address(0x6010a0) = 17227
15:38:33.246875 exit_group(0) = ?
15:38:33.247060 +++ exited with 0 +++