
Embedding Binary Objects in C - ingve
https://flak.tedunangst.com/post/embedding-binary-objects-in-c
======
haberman
If we're in the realm of "non-standard linker tricks"...

Compilers will concatenate sections of the same name. You can use this trick
to produce a concatenation of arrays across several files:

    
    
        $ cat t1.c
        __attribute__((section("some_array"))) int a[] = {1, 2, 3};
        $ cat t2.c
        __attribute__((section("some_array"))) int b[] = {4, 5, 6};
        $ cat t.c
        #include <stdio.h>
        
        extern const int __start_some_array;
        extern const int __stop_some_array;
        
        int main() {
          const int* ptr = &__start_some_array;
          const int n = &__stop_some_array - ptr;
          for (int i = 0; i < n; i++) {
            printf("some_array[%d] = %d\n", i, ptr[i]);
          }
          return 0;
        }
        $ gcc -std=c99 -o t t.c t1.c t2.c && ./t
        some_array[0] = 1
        some_array[1] = 2
        some_array[2] = 3
        some_array[3] = 4
        some_array[4] = 5
        some_array[5] = 6
    

This is the mechanism that linkers use "under the hood" to get a list of C++
object initializers that need to run pre-main().

It's unfortunate that there is no standard way of getting at this
functionality in portable C, or to get it in C++ without actually running code
pre-main(). Sometimes you really want a linker-initialized list things (of
some sort) that have been linked in, _without_ actually running code pre-
main() (which has all kind of issues).

I would love to see a thing like this standardized in both C and C++. C++
compilers need it under the hood anyway.

~~~
qppo
Can you give an example of a use case where this is needed?

~~~
codys
One place it's used is to avoid central "registration" for multiple "things".
For example, consider a program with many fairly separate "subcommands". Using
this mechanism allows adding a new subcommand simply by linking in a new
source file (without changing a shared source file to list the new command).

~~~
berti
I use this idea in embedded work, where the code is split into "modules". The
same firmware source is shared amongst various differing pieces of hardware,
each of which utilise a different subset of modules.

------
tyingq
There's also gnu's objcopy.

This post covers a lot of different approaches, and has lots of tips:
[https://www.devever.net/~hl/incbin](https://www.devever.net/~hl/incbin)

~~~
regularfry
I got about 90% of the way to having a working ruby single-file executable
builder which used objcopy to embed a sqlite database of the source files into
an MRI build. Then YARV happened and the ruby build chain changed _just
enough_ that I needed to throw it out and start again.

Every now and again I ponder having another go with mruby...

~~~
rkeene2
FWIW, this is roughly how Tclkits work in the Tcl world. Although by default
they use a Metakit database instead of SQLite.

Currently they append the database to the end of the executable, which has
some problems and I'm working to make including it in the image more
standardized as part of XVFS [0].

[0]
[https://chiselapp.com/user/rkeene/repository/xvfs/](https://chiselapp.com/user/rkeene/repository/xvfs/)

------
JoeAltmaier
I needed this last week! Building an embedded firmware image for a dashboard
display, with lots of PNG files for icons etc. 61 of them.

The original developer wrote a tool to expand the PNGs to BMPs (arrays of
32-bit pixel values) and generate a C array definition as a text file. Which
is lots bigger than the original PNG (13K => 100K sort of thing). Then
included that C source in the build. Used up 700K of my firmware image space
which was only 1MB to begin with.

So I wrote a tool to represent the raw PNG file as a C declaration, then added
parts of libpng to the boot code to decompress the images during
initialization. Even with libpng, I saved over 400K. Now the images use RAM
instead of ROM, but that's ok I had buttloads of RAM.

Anyway, this is a much slicker way of including binary files in an image. I
may go back and change my build.

~~~
hrydgard
There are even smaller png libs than libpng, try a stripped down stb_image for
example. Wouldn't use that for user-supplied images, but since you control
them all it should be fine.

~~~
teunispeters
I do this with embedded graphics in boards for a lot of reasons. Works great -
and STB is pretty easy to modify if you need a funny one (eg : monochrome pbm
for small displays). Animated gif, png, bunch of others all work pretty
smoothly. Just include the bits you need so it can be even smaller.

------
ddevault
I wrote this tool a long time ago, which allows you to embed files and then
access them as a "filesystem" with stdio. May be of interest:

[https://git.sr.ht/~sircmpwn/koio](https://git.sr.ht/~sircmpwn/koio)

If you just want the contents of the file as a symbol, the approach described
in the article is 100% the way to go.

------
saagarjha
Or, as any CTFer worth their salt will tell you,

    
    
      __asm__(“.incbin file”);

~~~
gpvos
CTF = ?

~~~
cosarara
A capture the flag challenge [https://ctftime.org/ctf-
wtf/](https://ctftime.org/ctf-wtf/)

------
rgovostes
One of the Capture the Flag challenges at CCC last year was to figure out a
way to leak the contents of a file on a remote compiler server. The server
would accept some C code and just give you a boolean true/false value whether
the code compiled or not---without ever executing it.

Others came up with a solution abusing the C preprocessor by defining macros
that would make the known structure of the file valid C and therefore they
could just #include it. But my solution works with arbitrary files, without
knowing the structure beforehand.

As others have pointed out here, you can use inline assembly and the .incbin
directive to include a file. But how could that influence whether the
compilation succeeds or fails? I figured out how to guess a byte of the file
and create metadata sections only accepted by the linker if the guess was
correct.

[https://ryan.govost.es/2019/12/18/compilerbot.html](https://ryan.govost.es/2019/12/18/compilerbot.html)

~~~
saagarjha
We didn't actually use .incbin on that challenge, interestingly; however, we
did use it (along with some nested static constructor function trickery) for
Online Calc from TokyoWesterns CTF. For that we had some straightforward
tricks to get around the forbidden characters, then we abused the flag format
to get the #include to work. After that we could leak the flag byte-by-byte
using Linux's BUILD_BUG_ON_ZERO, which is basically an upgraded static_assert.
I think this is the code we ran:

    
    
      """
      _Pragma("clang diagnostic push")
      _Pragma("clang diagnostic ignored \\"-Wtrigraphs\\"")
      ??= define STRINGIZE(...) ??=__VA_ARGS__
      ??= define EXPAND_AND_STRINGIZE(...) STRINGIZE(__VA_ARGS__)
      ??= define hxp EXPAND_AND_STRINGIZE(
      ??= define BUILD_BUG_ON_ZERO(e) (sizeof(struct <% int:-!!(e); %>))
      ??= define BUILD_BUG_ON(condition) ((void)BUILD_BUG_ON_ZERO(condition))
      const char flag =
      ??= include "flag"
      )[{}];
      BUILD_BUG_ON(flag == {});
      _Pragma("clang diagnostic pop")
      """
    

(The {}, of course, being Python format parameters from our script.)

------
hikarudo
My favorite is simply using a tool to create a C file with your binary data:

    
    
      static uint8_t mydata[] = {0xDE, 0xAD, 0xBE, 0xEF, ... };
    

The advantage is that it works anywhere, with any compiler.

The disadvantage is that it can increase compile time. I would limit each .c
file to 10MB; it seems there's a quadratic increase in build time with file
size, at least with gcc.

Also, instead of "0x%02x" I use decimal notation and no spaces in order to
decrease the .c file size:

    
    
      static uint8_t mydata[] = {
      42,7,105,0,0,...,
      8,0,1,20,...,
      ...,
      };

~~~
KMnO4
I've never noticed the "quadratic" build time increase you mention, so I did a
test[0]. Files in size from 1mb to 50mb, 3 trials each. These are the results,
and they look _absolutely_ linear[1].

[0]: [https://pastebin.com/Z9329xkc](https://pastebin.com/Z9329xkc)

[1]: [https://i.imgur.com/I3XSBDg.png](https://i.imgur.com/I3XSBDg.png)

~~~
RealityVoid
Nice, I love it when people actually try it out. In the end, I much prefer the
method outlined in the article, less processing overall, no need to transform
it into an array nor compile the array into an object afterwards. You just
skip 2 steps and bring it to an object directly!

------
quietbritishjim
Where did they define the symbols `_binary_quine_c_start` and
`_binary_quine_c_end`? I would expect that the symbol names would need to be
passed to the command that produced the object file from the binary file:

    
    
        ld -r -b binary quine.c -o myself.o -m elf_amd64

~~~
CraneWorm
> I would expect that the symbol names would need to be passed to the command
> that produced the object file from the binary file:

It's the other way around. The command produces these symbols basing on the
file name:

$ readelf --symbols myself.o

    
    
       Symbol table '.symtab' contains 5 entries:
          Num:    Value          Size Type    Bind   Vis      Ndx Name
            0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
            1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
            2: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT    1 _binary_quine_c_start
            3: 0000000000000113     0 NOTYPE  GLOBAL DEFAULT  ABS _binary_quine_c_size
            4: 0000000000000113     0 NOTYPE  GLOBAL DEFAULT    1 _binary_quine_c_end

------
zevv
Totally not relevant for this post, but this is how we do this in Nim:

    
    
        const a = readFile("mydata")
    

The `const` makes the expression evaluate at compile time, conveniently
slurping the file into `a`, where it will be available at run time.

~~~
mikepurvis
That is pretty cool, but also not the behaviour I'd expect coming from other
languages where this would be "get the contents of this file from the current
directory of my runtime environment, and assign them to immutable variable
`a`."

~~~
zevv
Understandable; there's `let` for run time and `const` for compile time
evaluation.

------
pdw
Unfortunately, this technique is a bit problematic with modern C compilers.
Because the `start` and `end` symbols are unrelated objects, as far as the
compiler is concerned, the subtraction `&end - &start` to get the length of
the data invokes undefined behavior. Just for that reason, I feel the include
file with a hex dump is the better method.

~~~
ghostpepper
Can you elaborate on this? Does "unrelated" have any special meaning in this
context? Why is this operation undefined?

~~~
saagarjha
In C, accessing data outside the bounds of an object is undefined (like, you
can’t legally “go past the end” of an array and end up in some other object).
The “start” and “end” pointers, as far as the compiler is concerned, are
totally different objects, so it may optimize a loop from one pointer to the
other out since it’s impossible to increment an address so it’ll go from
pounding at one thing to another.

~~~
ghostpepper
So is the idea that just because the two objects happen to appear sequentially
in memory under a certain implementation, the compiler (or linker in this
case?) has no obligation to ensure that assumption holds?

~~~
saagarjha
Correct. (Although, the pointers being defined that way seems to make it quite
unlikely that the compiler could optimize this incorrectly…)

------
dirtydroog
C++ has a proposal for std::embed

[http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2018/p104...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2018/p1040r0.html)

~~~
beefhash
C has a proposal for #embed.

[http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2499.pdf](http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n2499.pdf)

~~~
jws
In the section on search paths…

 _It follows the same implementation experience guidelines as #include by
leaving the search paths implementation defined, with the understand that
implementations are not monsters and will generally provide…_

Really? We are going to "understand that implementations are not monsters"
after what they've done with "undefined"? I think maybe these standards should
be written from the perspective that implementations are sociopathic demons
summoned against their will.

(But I really want this feature. I regularly use xxd and some Makefile rules
to embed assets in my executables. For instance, in a web service I might have
all my default configuration and templates in the executable with command line
options to send them to standard output and override them from external files.
Then on the chance someone needs to make a change they can just make their own
file and use it.)

~~~
rwj
May I ask what's wrong with using xxd? I don't have an opinion about C, but
C++ is already pretty complicated, and adding features to the language when
there are already well-known solutions doesn't seem wise.

(For the record, I'm aware that there are size limitation when using xxd, but
there are also other solutions).

~~~
beefhash
Windows doesn't ship it, so now your build system got even more complicated on
Windows.

~~~
rwj
That's true, but I put that in the bucket of "C++ needs a package manager so
that we can use dependencies more complicated than a single header file".

------
kazinator
I wouldn't do it this way, within the toolchain. Better to just append the
data to the finished executable, with a header that contains an identifying
marker. The program can scan its own image and look for that marker to get at
the data. That will work on any OS with any executable and linker format and
can be done post-production (users or downstream distributors can receive the
binary and add customized data to it without requiring a dev environment with
a toolchain, just tiny utility you can bundle with the program).

Scanning for the marker can be avoided, if we do the following:

    
    
       /* inside the program, at file scope */
    
       struct {
         char marker[16] = "dW5pcXVlbWFyawo="
         uint32 offset;
       } binary_stuff;
    

Then your tiny utility programs opens the executable and looks for the
sequence "dW5pcXVlbWFyawo=" inside it. Having found it, it puts the offset of
the data into the 32 bit offset field which follows, and writes the data at
that offset.

When the program runs, if it sees a nonzero value in binary_stuff.offset, it
can open its image and directly proceed to that offset to read the stuff or
map it: no searching.

~~~
marcan_42
This isn't portable. Executable files aren't binary blobs, they're a
structured format. If you append data to an executable there is no guarantee
it's actually going to end up mapped in memory for you. You'd have to put it
into an actual ELF segment, and at that point you're back to using the linker.

It's a lot more sensible to ask the linker to do this as in OP than to hack
together something like you've described.

~~~
kazinator
> _If you append data to an executable there is no guarantee it 's actually
> going to end up mapped in memory for you._

I didn't state it clearly enough, but I didn't say anything about it being
mapped. It almost certainly isn't mapped. Loaders do not blindly map the whole
thing to memory; then you would end up with debug info unconditionally mapped.

Even if it were mapped, the program wouldn't easily find it with the latter
approach I described: the offset given in the structure is measured from the
start of the file, not from some address in memory.

> _ask the linker_

It is not portable either. For instance, it doesn't work with Microsoft's
linker which is called link.exe.

------
nemanjaboric
According to
[https://news.ycombinator.com/item?id=22865842](https://news.ycombinator.com/item?id=22865842)
#embed proposal is in the review process [http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n2499.pdf](http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n2499.pdf) . This similar (same?) proposal is
also presented and in the review at WG21 to be included in C++ standard.

The author puts a lot of work towards making this proposal a reality (as far
as I as a casual twitter/slack observer can see) and I'm looking forward to
it.

~~~
Jasper_
Unfortunately, the proposal has been stalled by the C++ committee and the
author is uninterested in continuing it. See their post here
[https://thephd.github.io/full-circle-embed](https://thephd.github.io/full-
circle-embed)

~~~
saagarjha
Is Circle something that is actually coming to C++?

------
abaines
You may want to add -z noexecstack or linking the object will give your
program an executable stack.

e.g: ld -r -b binary -z noexecstack input.bin -o output.o

------
damnyou
This is simply the include_bytes! macro in Rust. There's also the include_str!
macro which does a compile-time check for UTF-8 validity.

~~~
mhh__
I'm pretty sure this is trivial in D too although you have to tell the
compiler the file exists (you can't do arbitrary FS reads at compile time)

------
nicoburns
In Rust: `let foo = include_bytes!("path/to/file")`. And people wonder why I
think C is difficult.

~~~
jfkebwjsbx
C does not have the feature (yet), it is not about being difficult or not.

~~~
nicoburns
The fact that C doesn't have the feature makes it difficult if I want to
perform that task!

~~~
jfkebwjsbx
Thing is, embedding binary files is not that useful, and there are easy enough
ways to do it for those few that really needed it.

The C standard has been very cautious about adding features that would make
the standard and the compilers more complex since it is the lingua franca of
computing.

Nowadays there is really just a handful of archs in use and their compilers
are all backends for GCC/LLVM, so they are relaxing a bit the gating of
features.

------
tyingq
Found Chromium is using a shell script that calls od and sed.
[https://chromium.googlesource.com/chromiumos/platform/ec/+/m...](https://chromium.googlesource.com/chromiumos/platform/ec/+/master/util/bin2h.sh)

------
mmm_grayons
There's also Drew Devault's Koio:
[https://git.sr.ht/~sircmpwn/koio](https://git.sr.ht/~sircmpwn/koio)

It works pretty well, though I don't believe it works on Windows.

------
samrat
I am in India and I cannot access the site. The DNS resolves to
`23.227.131.12` but the server doesn't seem to respond to pings either.

I can access the site through a VPN.

Anyone else facing the same issue? Anyone know what's up with that?

~~~
balnaphone
Same, from Canada. My guess is that it will come back up later.

------
bigfoot
Related Twitter discussion:
[https://twitter.com/taviso/status/1250518818197245953](https://twitter.com/taviso/status/1250518818197245953)

------
felixguendling
The linked page is currently not available. Archive link:
[https://web.archive.org/web/20200416183057/https://flak.tedu...](https://web.archive.org/web/20200416183057/https://flak.tedunangst.com/post/embedding-
binary-objects-in-c)

There are also ways to do this on Windows and Mac OS X.

This library provides CMake support to do this: [https://github.com/motis-
project/res](https://github.com/motis-project/res)

~~~
MaxBarraclough
Do you know how it compares with CMakeRC? See
[https://news.ycombinator.com/item?id=22888879](https://news.ycombinator.com/item?id=22888879)

------
shaklee3
Ted has one of the best, and longest running tech blogs I've seen, and it
covers such a wide array of topics. If you're reading Ted, thanks.

------
pieterk
If you do this with a dynamically loaded library that you _write_ to during
execution, can C become a dynamic language?

~~~
simias
Any turing language can be as dynamic as you want if you're motivated enough.
After all, many interpreted languages are implemented in C...

Many dynamic recompiler/JIT implementations effectively do something like what
you describe: they map some executable portion of memory and output native
code that is then executed.

------
andrewshadura
Back in the day Borland shipped a tool called binobj which did exactly this.

------
megous
Using GNU assembly .inbbin is also quite flexible. And you can use .dc.* and
.asciz and .byte to add arbitrary data inline.

Generating assembly files is my favorite way of including outside data into my
C programs.

------
RealityVoid
Heh, I used this trick when 2 years ago went to Hackaday Belgrade and did a
port of a NES emulator on the badge. The ROM that was being played would be
compiled using the same basic method. Neat!

------
ryanmjacobs
Damn I want to read this, but looks like it's down? 4/16/20 ~2 PM PST

------
soylentgraham
Used to do this for embedding stuff into homebrew GBA roms! :)

------
endless90
Is it possible to automate this with cmake?

~~~
tambre
You may want to checkout the CMRC library (CMake Resource Compiler) [0].

[0]: [https://github.com/vector-of-bool/cmrc](https://github.com/vector-of-
bool/cmrc)

~~~
MaxBarraclough
Neat. Looks a little more complex than it needs to be though, and I'm
surprised it targets C++ (and even relies on exceptions) with no support for
C.

