
The Lost Art of C Structure Packing - tadasv
http://www.catb.org/esr/structure-packing/?src=yc
======
herf
In some cases you can benefit by "struct of arrays" instead of "array of
structs". This benefits packing, enables SIMD, and usually improves
persistence speed too. Database people call this a "column store". So if you
abstract it a little you can do per-column compression that is really hard to
do with vanilla structs.

One thing I learned with Picasa (which is basically an in-memory database): if
you're storing a lot of data in RAM, sometimes you should think of the data
structures more as a database (with copy in/out, rather than reference). In
multithreaded cases, this kind of approach unexpectedly gets you more
parallelism and speed.

~~~
TrainedMonkey
Column store is really just making sure values for a column are continuous in
memory, instead of rows as it is done usually. Basic premise is that modern
CPU's can chew through continuous memory locations at light speed, with single
core being capable of scanning a gigabyte of RAM in few milliseconds[0].

This is extremely beneficial when your tables have large number of columns,
but you only want to do computations on a few of them. Columnar stores usually
implemented using dictionaries which allow for efficient duplicate handling
and blazing fast lookups. Often those dictionaries also contain run length
optimizations, so values 1 through 100 are only stored using two values start-
range 1 and end-range 100.

All in all, column stores are efficient at analytic workload, but struggle
when a lot of inserts and updates need to happen.

[0] Using prefetch, properly aligned data, and likely other optimizations I am
not thinking of.

~~~
seiji
_but struggle when a lot of inserts and updates need to happen._

I ran across this with Redis recently. Redis has an internal data structure
called ziplist that's a doubly linked list with no pointers. It just uses byte
offsets to the next element, so all elements are in one contiguous block of
memory. It's great because you can store a lot of values compactly without
needing 1 to 5 pointers (8 to 40 bytes) overhead per element in your data
structure.

But, a problem shows up when you want to insert or delete items. Every insert
or delete requires expanding or shrinking the entire solid-chunk-of-memory
allocation, which isn't the fastest thing in the world when your allocation is
large. Also, inserting into the HEAD requires copying the _entire_ ziplist up
exactly one element position because the block of your memory layout just
changed.

So, obviously these things are useful, but their usefulness degrades as your
solid blocks of memory hit cache limits. Everything is super fast when your
entire ziplist fits in L1, still about the same in L2, worse in L3, and
horrible if your ziplist grows any larger.

But, what can we do? We can fix the entire problem by creating a traditional
linked lists of solid memory block ziplists. Now, each ziplist can remain
small (8 KB to 16 KB), and when we grow bigger, we just cut a new ziplist,
pointer-attach it to the previous ziplist, and now we have a minimal-pointer,
high-locality data structure with unlimited growth potential without
performance defects. It doesn't buckle under inserting or updating arbitrary
elements since each memory block is isolated to a maximum ~8 KB size
(reminder: your L1 cache is 32 KB to 64 KB depending on architecture) and now
your memory usage due to pointer overhead is now also _greatly_ reduced
(assuming your data was small and dominated or matched by pointer overhead
size in the first place).

As a double bonus, now when your data structure is a solid block of memory,
you can compress individual contiguous memory blocks (usually with great
compression ratios) because your contents will tend to be homogeneous in shape
inside the same container. Result: huge reduction in pointer overhead +
sequential access + compression = best of all possible worlds.

Other details/comparisons at [https://matt.sh/redis-quicklist-
visions](https://matt.sh/redis-quicklist-visions)

~~~
hyc_symas
This isn't news. Arrays are fast until you need to insert/delete in them, then
they get horrible. This is why we have B-trees. Every page in a B-tree is
essentially a sorted array - you can find items in it quickly using binary
search, and you can insert/delete in it relatively quickly because it has a
small bounded size. When you need to grow beyond a single page you do a page-
split and add a parent above it, etc. B-trees are the optimal implementation
of dynamic arrays, end of story.

~~~
crucini
>B-trees are the optimal implementation of dynamic arrays

Then why don't the popular scripting languages use B-trees for their sequence
containers?

Both Perl and Python use arrays for this. Both pay the price of O(n) inserts
mid-container.

~~~
TrainedMonkey
I think there are several reasons:

1\. Name array strongly implies that it has an underlying continuous memory
location associated with it. All of modern scripting languages have some form
of dictionary, map, set, or hashtable which is implemented using insert/update
friendly approach.

1a. Typical workload for array is to iterate over it doing something with each
element, B-Tree is strictly worse in iterating over all elements compared to
static array.

2\. Most instantiated dynamic arrays rarely perform deletes. Most common
operation that could change size of array is push_back, and amortized cost for
inserts is probably way less than chasing pointers in B-Tree.

3\. B-Trees were created and optimized for HDD performance. With current SSD
and memory trends, there are simply better data structures to use.

4\. Pointer chasing got a lot more expensive relative to continuous memory
scanning. This is mostly due to majority of memory performance increases
coming from increased throughput as opposed to lower latency.

This does not mean B-Trees are bad, in fact they are still heavily utilized in
databases because they scale extraordinarily well with data size. They are
commonly used for indexes because membership testing (single value lookup) and
range lookups are O(1), and updates/deletes are generally very fast as well.

~~~
crucini
I think this is a really good answer. However I slightly differ on #1.

Python calls the default sequence container a list. If you took that
literally, it implies O(1) inserts.

------
ghshephard
This was the cause of my very first "hard" C Bug, in 1991 or so. I had written
a program in our Novell Netware labs to read in the printer control code
definitions for our printer-release consoles - being "clever", and wanting to
save a few lines of code, I read them into a memory structure, and overlaid a
C-Struct on top of them that mapped precisely to the fields that I wanted.
Everything worked fine, code compiled, and we were able to read all the job
definitions until I handed it off to the team responsible for the rest of the
release console - at which point the code just started breaking. No longer
worked. For the life of me I couldn't figure out what was going, until our
team leader took a glance, and made it clear that I needed to inform the
compiler to byte align the C Structs (which was likely the default behavior in
Turbo-C, but not Watcom C).

Really opened my eyes to the many, many things that I didn't know.

------
geronimogarcia
This was discussed like a year ago in HN
[https://news.ycombinator.com/item?id=6995568](https://news.ycombinator.com/item?id=6995568)

~~~
brudgers
Per its revision history, the guide was first published on 1/1/14\. This
current revision is only a few months old, and besides articles on important C
programming techniques by the big toads like Raymond are more or less
evergreen content.

~~~
adwn
I'm not a native English speaker, so "big toad" might be term whose sarcasm
escapes me, but there are probably about two dozen people in the world who
actually believe that Eric S. Raymond is in any way an authority when it comes
to programming (one of them being Eric S. Raymond). The following comic strip
of "Everyone loves Eric S. Raymond" sums it up pretty well:
[http://geekz.co.uk/lovesraymond/wp-
content/images/ep013.jpg](http://geekz.co.uk/lovesraymond/wp-
content/images/ep013.jpg)

~~~
GFK_of_xmaspast
Raymond's primary talent is self-promotion.

~~~
CyberDildonics
The John Romero of oss

~~~
ANTSANTS
John Romero was making shareware games for years before he met John Carmack.
He wrote the level editors, much gameplay code, created many levels for, and
significantly contributed to the game design of all the id games from
Commander Keen to Quake. His absence is arguably a big part of why most of the
later id games just aren't as _fun_ as Doom. Daikatana bombed for a lot of
reasons (zero experience as a manager, dotcom-era ridiculousness in hype and
project scope, infamously horrible marketing campaign that he had no part in,
etc), but not because John Romero was an incompetent programmer or game
designer.

If you have the time, watch this series in which John Romero and (Bioshock
level designer and apparent Doom fanboy) Jean-Paul LeBreton play through the
first episode of Doom and analyze its level design in depth:

[https://www.youtube.com/watch?v=ygp4-kmjpzI](https://www.youtube.com/watch?v=ygp4-kmjpzI)

~~~
alxmdev
Thanks for linking to that video, it was fun to watch. I've looked at the 90s'
id games differently ever since I read Masters of Doom 10 years ago (due for a
re-read soon) - learning their history made me appreciate those games on a
whole new level. Their story was a very big inspiration and motivation boost
for my own modest projects.

------
Dav3xor
Decent article full of good information. One exception -- the C standard is
vague on how bitfields are implemented and the compiler can rearrange them in
any way it pleases. You are not guaranteed efficient packing, placement order,
or size.

~~~
kps
It's not quite that loose. Order of elements is guaranteed, same as any other
structure members, and packing is guaranteed _if_ the members fit. Whether bit
fields can cross word boundaries, and endianness, are implementation defined,
so bit fields are still not usable where there's any interoperability
constraint.

------
jgrahamc
Sometimes this sort of stuff can really bite you:
[http://blog.jgc.org/2007/04/debugging-solaris-bus-error-
caus...](http://blog.jgc.org/2007/04/debugging-solaris-bus-error-caused-
by.html)

------
drv
The article goes into more detail on the "why", but the "art" is not really
that complicated or lost: sort structure elements by size/alignment, biggest
first.

~~~
_pmf_
> sort structure elements by size/alignment, biggest first

This does not work if I'm using structure packing to align my structs with an
existing protocol (i.e. a protocol that does not have the kind of "holes" that
an unpacked struct might have).

~~~
maxlybbert
I find this a very strange response: "this technique doesn't work if I have to
comply with an external standard." Fine, then don't use the technique.

Although, I believe the more common approach is to define two structs: one
using the packed standardized layout, and one using a layout more suited for
whatever you're actually doing. In that case, you may wish to consider sorting
the elements by size and alignment for the internal use-only layout.

------
lzybkr
Most commonly used compilers have options to help find padding that is
possibly not needed.

VC++ has a couple of undocumented (but well known and discussed) options
/d1reportAllClassLayout and /d1reportSingleClassLayout.

GCC has -fdump-class-hierarchy

CLANG has -cc1 -fdump-record-layouts

~~~
vonmoltke
Those are all C++ compiler options.

~~~
lzybkr
I implemented the VC++ options - they work in C.

I have no experience with the others, I've just briefly read about them and
assume they are similar. If they don't work in C and you would find the dumps
useful, try compiling as C++.

~~~
vonmoltke
VC++ is fundamentally a C++ compiler that has spotty support for C features
introduced after ISO C90. GCC and clang are probably better.

That said, many non-trivial C codebases cannot simply be compiled with a C++
compiler.

~~~
lzybkr
I take it back, the VC++ options are C++ only (faulty memory, it's been too
long.)

There is a warning I added which does work in C
([https://msdn.microsoft.com/en-
us/library/t7khkyth.aspx](https://msdn.microsoft.com/en-
us/library/t7khkyth.aspx)). As I recall, this warning is sometimes useless
(unless it's been improved upon), so it's best to turn it on only when you're
investigating packing.

You make a good point about C code not compiling as C++ as much as it used to,
but that's probably less true for VC code.

And I should point out - often, for the purposes of tuning, you can extract
just what you need from some code, get that small chunk of code compiling as
C++, and leverage your compiler's dumps.

Last thing - debuggers also know the layout of your objects. windbg's 'dt'
command can show you the object layout - I'm sure other debuggers have a
similar command.

------
pja
If anyone wants to take a look at the struct offsets inside their own code,
the article mentions pahole as being a useful tool but I found that it chokes
on recent C++ - the dwarf libraries it relies on can't cope with lambdas and
various other things IIRC.

However, helpful people have written an extension to gdb which does something
roughly equivalent & lets gdb do all the heavy lifting of parsing the struct
data from the binary debugging information.

pahole.py ships with Fedora, but since Debian doesn’t include it for some
reason I’ve kept my own versions around based on something I grabbed from the
gdb mailing lists some time ago:

[https://github.com/PhilArmstrong/pahole-
gdb](https://github.com/PhilArmstrong/pahole-gdb)

~~~
throwabob412
pahole is fantastic. If it worked on modern clang-generated binaries, it would
still be the best. This entire article could be replaced with "use pahole."

Anyway, the problem with pahole is that it is heavily dependent on libdwarves,
a custom wrapper around libdwarf by the pahole author. libdwarves has
bitrotted since 2010-2011 and newer dwarf extensions produce aborts. Hurray.

I started a similar tool in C just using libdwarf directly. It's not complete
and/or perfect, but it works on some binaries that pahole does not. See
[https://github.com/cemeyer/structhole](https://github.com/cemeyer/structhole)
. (And it's < 500 LoC and BSD-licensed. Please use as you will and contribute
patches if you are interested in improving it.) Cheers.

~~~
comex
The following is not very useful since I never bothered to polish it up or
write any documentation, but in case anyone is curious, it's a similar tool I
wrote that focuses on performance, outputs JSON, and uses no external
libraries. I needed to dump struct info from a binary I was trying to exploit,
which had debug info available in DWARF format, but was huge. (gigabytes)
pahole not only took like half an hour to get through it, it leaked memory
such that it took up most of my 16 GB of RAM by the end, which was immensely
frustrating; my tool does it in a few seconds. (There's also a Windows PDB
version which is older and probably broken.)

[https://github.com/comex/fastdbg/blob/master/fastdwarf.c](https://github.com/comex/fastdbg/blob/master/fastdwarf.c)

------
elros
Interesting to discover that this is a "lost art".

I didn't graduate, but I did attend PUC-Rio for a while and I remember
professor L. F. Bessa Seibel's lecture about that, as part of the INF1008
class (Introduction to Computers' Architecture), which is considered a basic
first or second semester course for undergraduate CS students.

That was just 5 years ago.

Great class, great lecturer, btw

------
t1m
I am going to suggest that 'Structure Packing' isn't the lost art. I think the
lost art is 'Variable Sized Structs', the emphasis on the last 's' of
'Structs'.

This is the situation where your solution calls for an array of structs that
are contiguous in memory but aren't the same size. This happens oftenish in
database, os, and compression tech. C99 and gcc have legalized the often used
convention:

    
    
      struct varb {
          char *foo;
          unsigned char age;
          int number_of_bars;
          char var[];  // it's something like var[0] in GCC, or var[1] in versions < c99
      };
       

where the last element is actually the start of the variable length piece. In
all the C supported implementations of var length structs, we have
limitations:

\- the variable part must be declared last

\- the struct may not appear in other structs or arrays

which totally makes sense, but doesn't help us implement a contiguous array of
variable length structs. Also, the restriction of having the variable part of
the struct come last may waste a lot of space if we have a structure where we
would rather optimize the order of the struct ourselves, but that's another
post!

Assume we are OK with the limitations of our _struct varb_ , we now need to
declare and _malloc_ (or _mmap_ ) a _char_ * (to hold all of our data) and lay
out our _varb structs_ fairly manually. Once we have found the start of a
particular _varb struct_ , we can cast it's _char_ * address to a _struct
varb_ * and the compiler then can help us populate or interrogate our data.
Note that we will need to keep track of detailed memory usage 'so far' when we
are populating as well track padding, alignment of adjacent _structs_ , as
well as any reallocation when our _struct_ array needs to grow.

~~~
blt
I'm working on this type of problem currently. C++ gives you some nice tools
to deal with it but I can imagine it would be a real pain in C.

------
devbug
Rust does structure packing automatically.

This is great for the general programmer (someone who is not knowledgeable or
caring about details like this.)

Unfortunately, this blows for people who do (someone like me.)

~~~
Ded7xSEoPKYNsDd
Why does it blow for you? If the compiler can do it automatically, that seems
very useful.

Or is it doing it suboptimally? (Say you want two fields in the same cache
line, but the compiler sorts them away.)

~~~
devbug
I know better than the compiler how my data is used. It is really simple as
that. I can optimize with the totality of the program in mind while the
compiler optimizes on heuristic like `order largest to smallest`. So yes, the
compiler is doing it sub-optimally. Not to mention the repetitious hell it
creates when writing C bindings, or memory-mapping structures, and so on and
so forth.

In the end, however, my reservations are mostly due to a "get off my lawn"
mentality.

~~~
spiritplumber
Are you Mel Kaye?

~~~
devbug
Hah!

Only when the difference in performance is end-user discernible. I do make
sure the code is well crafted in those bounds though. There's no _valid_
excuse for crafting ugly code.

Coincidently my stuffed toy that I have kept throughout my childhood was named
"Mel the Pal." I just noticed the irony of that. :)

------
wila
The article doesn't appear to mention it, but in the past I had a need for
this type of information when interfacing C data structures from other
programming languages.

If the structure you are connecting with is packed you have to be aware of
that or else you'll end up with data soup :)

------
Animats
In Pascal, you could just declare a structure as "packed", and this packing
happened automatically. "packed array [0..n] of Boolean" is an array of bits.
That feature has not been copied in more recent languages.

~~~
adrusi
I imagine it can cause problems with ABI compatibility.

I think Jonathan Blow mentioned some kind of language-level support for
structure packing in the language he's designing.

~~~
tomyws
He gives a fantastic walkthrough on his approach to structure packing in the
data-oriented demo[1].

[1]
[https://www.youtube.com/watch?v=ZHqFrNyLlpA](https://www.youtube.com/watch?v=ZHqFrNyLlpA)

------
noelwelsh
One of the many reasons that Rust is interesting is you can play these games
if you are interested in performance or size: [http://doc.rust-
lang.org/book/ffi.html#interoperability-with...](http://doc.rust-
lang.org/book/ffi.html#interoperability-with-foreign-code)

------
girvo
Is it really a lost art? I've recently been learning C and C++, and all three
of the textbooks I've been studying have mentioned in, and two go very in-
depth on it (one of those is a games programming textbook). This is a great
resource though, definitely going to add it to my studies.

------
unwind
This was rather readable, more so than I've come to expect from that
particular source.

Still, I must point out that using a signed 1-bit bitfield (there are _lots_
of those in that article), is generally a bad idea. Consider for instance
something like:

    
    
        struct foo6 {
          int bigfield:31;      /* 32-bit word 1 begins */
          int littlefield:1;
        };
    

That "littlefield" member is signed but has a size of just one bit, which
means it is limited to representing the two values 0 and -1 (in two's
complement). This is very seldom useful. The general rule of thumb is to
always make "boolean"-type bitfields have an unsigned type, for this reason.

------
source99
I'm guessing it's a lost art because available system memory is much larger
and many programs are written in other languages these days.

~~~
munificent
> I'm guessing it's a lost art because available system memory is much larger

This is true, but locality also matters a lot more these days. Minimizing
padding keeps more of your structs in a single cache line. That means doing
this not only makes your program use less memory, but it can also be faster,
in some cases significantly.

> many programs are written in other languages these days.

True, but most of those languages or their VMs are themselves written in C, so
it still matters at some point. :)

------
TazeTSchnitzel
A good portion of PHP 7's performance improvements came from better structure
packing:

[https://wiki.php.net/phpng-int](https://wiki.php.net/phpng-int)

[http://nikic.github.io/2014/12/22/PHPs-new-hashtable-
impleme...](http://nikic.github.io/2014/12/22/PHPs-new-hashtable-
implementation.html)

~~~
plug
nikic is an incredible asset to the PHP community. I feel slightly less bad
about my own programming abilities now that he is no longer a teenager ;)

Everyone likes to rag on PHP's chain, which is fair enough in some regards,
but it has become a far better language recently, with more useful features,
faster development cycles etc. People like nikic seem to be at the forefront
of this.

Since PHP is a major part of my day job this all makes me very happy!

------
sprw121
Or we could use __attribute__((packed))?

~~~
wvenable
That directly trades size for less efficient code. The structure packing
described in the article keeps the same code efficiency while decreasing the
size.

There are good reasons (as mentioned in the article) for this feature but it's
not equivalent to manual structure packing.

------
jheriko
i definately agree with the 'lost art' part. i've been stunned to find
programmers who don't understand this and make arguments like "the compiler
optimises that for you"... so much so that i've written my own piece on this,
and more than once (and in more or less detail in various places - i.e.
bitfields):

[http://jheriko-rtw.blogspot.co.uk/2011/02/know-your-cc-
struc...](http://jheriko-rtw.blogspot.co.uk/2011/02/know-your-cc-struct-
layout-rules.html)

[http://jheriko-rtw.blogspot.co.uk/2012/01/c-struct-layout-
op...](http://jheriko-rtw.blogspot.co.uk/2012/01/c-struct-layout-optimisation-
revisited.html)

------
CountHackulus
How about the lost art of COBOL record packing where COBOL records are like a
combination of C's union and structs. It was great you could sometimes squeeze
a few extra bits out in one case to use better in another case.

------
jobu
Years ago I worked on a few projects that used complicated structure ordering,
but even then it was my understanding that (with the right options) compilers
could optimize these things better than most people could by hand.

~~~
pcmonk
This is something the compiler can't optimize very well. For one thing, the
spec requires that they appear in the given order.

If this wasn't the case, a significant amount of code would break. I've seen a
good bit of code where there's a struct A, and then struct B begins with the
same fields as in struct A. This allows you to treat either of them as a
struct A, a kind of polymorphism. The compiler doesn't know whether you're
going to do that, so it can't arbitrarily rearrange the members of the
structs.

------
shultays
Don't elements of structures are sorted before even packing starts? For
example char c; int i; char c2;

I thought in reality c2 starts after c. Instead of adding padding between each
element, you will only need in boundaries

~~~
wila
The compiler doesn't care much how your structure element is named. The naming
scheme is for the benefit of the developer. Once the program is compiled the
variable is referenced by memory address, not variable name. Sorting them by
name would not help there.

~~~
dlp211
I don't think (s)he meant sorted alphabetically, but rather, sorted in such a
way as to minimize memory usage, ie: packed.

This is obviously not the default in C, but can be enabled via compiler
extensions. It's also important that the compiler doesn't automatically do
this since structs of the same type can be of different sizes. In order for
this to work, the memory layout needs to be as originally defined in order to
be correct. For an example of this see the simple dynamic string library used
in redis.

~~~
Genmutant
I don't think gcc or clang can reorder structs, or do they?

~~~
dlp211
Yep, you're right, I was mistaken, it's been a while since I looked at all the
extensions and I thought reordering was one of them, I should have rechecked
before I made my comment.

------
gambiting
I've actually had a question about this on my programming interview for a
games development company. And yes, I have had to use it several times since.
So no, the art is definitely not lost :-)

------
pfortuny
Very interesting and very detailed, and very informative and also, useful. So,
a great read.

------
PSeitz
This is no lost art, this is pretty well known for proficient c developers.
There is even an Wikipedia article.
[http://en.wikipedia.org/wiki/Data_structure_alignment](http://en.wikipedia.org/wiki/Data_structure_alignment)

------
jongraehl
I guess we're probably ok with the Y2118 cvs-export bug.

------
user_rob
I have always found that typedef does an admiral job of packing the data.

------
frozenport
Standard warning message in PVS studio. Not sure why the compiler doesnt warn
about it.

~~~
Ded7xSEoPKYNsDd
Gcc and clang both have -Wpadded. It's not in -Wall or -Wextra though.

