How to zero a buffer

jhallenworld · on Sept 4, 2014

Slightly OT since it has little to do with security, but fighting the optimizer is something FPGA Verilog and VHDL designers must also master.

If you don't use an the result of some logic it will be optimized out. One way to prevent this is to route it to a pin.

If logic is fed by a constant, it will be optimized out right up to the point where the result of the logic is mixed with some external input. (early tools could not use the dedicated reset net due to this- reset for each flip-flop had to be routed to a pin or the reset net was optimized out which means the initial state of your flip-flop is lost).

If you have identical logic, one copy is optimized out due to aggressive CSE. This is often bad for performance (routing in an FPGA is as slow as logic, so it's better to regenerate identical results in multiple places), so you add "syn_maxfan" constraints to prevent the "optimization".

On the other hand, an input flip flop will be duplicated if the fanout limit is exceeded- but this prevents the use of the dedicated I/O cell flip flop which then causes external timing to be messed up. So you use syn_maxfan=infinite for this case.

legulere · on Sept 5, 2014

Why would you want your FPGA to have circuits that aren't used? And why don't you want constant expressions to be pre-calculated by the optimizer?

exDM69 · on Sept 5, 2014

I have the exact same question.

I presume this is for some incremental development work. Like testing and seeing the number of gates/pins used for a design but you need to feed the logic with constant placeholders.

Someone1234 · on Sept 4, 2014

Can someone explain this line:

      static void * (* const volatile memset_ptr)(void *, int, size_t) = memset;

I've written some C but that is utter gibberish to me.

akavel · on Sept 4, 2014

As usually in C, you start from the name, then try to go to the right until you hit parenthesis, then go to the left until parenthesis, rinse and repeat.

So, first, the name:

    memset_ptr

Then, try going to the right, but aha! before we can really gain speed, a parenthesis is immediately blocking us. We shrug off the bruises and turn to the left:

    (* const volatile memset_ptr)

"Hey guys, memset_ptr is a volatile const pointer..." Now, we hit left parenthesis, so we're again allowed to go right, yay!...

    (* const volatile memset_ptr)(void *, int, size_t)

"...to a function taking such-and-such arguments..." Uh, oh, the equal sign, so no more to read to the right; disappointed, we turn back to the left for the final run:

    static void * (* const volatile memset_ptr)(void *, int, size_t)

"...returning a pointer to void! Hah, got you! Simple, really. No arrays, no pointers to pointers, not even a function pointer returning a function pointer, meh. Uh, oh, aaaaand, yes, by the way, the variable is static, so, like, file-local, um. Yeah, yeah, I saw it from the beginning, oh, go away, you're just picky. And, and, you wouldn't recognize a function returning a pointer to an array of pointers to functions returning anonymous struct even if it hit you in the face, pfff!"

neilxdsouza · on Sept 5, 2014

This technique is described in Peter van der Linden's book "Expert C Programming : Deep C Secrets" (and also explained really well by you).

stavros · on Sept 5, 2014

Is "void *" really something you can pass?

joelangeway · on Sept 5, 2014

Yes, it refers to a memory location, without implying anything about the semantics of the bits located at that location. You can't dereference or assign to the location because you don't know the type at that location. You can however assign that pointer to a typed pointer variable to actually read or write to that memory location. This is useful when you really care about the bits of memory but you're variable pointing to that memory could just as well be an (int64_t ) as a (char ) and those types are not interchangeable with each other, only with (void ). So library functions that just care about memory locations, not the semantics of the bits there, take (void ).

Some of this may be technically incorrect. This is my own mental model of the C language which is sometimes incomplete.

akavel · on Sept 5, 2014

I'd say it may be somewhat helpful to realize, that both "void" and " void * " are kinda wild cards in C's type system; they are there, but they're "breaking the rules". And " void * " is not exactly the same to "void", as " char * " is to "char".

stavros · on Sept 5, 2014

Hmm, I'm a bit confused. Isn't that a function call, rather than a function declaration? If it's a function call, it's passing a bunch of types in, which I thought was not valid C?

akavel · on Sept 5, 2014

Ah, now I get your question. So, it is a function declaration, not a function call. Um, sorry: a declaration of a pointer to a function, where this function would take as arguments: some (unnamed) void pointer, some (unnamed) int value, and some (unnamed) size_t value; and would return a void pointer.

Um; then there's the equal sign, so this is not only a declaration, but a definition too; but definitely not a call.

A call is further down in the original blogpost, in the below line:

    (memset_ptr)(p, 0, len);

stavros · on Sept 5, 2014

Oh! A pointer to a function! Aha, I didn't know that was valid, thanks!

Gracana · on Sept 4, 2014

cdecl is really handy for stuff like this. Try it out:

http://cdecl.ridiculousfish.com/?q=static+void+*+%28*+const+...

Hello71 · on Sept 4, 2014

http://cdecl.org/

cperciva · on Sept 4, 2014

memset_ptr is a const (we can't change its value) volatile (its value might change on its own) pointer to a function which taking three arguments (void * , int, size_t) and returning void *; and the memset_ptr symbol does not have external linkage and is initialized to "memset".

adultSwim · on Sept 4, 2014

http://ieng9.ucsd.edu/~cs30x/rt_lt.rule.html

Learning the right-left rule helps here. You'll still need to know what the keywords mean.

pavlov · on Sept 4, 2014

It declares a function pointer variable named "memset_ptr", and assigns the value of "memset" to it.

The "void *" before the first parenthesis is the return type of the function. The stuff in the first parenthesis applies to the function pointer variable. The second parenthesis lists the types of the arguments to memset.

blinkingled · on Sept 4, 2014

It's declaring a function pointer which is just an alias for memset and additionally qualifying it as volatile so the compiler doesn't optimize the call out.

denim_chicken · on Sept 4, 2014

In GNU C one can add the statement

    asm ("" : : "m" (&key));

just before or after the memset, effectively telling the compiler that the address of "key" escapes the scope of the function.

fafner · on Sept 5, 2014

GCC also has an `optimize' function attribute which might make sense to use here. This can set optimizations for the function to -O0. But I haven't tried it.

cesarb · on Sept 5, 2014

That's not enough, since even -O0 applies a few optimizations, and which optimizations it applies could change in the future.

The best answer is really GCC's __asm__("" : : "m" (&key)), or perhaps something like __asm__("" : : "r" (key) : "memory"), after the memset. It generates no extra code, just ensures that the memset won't be removed.

For other compilers (in practice only MSVC, since clang is gcc-compatible), you could pass the pointer to a dummy assembly function instead of using inline assembly; even link-time optimization can't know what happens within a function written in assembly. Or, for better performance, create in assembly a "safer_memset" which is a single instruction: a jump to the real memset function.

syncsynchalt · on Sept 5, 2014

That's not portable, whereas the "volatile" solution is.

tedunangst · on Sept 4, 2014

When this still doesn't work: JIT compiled C. The compiler can check for memset and elide it. (Or hell, one can envision the hypothetical Antagonizer9000 compiler including a version of memset which peeks up the stack to see what it's clearing and stops short.)

davidtgoldblatt · on Sept 4, 2014

To clarify: the strategy in the post doesn't actually work (or at least, is not guaranteed to work in every conforming implementation): the "volatile" only applies to the read of the function pointer, not to the execution of the function in question.

You don't even need to assume some sort of crazy evil compiler to have to worry about this - speculative inlining of function pointers guarded by a safety check is something that FDO builds will actually do.

BrandonM · on Sept 4, 2014

The first comment there (by Anonymous) claims that the final technique can also be optimized:

    (memset_ptr)(p, 0, len);

> can be replaced by:

    if (memset_ptr == memset) {
        memset(p, 0, len);
    } else {
        memset_ptr(p, 0, len);
    }

> Which in turn can be optimized using the other tricks noticed above into:

    if (memset_ptr != memset) {
        memset_ptr(p, 0, len);
    }

I'm no expert, but this seems like a believable defeat of the technique in the post.

cperciva · on Sept 5, 2014

That's not quite right since it's now reading memset_ptr twice, but the concept does seem to be right -- the volatile pointer must be read but the standard doesn't require that the function is invoked.

spott · on Sept 5, 2014

What about a data race? Theoretically, the function that memset_ptr points to could be changed between when it is checked and when it would be run.

Karellen · on Sept 5, 2014

If you have multiple threads accessing a shared (mutable) variable in your program, even a shared volatile variable, then you need to guard every access to that variable (which, in this case, includes every function call through memset_ptr) with proper thread synchronisation primitives. Marking a variable "volatile" is not enough to prevent data races in a multi-threaded environment.

If you've put a semaphore, or mutex lock, or whatever around your calls through memset_ptr(), the transformations will all take place inside the lock, and data races should not be an issue.

spott · on Sept 5, 2014

I think you missed my point.

memset_ptr is a const (not changed by this program... theoretically) volatile (allowed to be changed by the system, theoretically) pointer to memset. In THIS PARTICULAR CASE, memset_ptr points to memset. The compiler however doesn't know that it won't change due to another processes, but we do. So the compiler shouldn't be able to optimize out the call directly to the function pointer because it introduces a possible race condition: the program reads that memset_ptr points to memset, then the pointer changes (due to some other process changing it), but the program still calls memset, and not memset_ptr. The optimization allows for a possible race condition to occur.

tedunangst · on Sept 5, 2014

That race exists regardless. Many systems will execute this as loading the pointer into a register, them jumping to it. The value could change between those two instructions.

kelnos · on Sept 5, 2014

How about instead of assigning to your function pointer directly using `memset`, instead use `dlsym()` to look it up? You could even declare the fnptr to take a `volatile` argument as well as the ptr itself being `volatile` (not sure that is useful here).

Of course, using `dlsym()` isn't exactly portable...

joelangeway · on Sept 5, 2014

Even if a JIT compiler can prove that all the code in your app doesn't change that function pointer, because the variable is volatile, the compiler must assume that you intend to read from actual metal every time you refer to it and it can not predict what the value will be. Even a JIT compiler is not allowed to optimize away that read, or else you'd never be able to write a driver.

TheLoneWolfling · on Sept 5, 2014

Incorrect.

Because a JIT may have enough knowledge of the underlying system to know that the pointer is not pointing to a memory-mapped / DMA'd / etc area, and as such can be assumed to remain constant.

rchowe · on Sept 4, 2014

Would it be possible to read a value from the array and do something with it (e.g. send it to /dev/null)? Even a JIT shouldn't be able to optimize the value out if you're actually using the value, and as long as you're not zeroing secure memory all the time, the performance hit shouldn't be that large.

This might cause side effects if /dev/null does not exist or is not the null device.

pcwalton · on Sept 4, 2014

A compiler could just forward the zero from the memset directly to your write syscall and delete the memset (and I would certainly implement this optimization in a compiler if I found it helped real code).

userbinator · on Sept 4, 2014

While this completely subverts our intention, it is perfectly legal: The observable behaviour of the program is unchanged by the optimization.

This begs the question of what is "observable behaviour" - execution time, which is definitely "observable" and the basis of timing-based attacks, can certainly change depending on what the optimiser decides to do.

I think this and similar cases of "fighting the optimiser" should really be solved with per-function (or even per-statement) optimisation settings; both GCC and MSVC support #pragma's to do this, although it's nonstandard.

pcwalton · on Sept 4, 2014

The trouble with that is that you then have to define what a "legal optimization" is in order for a "don't optimize me" pragma to have any meaning. That can be notoriously difficult and annoying. For example, a common trick in languages with no irreducible control flow is to generate SSA from the AST to avoid having to compute dominance, which in some compiler backends can make things like simple dead code elimination hard to not perform.

cperciva · on Sept 4, 2014

The term "observable behaviour" is defined in the standard: Essentially, I/O to files and interactive devices, plus accesses to volatile objects.

schoen · on Sept 4, 2014

Perhaps in retrospect this was an inappropriate choice of definition, at least for cryptographic operations.

bunderbunder · on Sept 4, 2014

I'm gonna go ahead and say it: Perhaps in retrospect C is an inappropriate choice of language for these kinds of applications.

This "Performance at all costs, including safety and predictability" thing may be appropriate in video games, but for security-critical applications that philosophy is downright negligent.

tedunangst · on Sept 5, 2014

I'm not aware of any language that would be better. Most languages don't even let you touch memory to try to zero it.

userbinator · on Sept 5, 2014

There's basically only one (or a family of) languages that have no optimisation at all, and enable complete control over what the machine does - Asm. The code you get is exactly the code you write, no matter how efficient or inefficient it is. This also enables much better the prevention of other attacks like timing/power analysis, since you can effectively insert dummy instructions as needed to keep the timing and power well-behaved.

The biggest downside I see is that it's non-portable, but the reality is that there's not all that many architectures out there to port to anyway (x86, ARM, MIPS probably covers 90%+) and for truly security-critical code having that level of control could be worth it. (This also avoids the "trusting the compiler" problem - an assembler is far easier to verify correctness of than even the simplest C compiler...)

dllthomas · on Sept 5, 2014

"There's basically only one (or a family of) languages that have no optimisation at all"

... sort of. Chips themselves perform some optimizations.

schoen · on Sept 5, 2014

I think I heard of a timing attack that was introduced by CPU optimizations, not in the underlying code at all! But I can't remember what research that was and maybe I'm confusing two different issues.

dllthomas · on Sept 5, 2014

Certainly plausible. If the CPU optimizes one path but not another, in principle that's some information. In practice, gathering enough data to get that above the noise floor and turning it into something useful besides would certainly be difficult but maybe not too difficult.

schoen · on Sept 5, 2014

I might be thinking of some of the cache timing issues about sharing physical hardware with other guest VMs. There were papers in the last two years showing circumstances in which another VM on the same device can learn about secrets in caches via timing experiments. But that's a cache issue rather than a general pipeline issue. So I still can't remember if there's something else that I'm thinking of or if I'm just confusing it with the cache stuff.

http://blog.cryptographyengineering.com/2012/10/attack-of-we...

More recent stuff in this vein:

https://eprint.iacr.org/2014/435.pdf

(Yikes!)

TheLoneWolfling · on Sept 5, 2014

Well, the most obvious form of that is cache.

There's a entire class of timing attacks that rely on the CPU cache - they would not exist if CPUs didn't do the optimization of storing local copies of limited sections of RAM.

bunderbunder · on Sept 5, 2014

A language that wouldn't give you a way to zero out data wouldn't be appropriate for security-critical tasks either.

Just like how a language that doesn't do array bounds checking and permits pointer math (read: a language where buffer overflows are a damn feature) isn't appropriate for security critical tasks.

Just like how a language that must be actively fought using clever hacks in order to prevent it from undermining your attempts to guard against common exploit vectors isn't appropriate for security critical tasks.

ufo · on Sept 5, 2014

Zeroing out an array or struct field is not a C-specific feature.

tedunangst · on Sept 5, 2014

Dead store elision is not a C specific optimization.

jacksingleton · on Sept 5, 2014

This is what Mozilla is trying to create with Rust

http://www.rust-lang.org/

swift · on Sept 5, 2014

Many languages zero every allocation unless they can prove that you immediately write over that memory without reading it.

marvy · on Sept 5, 2014

Yes, but that's going the wrong direction. The situation here is that you've already written over it, and now wish to erase what you wrote.

tedunangst · on Sept 5, 2014

Which does nothing for memory which has been freed but not reallocated.

swift · on Sept 7, 2014

That's a great point; should've thought through that post a little better. Thanks.

pcwalton · on Sept 5, 2014

Dead store elimination is a really important optimization, and it comes out of bog-standard compiler optimizations like SROA on SSA form IR. You really want your compiler to perform it for acceptable performance.

bunderbunder · on Sept 5, 2014

Yes. But dead store elimination combined with being allowed to dereference arbitrary pointers and leaving newly allocated blocks of memory uninitialized is problematic. Contrary to what the C standard would like us to believe, those other features do mean that dead store elimination alters the semantics of a program. It doesn't impact the semantics of the procedure whose dead stores are being eliminated, but it alters the semantics of arbitrary operations elsewhere in the program because it could influence the result I get when I dereference a pointer dereference or examine the contents of a newly-allocated block of memory.

In most cases that distinction is nit-picky. It can be perfectly reasonable for the language to throw up its hands, shout "undefined behavior", and just assume that whatever random uncontrolled thing happens won't be too terrible, assuming whatever your program does isn't too important. But for security-critical applications it's a really stinking important distinction, because the range of possible behaviors found in the "undefined" category includes things like Heartbleed.

pcwalton · on Sept 5, 2014

If your compiler couldn't perform dead store elimination on memory (remember, memory includes allocas), then you'd lose most of the benefits of performing standard SSA form optimizations after SROA has happened. Essentially you'd kill scalar replacement of aggregates, which is a critical optimization. It's very important for performance that it allowed to happen.

bunderbunder · on Sept 5, 2014

Note that I'm not saying that dead store elimination is the problem. I'm saying that dead store elimination in combination with other language features is a problem. Dead store elimination is not unique to C. But those other language features are. Since I'm complaining about C in particular, I submit that the bit that I'm most worried about isn't the stuff that every language does.

That said, I realize that allowing data that's hypothetically disappeared forever into the free() black hole to come back into the universe through white holes such as malloc() and buf[buf_length] aren't the only reasons why you'd want to make for sure that you can clear out unused memory. Which is why it would be nice if the C spec also included some way to securely clear up memory that the compiler isn't allowed to defeat. No, an optional feature in a spec that's only 3 years old and mostly not supported isn't good enough. If it isn't ubiquitous it's not a whole lot more useful than any of the platform- and architecture-specific fixes that already exist.

ibisum · on Sept 5, 2014

You want your compiler, and execution environment, to support it. In C's case, the compiler can do a lot of work in this department - but so can most modern OS's in the real-world, provide adequate protection:

    sbuf = mmap(..,..,MAP_PRIVATE|MAP_ANON); // &etc.

ygra · on Sept 5, 2014

I'm not sure how this depends on the language, per se. Basically you want very special behaviour here: Allocate a piece of memory, preferrably so that it doesn't end up accidentally in permanent storage and later clear it and deallocate it again. C gives you no such facilities to do so, but so do very few other languages (if any). The correct course here would rather be to ask the OS to do that for you. You can tell the memory allocator to never page out the block you get. Every OS has a syscall to zero memory that isn't subject to compiler optimisations.

Honestly, in my eyes, this is a place where the language is specified in a way that it cannot ever guarantee what you're trying to achieve and in that case you're best off not relying on the language, but on other things that can make such guarantees.

maxlybbert · on Sept 5, 2014

C may have pitfalls for "these kinds of applications," but it has some strengths that other languages don't. Since C gives the programmer significant control over memory allocation, it's possible to avoid various kinds of timing attacks related to cache misses (possible is not the same as easy). Many languages don't give the programmer the necessary tools to do that.

dllthomas · on Sept 5, 2014

There are two different things you can mean when you say "C is not suitable to these kinds of applications". One is the more extreme, "You should not be using C, you should be using <other existing language> because it is more suitable." That's a bit of a hard sell; though specific alternatives should be evaluated on their merits. There is also, "there are design choices that have been made in C that make it worse for these applications than C would have been were it not for those choices", which seems an easy case to make.

maxlybbert · on Sept 5, 2014

The funny thing to me is that the standard crypto packages for other languages nearly always end up calling C code.

dllthomas · on Sept 5, 2014

That's not at all incompatible with "C is the best option of existing languages, but still bad in some obvious ways where it could be better (for this purpose)."

maxlybbert · on Sept 5, 2014

True.

PhasmaFelis · on Sept 5, 2014

C/C++ are like Formula 1 racing cars: indispensable if you need to go really, really fast; wildly impractical in all other situations.

exDM69 · on Sept 6, 2014

> C/C++ are like Formula 1 racing cars ...

I can't agree with this comparison at all. It may be true if speed is what you want but that's not the only reason.

C (and to a lesser extent, C++) are indispensable in lots of situations where you're working close to the metal. There are very few viable alternatives when working with kernel space code, micro controllers or embedded applications as well as crypto primitives.

Rust is perhaps the only language that can be used instead of C and C++ in these applications.

I'd liken C more to a heavy duty vehicle, something that most people never need but there's no replacement for the tasks it is intended for.

annnnd · on Sept 5, 2014

Is the proposed solution really the best approach? It seems complicated to me and relies on obscure parts of the language. Maybe the problem (compiler optimizes away function call because the result is no longer needed) could be solved like this:

  memset(key, 0, sizeof(key)); 
  if (key[0])  // we are using key, so you can't skip memset()
    dropDead();

Unless the compilers "understand" memset and still optimize away the last two lines? I would hope not... Does anyone know how aggressive the C optimizers are these days?

anon4 · on Sept 5, 2014

> Unless the compilers "understand" memset and still optimize away the last two lines

They do. That's the whole "problem" - the compiler knows what memset is and what it does, since it's specified in the standard.

DSMan195276 · on Sept 5, 2014

It's not even that it 'understands' memset, more likely that the memset call is almost immediately inlined into your code making it obvious what's going on.

Genmutant · on Sept 5, 2014

The compiler understands memset, so that would probably not work.

Genmutant · on Sept 4, 2014

Why wouldn't you make key volatile? Shouldn't that solve all the problems? Or is it because it would be to slow because the compiler can't do that many optimizations in the rest of the function any more?

cperciva · on Sept 4, 2014

Yes, making key volatile would force the zeroing to happen; and yes, you don't want to do that because it would absolutely kill your code performance.

jbert · on Sept 5, 2014

Can you play the game the other way and "fail safe"?

i.e. declare the storage volatile but running your crypto code on a non-volatile ptr to it (obtained via cast) to get your performance back?

If the compiler then generates enough smarts to work out that the non-volatile ptr you've passed into your crypto code is referring to volatile storage, then you keep security but get a (noticeable in testing?) performance hit.

I guess that's not as good as your solution though.

theseoafs · on Sept 5, 2014

Can't you just cast it to a `volatile uint8_t *` at some later point when you need to ensure that we've zeroed the memory?

mikeash · on Sept 5, 2014

That's discussed in the article. Volatile ultimately applies to the storage, so a sufficiently smart compiler may be able to deduce that you're lying to it with the cast and elide the write.

xroche · on Sept 5, 2014

Why would you want to zero a buffer ? Because it may contain sensitive information, I presume. If you don't have additional properties w.r.t allocated memory, what prevent a system with high load to temporarily put the given memory block on swap, leaking the information on disk ? Security is hard...

bostik · on Sept 5, 2014

There is mlock(2), which is supposed to prevent the memory from being swapped. The problem with that is that the call requires either root or CAP_IPC_LOCK.

If your user lacks the capability, you have to run the program setuid root. Allocate the sensitive buffers at the start, call mlock() on them and only then drop the privileges.

There's also the 10kg fine-tuning hammer, mlockall(2). That makes ALL the memory for the calling process to become unswappable. As it can lock either the "currently held" memory, or "all the memory to be allocated during process lifetime", it can provide for some additional amusement under memory pressure.

leni536 · on Sept 5, 2014

I assume this privilege problem can be solved now with UID namespaces in linux. However it's really ugly, depends on running multiple child processes and linux specific.

nitrogen · on Sept 5, 2014

what prevent a system with high load to temporarily put the given memory block on swap

mlock() or mlockall() is useful in this case, but those are POSIX functions, not C.

clarry · on Sept 5, 2014

Nothing's perfect, but we do what we can. Swap encryption is cool.

ibisum · on Sept 5, 2014

>Why would you want to zero a buffer ? Because it may contain sensitive information, I presume.

You don't zero sensitive buffers. You randomize them, then free() them.

ygra · on Sept 5, 2014

A dead store is a dead store. It doesn't matter whether you write something random into it or zero. If the compiler notices that you cannot read it again anyway, it will elide the write.

clarry · on Sept 5, 2014

Why do you randomize them?

syncsynchalt · on Sept 5, 2014

Because just free()ing them means anyone calling malloc() can get your password.

clarry · on Sept 5, 2014

So why don't you zero them?

ibisum · on Sept 5, 2014

Because then your attacker knows that your buffer had something in it of value.

haberman · on Sept 5, 2014

Interesting. This appears to solve a more general problem, which is: how to create a barrier against inter-procedural optimization and dead code elimination.

I wonder if this trick could also be used to solve the double-checked locking problem.

From the quintessential DCLP paper (http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf):

    Consider again the line that initializes pInstance:

    pInstance = new Singleton;

    This statement causes three things to happen:
    Step 1: Allocate memory to hold a Singleton object.
    Step 2: Construct a Singleton object in the allocated memory.
    Step 3: Make pInstance point to the allocated memory.

    [...]

    DCLP will work only if steps 1 and 2 are completed before
    step 3 is performed, but *there is no way to express this
    constraint in C or C++*.

But Colin's pattern here seems to be a way of indeed guaranteeing this. The volatile function pointer is a barrier against inter-procedural optimization: if the function must be called, then step 3 cannot possibly be performed before steps 1 and 2.

(There might still be necessary hardware barriers that are missing, and the lack of a memory model for pre-C11/C++11 probably makes it all technically undefined behavior anyway. But the key sequential ordering constraint that was claimed inexpressible in C and C++ appears to indeed be expressible with this trick, if indeed the trick works for guaranteeing a call to memset).

nitrogen · on Sept 5, 2014

if the function must be called, then step 3 cannot possibly be performed before steps 1 and 2.

So just to clarify, there's no way the compiler could do "1, 3, 2" instead of "1, 2, 3"? It seems a naive implementation of a compiler could store the pointer to the allocated memory in the pInstance variable before calling the constructor, rather than using a temporary location for the pointer (e.g. a register). Does C++11 and later specify otherwise?

haberman · on Sept 5, 2014

I should have been more specific. To use Colin's trick with this pattern, you would need to write a separate function (like InitializeSingleton()) that calls the constructor and returns the pointer. If InitializeSingleton() is impossible to inline/optimize, which is the goal of Colin's trick, then 1 and 2 must happen before 3, because 3 cannot happen until the function has been called and returns, and the function does steps 1 and 2.

foobarqux · on Sept 5, 2014

You should have test cases to verify the zeroing behavior in the object code. Even if the standard says a compiler must do something does not mean that it does.

maxlybbert · on Sept 5, 2014

The difficult thing is that any way to verify the zeroing behavior would change the compiler's decision about whether it could elide the call to memset. So it's possible (well, almost guaranteed) that the test would succeed even though the memory wouldn't actually be zeroed in production.

paulasmuth · on Sept 5, 2014

If you set up your tests up correctly they should test the exact binary or shared library that is deployed to production and not some test-specific build.

pepve · on Sept 5, 2014

Remember that we're trying to mitigate exploits. The test code could just be an exploit. Either that or just dump and analyze the memory.

jbert · on Sept 5, 2014

I was thinking about that. I think the way to do the test would be to put in a known key value, then call the platform-specific equivalent of "abort and write a core file". The test would then grovel through the core looking for the known key sequence.

apaprocki · on Sept 5, 2014

At least in LLVM 3.4, this seems to do the trick too:

  static void secure_memset(void *, int, size_t) __attribute__((weakref("memset")));

defen · on Sept 4, 2014

Nice teaser at the end there. Does it have something to do with the fact that the OS may have paged the memory containing the sensitive data to disk?

dsjoerg · on Sept 4, 2014

I figured the article would be about how you have to write random data to the buffer to truly "zero" it, otherwise the ghost of the data can still be read using some trick.

Dylan16807 · on Sept 5, 2014

Main memory decays in milliseconds.

cperciva · on Sept 4, 2014

Part 2 should be up tomorrow. ;-)

But no, I'm not talking about VM paging.

jacquesm · on Sept 4, 2014

Coredumps?

schoen · on Sept 4, 2014

OS/language runtine should provide cryptographic key management routines that are correct in the OS/device context?

jacquesm · on Sept 4, 2014

Sure, but programs can crash (or you can attempt to make them crash) and then your keys are sitting there unprotected in a coredump. Anyway, I'm curious if that's the situation Colin was thinking about or not. That would seem to be quite hard to protect against. The example Colin gave was one that is definitely not at the OS / Library level. Though maybe you're right and that's where the solution to all this lies.

alexforster · on Sept 5, 2014

I'm hoping it'll be an allusion to Rust and an article about how none of this nonsense is necessary, but I don't think so based on his past writing.

sgentle · on Sept 5, 2014

This all seems kind of silly. Why doesn't C have a type qualifier like called "secure" to inform the compiler that it should avoid security-compromising optimisations and maybe even automatically zero the memory when it falls out of scope?

ovi256 · on Sept 5, 2014

That sounds a lot like automatic memory management!

To a C dev, that's the same as communism to a US Republican.

dllthomas · on Sept 5, 2014

As a C dev, no. First, stack allocation is a type of "automatic memory management", and in most situations we C devs are perfectly comfortable with it. Second, in terms of how the memory is allocated/deallocated, the above doesn't sound any different than stack allocation. The difference is more like "volatile", telling the compiler "this memory is special, treat it carefully", and it mostly doesn't seem unreasonable. Note that C compilers frequently have extensions providing a way of naming destructors for particular variables. It probably would still be possible to skip it with a non-local jump (longjmp or computed goto) but avoiding those in security conscious code is probably already standard - it basically is in most code I've encountered.

ygra · on Sept 5, 2014

The problem is that C doesn't concern itself with security at all. It's a language with its semantics being defined by an abstract machine and observable behaviour on that machine. A compiler is only obliged to emit an executable that has the same behaviour as the original program would have on the abstract machine, again, only regarding observable behaviour.

You can still side-step the problem by calling a function that's not defined in the standard (so it cannot be inlined by the compiler), usually something like SecureZeroMemory on Windows and its equivalent on other OSes.

scott_s · on Sept 4, 2014

Colin, you're missing end-parenthesis in your memset calls.

cperciva · on Sept 4, 2014

Fixed, thanks.

e12e · on Sept 5, 2014

It is a little mind boggling that support for proper handling of this didn't arrive until c11. For a symmetric cipher without a demanding setup/init phase - would it make sense to just do a few rounds on a buffer using the zeroed key? Obviously quite a few more cycles, but should at least be a predictable (constant) overhead?

maxlybbert · on Sept 5, 2014

What do you mean? The solution that Percival presents compiles fine on my C89 compiler.

e12e · on Sept 5, 2014

I didn't mean to imply that the solution as presented didn't work, I was just wondering if it would also work simply running the cipher with the zeroed key in order to avoid zeroing the key being optimized away. Obviously that'd be a lot more cycles; I'm just curious if it would be a viable solution ;-)

maxlybbert · on Sept 5, 2014

I was referring to the first sentence ("It is a little mind boggling that support for proper handling of this didn't arrive until c11."). I don't see anything C11-specific in the code Percival posted. I don't know enough to say anything useful about the rest of your comment.

e12e · on Sept 5, 2014

My point was that this dance around "observed behaviour" isn't needed in C11, as per: "(...) on C11 (are there any fully C11-compliant platforms yet?) you can use the memset_s function. (...) [which is] guaranteed (or at least specified) to write the provided buffer and to not be optimized away."

On another note, searching for memset_s and openbsd yielded this hit from 2012:

https://mail-index.netbsd.org/tech-security/2012/07/22/msg00...

Which seems to point back to:

https://mail-index.netbsd.org/tech-userlevel/2012/02/25/msg0...

So I guess the "trick" outlined in the (very lucid) post has been known for a while.

maxlybbert · on Sept 5, 2014

Thanks. I had missed the part in the article about the C11 changes.

pjungwir · on Sept 5, 2014

Does anyone have any advice on articles about C compiler optimizations in general (especially gcc)? I'm doing my first serious C work in ten years, and I keep wondering if I should fuss with things like this or let the compiler handle it all:

    foo->bar->baz[i].oof = foo->bar->baz[i].durb + meep;

vs

   what *tmp = foo->bar->baz[i];
   tmp->oof = tmp->durb + meep;

EDIT: I'm not asking for a link to this:

https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I'm asking if there is advice about it. Any overviews with common pitfalls, advice on when to use -O1 vs -O2, specific optimizations to turn on/off, etc.

exDM69 · on Sept 5, 2014

    foo->bar->baz[i].oof = foo->bar->baz[i].durb + meep;

This is fine, no need to "optimize" anything. This kind of common subexpression elimination should be done by any modern compiler (for any language!) and the algorithm behind it is taught in university classes too.

Most of the time it's safe to use -O3. If you're doing numerical code with floating points -ffast-math is also pretty safe if your code is correct (ie. no NaN/Inf bugs). Almost the only reason to turn off optimization (-O0) is when higher optimizations make using a debugger harder.

Here's a pretty nice article with some specific optimizations that GCC can (or can't) do. It's pretty old, though, the examples were done with GCC 4.2.1, current version is around 4.9.

http://ridiculousfish.com/blog/posts/will-it-optimize.html

These days Clang can be as good or better than GCC most of the time. The exceptions are in more exotic code like kernel space stuff or micro controller programming.

There's no room for guesswork if you actually want to optimize code, so spend some time reading the assembler output from your compiler as well as benchmarking the results. I usually use objdump -d objfile.o to look at assembly output.

sjolsen · on Sept 5, 2014

>I usually use objdump -d objfile.o to look at assembly output

You can also compile to assembly with -S. I think it's clearer that way.

exDM69 · on Sept 5, 2014

Yes, I know about the -S flag. But CFLAGS, etc comes from Makefiles so inspecting the object files (which I already have) is easier than re-invoking the compiler with -S added to the command line.

pjungwir · on Sept 5, 2014

Really appreciate everyone's replies! A related question about my example: what if I want to assign the pointer dereference to `tmp` to improve readability (rather than avoid multiple traversals). Is there any reason not to use a tmp variable (presumably with a better name)?

haberman · on Sept 5, 2014

> Is there any reason not to use a tmp variable (presumably with a better name)?

Nope -- shouldn't hurt at all.

It's interesting to me that LuaJIT recommends not using temp variables like this because they can hurt optimization for LuaJIT. That's obviously very different than C in almost every way, I just mention it because it was so surprising to me that there is a situation (in any optimized language) where a temp variable could hurt optimization.

sjolsen · on Sept 5, 2014

>what if I want to assign the pointer dereference to `tmp` to improve readability

I personally find code like that harder to follow. The first version is clearer than the second (and you forgot to take the address of foo->bar->baz[i]).

pjungwir · on Sept 5, 2014

> and you forgot to take the address of foo->bar->baz[i]

Ha, I was afraid of that. :-) Still re-learning when I need that with arrays and when not.

exDM69 · on Sept 5, 2014

No, it doesn't hurt to have an extra local variable if you have compiler optimization enabled. If it would be a global or member of struct/class, that's a different deal.

It's useful to add some variables to be inspected in the debugger.

TheLoneWolfling · on Sept 5, 2014

This shouldn't need to be said, but don't use -ffast-math if you want reproducibility. Unfortunately, the times when you most want reproducibility tend to be the same sorts of number-crunching where -ffast-math would be most useful.

exDM69 · on Sept 6, 2014

> Unfortunately, the times when you most want reproducibility tend to be the same sorts of number-crunching where -ffast-math would be most useful.

Thanks for the clarification. The obvious caveats of -ffast-math are well documented and most applications shouldn't use that flag.

There are exceptions to this however, I tend to work on such problems. For example, game physics, 3d graphics and some scientific algorithms that have built-in numerical inaccuracy (so -ffast-math doesn't help but doesn't hurt either) but high perf requirements. I also tend to have extensive testing for the most crucial parts of my programs that should catch any problems with this (but many people don't do this with game physics, etc code).

Thankfully, -ffast-math is easy to disable if you start suspecting problems that are caused by that flag.

TheLoneWolfling · on Sept 6, 2014

I'd diagree, Game physics and scientific algorithms are often precisely where you want reproducible results.

Game physics because of lockstep networking and replays, scientific algorithms because... well, you want your results to be reproducible. Scientific method and all that.

Often times you don't mind if it isn't accurate, but that isn't the same thing as precision. You want it to be precise, i.e. reproducible.

Yes, there are cases

haberman · on Sept 5, 2014

Here's my advice. Generally speaking, the compiler is really smart. I would characterize the optimization you put as a third-grade optimization: GCC is in college (Clang too, for that matter). It's many steps ahead of that level.

However, if you're ever in doubt, I recommend compiling very short functions and viewing their output.

    typedef struct {
      int oof;
      int durb;
    } baz_t;

    typedef struct {
      baz_t *baz;
    } bar_t;

    typedef struct {
      bar_t *bar;
    } foo_t;

    void f(foo_t *foo, int i, int meep) {
      foo->bar->baz[i].oof = foo->bar->baz[i].durb + meep;
    }

    $ gcc -O2 -c -o test.o test.c
    $ objdump -d -r -M intel test.o

    test.o:     file format elf64-x86-64

    
    Disassembly of section .text:   
       
    0000000000000000 <f>:           
       0:   48 8b 07                mov    rax,QWORD PTR [rdi]
       3:   48 63 f6                movsxd rsi,esi
       6:   48 8b 08                mov    rcx,QWORD PTR [rax]
       9:   48 8d 34 f1             lea    rsi,[rcx+rsi*8]
       d:   03 56 04                add    edx,DWORD PTR [rsi+0x4]
      10:   89 16                   mov    DWORD PTR [rsi],edx
      12:   c3

You can see here that it followed the chain of pointers only once.

The one thing to watch out for though is things that gcc isn't allowed to optimize because of C. For example, if a pointer escapes the function (to another function that the optimizer can't see), gcc cannot assume that the pointed-to memory remains unchanged, even if the called function takes a const pointer! Because the function could always cast away const. For example, this variant will have to follow the chain twice:

    int g(const foo_t *foo, int i);

    void f(foo_t *foo, int i) {
      int x = g(foo, foo->bar->baz[i].durb);
      foo->bar->baz[i].oof = x;
    }

Generally people always use at least -O2. The main difference between -O2 and -O3 is that -O3 is more aggressive with unrolling and other optimizations that increase code size, so sometimes -O2 is faster because of icache pressure. I generally use -O3 on my tightest loops and -O2 (or even -Os) on everything else.

sjolsen · on Sept 5, 2014

That's basic dataflow analysis. Unless one of those pointers is to a volatile object, I'd be very surprised if any halfway serious compiler produced more than one access to

  foo->bar->baz[i]

unless "meep" has side effects.

To address the question, these some of the guidelines I try to follow:

- Compile with "-Wall -Wextra" (and "-pedantic" if feasible)

- Modularize your code. You can always mark functions "static inline."

- Don't try to be clever; "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?" In fact, Kernighan has a lot of good advice: https://en.wikipedia.org/wiki/The_Elements_of_Programming_St....

- Be careful with signed integers. Overflow can do weird things to your program. You can make signed integers act like unsigned integers on overflow using -fwrapv, but if that behaviour is correct, you probably should have used an unsigned integer outright.

- Be careful with pointers; specifically, the requirements of any pointer passed to a function should be explicitly documented: whether it is allowed to be null, whether it's an "in" parameter or an "out" parameter, whether it points to one object or an array, etc. If a pointer points to an array, carry a length parameter with it; null-termination is really easy to foul up.

- Don't optimize until the program needs to be faster. When it does, profile and target the low-hanging fruit. Personally, I usually use either -O0 or -Ofast, depending on whether or not I'm debugging something (-Og is a good one if you need speed while debugging).

- Speaking of optimization, don't underestimate the power of inlining. It's easy to go overboard with it, but it can make a big difference in the right situations.

- Your compiler probably has a peephole optimizer. Replacing "i / 16" with "i >> 4" is probably not an improvement to the quality of either the source code or the object code.

- If you find yourself reimplementing something that C++ knows how to do, consider using C++ to do that. It's not always politically feasible, but remember that you can link C and C++ code.

drivingmenuts · on Sept 4, 2014

Why would the compiler be allowed optimize away a call to a perfectly valid function? This seems like it's allowing to compiler to make judgement calls on whether or not your code is worthy of being executed.

DSMan195276 · on Sept 4, 2014

It's because memset is a standard function and has a defined standard way of acting, with the most important part being that it doesn't produce any side-effects. It's also worth noting that accessing memory that's no longer in the current scope is undefined-behavior, so the compiler can assume it doesn't happen. Thus the memset has absolutely zero effect on the actual program and isn't necessary.

In general this isn't a big deal. It's only a big deal here because we're working on the assumption that you might have an issue in your code that invokes undefined-behavior and access contents of memory that you're not supposed to be looking at anymore, and the compiler's just assuming that your program won't ever allow that.

pwg · on Sept 4, 2014

What if one made some trivial use of the block of memory after having performed the memset, say something like this:

   void
   dosomethingsensitive(void)
   {
        uint8_t key[32];
        ...
        /* Zero sensitive information. */
        memset((volatile void *)key, 0, sizeof(key));
        key[0] = key[1] + 1;
   }

Would that thwart the optimizer, or would it also see through that usage and eliminate it as well?

DSMan195276 · on Sept 5, 2014

You should keep in mind that the memset call is almost guarenteed to be inlined. So your code actually looks like this:

    void
    dosomethingsensitive(void)
    {
        uint8_t key[32];
        /* ... */
        int i;
        for (i = 0; i < 32; i++)
            key[i] = 0;
        key[0] = key[1] + 1;
    }

Assuming the optimizer is sufficiently smart, then it'll remove that 'for' and it'll remove the addition after it in the same fashion.

cperciva · on Sept 5, 2014

That just creates another dead store which will get eliminated.

dietrichepp · on Sept 4, 2014

Because the C standard permits such optimizations. It is really just a case of inlining followed by dead store optimization, two fairly common optimization passes which are normally desired.

The subtlety is that while the compiler considers it a dead store, you don't, because you're going behind the compilers back to examine memory afterwards.

rohit89 · on Sept 4, 2014

How about compilers having a way for us to tell it that a particular function must not be removed? It seems silly that we have to come up with hacks to work around the compiler.

ygra · on Sept 5, 2014

They do. Every function they don't know about is considered important, potentially having side effects and the call will remain. Just call a function that exists in a library somewhere else (but use a static library and link-time optimization and that goes wrong again, because the optimizer knows the function then).

TheLoneWolfling · on Sept 5, 2014

Or, for that matter, use a dynamic library and it can still go wrong! (hypothetical future JITter or somesuch)

JoshTriplett · on Sept 4, 2014

The compiler constantly makes judgement calls on whether or not your code needs to run.

For instance:

    if (something)
        do_something();

Elsewhere:

    #ifdef CONFIG_RUNTIME_ENABLE_SOMETHING
    bool something = false;
    /* and some means of changing something at runtime */
    #else
    const bool something = false;
    #endif

If you don't define CONFIG_RUNTIME_ENABLE_SOMETHING, and thus "something" cannot change at runtime, then the compiler should recognize the if as always false and throw away the call to do_something(). If that was the only call to do_something(), it should throw away the code of do_something().

wumpus · on Sept 4, 2014

There's a lot of dead code in real programs, and this kind of optimization gets rid of a lot of it.

bodyfour · on Sept 4, 2014

...especially for code that's macro heavy. It's easy to have code where a macro is including a memset to protect some invariant. The compiler can see the bigger picture and realizes that the memset target isn't actually read from again in this case.

_qc3o · on Sept 4, 2014

Why would it not. That is the whole point of optimizing compilers. You write the code with as many variables and functions calls as it makes sense and the compiler figures out all the redundant stuff and throws it away, inlines some other things, packs structs, etc.

ben0x539 · on Sept 5, 2014

The "result" of compiling a C program is defined in terms of observable behavior. It's mostly stuff like IO and writes to `volatile` variables and that sort of thing. There's no presumption of any sort of correspondence between the source code and the generated code and really a compiler can emit whatever, except that the observable behavior must be the same as that of a translation that actually follows the language definition.

Since there's no accounting for shenanigans like using a debugger to look at variables that are never accessed anymore or dumping the entire stack to a file at arbitrary points etc, the compiler is free to consider code unworthy of being executed if it provably doesn't contribute to further observable behavior of your program.

jstanek · on Sept 5, 2014

Does GCC include any flags to prevent this sort of detrimental optimization?

prof_hobart · on Sept 5, 2014

There's a bunch listed (https://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Optimize-Options....), including

-O0 // Do Not Optimise

This sounds like it should do the trick (but I've not done C coding for quite some time, so I don't know if there's a nuance as to why it wouldn't), but would also presumably kill any other optimisation.

A better option may be to combine that with pragmas (https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-...) to switch optimisation levels within the code.

Anyone who's played with GCC recently know whether this would work or not?

zokier · on Sept 5, 2014

You mean something like friendly C: http://blog.regehr.org/archives/1180

(discussed on HN 8 days ago: https://news.ycombinator.com/item?id=8233484)

Perseids · on Sept 5, 2014

> on C11 […] you can use the memset_s function

How is the case for modern C++? Are there `vector` or smart pointer alternatives that reliably zero the memory in the destructor?

dllthomas · on Sept 5, 2014

It seems like ideally what we need is a language designed for as-fast-as-secure computation, JIT for the specific architecture it is going to run on to ensure no differences in timing, energy use, or anything (within whatever bounds are achievable) even in the face of different cache layouts, CPU optimizations and similar and which makes it a point to clean up everything that is not meant to be returned.

api · on Sept 5, 2014

If your goal is just to "burn" the memory, why not write your own loop that copies some arbitrary piece of data that the compiler can't optimize out over the memory's contents? Do something like fill the buffer with its own pointer address.

clarry · on Sept 5, 2014

It's been stated here already. You can write whatever you want into it. It doesn't matter. What matters is that the compiler realizes there's a dead store going into it, i.e. that the data is thrown away after it's written. So it can optimize out any write, since no conforming program can read that data after it's thrown away.

im3w1l · on Sept 5, 2014

A sufficiently malicious compiler could keep around a copy of the key in non-volatile memory.

kazinator · on Sept 5, 2014

The article the completely obvious:

    /* implemented in another translation unit */
    void zero_for_sure(void *data, size_t size);

    void func(void)
    {
      char securedata[42];
      /* ... */
      zero_for_sure(securedata, sizeof securedata);
    }

The key here is that our zero_for_sure is an external function in a separately translated file. In the absence of a stunningly advanced global optimization that peeks into other previously compiled units, the compiler has no idea what zero_for_sure does, and so it has to earnestly pass it the given piece of memory.

In turn, zero_for_sure is just this:

   void zero_for_sure(void *ptr, size_t size)
   {
      memset(ptr, 0, size);
   }

The compiler has no idea where ptr might come from since this is an external function, and so it cannot optimize away the memset.

Only if the compiler could consider the whole program together could it still optimize this.

In fact, you don't even need this function, just a dummy external function:

   void zero_for_sure(void *ptr, size_t size)
   {
      char securedata[42];
      /* ... */
      memset(securedata, 0, sizeof securedata);
      commit(securedata);
   }

Of course, commit is a noop which just returns. But the compiler doesn't know that because commit is in another translation unit.

The only optimization card that the compiler could pull here is since securedata is going away (so that it is illegal for commit to stash a pointer to it), it's okay to call commit with a pointer to some other block which contains zeros, and not actually securedata.

With any trick like this, you should inspect the object code to make sure it's doing what you think it's doing.

Oh, and sizeof doesn't require parentheses when the operand is an expression; they are required when a type name is used as an operand.

dllthomas · on Sept 5, 2014

From the article, "Some people will try this with secure_memzero in a separate C file. This will trick yet more compilers, but no guarantees — with link-time optimization the compiler may still discover your treachery."

https://gcc.gnu.org/wiki/LinkTimeOptimization

http://llvm.org/docs/LinkTimeOptimization.html

kazinator · on Sept 5, 2014

1. It is the compiler that is committing treachery here. This stuff stretches, if not outright breaks, the translation model given in the C standard, where it is clear that a program is separated into translation units, and that linkage resolves external names.

2. You bring this on yourself; it's not enabled by default by ordinary optimization options like -O2 or -O3. You have to ask for it, and so you must know what you're doing.

3. Under gcc, it looks like only those object files compiled with -flto are prepared for this optimization. You can arrange through your makefile or whatever not to apply -flto to sensitive modules that cannot be inlined or optimized away into other translation units. Those object files won't then contain the GIMPLE bytecode and whatnot needed to be able to peer into their internals at link time.

4. I don't think the dynamic linker in libc (ld.so) does this optimization, so putting code into shared libs may be another good way to hide it.

So, basically, the external function approach is still a very good tool for defeating unwanted inlining and dead code elimination, provided you don't stupidly use some advanced features that bend the standard translation model of the C language. In security-critical code, to boot. External functions are expressed using the standard language; the approach will work under pretty much any compiler.

dllthomas · on Sept 5, 2014

People do stupid things sometimes. Yes, I agree you shouldn't apply optimizations to security critical code without fully understanding the ramifications. It is also the case that you shouldn't write security critical code that might break under optimizations, where you can avoid it.

sharpneli · on Sept 5, 2014

Move the func into a dynamically linked library.

Thanks to the performance requirements of dynamic linking it's going to be a really long time until we have dynamic linker peeking into .so files and checking what a func does.

dllthomas · on Sept 5, 2014

That certainly does it, yes. Though that's more complicated than the solution offered here, and of course now the right thing to do is just a memset_s.

TheLoneWolfling · on Sept 5, 2014

...but still not guaranteed, which is what this is striving for.

DenisM · on Sept 5, 2014

>In the absence of a stunningly advanced global optimization that peeks into other previously compiled units

You'll be surprised, but this stuff exists since late last century, known as Link-Time Code Generation (LTCG):

http://msdn.microsoft.com/en-us/magazine/cc301698.aspx

kazinator · on Sept 5, 2014

I've just been digging through the C99 standard (official) and the N1570 (final C11 draft) and have come to the conclusion that these optimizations break the language. This is probably not something you should be compiling your OpenSSL shared library or SSH with.

A C program consists of translation units which may be preserved in translation form. That happens in translation phases 1 through 7. Multiple translation units may be linked, which is translation phase 8. Phase 8 only consists of resolving references; the last semantic analysis takes place in phase 7.

An example under 5.1.2.3 gives the range of adherence between actual semantics and abstract semantics. Though it is just an example, and not normative, it is very clear from the wording that the locus of valid optimizations is the translation unit.

Selected citations:

5.1.1.1 Program Structure

A C program need not all be translated at the same time. [...] After preprocessing, a preprocessing translation unit is called a translation unit. Previously translated translation units may be preserved individually or in libraries. [...] Translation units may be separately translated and then later linked to produce an executable program.

5.1.1.2 Translation Phases

[...]

7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

8. All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

5.1.2.3 Program Execution

8. EXAMPLE 1 An implementation might define a one-to-one correspondence between abstract and actual semantics: at every sequence point, the values of the actual objects would agree with those specified by the abstract semantics. The keyword volatile would then be redundant.

9. Alternatively, an implementation might perform various optimizations within each translation unit, such that the actual semantics would agree with the abstract semantics only when making function calls across translation unit boundaries. In such an implementation, at the time of each function entry and function return where the calling function and the called function are in different translation units, the values of all externally linked objects and of all objects accessible via pointers therein would agree with the abstract semantics. Furthermore, at the time of each such function entry the values of the parameters of the called function and of all objects accessible via pointers therein would agree with the abstract semantics. In this type of implementation, objects referred to by interrupt service routines activated by the signal function would require explicit specification of volatile storage, as well as other implementation-defined restrictions.

__david__ · on Sept 5, 2014

> Though it is just an example, and not normative, it is very clear from the wording that the locus of valid optimizations is the translation unit.

I don't think that's clear at all. As you said, it's just an example.

> Phase 8 only consists of resolving references; the last semantic analysis takes place in phase 7.

Why do you think that the only place you can optimize is during the "semantic analysis" phase?

To me the phrase "All such translator output is collected into a program image" (from step 8) is vague enough to not rule out optimization during "collection".

kazinator · on Sept 5, 2014

The word "all" in "all such translator output" seems to rule out removing any code, such as the zero memset.

"Translator output" suggests that translation is complete and we just have its output to link together. Optimization is "semantic analysis"; you cannot optimize without reasoning about meaning, and optimization also implies that translation is still going on: the output of earlier translation is still being tweaked, with regard to the meaning of the original source.

DenisM · on Sept 6, 2014

When it's a choice between standards and a 10% performance gain, the latter will win.

MichaelMoser123 · on Sept 6, 2014

You can have a function that wrapps memset with zero argument this wrapper should be in a different shared library, this way the compiler will not follow it; wait, that's exactly what memset_s is,

dllthomas · on Sept 6, 2014

Not exactly. There's no reason memset_s can't be understood by the compiler, inlined, and optimized, so long as those bits still get zeroed. That's not the case for any of the other approaches.

MichaelMoser123 · on Sept 6, 2014

the compiler should not make any assumption about what memset_s is doing (same goes for user defined memset wrapper function in different shared library - the implementation of that function is not known at compile or link time); if it can't make such assumption then it can't optimize the call out.

dllthomas · on Sept 6, 2014

I think you misunderstand me. There is no guarantee that there will be no optimization of or around memset_s (at least, not provided by the standard), and we don't want one. What the standard has done is assured us that it will not be optimized away - that the effect of zeroing that memory will be treated as visible even if the memory is otherwise dead. Allowing the compiler to optimize without changing semantics is desirable, and is permitted by the standard but prevented by the attempts to erect artificial walls through linking (and also by the volatile function pointer in the article), so memset_s - in addition to being clearer - is a technically superior solution.

bakhy · on Sept 5, 2014

I had no idea that such things are possible in C. The things about I've read recently (the "friendly" C suggestion) and this seem like violations of the spirit of the language. And for what, really? The language looses its signature predictability, which to me seemed like a great feature of C.

If you write crappy code and expect the compiler to fix it for you, you should maybe consider another language. I can only imagine how hard it is to write reliable system software in a language that does these things.

pjc50 · on Sept 5, 2014

Other way round: the C language after optimisation is unpredictable, which makes it hard to write secure system software.

You don't expect to write sound-looking code and have it broken for you by the optimiser.

pjmlp · on Sept 5, 2014

Actually back in the old K&R days it was much much worse, as no real standard was in place, compilers were just kind of compatible with endless little surprises.

Specially fun when trying at home your UNIX homework and vice-versa.

PhasmaFelis · on Sept 5, 2014

> If you write crappy code and expect the compiler to fix it for you, you should maybe consider another language.

This seems rather to be a case of writing good code and having the compiler break it for you.

bakhy · on Sept 5, 2014

Yes it is. You misunderstood my point, which was about the purpose of existence for such optimizations - they are meant to improve code, which presumably needs improvement. But if your code needs improvement, then why not go for a higher level language?

BTW, I gave you an upvote by accident :D

clarry · on Sept 5, 2014

They are meant to make the code run faster. If your code needs optimization, going for higher level languages is seldom a good idea.

You could try do low level optimizations in your C or assembly, but for most programs this will eventually backfire. So letting the compiler do its job is actually a good thing.

jimmaswell · on Sept 5, 2014

Couldn't you just compile these functions with optimization turned off, in a separate binary or something?

amalcon · on Sept 5, 2014

You would also need to link with optimizations turned off (or link dynamically, making the behavior undetermined at link time) to be sure. Linker optimizations have been a thing for a few years now.

oso2k · on Sept 5, 2014

Wouldn't returning the passed memory block through the return fix the optimization issue?

tiffanyh · on Sept 4, 2014

For those of you unaware, Colin Percival (author of the blog) was for many years the FreeBSD Security Officer and he's highly recognized in the field for his expertise.

He also runs http://www.tarsnap.com/ which is arguably the most secure (and cost effective) back up solution in the market.

(I'm in no way affiliated with Colin and/or Tarsnap. Just a fan of his work and humble attitude.)

cperciva · on Sept 4, 2014

humble attitude

I'm guessing you haven't seen the "comeback of all time" thread...

tiffanyh · on Sept 4, 2014

I haven't.

But I think it's super funny and cool that you (yourself) are pointing it out.

All the best with you.

Edit: just read the "comeback of all time". That was really funny. Nice nod from PG as well. For those of you unaware like me: https://news.ycombinator.com/item?id=35083 Colin is our resident mathematical genius :)

bentcorner · on Sept 4, 2014

Thanks for linking that. It's always interesting to see posts from several years ago and reflect on how things turned out (if you go up a few parents you can see discussion about tarsnap and dropbox makes a brief appearance).

tiffanyh · on Sept 4, 2014

Colin, if you don't me asking (and this is extremely off topic), do you know of any resources to read up on the differences between FreeBSD and DragonflyBSD now 11 years after the split.

I'm not looking for you to take sides on the matter. I just enjoy reading history and would like to read a recent review of the two BSD now 11 years later and how they compare.

I've looked and looked for the past few months and can't seen to find a good solid length article on it ... which is why I ask.

cperciva · on Sept 4, 2014

DragonflyBSD is less mature. This allows them to play around with things in ways which we can't do in FreeBSD because we don't want to break production systems. In many ways DragonflyBSD acts as a skunkworks for FreeBSD -- we import a lot of cool stuff from there, once it has been proven to work.

On the other hand, I wouldn't want to run DragonflyBSD in production... because skunkworks projects often don't work.

pedrocr · on Sept 4, 2014

Do you happen to have anything written on comparing modern FreeBSD with Linux (either kernel vs kernel or FreeBSD vs say Debian)?

cperciva · on Sept 4, 2014

No, but the major differences I find are:

1. The FreeBSD base system is developed intact, so there are far fewer kernel/library versioning issues.

2. FreeBSD, possibly because of its academic heritage, tends to take a more careful "let's study this problem and make sure we come up with the right solution" approach. This means that FreeBSD development is often slower, but once a feature is added it is more likely to actually work. (This is somewhat self-reinforcing: FreeBSD's stability attracts companies building servers and appliances, and when those companies contribute back they care a lot about having things continue to not break.)

3. Linux is far more popular, especially on desktops, so it tends to get drivers for new hardware faster (especially for consumer hardware).

4. FreeBSD is BSD licensed, which makes it available for a lot of companies which wouldn't want to get anywhere near Linux.

pedrocr · on Sept 4, 2014

Thanks. That fits my general idea of FreeBSD. I've mostly wondered about how they technically compare though. Facebook seems to think Linux is generally faster for example[1]. I'd love to follow this more closely but I haven't found a LWN equivalent for the BSDs.

[1] https://lwn.net/Articles/608954/

cperciva · on Sept 4, 2014

I think the best answer there is "it depends". Facebook has an open job posting which states that "Our goal over the next few years is for the Linux kernel network stack to rival or exceed that of FreeBSD" [1], so clearly there's at least one place in Facebook where Linux does not provide the best performance...

[1] https://www.facebook.com/careers/department?req=a0IA000000Cz...

jfindley · on Sept 4, 2014

I wonder when that was posted. Linux networking has improved massively since the 2.6 days, and continues to get better.

It's still not FreeBSD, but the difference is pretty small these days.

cperciva · on Sept 5, 2014

Within the past couple of months, I believe. Speculation at the time was that this was fallout from Facebook devops trying to port Whatsapp to run on Facebook's systems.

tiffanyh · on Sept 4, 2014

Thanks!

smegel · on Sept 4, 2014

When I allocate the key on the heap, the memset is carried (heavily optimized and inlined). When I allocate key on the stack, it disappears. Using gcc -03:

    #include <string.h>

    void doSecure(void)  
    {  
        /*char key[32];*/  
        char *key = (char*) malloc(sizeof(char)*32);

        memset(key,sizeof(char),32);  
    }

    int main(void)  
    {  
        doSecure();

        return 0;  
    }

    -- key on stack

    main:  
    .LFB13:  
        .cfi_startproc  
        xorl	%eax, %eax  
        ret  
        .cfi_endproc

    -- key on heap

    main:  
    .LFB13:  
        .cfi_startproc  
        subq	$8, %rsp  
        .cfi_def_cfa_offset 16  
        movl	$32, %edi  
        call	malloc  
        movabsq	$72340172838076673, %rdx  
        movq	%rdx, (%rax)  
        movq	%rdx, 8(%rax)  
        movq	%rdx, 16(%rax)  
        movq	%rdx, 24(%rax)  
        xorl	%eax, %eax  
        addq	$8, %rsp  
        .cfi_def_cfa_offset 8  
        ret  
        .cfi_endproc

rwg · on Sept 4, 2014

That's not a behavior you can count on — clang 3.4 will optimize doSecure() down to "ret" and main() down to "xorl %eax, %eax" + "ret" in both cases, for example. Also, gcc not optimizing out the malloc + memset in the heap case seems like a missed optimization that the gcc devs might fix in the future.

pjmlp · on Sept 5, 2014

That is compiler specific behaviour.

You have zero guarantees it will work in another compiler or even between releases of the same one.

Never take a compiler behaviour for the standard. That's the beauty of standards.

wolf550e · on Sept 5, 2014

    memset(key,sizeof(char),32);

Note that you are not clearing 32 bytes with NUL (byte value 0). You are clearing 32 bytes with byte value 1 == (int)(sizeof(char).

That's why memset that works 8 bytes at a time fills memory with 72340172838076673 == 0x0101010101010101

http://linux.die.net/man/3/memset

tedunangst · on Sept 4, 2014

A more realistic idiom is memset followed by free. The free provides a solid hint to the compiler that the object is dead without relying on escape analysis.

smegel · on Sept 5, 2014

Naturally, however adding free made no difference in this case. I guess if your dealing with a raw pointer rather than an array type, gcc cant be sure what memory you intend to erase.

angersock · on Sept 4, 2014

This sort of thing would be exactly what should go into the "Friendly C" dialect being chatted about the other day--for things like zeroing memory, it's very unexpected that a compiler would be like "nah, not feeling it...nobody will notice anyways".

ars · on Sept 4, 2014

No, you are looking at it backward. The compiler tries to optimize the program so that it will run fast.

If you force it to execute this code it only benefits the very rare security program, yet every single program will run slower.

That's a bad tradeoff. Better to make the security program jump through hoops and let everyone else run fast.

CJefferson · on Sept 4, 2014

If your compiler wrote every variable into memory, every time you changed it, your code would run at least 3 times slower. Compilers generally do not worry about making sure the value of unallocated blocks of memory are what you expect, as there is no way to observe them within your program.