Hacker News new | past | comments | ask | show | jobs | submit login
Zeroing buffers is insufficient (daemonology.net)
299 points by MartinodF on Sept 6, 2014 | hide | past | web | favorite | 130 comments



Part 2 is correct in that trying to zero memory to "cover your tracks" is an indication that You're Doing It Wrong, but I disagree that this is a language issue.

Even if you hand-wrote some assembly, carefully managing where data is stored, wiping registers after use, you still end up information leakage. Typically the CPU cache hierarchy is going to end up with some copies of keys and plaintext. You know that? OK, then did you know that typically a "cache invalidate" operation doesn't actually zero its data SRAMs, and just resets the tag SRAMs? There are instructions on most platforms to read these back (if you're at the right privilege level). Timing attacks are also possible unless you hand-wrote that assembly knowing exactly which platform it's going to run on. Intel et al have a habit of making things like multiply-add have a "fast path" depending on the input values, so you end up leaking the magnitude of inputs.

Leaving aside timing attacks (which are just an algorithm and instruction selection problem), the right solution is isolation. Often people go for physical isolation: hardware security modules (HSMs). A much less expensive solution is sandboxing: stick these functions in their own process, with a thin channel of communication. If you want to blow away all its state, then wipe every page that was allocated to it.

Trying to tackle this without platform support is futile. Even if you have language support. I've always frowned at attempts to make userland crypto libraries "cover their tracks" because it's an attempt to protect a process from itself. That engineering effort would have been better spent making some actual, hardware supported separation, such as process isolation.


The "right privilege level" allows you to see anything that happens during the execution of the lower privilege levels. I can even single-step your application with the right privilege level. So the crypto services have to run at the high privilege level and ideally your applications should leave even the key management to the "higher privilege levels." That way attacking the application can leak the data, but not the key, that is, you can still have the "perfect forward secrecy" from the point of the view of the application. So you have to trust the OS and the hardware and implement all the tricky things on that level. Trying to solve anything like that on the language level doesn't seem to be the right direction of the attacking the problem.


So is it correct to say that if a process does not want to leak information to other processes with different user ID's running under the same kernel that a necessary (but not necessarily sufficient, due to things like timing attacks) condition is for it to ensure that any allocated memory is zero'd before being free'd ?

I wonder if current VM implementations are doing this systematically.

It seems like a kernel API to request "secure" memory and then have the kernel ensure zeroing would be useful. Without this I'm wondering if it's even possible for a process to ensure that physical memory is zero'd, since it can only work with virtual memory.


All kernels I know of zero all memory they hand over to user processes. It's been part of basic security for quite some time - exactly for this kind of thing. It's usually done on allocation, not free - it doesn't really matter which way around, but doing it "lazily" can often be better performance.


In that case your original comment looks like the way to go and should make pretty much everything else in this thread moot.

It seems like the key though is ensuring that your environment uses distinct non-root users for all security relevant processes so that a security bug in one process doesn't allow the attacker to gain access to others.

EDIT: On second thought there may be some advantage to effectively zeroing memory for security critical data within a process but the likely value add seems low to me. Once a process has been hacked it seems pretty unlikely that you can hope to control what information it leaks.


Actually use of uninitialized memory is a reasonably common flaw and doesn't imply the process has been or can be hacked to execute arbitrary code.

So wiping that sort of information as soon as it becomes unneeded is good hygiene. And I still think it is reasonable to do the least you can to avoid ending up with sensitive data on the disk after a core dump.


Use of uninitialized memory is certainly a common bug but I'm not seeing what that has to do with zeroing free'd memory. It might be easier to detect such a bug if the uninitialized memory is zero'd but it seems like the work devoted to zeroing memory would be better spent fixing the uninitialized memory accesses.

As for the second point production software isn't typically configured to produce core dumps (ie. ulimit -c 0).


It's not so hard to zero memory when it becomes unused. So libraries like LibreSSL do that. Increasingly, other applications are also starting to use this pattern. It is easier to add a few safeguards into the library than it is to fix every past, present and future application that uses it.

It's a start. Adding the safeguard doesn't mean effort won't be put into fixing the actual bugs. But you just don't fix all the world's bugs overnight. That's why things like virtual memory, permissions, chroots, ASLR, NX, SSP and such exist.

How many systems enable core dumps by default? I don't actually know, but I think quite a few do. Every application you use to get stuff done is a production application. Every application that handles sensitive information handles sensitive information whether it is in production or not. Leaking passwords and keys can be as simple as working on some client software, having it crash once, then passing through airport security and getting your HD snooped on...


I understand your point but it seems like this approach ends up making security dependent on an a very deep stack of technology solutions, each rather fragile (as this post and thread demonstrate).

I wonder if it wouldn't make more sense to do a first principles analysis of what needs to be protected and then design mechanisms at the appropriate level of abstraction to ensure that these requirements are met. It seems to me that this is the approach that has been traditionally taken in OS level design and I agree that it hasn't worked very well. But I wonder if that isn't more because applications and environments are not being carefully designed to take advantage of the OS level security mechanisms that already exist.

Personally I would feel more confident depending on a robust kernel level security mechanism than a hodgepodge of application level fixes that depend on everything from compiler optimizations to CPU caching mechanisms.


I, for one, welcome new research. But it also needs to be demonstrated in practice before it can be used in practice. Until then, it's all hypotheticals and wishful thinking. Unicorns, basically.

In the meanwhile, I do what I can to review, audit, and apply best practices at application level. These are the things we can do here and now. These are things that are already in use to make your system more secure.

You're right that it is a hodgepodge of tricks and never quite perfect or capable of blocking all attacks. In an ideal world someone would design and give us a system that provides perfect security right out of the box, in a small & elegant & easy to understand manner.

I'm not smart enough to do that so I'll only dream of the unicorns. :-)


Excellent point! I really hope such a sensible suggestion is added to mainstream compilers asap and blessed in future standards.

Apologies to everyone suffering Mill fatigue, but we've tried to address this not at a language level but a machine level.

As mitigation, we have a stack whose rubble you cannot browse, and no ... No registers!

But the real strong security comes from the Mill's strong memory protection.

It is cheap and easy to create isolated protection silos - we call them "turfs" - so you can tightly control the access between components. E.g. you can cheaply handle encryption in a turf that has the secrets it needs, whilst handling each client in a dedicated sandbox turf of its own that can only ask the encryption turf to encrypt/decrypt buffers, not access any of that turf's secrets.

More in this talk http://millcomputing.com/docs/security/ and others on same site.


> we have a stack whose rubble you cannot browse, and no ... No registers!

Wow. There's the Wheel of Reincarnation [1] in action. The Intel iAPX 432 microprocessor had similar ideas.[2] E.g. no programmer visible general purpose registers, "capability-based addressing" to control access to memory.

That was a mere 30+ years ago. Let's hope you're more successful than they were.

[1] http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm... [2] http://en.wikipedia.org/wiki/IAPX432#Object-oriented_memory_...


If I'm understanding the idea, this reminds me of the processor in my first computer, the TMS9900 used in the TI-99/4[a] computers.

This processor didn't have general purpose registers, it had "workspaces" in RAM that served as register sets. The processor had a workspace pointer register that pointed to the workspace currently in use. This was cool because it meant that a context switch could be achieved by just changing the workspace pointer. However the downside is that RAM access is slower than register access.


They were cool. That processor is fairly maligned in my experience but it had some solid ideas. I built a CPU in an FPGA that used a construction similar to a cache line to hold registers as an experiment once. The one line (64 bytes) was 'reserved' for 16 registers (8 GP, 4 index, and 4 process specific (SP, PC, STATUS, MODE)), a context switch reloaded the registers as it reloaded the cache. That made context switching a bit faster than a full on push/pop stream but not as fast as having dedicated register banks ala SPARC.

These last two posts by Colin though really got me thinking about the who push for a 'trusted computing base' that folks had back in the 90s. Basically the same argument was used then to justify specific hardware as part of the system for implementing the crypto bits. At the time I thought it was overkill but I can see now how such a system can contain implementation faults into more detectable domains.


Just wondering, have you talked to the CHERI people? It sounds like there is a lot of commonality of interests there.


I've been following them but we haven't talked. Yet :)


What's the current status of the Mill project? Is there a proof of concept compiler / emulator? What's the bootstrap strategy to get things rolling?


Last I heard they're still limited to sims and are concentrating on patent filings. The bootstrap strategy is LLVM (once they get around LLVM assuming addresses are integers as opposed to the Mill's compound things) and to get Linux running on top of L4 which seems doable[1]. They say they're looking for a niche to start in before going after PCs.

[1]http://l4linux.org/


Why on L4? Is Mill somehow tied to it, architecture-wise? Or is it just that L4 has a smaller footprint and is easier to port?


Its about footprint. We certainly will run Linux on the Mill, but its work we don't have to put on the critical path. L4 is just a familiar lightweight OS, and we're keen to play with Mill-specific security features which are particularly applicable to microkernels too.

When we do port Linux, I expect it to become much more microkernel like, as in why would you want your disk drivers to be able to read write video memory etc?


> why would you want your disk drivers to be able to read write video memory etc?

That is a much bigger issue than the CPU architecture, as it has more to do with how the peripheral hardware works (firmware, DMA, etc.), but I appreciate the effort.


Well its how the CPU architecture exposes the hardware to software i.e. drivers. The Mill does MMIO but doesn't have rings.


Because the Mill has a single address space with memory protection, which works a lot better with L4 than it does with Linux. Porting Linux directly would probably be possible, but a huge effort the team wouldn't be able to pull off without a lot of extra resources.


We are hard at work :)

There is no public SDK yet, and hardware is also under development.

We've had a simulator for a long time, and we show it off a bit in the Specification talk:

http://millcomputing.com/docs/specification/


It's becoming gradually more tempting to write a crypto library in assembly language, because at least then, it says exactly what it's doing.

Alas, microcode, and unreadability, and the difficulty of going from a provably correct kind of implementation all the way down to bare metal by hand.

The proposed compiler extension, however, makes sense to me. Let's get it added to LLVM & GCC?


That works for well-defined ISAs (like ARM), but not for those with undocumented pipelines, or instructions defined by practise (like x86 and amd64).

In other words, if you write a crypto library in x86 assembler, Intel don't guarantee that they won't introduce a side channel in their next chip model or stepping.


Sadly, I know that only too well: hence my "alas, microcode" comment! A prefix or mode or something which allows code to handle secure data and it gets constant-time multipliers, for example, or true µop-level register zeroisation, would be handy, but also close to unverifiable - we just have to sort of trust it, which sucks.

Until then, we do the best we can with turtles all the way down. Software running under that same undocumented pipeline is going to find it very hard to access or leak (accidentally or otherwise) internal registers, at least.

For the other avenue of attack (cold-boot attacks), it's also notable that registers, at least, have extremely fast remanence compared to cache, or DRAM - bit-fade is a very complex process, but broadly speaking, faster memory usually fades faster.

Digression along that vein: I basically pulled off a cold-boot attack on my Atari 520 STe in the early 1990s (due to my wanting to, ahem, 'debug' a pesky piece of software that played shenanigans with the reset vector and debug interrupts), with Servisol freezer spray pointed directly at the SIMMs in my Xtra-RAM Deluxe RAM expansion (and no, cold-boot attacks are not new, GCHQ's known about them for at least 3 decades and change under the peculiarly-descriptive cover name NONSTOP, I believe?). It just seemed sensible to me: cold things move slower, and they had a particularly long (and very pretty) remanence - I was able to get plenty of data intact, including finding where I needed to jump to avoid the offending routine and continue my analysis with a saner technique (i.e. one that didn't make me worry about blowing up the power supply or electrocuting myself)! It's harder these days - faster memory - but the technique incredibly still works and was independently rediscovered as such more recently: very much a "wait, this still works on modern RAM?" moment for me. (By the way, when I accidentally pulled out the SIMMs with the internal RAM disabled - whoops - and rebooted the Atari on my first try, it actually powered up with an effect that I can only describe as "pretty rainbow noise with double-height scrolling bombs" that would not have looked out of place in a demoscreen! I don't know if that was just mine, but... the ROM probably never expected to find RAM not working, and I guess the error-plotting routine had a very pretty and unusual error in that event?)

I've never seen or heard of anyone pulling off a NONSTOP on a register in a CPU, or actually even on an L1, L2 or L3 cache (maybe an L2 or L3 might be possible, depending on design?). They're fast - ns->µs remanence? - and cooling doesn't help much. I don't know if it's possible at all, but I'd tentatively suggest that it might be beyond practical attack - unless the attacker has decapped the processor and it's already in their lab (in which case you're fucked, no matter what!). That's what suggests that approaches like TRESOR (abuse spare CPU debug registers to store encryption key; use that key to encrypt keys in RAM), despite being diabolical hacks, actually work.

If you fancy giving it a try in the wild by the way, I think a Raspberry Pi might be a good modern test subject - the RAM's exposed on top of the SoC, there are no access problems, and it's cheap so if it dies for science, it's not such a problem. (Of course, you'd want probably to want to change bootcode.bin so that it dumps the RAM after it enables it but before it clears it.) The VideoCore IV is kind of a beast - and is frustratingly close to being able to do ChaCha20 extraordinarily efficiently, if I can just figure out how to access the diagonal vectors... or if I can, or if I can fake it.


If I wanted to read CPU registers from the outside, there's an easy way: JTAG. You should be able to halt the CPU, read (and modify!) the registers, and resume the CPU.

That should be possible even on x86, though on x86 the relevant documentation is probably hard to find. For some ARM processors, it should be as easy as installing openocd.

Of course, JTAG requires physical access to plug the debugging cable, which puts it in a different category of attack.


I can't believe I'd forgotten about JTAG! Yes, that's definitely more viable than decapping! <g> Same completely-doomed threat model though ("attacker has physical access, can do anything they want and take as long as they need").

Sorry, I've been dealing with a few things more recently which, uh, haven't been quite so accommodating to analysis.


There's a fundamental difference between your main memory and your L3 in that the former is DRAM and the later is SRAM. In DRAM you have a charge hidden in a well behind a single transistor and it's designed to be stable for a while (the refresh interval) without anybody doing anything to it. SRAM doesn't have that static component at all, it's a set of 6 or 8 transistors which have two stable configurations when powered and which lose their state as fast as all the other logic in your chip as soon as the power is cut.

You can play with the temperature if you want, but the mechanisms that prevent unauthorized access in normal conditions will have their lifetimes extended or decreased as much as you change the lifetime of the data you're trying to access. And liquid nitrogen temperatures at least tend to make everything happen faster in CMOS circuitry. That's governed by a complex interaction between the effect of temperature on carrier density and carrier mobility, so I'm not sure that you couldn't slow things down with, say, liquid helium, but even then I'm not sure you're buying anything.


Yes, SRAM has close-to-zero remanence. But if I were running, say, a Haswell, my L3 might indeed be DRAM (depending on the model).

Saves die space, even with the controller overhead. They love it on SoCs in general, particularly where anything with embedded graphics is concerned.

I'm curious what mechanisms you're talking about. Did you mean physical access? General-purpose processors don't have those (why would they?). Specialised cryptographic SoCs which try to prevent physical attack… well, let's just say for now that results may vary and that if a determined adversary has unlimited physical access to a device, you cannot prevent its compromise.


Another good reason to write crypto in assembly is to ensure that the implementation is not susceptible to timing attacks. If your code has different code paths that take different amounts of clock time attackers can use that. This can be difficult to achieve in a high level language.


Using assembly won't preclude timing attacks vulnerability, esp on x64. Nowadays beating even the C compiler performance wise is exceeding difficult with hand written assembly.


The point isn't to be faster, it is to be consistent.


That's what I mean actually getting it consistent is hard as the performance is really hard to predict and may change even with CPU stepping. Even then it requires very solid planning as well.


I think you misunderstand timing sidechannels. The idea is that (for example) if you compare two strings with length 15 you compare all 15 chars regardless if you find that the 3th char is already different.

You only need to be consistent with yourself. Stepping is completely irrelevant here.


An instruction that is constant time in one CPU may vary its time based on input in the next version of the CPU. That could still provide a timing channel in your example of a comparison if the comparison instruction finishes faster if, say, the left most bit is a mismatch.


>I think you misunderstand timing sidechannels.

I don't :) Basically you want all the code branches to result in similar (same) timings. Basic on the CPU and the data inputs those timing would vary, hence assembly alone won't do.


David Beazley after analyzing 1.5 Tbytes of C++ code shows in "Some Lessons Learned": C++ -- SUCKS, Assembly code -- ROCKS http://www.youtube.com/watch?v=RZ4Sn-Y7AP8#t=2049


This is what djb is doing using his "qhasm" assembly like language. He seems to be doing it for performance though, not to work around too aggressive compilers.

As an alternative, maybe write crypto algorithms in LLVM IR?


Adding an annotation for qhasm where stack variables/registers would be zero'd at the end of the function if they still contained sensitive data would be great.

What I'd really like to see is qhasm put on github along with the syntax files he or others create. q files aren't really useful without the syntax files they were made for, and without a central repo, custom made syntaxes will be a mis-mash of random decisions and instructions.


For AESNI, you probably are already using some sort of assembly to call the instructions. In the same assembly, you could wipe the key and plaintext as the last step.

For the stack, if you can guess how large the function's stack allocation can be (shouldn't be too hard for most functions), you could after returning from it call a separate assembly function which allocates a larger stack frame and wipes it (don't forget about the redzone too!). IIRC, openssl tries to do that, using an horrible-looking piece of voodoo code.

For the registers, the same stack-wiping function could also zero all the ones the ABI says a called function can overwrite. The others, if used at all by the cryptographic function, have already been restored before returning to the caller.

Yes, it's not completely portable due to the tiny amount of assembly; but the usefulness of portable code comes not from it being 100% portable, but from reducing the amount of machine- and compiler-specific code to a minimum. Write one stack- and register-wipe function in assembly, one "memset and I mean it" function using either inline assembly or a separate assembly file, and the rest of your code doesn't have to change at all when porting to a new system.


I don't think this can be a language feature. It's more a platform thing: Why is keeping key material around on a stack or in extra CPU registers a security risk? It's because someone has access to the hardware you're running on. (Note that the plain-text is just as leaky as the key material. Yike!)

So stop doing that. Have a low-level system service (e.g., a hypervisor with well-defined isolation) do your crypto operations. Physically isolate the machines that need to do this, and carefully control their communication to other machines (PCI requires this for credit card processing, btw). Do end-to-end encryption of things like card numbers, at the point of entry by the user, and use short lifetime keys in environments you don't control very well.

The problem is much, much wider than a compiler extension.


Sensitive information doesn't exist in a vacuum. What we need to protect is more than some keys that can be carefully loaded onto a crypto processor hiding in a secure bunker. Yes, users should have security too. The point of entry matters too.

So how do you get that isolated box onto everyone's computer and phone? How do you move these users' sensitive information onto that isolated box without leaving a trace on their non-isolated computer? How do you move their keys around?

When you use two systems to process sensitive information, you have at least two problems to solve...


This is also why dedicated cryptoprocessors exist, with special features for attack resistance; I'm not completely certain about this, but I'd think the software running on those does not have to zero memory containing keys, because the whole environment that said software runs in has been secured from the outside already, and if it's possible to read any memory or run untrusted code from outside on those without being detected, then there are far bigger problems to worry about...


Having seen a few existing designs of those, up-close and personal - actually they do have to worry about zeroisation, quite an awful lot.

And sometimes they don't worry enough either. They ought to fail a FIPS-style audit for that. But, well... they ought not to contain proprietary LFSR "crypto" algorithms, either. They are not as well audited, or as publicly designed, as they ought to be: many are as black-box closed-source as they could possibly be.

They tend to be based on extraordinarily old architectures with new bits glued on - think Intel 8051, that kind of era. If you're really lucky you might get an ARM, or at least a Thumb. People making them are notoriously hyper-conservative (most don't support ECC yet, and many don't even go above RSA-2048 or SHA-1 without going to firmware), and minimise any changes, perhaps for cost reasons, the effects of which are not always positive (actually, CFRG are discussing that general area right now in the context of side-channnel defences for elliptic-curve crypto).

So, how would you think that environment translates to writing secure firmware, or designing secure, state-of-the-art hardware? ;-)


Are current GPUs suitable subsystems for running properly isolated cryptographic algorithms? If not, why not? If yes, perhaps a well-audited open source library would be possible.


In the presence of closed-source drivers that manage them and compile the shaders? I'd say probably not. Something open-source with actual direct access to the opcodes (would Mantle work? Intel's embedded GPUs?)… maybe. I don't know if I'd consider them secure. Some running hot and loud and by the seat of their pants? The presence of DMA? Hm. I have my doubts they'd be better than the CPU safety-wise.

What I do know is they can usually run them very fast if asynchronous low-communication parallelisation is not undesirable - GPUs overtook PlayStation 3 Cell processors for the "throw watts at it, but we don't have enough money to burn FPGAs or ASICs" class of crypto attacks quite a few years ago now. (As anyone who runs a Bitcoin mine knows!)

Might be effective on, say, the Raspberry Pi, where the "GPU"'s the ringmaster and the ARM's the clown. That vector processor looks tempting, and if I could figure out how to get it to do diagonals, it's a poster child for ChaCha.


Remember this the next time someone says "C is basically portable assembler." It's not precisely because you can do many things in assembly that you can't directly do in c such as directly manipulate the stack and absolutely control storage locations.


> For encryption operations these aren't catastrophic things to leak — the final block of output is ciphertext, and the final AES round key, while theoretically dangerous, is not enough on its own to permit an attack on AES

This is incorrect. The AES key schedule is bijective, which makes recovering the last round key as dangerous as recovering the first.


Oops, quite right. I was looking at the "mix and xor" and my brain jumped to "oh, this is the standard hash idiom" and I completely missed the fact that the word being xored is not the word being mixed...


How hard is that attack to code? I have a hard time imagining a case where a target leaks just a subkey, so this is one of those things I knew "about" but not "how".


Dead simple. 2nd year undergraduate programming assignment.


Is it perhaps so simple that... Colin Percival could effectively describe how to do it in an HN comment, perhaps even challenging someone like Thomas Ptacek to code it up and publish it instead of just yakking on HN like he always does I hate him so much?


Each word in the 4-word AES round keys is computed as w[i] = Mangle(w[i - 1]) xor w[i - 4], where Mangle(x) = Subword(Rotword(x)) xor Rcon for i%4=0 and Mangle(x) = x otherwise.

Just turn that around and you get w[i - 4] = w[i] xor Mangle(w[i - 1]). Now start with i = 43 (i.e., w[i] is the last word of the last round key) and count backwards, filling in words of the round keys until you get to w[0]. Then w[0..3] is the AES key.


It's pretty straightforward to just iterate the key schedule backwards using the inverse S-box and a few xors; no need for any fancy stuff.


cperciva already answered, so I'll just add that most side-channel attacks (at least those using power analysis) on AES typically focus on the last round.


A-ha. That makes a lot of sense. Thank you!


Anything sent over HTTP(S), such as your credit card numbers and passwords, likely already passes through generic HTTP processing code which doesn't securely erase anything (for sure if you're using separate SSL termination). Anything processed in an interpreted or memory safe language puts secure erasure outside of your reach entirely.

Afaict there's no generic solution to these problems. 99.9% of what these code paths handle is just non-sensitive, so applying some kind of "secure tag" to them is just unworkable, and they're easily used without knowing it... it only takes one ancillary library to touch your data.


Some of this can be addressed by never giving sensitive data to remote servers. This wouldn't work for credit cards, but with Bitcoin you never need to let a non-bitcoin library touch your private key, because that's not going over https.

Similarly, if you encrypt all of your information from within a safe library before handing it out to unsafe libraries, they can't leak anything. This can add overhead and redundant encryption (and you still need to trust that the remote server processing your data is safe), but there are steps you can take to be more safe.


"As with "anonymous" temporary space allocated on the stack, there is no way to sanitize the complete CPU register set from within portable C code"

I don't know enough of modern hardware, but on CPUs with register renaming, is that even possible from assembly?

I am thinking of the case where the CPU, instead of clearing register X in process P, renames another register to X and clears it.

After that, program Q might get back the old value of register X in program P by XOR-ing another register with some value (or just by reading it, but that might be a different case (I know little of hardware specifics)), if the CPU decide to reuse the bits used to store the value of register X in P.

Even if that isn't the case, clearing registers still is fairly difficult in multi-core systems. A thread might move between CPUs between the time it writes X and the time it clears it. That is less risky, as the context switch will overwrite most state, but, for example, floating point register state may not be restored if a process hasn't used floating point instructions yet.


Register renaming doesn't work like that. How could register contents of a process changing randomly even be usable for anything? Register renaming is about dynamically mapping a small number of ISA register names to a larger number of hardware registers to increase parallelism, but the whole reason for the exercise is that those additional registers don't have ISA names, so you obviously can't read them explicitly, at least not as part of the normal instruction set, who knows what backdoors some CPUs might have ...


Once a rename register is garbage collected, it's flagged as "not ready" which is a state in which any instruction attempting to read it will block. They can only be scheduled once it's been written to.


Register renaming is transparent (aside performance) even to assembly. Multi-core system are irrelevant as each core has the same set of registers and registers are not (visibly) shared amongst cores.


This article makes a good point, but I think the problem is even worse than he describes.

Computer programs of all kinds are being executed on top of increasingly complicated abstractions. E.g., once upon a time, memory was memory; today it is an abstraction. The proposed attribute seems workable if you compile and execute a C program in the "normal" way. But what if, say, you compile C into asm.js?

Saying, "So don't do that" doesn't cut it. In not too many years I might compile my OS and run the result on some cloud instance sitting on top of who-knows-what abstraction written in who-knows-what language. Then someone downloads a carefully constructed security-related program and runs it on that OS. And this proposed ironclad security attribute becomes meaningless.

So I'm thinking we need to do better. But I don't know how that might happen.


You remind me why it's so hard to do secure deletion: there are a bunch of abstractions built on old assumptions that no one cares about secure deletion. If you forget your pointer to that memory, it can be reused, so it's effectively deleted, we're all good, right? Meanwhile, the file you "sync"ed to disk might be synced to a network drive or flash memory or a zillion cache layers.

I think we need, right at the base metal, a way of saying "this data needs to not be copied" and/or "if you do copy it you must remember all copy locations so we can sanitize them all." And then we require every abstraction on up to have a way of maintaining this, the same way all the abstractions are required to, say, let us read data.

Or I guess this is part of what HSMs are supposed to do -- do all your "secure" work in something that is very strictly controlled.


And if I run my C program in an emulator that allows me to freeze it and dump memory I can do this stuff too...

The point is, if you want security you need to look at the whole system and in the situation you describe you can't guarantee it, no.

I'm not going to say "So don't do that", but I am going to say "If you're going to do things like that, please realise that the assumptions the system security was built on no longer hold true".

I think to do it better we just need to pay a bit more attention. And try not to let ourselves get into situations (cough heartbleed cough) where memory zeroing is actually an important feature. IE - by the time the attacker is able to read your process memory you're probably already screwed.


You could try for an abstraction / language that provides deterministic execution.


If I have enough control to the point where I can read your memory in some way, I can just use ptrace. Heck, I could attach a debugger. It seems ludicrous to want that level of protection out of a normal program running on Mac/Win/Linux.

Now, if your decryption hardware was an actual separate box, where the user inserts their keys via some mechanism and you can't run any software on it, but simply say "please decrypt this data with key X", then we'd be on to something. It could be just a small SoC which plugs into your USB port.

Or you could have a special crypto machine kept completely unconnected to anything, in a Faraday cage. You take the encrypted data, you enter your key in the machine, you enter the data and you copy the decrypted data back. No chance of keys leaking in any way.


One of the other things you're sorta-describing is an HSM.

These are dedicated boxes that just do crypto. You keep them on the network or attached via a serial port or... whatever. Accessible to your machines but not the outside world. Then you send them messages to ask them to encrypt and decrypt data for you. That way the keys never leave the box. The HSM doesn't accept new software, nor does it ever expose the keys to anyone.

They are, however, quite expensive.


What you're describing is called a smartcard, and readily available on the market. I keep my PGP key on one.


Does your PGP key stay on the smartcard or is a copy of it transferred to your computer on occasion?


The key can be generated on the smartcard, and it's not possible to transfer it out of the smartcard by design. (anything that calls itself a smartcard but allows this isn't a smartcard)


If it's a properly designed smartcard system then the key never leaves the card.


Please, assembly is OK. It's not even magic or special wizardry. My dad programmed and maintained insurance industry applications in assembly side by side with many other normal office workers for decades. Assembly is OK.


Assembly is bad for auditability, which is important in crypto to prevent subtle errors.


As a seasoned and experienced reverse-engineer myself, I'm (genuinely) curious where you got that impression. Do you find it unapproachable?

Assembly is the simplest language you can write a computer program in, for a certain very textbook definition of "simple" - it's just that you actually have to do everything by hand that you normally wouldn't. And yes, that can be a pain in the ass, and yes, you do have to watch out for not seeing the wood for the trees - but one thing it most definitely is, is auditable.

Bearing in mind, say, the utter briar-patch that is OpenSSL: a crufty intractably complex library written in a high-level language with myriad compiler bug workarounds, compatibility kludges and where - despite it being open source, and "many eyes making bugs shallow" - few eyes ever actually looked, or saw, or wanted to see, and when attention was finally paid to it, it was found wanting... might not assembly be perhaps better for a compact, high-assurance crypto library? Radical, I know, but perhaps an approach that's worthy of consideration.

I understand you may well be more familiar with high-level languages, and I don't know if you're confident about your ability to audit that - but I must point out, if you're auditing it from source, you're trusting the compiler to faithfully translate it. So to actually audit the code, you need to include the compiler in that audit. Compilers have (lots of) bugs and oversights too (lots of OpenSSL cruft is compiler bug workarounds, it seems?): as the article points out, existing compilers just weren't really designed to accommodate writing secure code.

Meanwhile an assembler makes a direct translation from source assembly to object machine code - that is deterministic (a perniciously-hard process with compilers) and much more easily, and automatically, auditable and indeed directly reversible.

To be clear, I'm not suggesting we replace, say, libsodium with something written in assembly language tomorrow! There are good high-level language implementations. And inline assembly is already used in some places for certain functions, including this exact one (zeroing memory), to try to minimise the compiler second-guessing us. But as the article points out, that approach only takes us so far, and it's something we need to be guarded against when trying to write secure code.


The briar patch of OpenSSL is more in the high level protocol code, and not the asm crypto (the perl obfuscation layer makes it fun, but isn't a major source of bugs). I would not want to write a robust asn.1 parser in assembly. Lots of other cruft works around the presence or absence of various #define values in header files. Rewriting in assembly is not going to solve the problem of deciding how big socklen_t is.


Mm, true: the long grass of the libcrypto part pales in comparison to the thorny nightmare that is the rest of it. Thank you to OpenBSD's LibReSSL for beginning to clear away the worst of the bramble (and uncovering the occasional juicy blackberry in the Valhalla rampage process).

There's a crypto library, and then there's a protocol library. And TLS, to put it politely, has lots of hairy bits, and I hope and pray TLS 1.3 makes a positive impact on that, but I'm not yet sure if it will.

If one were developing from scratch (and I am not) I'd wonder if a reasonable approach would be a ridiculously low-level approach for the primitives, but a ridiculously high-level approach for the protocol. I might consider writing the first in assembly, but the second? If it involved an ASN.1 parser? I would prefer to do something else, anything but that! :-) At the very least, if someone did do it, we'd be able to see exactly how. We just might not want to! I would suggest perhaps instead, maybe something involving formal correctness proofs and then converting those to assembly, because miTLS indicates that this can be an enlightening approach, and seL4 proves that it's possible to actually make use of? (For someone else. I think it is outwith my expertise!)


In light of the current discussion, it's hard to make such a clean distinction. Your private key is going to be stored in a file that goes through the PEM and ASN.1 parsers. It's going to hang around for a bit while you sign stuff (using some sweet asm code), but now you need to dispose of it. The object lifetime is often much longer than we'd like even with perfect zeroing, and there are some ways to address that, but it casts a long "shadow" on the call graph, not all of which can be made minimal.

In short: imperfect buffer zeroing probably reduces risk enough that it drops below several other concerns.


If you restrict yourself to super simple things like zeroing chunks of memory and the unused stack then it still might be acceptable.


Assembly is not really portable and error prone. I don't consider it anywhere wizardry (or hard) but corner cases are hard in C and in assembly even harder.


In the cases where the alternative is fighting the compiler every step of the way, assembly may very well be less error prone as long as you limit it to exactly the small areas where it will help.


The suggestion has the right idea, but the wrong implementation. The developer should be able to mark certain data as "secure" so the security of the data travels along the type system.

Botan, for example, has something called a "SecureVector" which I have never actually verified as being secure, but it's the same idea.


This was my initial idea, but talking to compiler developers convinced me that the dataflow analysis needed for this would be tricky. They were much happier with the idea of a block-scope annotation.


Similar data-flow analysis techniques as volatile.


Why are there no suggestions to change processors accordingly? Intel should be considering changing the behavior of its encryption instructions to clear state when an operation is complete or at the request of software. Come to think of it, every CPU designer should be considering an instruction to clear the specified state (register set A, register set B) when requested by software. Then, the compiler can effectively support SECURE attributed variables, functions, or parameters without needing to stuff the pipleline with some kind of sanitizing code.


You can clear the CPU state. But how is the CPU to know when it's safe to clear unless the software tells it?


Try:

  #include <string.h>

  void bar(void *s, size_t count)
  {
        memset(s, 0, count);
        __asm__ ("" : "=r" (s) : "0" (s));
  }

  int main(void)
  {
        char foo[128];
        bar(foo, sizeof(foo));
        return 0;
  }

  gcc -O2 -o foo foo.c -g
  gdb ./foo
  ...
  (gdb) disassemble main
  Dump of assembler code for function main:
   0x00000000004003d0 <+0>:	sub    $0x88,%rsp
   0x00000000004003d7 <+7>:	mov    $0x80,%esi
   0x00000000004003dc <+12>:	mov    %rsp,%rdi
   0x00000000004003df <+15>:	callq  0x400500 <bar>
   0x00000000004003e4 <+20>:	xor    %eax,%eax
   0x00000000004003e6 <+22>:	add    $0x88,%rsp
   0x00000000004003ed <+29>:	retq   
  End of assembler dump.
  (gdb) disassemble bar
  Dump of assembler code for function bar:
   0x0000000000400500 <+0>:	sub    $0x8,%rsp
   0x0000000000400504 <+4>:	mov    %rsi,%rdx
   0x0000000000400507 <+7>:	xor    %esi,%esi
   0x0000000000400509 <+9>:	callq  0x4003b0 <memset@plt>
   0x000000000040050e <+14>:	add    $0x8,%rsp
   0x0000000000400512 <+18>:	retq   
  End of assembler dump.


That should be __asm__ __volatile__ but extended inline asm, even with no actual opcode (even if "nop" would work pretty much everywhere) is not portable. So at this point, you might just use clang/gcc/icc pragmas instead.


It very much looks like a situation in which the system has already been compromised and is running malicious programs that it shouldn't. These malicious programs could still face the hurdle of being held at bay by the permission system that prevents them from reading your key file.

However, they could indeed be able to circumvent the permission system by figuring out what sensitive data your program left behind in uninitialized memory and in CPU registers.

Not leaving traces behind then becomes a serious issue. Could the kernel be tasked with clearing registers and clearing re-assigned memory before giving these resources to another program? The kernel knows exactly when he is doing that, no?

It would be a better solution than trying to fix all possible compilers and scripting engines in use. Fixing these tools smells like picking the wrong level to solve this problem ...


I'm not sure this is the scenario we're fighting. The problem is when your program (which handles sensitive data) has a flaw in it: for example, it might be possible to trick it into leaking uninitialized data (possibly out of bounds) over the wire. Another potential issue is core dumps (and maybe swapping, but that's a little different). You don't want sensitive data to be written on the disk.

Malicious programs running with your program's privileges are a different scenario altogether, and usually they can do a lot of damage. Want sensitive information out of another process? Try gdb.

But yes, it is trivial for the kernel to zero a page before handing it out.


What about malicious programs without privileged access ? Is it possible for them to just keep requesting new memory pages from the kernel and see leaked data that was free'd by another process they shouldn't have access to or is this something kernels are already preventing ?


WRT the AESNI leaking information in the XMM registers, wouldn't starting a fake AES decryption solve the problem?

Also, wouldn't a wrapper function that performs the AES decryption and then manually zeroes the registers be a good enough work around?


If you're using AES-NI, you're already using an intrinsic. I haven't yet met a compiler "smart" enough to recognise an AES implementation and replace it with AES-NI, and god, I hope I never do.

Yes, you probably ought to be clearing xmm* registers touched by it, and that would I hope be good enough.

The point in the article about compiled code very seldom touching xmm* so that if you don't wipe it - is doing so currently common practice? I haven't checked, but I feel like that would be something that needs checking! - it's hanging around and you might leak it, is completely valid, however.


Every time I read one of these posts about a clever "attack vector", how something can be gleaned from this special register, or a timing attack, somesuch, I remember about a theory that the sound of a dinosaurs scream can be extracted from the waves impact made on a rocks crystal structure.

I googled pretty hard for real life example uses of a timing attack, and now using of stale data on the register, but couldn't find anything. Does anyone know of examples of this actually being done?


These types of attacks though only require one person to create a system that can reliably exploit them, and then the vulnerability will be in the wild and a more significant problem. Pulling off this type of attack is difficult, but you only need one piece of malware that has a reliable way to exploit this in a general case and then it becomes available to every script kiddie who finds some motivation for stealing private keys.

These type of attacks also might become more of a problem as more sensitive computation is done on shared machines (IE cloud compute).

So, while there's no reason to panic because these security features aren't implemented hardly anywhere, you can't let the issues sit unaddressed for long periods of time.


But there is a whole range of potential issues. Or things compiler developers can do. As any task, they should be sorted, weighted by ease of exploitation and ease of solving. What I suspect, and I'm just curious to see if I am wrong, is that developers postulate vulnerabilities that real hackers would never bother with, and miss what they really go for, such as trivial mistakes, such as forgetting bounds checking.

So, I've seen a lot of (conceptually) trivial exploits and combinations of trivial exploits, but I would love to see a real world example of someone collecting enough information from an 'bad RNG', registers, or timing, to do anything with it.


For examples of real implementations of timing attacks, try this: http://www.contextis.com/documents/2/Browser_Timing_Attacks....

Some of those are fixed now, but the history stealing link redrawing one is still an issue as far as I know (or at least, this bug is still open https://bugzilla.mozilla.org/show_bug.cgi?id=884270 ).


Thanks, that's pretty awesome. However, I was talking about attacks relying on non-constant time memory copy or math function that is used to somehow defeat a server or cryptography.


A few examples:

"Remote Timing Attacks are Practical" https://crypto.stanford.edu/~dabo/papers/ssl-timing.pdf

"RSA Key Extraction via Low-Bandwidth Acoustic Cryptanalysis" http://www.tau.ac.il/~tromer/acoustic/


Doesn't actually seem true. OK, running the decrypt leaves the key and data in SSE registers that are rarely used where it might be looked up later by attackers. There isn't any portable way to explicitly clear the registers. Then why not just run the decrypt again with nonsense inputs when you are done to leave junk in there instead? Yes, inefficient, but a clear counter example. You could then work on just doing enough of the nonsense step to overwrite the registers.


> Then why not just run the decrypt again with nonsense inputs when you are done to leave junk in there instead?

Because the compiler is perfectly within its rights to optimize that out!


Not if you write the junk output to volatile variable, right?


Let me clarify.

If you use a deterministic nonsense value, the compiler can turn the result of decrypt(nonsense) into a constant at compile time, and just directly output the constant to the volatile variable, without actually ever calling decrypt again at runtime. So it can turn this:

    decrypt(real);
    nonsense = <whatever>;
    volatile junk = hash(decrypt(nonsense));
Into this:

    decrypt(real);
    volatile junk = <the appropriate constant value>;
But even if you nonsense is non-deterministic (although I question where you are getting the entropy - if you're using a syscall / random / etc your performance has potentially just gone out the window), the compiler is well within its rights to optimize the second junk decrypt of the nonsense input differently than the first (real) decrypt - in such a way that it does not overwrite everything left behind by the first decryption.

(Same with encryption)


I think the "slowness" aspect of "do it twice" is a given, making the best of a bad situation. Clearly, the people who want languages or hardware to do this "right" have a better solution, but if you have to get by with C on existing hardware, it would seem that an option in a library to select further security at less speed is reasonable. Of course assuming said option runs code that won't be optimized away.


That's just a workaround though. It still does not invalidate the basic thrust "I can't write code to handle keys in C and be sure I have not left copies anywhere"

The proposal seems goods


Even if the proposed feature is added to C and implemented, there is still the (practical) problem of OS-level task switching: when your process is interrupted by the scheduler, its registers are dumped into memory, from where they might even go into swap space.

It would be consequential (but utterly impractical) to add another C-level primitive to prevent OS-level task suspension during critical code paths. Good luck getting that into a kernel without opening a huge DoS surface :)


The obvious fix is to address "might go into swap space". However, the real problem is that the process can be interrupted at any time and examined, not that the registers might go to swap.

If someone has the root privs to peek at your memory, they can also stop your process at any time and examine all the registers, whether they were swapped out to disk or not.

Moving the crypto code into the kernel and running with disabled interrupts doesn't help because the attacker is already assumed to have super-user privileges (they can peek at arbitrary RAM, after all). There are also non-maskable interrupts.

You basically cannot hide the machine state from someone who controls the machine: not without splitting the machine itself into additional privilege levels, such that there is a security level that is not accessible even to the OS kernel. The sensitive crypto routines run in that level. The manufacturer of the SoC provides these as firmware, and the regular kernel has no visibility to the internals.

ARM has a security model that supports this.

There is also something even more paranoid called TrustZone: http://en.wikipedia.org/wiki/ARM_architecture#Security_exten...


Posts like this make me just more convinced about that C combines the worst of "portability" and "assembly" into "portable assembly".


I don't completely understand the C spec. Would the following approach work for zeroing a buffer?

1) Zero the buffer.

2) Check that the buffer is completely zeroed.

3) If you found any non-zeros in the buffer, return an error.

Is the compiler still allowed to optimize away the zeroing in this case?


You are mixing up the C level buffer abstraction and some potentially underlying RAM. C doesn't deal with RAM, only with the abstraction, so you can't look at the RAM in C, you only can look at the abstract buffer, and the only thing that the compiler has to guarantee is that the abstraction holds - namely, that after you write zeroes into the abstract buffer, a subsequent conditional that checks whether the buffer is zeroed will branch accordingly, which is a fact that is trivial to evaluate during compile time, and as soon as the compiler has determined that the conditional is statically determined, it can eliminate any alternative branches as dead code and translate the abstract buffer write into a NOOP at the machine code level.


> Is the compiler still allowed to optimize away the zeroing in this case?

Yes, completely. In the snippet below, the compiler is allowed to eliminate all code after “leave secrets in array c”.

  {
    char c[2];
    ... /* leave secrets in array c */
    memset(c, 0, 2);
    c[0] = 0;
    c[1] = 0;
    memset(c, 0, 2);
    if (c[0] || c[1]) exit 1;
  }
The compiler is also allowed to compile the last three instructions below as if they were “return 0;”

  {
    char c[2];
    ... /* leave secrets in array c */
    c[0] = 0;
    c[1] = 0;
    return c[0] + c[1];
  }


> In the snippet below, the compiler is allowed to eliminate all code after “leave secrets in array c”

gcc 4.4.5 doesn't though (-O3), it still clears the stack once and performs the comparison.

I believe these optimizations can be defeated by declaring a global

  volatile char fill = 0;
and using that instead of 0 in memset().


It's not guaranteed to defeat the optimization. For instance, it could just read fill into two registers and do the comparison there.


Is the compiler still allowed to optimize away the zeroing in this case?

With 'volatile', generally not, modulo bugs. Without volatile, it would never return an error.


I was wondering that too.. I would think that simply accessing the any byte in the buffer afterwards would prevent the compiler optimizing it out.


That depends how much the compiler can optimize (away). If the next call is free(), it's quite trivial to skip the zeroing and just take the correct branch.

I am still uncertain while people want to just 'zero' it. Filling random data (just one random() call) and then using inline PRNG, then summing the result, storing it globally in volatile would reliable 'zero' the data but it's quite CPU intense.


You still are not guaranteed to clear the buffer like that.

  for (int i=0; i<len; i++){
    sensitiveBuffer[i]=random();
  }
  int sum=0;
  for (int i=0; i<len; i++){
    sum+=sensitiveBuffer[i];
  }
  volitileVar=sum;
Using loop fusion, the compiler can optimize this to: int sum=0; for (int i=0; i<len; i++){ sensitiveBuffer[i]=random(); sum+=sensitiveBuffer[i]; } volitileVar=sum;

Which it can then optimize to: int sum=0; for (int i=0; i<len; i++){ sum+=random(); } volitileVar=sum;

In fact, as the article points out, the compiler can legally transform:

  reallyZeroBuffer(sensitiveBuffer);
into

  pointlesslyCopy(sensitiveBuffer);
  reallyZeroBuffer(sensitiveBuffer);


not like that: Not exactly like that: I didn't mean simple that way, more like hash alike. The 1st example is not optimal but shows the idea.

I disagree about "pointlessCopy". Of course it's permitted by the standard but it's not an optimization. Using such a broken compiler is beyond help.

------- volatileVar *= random(); for (int i=0;i<len;i++){ volatileVar+=buf[random()%len]+0x61c88647; buf[i]=volatileVar; } ===== static volatile uint volitileVar;

  for (int i=0; i<len; i++){
    sensitiveBuffer[i]=random();
  }

  for (int i=0; i<len; i++){
    volitileVar+=sensitiveBuffer[volatileVar%len];
  }


Why shouldn't the compiler be able to figure out that you sum a series of numbers that aren't used anywhere else, thus don't need to be spilled to RAM?


You write them to a global state volatile


How does that require writing the random data to RAM, apart from the volatile variable itself?


There are some chips providing zeroizing a small region of device memory when needed and it's specially designed to hold encryption keys etc. It's also done by hardware.


Would running your file system read only and optimizing the system for fast bootup be a workaround ? If so you could zero successfully by rebooting...


After what? Every https request? Simply exiting the process is sufficient to prevent most info leaks, but even that's much too slow and not even a solution. The class of bugs here is that sensitive data is in memory and then the same program inadvertently leaks it while performing some other operation. If you reboot before the leak, you won't make it to that other operation, sure, but your program won't be much use either.

User logs in by sending password. System transitions to authorized state. System wants to wipe password to avoid later leak. If you reboot at this point, the user will no longer be authorized.


> It is impossible to safely implement any cryptosystem providing forward secrecy in C

What about Rust?


Rust has the same problem described in the article.


I'm guessing that mitigating this at the Rust level isn't doable, because its memory model has the same properties with regards to zeroing. To change that, LLVM support would be needed. This does make me wonder — how do you integrate this into a type system? Rust has already done a pretty awesome job at integrating memory-safety into the type system, but memory-secure type systems seem fairly unexplored.


Memory-security as defined in this article isn't exactly safe. It's just a mitigation feature. The comments above provide plenty of examples of how once an attacker is on the system, he or she can easily get past any language-level construct you care for. If the system were completely memory-safe (which would mean no memory safety bugs in the hardware, kernel, SELinux (or some other kernel extension that lets you do things like deny ptrace), LLVM, the Rust compiler, any libraries you're using, or your program itself, and you weren't doing anything that completely negated all those benefits like JIT compiling code, then you don't need to zero the memory at all--it will do nothing for you as a mitigation technique, and you'd be "memory safe" by your definition. But you're not out of the woods yet, because even with zeroing you are STILL vulnerable to ordinary, non-memory-safety bugs in your code allowing that data to be read. Multiple threads, forgotten heap allocations, and so on. A user with sufficient privileges could glance at the available cache lines. Etc. Anyway, this entire scenario is a fantasy because your system isn't fully memory safe :)

The only way I can think of to actually guarantee real memory security in any meaningful way is to completely verify a much smaller system (not just memory safety, but that it's actually bug-free), isolate it at the hardware level, and do all of your computation using that hardware isolation feature. It has to be hardware because, for example, there's no reasonable way to deterministically erase data swapped to SSD. You'd still be susceptible to hardware bugs, but you can't ever protect against those completely. So basically, get an HSM :)




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: