Hacker News new | past | comments | ask | show | jobs | submit login
Heartbleed and Static Analysis (regehr.org)
134 points by pascal_cuoq on April 11, 2014 | hide | past | favorite | 36 comments

It's interesting that these sorts of problems with memcpy, etc., always boil down to the same thing - C arrays are lowered to simple pointers when they are passed as function arguments. This means they lose their array bounds, frustrating attempts at overflow detection. This has caused endless bugs and endless workaround schemes of varying complexity. I wrote about it as "C's Biggest Mistake" a while back:


and offered a simple adjustment to C to fix it. I quote from the article:

All it needs is a little new syntax:

    void foo(char a[..])
meaning an array is passed as a so-called "fat pointer", i.e. a pair consisting of a pointer to the start of the array, and a size_t of the array dimension. Of course, this won't fix any existing code, but it will enable new code to be written correctly and robustly.

This is basically what Bounds Checking GCC[1] did. Except instead of changing the syntax or using fat pointers, it kept a map of (pointer -> bounds), allowing memcpy to check the bounds of any arbitrary runtime pointer, and to interwork with code that isn't compiled with the -fbounds-checking flag.

[1] https://www.doc.ic.ac.uk/~phjk/BoundsChecking.html

C's second biggest mistake, IMO, is NUL-terminated strings. The combination of these two mistakes is much worse than either of them alone. Since you don't know where a string ends without scanning it, and nothing prevents you from writing off the end of the allocated space, the familiar 'strcpy' etc. are loaded guns with hair triggers.

I'd argue that the nul-terminated strings is a direct consequence of arrays decaying to pointers when passed to functions.

They're clearly closely related, but it would technically have been possible to represent strings using a header with a length field followed by the characters in the string. That would have required people to pass the location of an arbitrary character as two explicit values, the string itself and an offset, which would certainly not have been in the spirit of the way C takes advantage of pointers, but nonetheless would have been possible.

BTW I must take issue slightly with your characterization of the problem. Arrays decay to pointers immediately when referenced, not just when passed to functions. I suppose a C implementation could do bounds checking on non-pointer-mediated accesses to local arrays, but I've never seen an implementation do that (except those that do bounds checking more generally). Did yours?

In fact, the way I put it once was, "C does not have arrays. It has pointers and notations for allocating and initializing memory; it does not have arrays." (Meaning, of course, "array" in the sense of a firstclass object, with bounds checking.)

Length-prefixed strings have another disastrous problem - strings cannot be sliced. A substring would always have to be copied.

Arrays decay to pointers when dereferenced, but this wouldn't impede array bounds checking. I didn't bother putting in static array bounds checking in Digital Mars C because pretty much all the interesting cases were buffers passed to a function.

I know C doesn't have first class arrays. That's what my enhancement suggestion was for.

> Length-prefixed strings have another disastrous problem - strings cannot be sliced. A substring would always have to be copied.

Slicing is one common operation that's O(n) with a simple vector representation. I agree it's nice if it can be made O(1), but concatenation is still O(n) with that representation. This is why I'm a fan of ropes or other more exotic sequence data structures that can do all the common operations in log time or better.

> I know C doesn't have first class arrays. That's what my enhancement suggestion was for.

Fair enough. I think it's a good suggestion, but I'm skeptical that it would see wide use. Here's why. It has been straightforward for a long time to declare a struct type that holds a base and bounds, pass that struct by value, and then do one's own bounds checking. (The K&R-era compilers didn't have struct passing by value, but it was introduced in Unix v7, as I recall. That was a long time ago.) This isn't quite as convenient as what you're proposing, but it's not hard, and I don't think I've ever seen anyone do it. (Not that I spend my time reading other people's C code.) It certainly hasn't become standard practice.

Still and all, if you can persuade the committee to put it in the language, and also define new standard library functions that make use of it, I agree it would be a good thing.

Aren't strings immutable in most languages? Like in javascript, String.slice() returns a copy, it does not mutate the original string.

I don't disagree with you, but I don't see how it's a disastrous problem, either.

In D, which has ptr/length arrays, strings are immutable and sliceable without copying. This can make them very fast.

C's NUL-terminated strings were intended for mutating strings in char buffers (where the buffer's capacity was known at compile-time), not for passing around. More detail here: https://news.ycombinator.com/item?id=7475866

That's completely bogus. C doesn't have first-class arrays so that means the entire C standard library provides no way to pass strings to functions.

Did you read the link? The point was that in idiomatic C, you pass a string to a function by passing a raw pointer to a char-array buffer together with an external length.

Yes I read the link. It makes the unfounded claim that heap allocated strings were never intended to use the NUL character as the means of determining length. Get a 1st edition of K&R and see what fraction of functions in the standard library pass a length with a string.

Get an SVR4 API manual and do the same. After doing this I don't find this argument to be compelling.

In the end, it's about efficiency and redundancy. If a function is going to have to do something with every character of a string (e.g. copy it into a new buffer), and you don't have very many registers reserved in your calling convention (especially true for e.g. system calls), then it's silly to pass an explicit length if you know you've already NUL-terminated the string in its buffer.

All Unix system-call wrapper functions were specified this way, because it is guaranteed that they will need to move the string data from a user-space data structure to a kernel-space data structure (copying it wholesale in the process) before manipulating it. The NUL-terminated string can, in this case, be thought of as an encoded wire format for passing a string between the userland process and the kernel.

And because the Unix system-call wrappers--the most visible and useful parts of the C stdlib--followed this convention, C programmers generalized it into a convention for passing strings in general, even between functions in the same memory-space that can afford whatever registers they like.

If you're not going to iterate over everything in the passed buffer, though, C intended you to do things more like sprintf: take a destination buffer, and explicit size; write to the destination buffer, without scanning for a NUL first as a place to begin, and without appending a NUL afterward; and return an explicit size in bytes written.


I call it my billion-dollar mistake. It was the invention of the null reference in 1965

I think people would prefer that openssl seg-faulted than silently allow theft of private data.

I.e. the array-decay-to-pointer is a more expensive problem.

It would help, but it wouldn't get rid of the errors.

It would convert corruption/security bugs into other, much safer runtime errors.

To get rid of the runtime error altogether, we need a stronger type system or a different form of static analysis, as mentioned here.

Also, the extra bounds checking might be somewhat expensive in some scenarios, especially as the fat pointers would consume many more cache lines than thin pointers.

Cyclone [1] did this, among many other things. They actually got C provably memory-safe, though they sacrificed a fair amount of performance doing so.

[1]: http://cyclone.thelanguage.org/

Having grown with safe languages back when C was a UNIX only thing I always consider that FUD from C people.

Back then, C optimizers were not better than Object Pascal/Modula-2 ones and similar languages. It was all down to profiling and switching off bounds checking in specific code sections.

Of course, with time compilers got better, while at the same time safer languages lost to UNIX widespread into the industry creating this idea in younger generations only C can have fast code.

Your analysis of the issue of not being able to pass arrays as arguments is compelling, but I'm not so sure about "All it needs is a little new syntax" as a solution: you say how to keep backwards compatibility with existing C standards you need a #if cpp-directive: that's already a code health and maintenance problem that isn't going to go away.

Regehr's suggestion of using assertions hidden in code avoids that: isn't it better? You could have just a comment directive like

     @ arraycast char* a \represents char a[..]
and then you wouldn't need any cpp-directives.

I've always been curious why C was designed this way, was it thought to offer any advantages or just done to make the compiler writer's job easier?

Unix and C were written in an agile fashion. C was a step up from a macro assembler, to be used for writing an operating system. String manipulation is rare in operating systems, certainly in those days (even if you felt a need, you don't have much room to create new strings in if your system and its applications must run in 64 kilobytes of RAM (144 kB for the PDP 11/45; see http://cva.stanford.edu/classes/cs99s/papers/ritchie-thompso...)

The filesystem was about the only place where strings got manipulated and there, filenames were stored in fixed-size (14 bytes) buffers. Since all code could know strings were at most 14 bytes, having a complex type (pointer,length) was deemed a waste of memory and performance there. Everything else followed from that.

It would have made C a profoundly different language. C is what it is; a portable assembler. I believe K&R had a strong sense of minimalism; fancier string s would be seen as something that belonged in a library not the language.

This has its benefits; _anything_ fancier would have embedded a whole set of assumptions about typical string handling usage for which the design had optimised.

That no safer string handling library became popular can be seen as a indictment of C or a tribute to their tasteful restraint.

Static analysis (and other "expensive" techniques) should really be standard on something as critical as OpenSSL.

If I compare the effort we've had to go through building things that are allowed into a banking environment the OpenSSL project is very underfunded. Somehow we have to solve this "tragedy of the commons" problem in open source. OpenSSL must be worth millions of dollars to so many companies, but almost no-one cares about it. It seems very strange that the project is not getting BIG corporate sponsorship, given the value of the data it helps to protect.

The problem is that it's worth a relatively small amount of money to each of a huge number of people. We don't have a good way to pool funds for maintaining open source software in such a situation.

Crowdfunding seems to be the obvious solution. But I asked in a previous Heartbleed discussion whether people would be willing to try it for OpenSSL, and got no response.

Someone else commented that maybe Google will take over OpenSSL, or at least become a major contributor to it. It probably is in their interests to make sure this doesn't happen again -- maybe it's worth a large enough amount of money to them to justify dedicating a couple of engineers to it for a couple of years, which is what it seems to need.

Presumably, the static analysis won't find this bug because the 'out of bounds' memory that heartbleed accesses isn't necessarily out of bounds?

As has been explained elsewhere, OpenSSL has a wrapper around malloc() and free() so that it re-uses allocations. This means that the 64k of buffer used to send the heartbeat response is data that has never been free()d and has previously been written to by the process.

To make the static analysis spot this kind of thing, I'd guess you'd have to mark/label the OpenSSL allocators as being 'like' malloc. Likewise, valgrind would also not spot the faults without extra knowledge.

I presume, from their silence, that competing products also fail to find this bug?

I can confirm that HP Fortify SCA does not find this bug.

(I am the architect for the SCA team.)

There are numerous reasons for this, as Regehr points out. None of these products are "sound". In other words, they aren't guaranteed to find everything. Most of the time, these are tradeoffs to reduce the number of false positives and to speed up the analysis, which come with the risk of false negatives, as is the case here.

Indeed, trading off sensitivity for specificity is always a win when trying to sell static analysis to people.

Considering that any non-trivial piece of software has a fairly large number of bugs, you can dial the sensitivity way down and still find some bugs, and the high specificity looks really impressive "Most of what it labelled as bugs really were!"

Consider the hypothetical situation where your piece of software has 40 bugs in it (but you don't know that).

If you see a report of 10 bugs and 6 of them are real, that looks more impressive than a report of 100 bugs where 20 of them are real, despite the fact that the latter actually found more bugs. In fact there is a non-trivial chance that the first 5 or 10 bugs you look at in the latter will be false-positives.

Obviously if your specificity goes too low, the payoff of inspecting each report starts to become questionable, but I think that to successfully market an analysis took, you need to sacrifice more sensitivity for specificity than merited by a rational analysis of the cost/benefit of detecting the bugs early.

Very interesting article!

A small typo, there's a sentence after the Frama-C code block that ends "[...] and that a subsequent memcpy() of the two regions will return zero" which should have a s/memcpy/memcmp/. Quite obvious, but in an article such as this it's best to be precise of course.

I think this is a good demonstration of the limitations of static analysis - for a language like C, there will always be bugs that can't be found statically.

Unless you're willing to move the burden to the programmer, to prove correctness.

Bugs can always be found statically, but there may always be bugs that evade all the static analysis you happened to do.

Also of note was Leslie Lamport's recent talk on how Amazon uses TLA+ to verify algorithms.

Googled to no avail... link please?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact