Hacker News new | past | comments | ask | show | jobs | submit login
C++ at Google: Here Be Dragons (llvm.org)
271 points by ryannielsen on May 23, 2011 | hide | past | web | favorite | 88 comments



I haven't worked in C (or C++) heavily in about 6 years, since I shut down my prior company and stopped working on Squid or having to look at kernel code. But, these errors are simply beautiful, and make me have vague longings to work on C projects again (I'm sure I'll get over those longings soon).

These are the kinds of mistakes I made all the time when working in C, and the kind of thing that made coding extremely tedious...it feels like magic when the compiler catches them with such clear and concise warnings. For whatever reason, I didn't use lint very much back then, as I guess I always assumed I knew what I was doing and that the compiler would catch mistakes. Having this capability in the compiler is pretty cool and brings C/C++ a small step closer to working in higher level languages, is what I think I'm trying to say here.


> I'm sure I'll get over those longings soon

Your getting over those longings will probably coincide with when you start working with C++ again.


"Having this capability in the compiler is pretty cool and brings C/C++ a small step closer to working in higher level languages"

C++0x is another (much bigger, in my opinion) step that makes C++ a lot easier to program. It's not quite as easy as higher level languages, but far closer than before, and the performance gains over most other languages make it worth using.


I agree. Lately I've been writing substantial amounts of 'higher level' code again, and I find myself writing many checks for the types of variables (as members or arguments), return values, contents of containers, in situations where one may make inadvertent conversions etc. I'm thinking that much of the time I spend on that would have cost me less time if I could have just specified them in the code and the compiler/runtime would check them for me, like in C++.

Of course writing unit tests helps too, but in C++ the compiler catches these things easily. And I haven't had the desire to use a variant in years so the advantages of having a 'variable' that can be of any type is quite minimal, imo. The auto keyword in C++0x will make a large portion of the tedious parts of strong typing in C++ go away, too.

IMO thinks like type hinting in PHP are absolutely steps in the right direction. There is room for both 'scripting' and 'compiled' languages, but good support for indicating and checking the expected or required types in scripting languages helps tremendously in proactively validating programs.


To me, the most remarkable thing about this post is that when the rest of the world is falling in love with the "power" of weak typing systems, Google is going the other way.


I think this is a very web-development-centric view on things. In fact, I dare suggest that the world does not revolve around JS, and the bulk of development, especially in-house, is done in fairly pedestrian languages (C++, Java, C#). Web developers tend to have more web presence, naturally, so the majority of public discussions and news may make it seem to be representative of the world.


the rest of the world is falling in love with the "power" of weak typing systems

Really? I'd argue the exact opposite -- with F# and Scala (and some others) on the rise, I think there are plenty of people who are fed up with weak/dynamic typing and want to take advantage of building programs with strong type systems.

At the very least, it seems that the programming world is becoming much more polarized. Anecdotally, it seems that every developer I know is either a proponent of strong/static typing or weak/dynamic typing -- but I don't know anyone that's just sitting on the fence.


I'm certainly going the other way. After working almost exclusively in dynamic languages for almost 10 years I'm finding it enormously refreshing to have the help of the compiler again when I'm writing Scala code. Even the primitive type system of Objective-C is a welcome change. Of course, occasionally you need an escape hatch. But, in my experiences, the good static languages give you enough wiggle room to do this when you really need to.


The term "strong" appears overloaded here. A language can be dynamically typed, but still have strong typing (in the sense that the language, whether at compile time or runtime, enforces what operations are allowed on a particular value).


This is true. Lisp compiles beautifully if you put the work into making it do so, and this work can be automated with macros.


Those in favor of only static typing and those in favor of only dynamic typing are living in the past. Any interesting evolution in programming languages will allow developers to program with and without types / contracts from the same language.


In Mascara (my own project) you can start out with dynamically typed JavaScript, but then gradually add type annotation as appropriate to provide stronger verification. It can be done using structural types without changing the runtime semantics, and further by rewriting to use classes and nominal typing.

I think this provides a useful upgrade path, because type verification is a hassle with small projects, but becomes increasingly valuable (IMHO) as the program grows.

The problem with current languages is that you have to decide upfront if you want a language optimized for quick development or strong verification. But often in the real world, programs start out as quick prototypes, and then grow into large applications.


C#'s new "dynamic" type and Scala's new Dynamic trait are a step in that direction.


Qi is a lisp-like language with Haskell-style type-checking which can be turned on and off at will:

http://en.wikipedia.org/wiki/Qi_(programming_language)#Type_...


I'm a fence-sitter, but as you note, it's unlikely you know me. I do find that I'm starting to lean more on the strong/static side lately, but that might be because professionally I write financial institution software which tends to be big, and I want as much checked as early and as thoroughly as possible.

For fun hobby stuff, I love dynamic languages where I can produce quite a bit of functionality in a very short (if dangerous) time and rely on a set of tests to keep it all managed.


After a few years of fence-sitting doing a lot of ruby, I find myself... still sitting on the fence. My next project will probably be in erlang and haskell, for what it's worth.


I hear you, but I think the rest of the world is actually polarizing to some degree. Perhaps I'm just fickle or a language whore, but I love both the dynamism and freedom of Ruby (et. al.) and the "protect me from doing stupid $#@!" of Scala (et. al.) Although honestly, I don't know that I'd want a language that supported both, if even such a thing were possible.

I do see a lot of people getting all hung up on one side or the other though; and it could be that the rest of the world you are seeing is just that group "over there".

Perhaps predictably, as I get older (in my 40's now), I am leaning more towards the strong type system languages, but I do enjoy my dalliances with a dynamic language, at least for small things.


Why can't we have a language that supports both? I would love to have a language with a scalable type system where I could hack out a quick small prototype with dynamic typing first, and then perhaps layer on static types later where necessary to improve reliability and performance as the program evolves. For example, I would like to be able to declare a variable in all of the following different ways.

a value

a number

a rational number

a rational number between [0.0 - 1.0)

a rational number between [0.0 - 1.0), and I don't need any more than 30 bits of precision

a rational number between [0.0 - 1.0), and I don't need any more than 30 bits of precision, and please try to optimize for minimum memory usage instead of performance


You can support both. You should take look at Typed Racket, Racket w/ Contracts, Erlang Dialyzer, and Qi/Shen. Frankly the Lisp contingent is way ahead on this kind of thinking (would of course love to hear about others).

I'm interested in developing such systems for Clojure and have actively researching something considerably more powerful than Chambers/Chen predicate dispatch.


Qi/Shen is a fusion of Haskell and Lisp, heavily inspired by Haskell. The Lisp contingent is not 'way ahead': they are simply assimilating useful bits of their strongly typed functional cousin languages.


> a rational number

> a rational number between [0.0 - 1.0)

Those are representation errors.

The hard problems that I run into rarely have much to do with representation. The hard problems are knowing when it's okay to add apples and oranges and when it isn't.


I'm not experienced enough to say that you absolutely can, or cannot. I only have some experience to guide me, and I haven't seen it done yet.

That said..., I have an uneasy feeling that trying to do it would end up with a PL/1 type of situation where, depending on your background, you write it with whatever baggage you bring with you, and the language would get large to the point of different types of programmers writing in their own familiar subset of the language.

Or, I'm completely wrong. =)


You might be interested in the type system of Perl 6:

    subset Filename of Str where { $_ ~~ :f };
Here you define the Filename type as a string that represents an existing path. This particular example is a horrendous hack of course, but the tool looks very interesting. Not sure if this can be used for performance tuning in existing P6 implementations.


That's a horrible hack indeed :)

Perl 6 does have 'gradual typing' though. You can define a variant:

  my $foo;
or a typed variable:

  my Int $foo;
or a subtype with arbitrary restrictions

  subset OneToTen of Int where { 1 <= $^n <= 10 };
  my OneToTen $foo;
Or... (http://rosettacode.org/wiki/Define_a_primitive_data_type#Per...)

  subset Prime of Int where { $^n > 1 and $^n %% none 2 .. sqrt $^n };
Perl 6 scares me.


It's been a long time, but I believe Ada provides the ability for the first 4 and I would bet you could define the last two in it.


Hrm? That blog post is all about Google is using Clang to dig themselves out of a hole they fell into by using C++ with its weak type system, and doesn't suggest they have plans to change languages.


I think the Go* language is proof they are trying to solve some of C++'s problems without giving up speed.

It would be an amazing research project to take a couple different, large Google C++ programs and port them to Haskel and Erlang to see how they compre.

* Why does it have to be such a common word for it's name? Just means it's name, for all practical purposes, is "Go Language".


C++'s typing is fairly weak. Most of the errors shown in the article were weaknesses of the type system. Now Haskell, or even Java...


When will they enhance it to flag the other error in this line:

long kMaxDiskSpace = 10 << 30; // Thirty gigs ought to be enough for anybody.

10<<30 is ten gigs, not thirty gigs.


Doh! Good catch, comment updated. =[ Maybe we do need Clang-for-comments as well as Clang-for-C++ code.... ;]


And the error says it's an int, but it's declared long. Am I missing something about long in C++ not being 64 bits?


That's the whole point. =] This is a surprising aspect of C++: the shift expression doesn't have the type of the declared variable.

The integer literals we are shifting are of 'int' type, and the shift occurs at that type (based on the usual arithmetic conversions). There is stack overflow question with explanations and a good blog post here about it:

http://stackoverflow.com/questions/836544/usual-arithmetic-c...

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/10/87247...

Also, you can look through the C++98 standard to understand all the details. Relevant sections are [expr]p9 and [expr.shift].


The left operand is an int, so the result of the shift expression is an int. The fact that the shift expression is used to initialize a long is not relevant.


long is 32-bit on x86_32 Linux, 64-bit on x86_64 Linux, and 32-bit on x86_64 Windows.


in that case you should checkout this paper: "iComment: bugs or bad comments" (http://portal.acm.org/citation.cfm?id=1294276 sorry couldnt find an open pdf)


I wonder if this is an indication that google is moving to clang for compiling (and not just diagnostic tools). If that's true, maybe this is another nail in the coffin for gcc? I see apple and google behind llvm/clang, who's behind gcc? Nobody?


Google has said before they didn't use Clang and LLVM because of performance issues. GCC is far from dead, and probably never will. Clang generated code is still typically 10-20% slower than GCC. Lots of companies work on GCC, including Google, Intel, AMD, IBM, Red Hat and others.


I've seen this "10-20% slower" meme several times, but I've also read several accounts of 10-20% faster runtime performance.

All I know for sure is I compiled my codebase with Clang for the first time yesterday and the compilation time was absurdly short. I thought the compiler was broken. And it enables excellent tools like clang_complete for Vim code completion.


It depends on the code - while GCC, being more mature, is typically better at most optimisations, there are a few cases where Clang produces code that is a fair bit faster.

Over time, as Clang gets more mature, it will become more and more on par (or better) than GCC.


Woah, let's not conflate compilation-speed with runtime-speed...


Read carefully and you'll see I was not.


May be but your post was ambiguous wrt its implications. Defensive writing would have had you writing it more carefully about what you meant.


The parent was not intimating what you inferred.


Apple's been pushing developers toward LLVM for several years, and is bound to make it an app store requirement at some point.

I've observed no such 10-20% slowdown on iOS, which is more CPU-constrained than most realms. Typically, LLVM-generated code is equivalent to or faster than gcc. Sometimes it's much faster.


Just a guess: couldn't be an x86(-64) vs ARM thing? Like GCC being better on x86, and LLVM on par with it on ARM? Or maybe they tuned the compiler for the iDevices?


People forget that AOT can only get you so far. JIT not only has all the info the AOT mechanism has, but also has real runtime fact based data, and can do wonders with it.


LLVM isn't really a JIT, and I've never heard of people using it as one for C++. (Also, those that tried to use it as a JIT had lots of problems, like Unladen Swallow).


Rubinius is using it successfully though (I think anyways, does the Ruby community have benchmarking infrastructure?)


Yes. My impression is that it's the same kinda-sorta successfully as Unladen Swallow - pretty good, but nothing like LuaJIT2, V8 or SpiderMonkey.


Apparently PyPy still has work to do, we must insert ourselves into this list!


For the sake of argument:

What is an example of an optimization that a JIT compiler can make that a AOT compiler cannot?

If the developer is able to profile the application on typical end-user workloads, don't profile-guided optimizations provide the same benefit as JIT runtime profiling?

Why can't an AOT compiler just consider every path a "hot" path?

Last but not least: Got any benchmarks?


For one: JIT can do polymorphic inline caching (you can read more about from Google's senior vice president of operations Urs Hölzle[1]), while AOT can't.

Wikipedia gives a few more[2]: runtime profile-guided optimizations and pseudo-constant propagation

[1] http://research.google.com/pubs/author79.html

[2] http://en.wikipedia.org/wiki/AOT_compiler


The Polymorphic Inline Caching paper refers to AOT compiling with runtime hints.

In the case of non-dynamic languages like C and C++ that clang generally targets, are there other examples of where JIT would make things possible that are not possible in AOT?


Profile guided optimizations that are relevant for the specific invocation of the program. Loop optimization based on invocation parameters for that specific run of the program. Hard-coding in the jump target address for calling functions from dynamically loaded libraries (can't do that AOT, because if the library is replaced, the symbol offsets change).

Optimizing for the specific processor you're running on, as opposed to being forced to compile for a lowest common denominator.

A whole bunch of other small things like that.


One nice thing JITs can do that AOT compilers can't is on-stack replacement. That's where you recompile a particular function at run-time based on new information. This allows you do speculative optimizations.

For example, you might see that branch X is always taken. So you assume that X will always be true, and add a guard just in case which triggers a recompilation. You reoptimized the function on the basis of your new (speculative) information about X. This could improve register allocation, allow you remove lots of code (other branches maybe), inline functions, etc.

Java JITs have been known to inline hundreds of functions deep with this.


A simple example: let program P do a zillion <something> * <command line argument> multiplications, and call the program every hour with argument value zero or one, depending on a coin flip. An AOT compiler would not even know that the program will never be called with other arguments. A JIT compiler could remove all multiplications.

Profile-guided optimizations only work on the next run, and, when used by the developer, do not work for cases where there are widely different usage profiles for a single program. For example, most users would have data sets that fit in memory, but others will have ones that do not.


Wouldn't you get a code explosion and difficulties dealing with cache coherency if every path was a hot path (serious question, I don't know much about this stuff)?


GCC has enabled Apple and Google far more than the other way around. Clang and LLVM sound like great projects, but I hardly think GCC is heading for the grave anytime soon.

http://gcc.gnu.org/releases.html#timeline http://www.gnu.org/philosophy/pragmatic.html (see part about Objective C)


who's behind gcc? Nobody?

I hear there's this kernel called Linux that depends heavily on GCC.


Clang has been able to build Linux since October of last year.

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2010-October/0117...


Clang has support for a lot of gcc extensions: http://clang.llvm.org/docs/LanguageExtensions.html


And Linus's feelings about that relationship are...?


Pragmatic, I would guess.

"Quite frankly, I'd like there to be more competition in the open source compiler game, and that might cause some upheavals, but on the whole, gcc actually does a pretty damn good job."

http://kerneltrap.org/Linux/Volatile_Performance


But why would the kernel depend on gcc? Are there so many gcc-isms in there that would be hard to replicate on other compilers?


The biggest one is (was?) the use of GNU variable length arrays. The GNU extension has different syntax than C99's variable length arrays. There are also instances of __attributes__ on platform specific code.


CLint meet CLang and the better for it. Although if I read correctly between the lines, there might be a little trouble getting the engineers to buy in :) Every met anyone who could actually make it all the way through CLint with all warnings on! Enough to drive you crazy!


Well written! The simple bugs that it seems to detect are fairly high frequency so it should therefore improve overall code quality. Looking forward to playing around with this in my spare time.


I haven't even done any C++ for a while, but reading those other articles on HN about undefined behavior in C made the example bugs in this article really jump out at me.


The article implies to me that the third bug (passing 0.5 to sleep() ) is not caught by gcc. Does anyone know if this is the case? It doesn't seem excessively hard to produce a warning about shortening like that - the first two seem more subtle, but that one less so. I don't have gcc on this machine to check it, but VC++ certainly does emit a warning for that kind of thing.


GCC definitely has a warning for 0.5 -> int (likely -Wconversion, but I've not checked). It also has a warning for setting a pointer to "false" (-Wconversion-null). However, turning that warning on in a codebase where every warning breaks the build was challenging because of false positives. We're able to remove false positives and narrow the scope of the warning to just the buggy code in many cases with Clang, and that allows us to turn these warnings on much more aggressively.


Isn't the real bug in the example where the bool was assigned to the pointer that a pointer was used where it should have been a reference?


Google's C++ style guide discourages the use of references altogether.


When is assigning a pointer to be boolean false intentional and correct?


Most of the cases we ran into were metaprogramming techniques which test whether an expression is a valid null pointer constants. These got innocently applied to 'false' and trigger the warning needlessly.


Done that before...had a variable that was formerly an int:

  int foo = 0;
But got changed into a non-integral type:

  Thing foo = 0;
Turns out Thing had a copy constructor roughly like:

  Thing::Thing(Thing* old)
      : field(old->field)
  {
  }
Needless to say, my program was not very happy.


I think the -Wconversion-null warnings are the other way around:

    int i = NULL;
Because of C++'s conflation of NULL and 0 (and now C++0x's nullptr).


Eclipse highlights these kinds of bugs in java code. It saves me a lot of time.


I wonder how their checks compare to Coverity's and QAC++'s.

I have a passing acquaintance with both, and I'm almost certain both would have caught the three bugs listed on that page.


I would expect many of these tools to catch these types of bugs. The challenging thing for us has been to catch only bugs, and to catch them very fast during normal compilation.

A lot of the static analyses we've looked into (and I'm hoping for more detailed blog posts about that in the future) find plenty of bugs, but also find lots of non-bugs. Combine that with being too slow to run during the normal build, and you can't break the build when such a bug is found.

I think one of the most interesting aspects of this is how we catch the bugs early, and force developers to fix them immediately by breaking the build.


These sort of articles (and the attendant comments about false positives) always scream out for Ada to me. It's a language designed by a calm, careful thinker back in the 80's for life critical programs. It has everything Java and C++ have except the vast number of undefined states and it's designed for static analysis. By designed I mean, there are formal verifiers and the NSA has used it in a test security system.

Plus the compiled code is pretty fast. So if you're feeling the need to reduce your workload take a look at it, you might be surprised.


What kinds of libraries are available for Ada? Half the reason I use C++ is that half the code I need to write is already available in mature libraries.


Library support is aimed squarely at realtime life critical systems. You're more likely to find a library with some sort of safety certification than not. If you're expecting to use the latest web libraries or hadoop you'll be disappointed.

However, there is a small collection of oss Ada libraries out there.


I seem to remember there was a pretty interesting web framework written in Ada, Adaweb I think?

Anyway, Ada has some other amazingly cool features. The concurrency primitives it offers are very cool, lets you make some much stronger guarantees about the interactions between threads than any other language I've seen. For example, you can define rendezvous sections, which if memory serves, are pieces of code that are guaranteed to only be run once both threads participating in the rendezvous and neither thread can leave the section until both are ready.


There's ada web server (aws) which is neat. Similar idea to the java web kits like jetty. You can even hotplug code during runtime.

You're right about the concurrency. Ada has a bunch of stuff like that built into the language since 1983.

The particularly cool toys I like are SPARK (a formal verifier tool) and stackcheck - tells you exactly how deep in the stack your code can possibly go. (Yes you have to annotate cycles.)


Admittedly I only programmed in Ada for a few months, but I found it to be an unbearably tedious language. The verbosity is monstrous--- the type system is inexpressive; it feels primitive. Nothing is inferred, everything is repeated. Ugh, I would rather code exclusively in C++ templates than touch Ada again.


I will freely admit that Ada is not appropriate for all uses. But for the sort of thing you'd want to use Ada for (life critical systems, critical infrastructure, control systems, etc) The explicit type system, explicit declaration, statically typed system is ideal. I particularly like the ability to declare down to the bit level how my data is stored. Very useful for working with low level hardware.

Take a look again, with an eye towards large scale long lived critical systems. You'll find that explicitness a feature, as is the very strict static typing.

Oh, and they got pointers right the first time.

EDIT: Forgot to mention the coolness of the type system - you can declare ranges and other type information and the compiler will hold that requirement strictly. for example type direction is range 0..359; Declares the obvious, but now you can have the compiler error if you assign a direction type with a value that might be outside that range. Even if you don't set that option you can always use the foo'valid attribute to check that foo is within bounds. This is used to check for stray cosmic radiation (seriously!).

Yes it seems like overkill. But when you come back to 20 year old code that's still running you'll smile.


I would also add that Ada compilers do generate some surprisingly efficient executables. This helps quite a lot in the embeded space.


Static analysis is a great tool - Clang's isn't too bad - but it isn't the right one to drop into a normal workflow.

We still develop using GCC - for our use case Clang performance isn't there yet - but have our continuous integration system perform Clang builds (as well as other platform/compiler variants, with unit+regression tests, etc.) so we miss out on the immediate build breakage that you mention, but do find out within ~30 mins if someone slipped up.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: