Hacker News new | past | comments | ask | show | jobs | submit login
Accidentally quadratic: When Python is faster than C++ (arxiv.org)
218 points by mehrdadn 6 months ago | hide | past | favorite | 213 comments

If you're wondering whether this is a theoretical or practical problem: I actually observed some of this effect in practice first, and only after thinking about it for a while did the larger issue (and the complexity implications) dawn on me.

I had something like a set<string> or a set<set<string>> (or map... I can't remember which) somewhere in my program a few years ago, and I was trying to improve the program's performance. I tried breaking into it several times and found it quite perplexing that the bottleneck appeared to be the set<> container. I mean, I get that cache locality and all has an effect, but it seemed to be having a little too much of an effect. Why did I find this odd? Because other languages (like C#) have tree-based sets and maps too, but I'd never felt they were quite as slow as I was seeing them in C++. So I felt something weird must be going on.

I tried to step through the program for a while and observe what's going on, and at some point, I realized (perhaps I hit the same breakpoint twice? I don't recall) that these functions were being called more often than I intuitively thought would be necessary. Which was obvious in hindsight, as less() needs to be called multiple times on the same objects (4 times at level 2). Now, this hadn't resulted in quadratic behavior, but that was only because my data structures weren't arbitrarily deep—the slowdown was nevertheless a significant constant factor at the core of my program, only linear because the data structure's depth was bounded.

So once I had realized the implications of this, including that constant-factor differences can actually turn into polynomial ones, I eventually decided to post an article about it on arXiv... hence this article. Aside from the complexity issue illustrated in the article, one of my (unexpected) higher-level takeaways was that you can't just take any utility function in a program and blindly put it into a library: even if it's correct, you probably have hidden assumptions about its use cases that won't hold true in general. You really need to think about how it could be used in ways you didn't initially expect, and this may well require a different kind of code review with an entirely different mindset than before. It really drove home the point for me that writing a library is quite a bit different (and in some ways more difficult) than writing an application.

It's possible there is more theory lying underneath this that I haven't touched on—it would be interesting if someone can take this and find more interesting consequences of it. For example, maybe when analyzing algorithms we need to consider something akin to a "recursive time complexity", too? Is there another useful notion of complexity that we can generalize here?

Anyway, hope people enjoy reading it and take something useful away from it. :)

By the way for the hashing example in count() - Rust has the Entry API for HashMaps to avoid exactly this problem: https://doc.rust-lang.org/std/collections/struct.HashMap.htm...

The PartialOrd trait also uses 3-way comparisons so I think the other issue is mitigated too, but it'd be interesting to check: https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html#tyme...

I'm failing to understand some things, maybe because I glanced through the paper with morning coffee. Apology if my tone ends up a little bit harsh, but these are just constructive criticism :)

The first one is lt2 and lt3 implementations. You are implementing cmp2 through lt2, but lt3 through cmp3 (is this omission?). Both of them are stack-sensitive. Without being too harsh, I'm getting the impression that the intention was to write the most horrible comparison possible, which is different than worst-case time complexity.

In the paper, lt2 (actually cmp2 in the paper) will always be at least two passes, and lt3 is at least one pass. I would not say they are two/single pass algorithms because complexity increases depending on list depth when lists are involved.

Maybe I'm wrong, but both Python and C++ comparison operators are designed to be general-purpose comparison functions (and I'm more sure about C++, because this was touted through hundreds of books). As such, they should be good enough for most average cases. If you want speed, you go with balanced trees or something funkier.

Also, for C++ Tree implementation, you are again using probably the worst approach - appending to vector recursively. Use list for this. Python list implements all sorts of tricks, comparing to C++ vector.

And the last thing, but not the least: C++ containers depend on implementation (gcc libstdc++, stlport, msvc whatever), and I've seen substantial speed differences in standard operations. Hell, my old (almost conforming) list implementation was much faster than libstdc++ implementation because it wasn't trying to be too clever with slices and other magic.

I'm sad you haven't used a more scientific approach with much more rigor here: what C++ compiler was used, what version, what assembly output was produced, on what processor, after how many runs, etc... Claiming "Python is faster than C++" sounds like a clickbait title.

I'm not trying to write the most horrible comparison at all. Perhaps the most important thing to keep in mind here is that this is a general computer science paper, and my comparison of C++ and Python is just intended to serve as a familiar (and vivid) illustrative example of the general phenomenon I'm trying to describe. The paper is emphatically not intended to be a "Python vs. C++" paper. Everything you see there that is "concrete" (the language, the running times, etc.) is intended to be a mere illustration of the overarching concept (design decisions & their consequences) being discussed, and it could manifest itself in any language.

The context to keep in mind when reading the paper is: When designing a programming language & its standard library (or any API), we need to define an interface we can use as a building block, and we're analyzing the consequences of our choice of building blocks. In particular, we first examine the case of comparison-based data structures, which requires defining ordering primitives. In C++, the primitive is the < operator. In Python 2, it's cmp(). (In Python 3, it's a mix of < and ==, whose implications I discuss as well.) We assume user-provided types implement that basic interface, and we implement everything else we need on top of that.

So the question I'm analyzing in that example is: What happens if my primitive comparison operation is a 2-way comparison (lt(), like in C++) and then I implement 3-way comparison in terms of that (such as when I need it for for a binary search tree)? Now, what if we do the opposite: what happens if instead my primitive comparison operation is a 3-way comparison (cmp(), like in Python 2) and I only need to perform a 2-way comparison later? What are the trade-offs?

To do this, I take both approaches, implementing each in terms of the other, and compare how they behave complexity-wise. The conclusion I come to is that the choice of the primitive (which is often made by the language designers) isn't merely a question of aesthetics, but rather, it actually affects the time complexity of common operations (like traversing search trees). Similarly, the decision to cache a hash code doesn't just result in a constant-factor improvement, but it can actually change the time complexity of a program. And so on.

I think if you re-read the paper with these in mind, it should hopefully make more sense. The rest of what you said doesn't enter the picture at all... these are already balanced binary trees, the decision to use less<> is fundamentally independent of what C++ stdlib implementation you use, and the time of the vector concatenation isn't even being measured. Those things are unrelated to the point of the paper entirely. I was just trying to minimize the extraneous lines of code so we can focus on the heart of the problem instead of getting distracted by boilerplate.

Thanks for the interesting case. I guess many readers have read too much into lt2 and lt3 but overlooked the code under 2.4.2: C++ could actually be that bad. In that code, you only showed the timing in comments. The result may be worth a table. Perhaps another way to structure the manuscript is to give the surprising result of 2.4.2 first and then to explain what makes that happen.

Yeah—I think I laid out the paper more like a story, but I might indeed need to change that as it appears it leaves people confused before they get to the punchline. Thanks for the feedback! Glad you found it interesting.

1. People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

2. The C++ standard library's maps and sets are known to be rather slow. See, for example:


when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.

Ordering is not the only concern here. std::set actually provides a logarithmic worst-case guarantee, whereas std::unordered_set does not. This is a factor to consider depending on the application, regardless of whether ordering is necessary. Whichever one prefers in any case, though, is beside my point—I'm merely trying to use trees and hashtables to illustrate a far more general CS phenomenon that can occur in lots of data structures and languages.

If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.

There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it

Sure, but this isn't a benchmarking paper.

Imho "but technically..." is not a valid opinion while in practice the access is O(1) on average. Yea sure, it becomes linear if your hash is "return 42;", it can't grantee that you supplied a good hasher.

It's not technically. Worst-case guarantee is required for some applications. Hash Maps have better amortized complexity but some implementations have bad worst case time.

Java uses balanced trees instead of linked list in their chained-hashtable implementation, if I recall correctly.

There are lots of reasons to prefer trees (and, correspondingly, lots of reasons to prefer hashtables); I just pointed out ordering isn't the only one, and I merely gave another (worst-case performance guarantee). For example, yet another one is iterator stability; insertion can invalidate iterators for unordered containers, making them useless for some applications that need to analyze data efficiently as they insert.

I could go on, but it's very much detracting from the point of the paper to argue about the data structure choice when the paper isn't on this topic at all or trying to analyze any particular application to begin with.

Technically is a very valid opinion if you're writing software that needs to be robust and secure in the face of user input. std::set is guaranteed to be logarithmic strictly in the number of comparisons in the worst case without exception.

std::unordered_set has some very difficult to reason about worst case scenarios that must factor in both the size of the collection and the hash function. Unless you are very careful about the type of hashing you use, you can easily be vulnerable to DDOS attacks that scale on the order of O(k * n) for key size k and container size n.

Note that this isn't just some user-supplied bad hashers. Hash for std::string in GNU library is not much harder to hack for collisions than your example. And also it uses the whole string, so is slow on long strings. I didn't investigate if other implementations of STL are any better.

> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.

Depends what’s your definition of “quite new”. That’s over 10 years go now :)

They've been in the language for about a decade now, and before that we had the tr1 hash_map classes that were available in most environments.

C++: the language that's been around for 40 years and is completely new!

Actually, C++ has barely been around at all. It is only with C++20, IIANM, that they key features Bjarne Stroustrup wanted to imbue it with are in place (IIRC; see his "Design & Evolution of C++" book from 1994). Some might even argue that the language C++ _should_ be is not even all here; it's still gradually arriving.

By now, this is no longer "quite new". A recent C++ community survey suggests that under 10% of developers currently use C++98/C++03 the most. Naturally this is not a valid sample of the whole userbase, but it's a good indication.

Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.

They were in TR1 in 2005, if I am not mistaken.

Why is the default set implementation ordered in the first place? The formal data structure is unordered, which probably informs people's assumptions about its performance characteristics. Should it not be "set" and "ordered_set"?

One reason would be because std::set guarantees O(log n) complexity for each operation on the worst case. std::unordered_set has average complexity O(1) and O(size) for the worst case (it being a hash table) which can be unexpected to debug in those rare cases.

Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.

> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.

It's a mis-feature. And since C++ is very committed to backwards compatibility, they didn't change it later on.

If/when `std2::` happens, this is one of the things I assume would change.

cpython is faster than c++ under the following circumstances:

1) when you're using native libraries via python that were written by better c/c++ programmers than you are and you're spending most of your time within them

2) when you're using native libraries in python that are better implementations than the c/c++ libraries you're comparing against

3) when you don't know the libraries you're using in c/c++ (what they're talking about here)

...otherwise, if you're just using doing basic control flow optimizing compiler c/c++ will almost always be faster. (unless you're using numba or pypy or something).

point stands about the constants though. yes, asymptotic analysis will tell you about how an algorithm's performance scales, but if you're just looking for raw performance, especially if you have "little data", you really have to measure as implementation details of the software and hardware start to have a greater effect. (often times the growth of execution time isn't even smooth with the growth of n for small n)

I think a key part here is also being realistic about the time available to write and optimize your program. I’ve seen Python completely crush C++ a fair number of times (order of magnitude or better) and it basically came down to the C++ programmer having bitten off more than they could chew, spending so much time on the basics that they never got around to optimizing algorithms or the data layout. (One variation: Python hashlib > whatever C textbook example you copy-and-pasted because you thought calling OpenSSL was too much work)

This is frustrating for programmers because everyone wants to focus on the cool part of a program and forgets how much the rest takes to write, debug, etc. There are many reasons why I prefer Rust but one of the biggest is simply that having a standard package manager means that you can get high-quality code as easily as in languages like Python or JavaScript and are more likely to avoid that pitfall of reinventing wheels because it initially seems easier than finding a library.

yeah, the c/c++ ecosystem never really had the benefit of a internet connected curated library community. afaik the first big example of that was perl in the late '90s. CPAN was awesome: here's this big library of awesome libraries that have been curated with full tests and documentation that you can search and add to your system with a few easy cli invocations. (for the uninitiated, this was npm or pip for perl in the 90s- complete with dedicate wikipedian level pedants gatekeeping/curating)

moreover, batteries included scripting languages like perl, python, matlab, etc all tend to have the benefit of having their core bits be best of breed. perl has/had one of the best re engines out there, matlab has a great blas all compiled with the best optimizing compiler and ready to go, python was more generic i suppose, with fairly strong libraries for most things (strings, io, network io, etc).

other than the microsoft nuget stuff, the c/c++ ecosystem never really had the benefit of anything like that other than boost which was pretty tough to pull into a given project and didn't really have the community of people writing high level libraries like the scripting languages did. that said, i often used to think it would have been interesting to build a language agnostic platform for language centered library communities. (a sort of generic cpan/pip/npm in a box for pulling down libraries and running tests for any language- a combination of build system, online community platform and search engine)

but the real moral of the story: use the libraries, luke/lucia! also, know them!

CPAN definitely deserves more attention, especially for the emphasis on testing which many successors still don't have. That was even more necessary back when OS consistency was worse but it really should have been seen as a first-class feature.

I think C/C++ also had this problem with the whole cultural evolution around shared libraries. Because installs were non-trivial I think there was an almost subconscious blinder effect where people restricted themselves to what shipped with the OS / distribution even if that meant keeping a workaround for a bug which had been fixed upstream years before because that was seen as better than static linking or bundling a newer version.

the pedantry around design and testing in CPAN is where i learned how to be a serious software engineer. perl was special in that it bridged the divide between software engineering and system administration. it was software engineering with the pedantic nature of high strung sysadmins who insisted the garden was in perfect shape at all times.

Historical trivia: CTAN (Comprehensive TeX Archive Network) pre-dated CPAN (.. Perl ..) by about 1..3 years.

[1] https://en.wikipedia.org/wiki/CPAN

We should add to note to 1, "and you aren't making lots if short calls into said library". If you are, the ffi ends up costing more than the savings.

4) string concatenation

Probably (1) above, no?

NULL terminated strings are quite bad for performance, because you need to transverse them to find the terminator.

Now try to concatenate a bunch of them in C.

C++ code usually use std:: string, std::string_view, or other string classes that stores the string size and has some capacity pre-allocated to avoid such issues.

Profiling is essential. I found a performance bug in calling some C++ functions a while ago, because they accepted a const std::string& and were being called in loops with a C const char*. Every single call had to construct a std::string involving a strlen, allocation and copy.

std::string_view is a nice fix for this but few programmers seem to use it yet.

The (1) also mentions C.

I would partially agree with you, if it only mentioned C++.

Only partially, because too many C++ devs still use plain old C strings.

Like this ?

    const char* strings[]={"ab","cd", "ef", "gh", ....};
    char result[1024*1024*1024*1024];
    size_t sz=0;
    for(int i=0; i<sizeof(strings); i++)
    { size_t len=strlen(strings[i]); memcpy(result+sz, strings[i], len); sz+=len; }

>you need to traverse them to find the terminator.

Why not just do a binary search through the entire 256TB virtual memory space, finding the largest memory address that doesn't segfault? :-)

I’ve been thinking about this lately. Can you actually be faster than C? Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.), any algorithm for say parsing JSON can be expressed equally as efficiently in C, while a clever and hacky C-specific algorithm cannot necessarily be expressed in those higher level languages.

In other words, is “faster than C” even a good metric if all it means that “if you implement something inefficiently in C it will be faster than if it is implemented efficiently in not-C”?

> Can you actually be faster than C?

Sure, in any language that provides more semantic information than C does. For example, D enables a "pointer to immutable data" type, while C does not. This can improve optimization.

On a pragmatic note, C makes it easy to use 0-terminated strings, and clumsy to use length-terminated strings. The natural result is people use 0-terminated strings.

0-terminated strings are inefficient because of the constant need to strlen them. When I look to optimize C, I find plenty of paydirt in all the strlen instances. D makes it easy to use length-terminated strings, and so people naturally prefer them.

> For example, D enables a "pointer to immutable data" type, while C does not.

Wait, isn’t that what

  const int *x;
does? I.e. a pointer to a constant.

In C, the value pointed to by `x` cannot be changed via a write through `x`, but if there is another pointer to that value, it can be changed through that.

In D, `immutable(int)* x` cannot be changed by any reference to the value.

Isn't that what 'restrict' is for? It's a new type of footgun of course because the compiler doesn't detect if the value is actually accessed through another pointer, but by simply ignoring that possibility via 'restrict', the compiler should have the same optimization opportunities, no?

> but by simply ignoring that possibility via 'restrict',

If you deceive the compiler like that, it is entitled to hand you a broken executable :-/

Besides, I forgot to mention that D immutable data is inherently thread safe, no synchronization required. And you can have as many live references to it as you like.

Immutable data is part of D's support for functional programming.

Well yeah, sure. I wasn't picking on D, just pointing out that C offers an escape hatch, no matter how dangerous that might be ;)

restrict is so rarely used that clang has had open code generation bugs for years. Rust would like to use its non-aliasing guarantees to generate better code, but clang is unable to reliably generate correct code.

So even if you manage to reason correctly around `restrict` in C, you can't count on the compiler to translate your code correctly.

GCC also had bugs around restrict, but I don't know about their current status.

I haven't found the compiler ever optimizing something different because of constness. The various escape hatches means that it is really hard for the optimizing compiler to make certain assumptions, especially if it's connected to something with external linkage. A function with a const struct arg that you never take the address of? compiler emits memcpy!

You might be right but if it's just a technicality and in practice nobody uses restrict the way you said, would anyone care about the theoretical superiority of C? I'm sure some would, but they wouldn't be the majority.

restrict is used so rarely in C code that the Rust team hit a lot of bugs when they tried to give LLVM all the aliasing info the Rust compiler has.

You’d think that C could do any appropriate optimizations as well as D, since the C compiler (in theory) should know whether there can exist (in any given program) any writeable pointers to the same value or if there only can exist pointers to const.

Only inside the same compilation unit.

Doing it across the whole compilation requires LTCG, but that does exist as well. It just makes incremental builds a pain with large binaries.

Pointer arithmetic would surely make all value mutable. Or even simple array access.

Yes, but the compiler should know if there is any pointer arithmetic present in the code, and whether any such pointer arithmetic could, possibly, result in a pointer pointing to the const data.

If you point to writable memory (Not all programs reside in ram.)

    const int world = 42;
    const int const * const hello = &world ;
Apparently, you can be very const and `gcc -Wall -Wextra -std=c99` won't raise any complaints.

Firstly, isn’t that a syntax error? There’s a stray “const” in there. You probably meant

  const int * const hello = &world;
Secondly, what should the compiler complain about? You have a const int, and then a const pointer to const int, pointing to that first const int. What’s the problem?

Thirdly, the latest C version supported by GCC is “-std=c17”.

Nah const is fun

  int main()
    const int world = 42;
    const const const int const const * const const hello = &world;

    return 0;
Is a valid program


Sure, but those extra “const” do not mean anything, and are apparently ignored by the compiler.

> In D, `immutable(int)* x` cannot be changed by any reference to the value.

even if I use a magnetic needle to change the bits in my RAM ? :)

There's an easy answer to that: using a magnetic needle to change the bits in your RAM violated the assumptions of the compiler, leading to undefined behaviour. What exactly will happen in that circumstance is just that: undefined.

To put that another way: hardware failures are beyond the scope of the compiler.

Maybe your compiler.

I think the point of the parent is that in principle you can always implement those optimizations by hand in C although it might of course be impractical.

If you go down this road then you can always drop down to assembler to be even faster than C.

I don't think this is a reasonable argument. Every Turing compatible language that gives you direct access to the metal, so to speak, provides you with the opportunity to implement these optimization by hand.

I think it's much better to look at average C code there. And then C has a tremendous advantage with their compiler support. C compilers have decades of optimization put into them. This will take a while for other languages to catch up to that.

The three D compilers use, the GCC backend, the LLVM backend, and the Digital Mars backend. All the decades of optimization in them are taken advantage of by D. If you write C-style code in D, you'll get the same code generated as if your wrote it in C.

But isn't that a consequence of D "mimicking" C to some extent?

Disclaimer, my compiler background was the Java compiler and its bytecode generation but I would expect both gcc and clang/llvm to have lots of optimization cruft hardcoded for how C programs are written. If you deviate from that, you get potentially less well optimized code.

> But isn't that a consequence of D "mimicking" C to some extent?

It's true that multi-language backends will inevitably cater to the semantics of the dominant language.

For a related example, I discovered that gdb doesn't pay much attention to the DWARF symbolic debug spec. It pays attention to the particular way gcc generates symbolic debug info. Other patterns, while compatible with the spec, don't work with gdb. Similar problems have happened with linkers.

It's a matter of "do as I do, not as I say".

Since I am the author of DMD's backend, I've been putting in optimizations that do cater to D, rather than just C and C++. For example, in the last few days I started putting in optimizations to take advantage of D's new "bottom" type:


> in principle you can always implement those optimizations by hand in C

Some things, like integral promotion rules, you can't get around. You may know that the promotion can be skipped in certain cases, but the compiler cannot know it.

Leading to eye-rolling problems like these: https://github.com/biojppm/rapidyaml/issues/40

Because C is not machine code or even assembly language it must be compiled. Not only might the compiler not be as opportunistic about improving the human written code, but it might not fully utilize the machine's full capabilities if the compiler does not understand the complete instruction set that is available or other things like memory and different levels of cache.

The compiler might not even provide access to these capabilities to a programmer who knows about them and wants to explicitly use them. In such cases, the programmer might have to use assembly, or Fortran, C++ or even something much higher level than C with a compiler that provides access or knows how to "intuit" when those capabilities are useful.

Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.

Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.

Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.

So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.

The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

C never was the lowest level of abstraction; there are other abstraction models out there and more still to be invented no doubt. C's model did well (though struggled mightily against Fortran for the longest time) at aligning with processor models of the 80's and 90's. It helps that C's ubiquity meant that to a degree the trail was washing the dog: processor designs were often measured against how they executed code compiled with a C compiler. But past preformance is a poor predictor of future success; who is to say that as processor designs continue to evolve, C's abstraction model won't become increasingly leaky against another? Absent a sufficiently smart compiler, it's entirely possible that C compiler writers will find themselves at a disadvantage.

And that assumes they're competing with a traditional compiler. It's possible, though unlikely, there will be competition from other execution models. As you said memory is often the core area of performance bottlenecks these days, and as terribly inefficient as bytecode interpreters are, they tend to have smaller code sizes. Efficient interprets are hand tuned specifically to make the overhead of bytecode interpretation as efficient as possible. Now, intrinsically they are performing a translation step at runtime that a compiler did before runtime, but one can at least theorize of a model where the interpreter is effectively a specialized decompression algorithm that feeds machine code to the CPU (that's really not that far afield from what happens in hardware in modern cpus and in mobile runtimes). Higher levels of abstraction might allow for more efficient decompression... It's crazy, but not inconceivable.



Damn, I should have proof read this on my desktop rather than just blind posting from my phone.

Well said.

I've had a lot of conversations with javascript engineers over the years who've argued to me that well tuned JS will be nearly as fast as the equivalent C code. I've written plenty of little toy benchmarks over the years, and in my experience they're partly right. Well written JS code in V8 can certainly run fast - sometimes around half the speed of C code. But a massive performance gap opens up when you use nontrivial data structures. Nested fields, non-uniform arrays, trees, and so on will all cripple javascript's performance when compared to C's equivalent of simple embedded nested structs. If you couple clean C data structures with allocation-free hot paths from arenas, the performance of C will easily hit 10-20x the performance of the equivalent JS code.

From memory my plain text based operational transform code does ~800k transforms / second in javascript. The equivalent C code does 20M/second. The C implementation is about twice as many lines of code as the JS version though.

(The code in question: https://github.com/ottypes/text-unicode/blob/master/lib/type... / https://github.com/ottypes/libot )

More like "well tuned JS will be nearly as fast as poorly tuned C code"...

True. But if I’m given the choice, I’d usually rather well written javascript than badly written C. And there’s a lot more decent JS programmers out there than decent C programmers.

Fair. I'd rather neither, and take Python, or even (God forbid) PHP. LOL

C doesn't tell you anything about your cache efficiency, which is to first order the only thing that affects your program's performance.

You're right that flat datastructures are important, but C is far from the only language that can offer those.

I don't think C can ever be the answer; "comparable C code", i.e. code that was produced with the same amount of developer effort, is almost always undefined behaviour that happens to work most of the time on the developer's compiler. C doesn't have the kind of polymorphism that you need to write code that's compositional, testable, but also has good cache locality, at least not for code that also has to interact back and forth with the outside world. There is at least some work in the literature on making efficient use of cache in a Haskell/ML-like evaluation-by-reduction model, and I think that's where the long-term way forward lies (and FWIW in my experience compilers for ML-family languages already more than smart enough).

I'm not arguing that C is the ultimate programming language for maximum performance or anything like that. Simply that it gives you a lot more freedom to tune your implementation where it matters for performance.

For HPC, you'd likely be better with something designed to run on GPUs, with built-in primitives for matrix operations, and a compiler that can tune the way those operations get implemented for you.

However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML. AFAIK those languages have the same pointer-chasing issues I've described earlier. If you need to design a complex data structure, for example a scene graph in a 3D game engine, it will be much easier to make it compact in C.

> However, when it comes to cache locality, if the C programmer knows what they are doing, I think you'll have a very hard time beating them with Haskell or ML.

A programmer who knows what they're doing is a rare thing indeed. On actual business problems with real-world programmers Haskell consistently beats C, IME.

> AFAIK those languages have the same pointer-chasing issues I've described earlier.

There's a pragma to flatten strict record fields, or a compiler flag to do that by default (at least in GHC).

Thank you for the good explanation! I'm in embedded and my experience is that this stuff is almost totally ignored in universities. People have to go through a steep and (for us) expensive learning curve until they get the feeling for the whole system and what makes it fast. And then I think, is it asking for too much? Can we expect knowledge about a gazillion layers smeared onto each other from everybody who just wants to deliver value from the topmost layer?

> Inlining is actually non-trivial.

OTOH, JIT runtimes have more input data than a C compiler. They can implement some runtime equivalent of C++ profile-guided optimization: measure what actually happens in runtime, assume the input data is going to stay roughly the same, and re-generate machine code with this new information into something more efficient.

Pretty sure modern Java does that sometimes.

> In Python, JavaScript, Ruby, you have a lot of pointers to objects.

In C# you can make data structures which don’t use pointers much, or at all. The language is strongly typed, has value types, it’s easy to place structures inside other structures. There’s a native stack just like C, and even stackalloc keyword which is an equivalent of alloca() in C.

> In C# you can make data structures which [...]

Yeah but that's completely validating my point. C# is not Python or JS. It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime. Because it's strongly typed, it's a lot easier for the compiler to optimize, and because you have all these tools to design compact objects without pointer, you can do that job so the compiler doesn't have to.

And again, an experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.

> It's a (remote) cousin of C which tries to take some of the valuable performance tools from C and bring those to a managed runtime.

That’s correct. But at the same time, the language is way higher level than C or C++.

> experienced C# programmer can probably write code that runs circles performance-wise around code written by an experienced JS/Python developer in most cases.

In some cases, an experienced C# programmer can even write code which approaches or outperforms C. My Linux video player library https://github.com/Const-me/Vrmac/tree/master/VrmacVideo#per... uses CPU on par with VLC, and 20% less RAM.

> JIT runtimes have more input data than a C compiler

They have more data, but they are at a disadvantage by being time-pressured. They can't apply costly analysis to these data because they need to compile fast. Therefore typically they limit themselves to local analysis which may miss a lot of opportunity for inlining.

> They can't apply costly analysis to these data because they need to compile fast.

True in general, but they use quite a few tricks to minimize the consequences.

They use interpreter, or very unoptimal but fast version of JIT compiler, first time a function is called. They replace it with faster version once it’s clear the function is called a lot.

Unlike C compilers, they don’t need to do that analysis globally for the whole program. They only need to do that for the hot paths, that’s often a small portion of the code.

They can offload that analysis and code generation to another CPU core, and replace the implementation once that background thread has produced a faster version.

> They only need to do that for the hot paths, that’s often a small portion of the code.

That's often correct, however unfortunately codebases today can be very, very huge. It can take a really lot of effort to optimize even just 10% of the hottest code if the product is several hundreds of MB of compressed byte-code. There are also applications with no obvious hot-spots, but flat profiles - e.g. database systems, where most of the time is being spent transferring data between various layers of the system. If a request gets passed through most of the layers, and can be routed into different areas depending on the query type set at runtime, and the clients are allowed to send queries of various types, so they target various different submodules of the server, there will be no hotspots. In these cases warmup can take enormous amount of time.

Even for a server this can be a problem, because after restarting you get an immediate performance hit.

Also keep in mind many software products are not long-living backend processes that can warmup for hours or even minutes. Client apps need good responsiveness, and before JIT even realizes which code to compile, it is already too late.

I think what you wrote largely applies to Java and especially JavaScript, much less to C#. Value types, real generics, and native stack allow even the faster version of the .NET JIT to produce native code that’s not too horrible performance wise.

Good enough for desktop or embedded use cases, even on slow CPUs. I have 3 such devices on my desk, Raspberry Pi 4, a dev.board with Rockchip RK3288, and a tablet with Atom Z3735G, .NET is reasonably fast on all of them, without noticeable warmup issues at startup.

I was talking about JITs in general. Sure, you can AOT-compile .NET code quite efficiently, but then this is a different story.

Here is a nice analysis of how various JITs warmup in practice:


TL;DR; often they don't!

> unfortunately codebases today can be very, very huge

why is that?

Actually, action for action managed languages tend to be "faster" than C in the sense that writing the equivalent program would be slower in C. However, the additional freedom leads to increased program complexity and that complexity eats into the performance budget enough to end up slower than "idiomatically" written C where the developer distilled the solution to its bare essence.

Javascript has the fastest general purpose hashmap/dictionary implementation of all programming languages but at the same time you are forced to use them all the time which overall slows the language down. Writing exactly the same code in C would be even slower since C hashmaps aren't as optimized. However, C code is rarely written like that so it's usually faster.

I don't know about LLVM but GCC has had something like this for years as a standard optimization feature. You create a binary with special profiling code which writes to a file. After running the program a few times, you recompile with optimization that uses this file. I forgot the flags though.

Personally, I'm more interested in executable optimization. It decompiles an executable, performs whole-program optimization, and re-compiles it. I'd love to tinker around with optimizing other people's binaries (e.g. games) for my machine. There is something like that for LLVM but it's very experimental.

It's called profile-guided optimization.

And it fails for C when the source data changes significantly, compared to the inputs used by developers when they were building the binary.

Yes indeed.

> You would have to prove that integer values lie within certain ranges (hard)

Don't JavaScript JITs rely heavily on this to reduce JavaScript's floating-point arithmetic to integer arithmetic?

Not to say it's easy, but hasn't a lot of work been done on this?

Sometimes you can prove that values will remain integer when iterating over array indices for example. However, in the general case, when doing integer additions or multiplications, they need to insert dynamic integer overflow check instructions everywhere. When the integers would overflow they have to be converted to floating-point values.

If you think about this for a second. Suppose I have a loop where I'm multiplying values:

function foo(x, y, n) { for (var i = 0; i < n; ++i) { x = x * y; }

    return x;

In order to know that the multiplication won't overflow, you have to be able to prove that both x and y, coming into that function, will be integers. You also have to have information about the range of x, y and i. If you know that x>=0 and y>= but you don't know how many times the loop will execute at compile time, you are basically screwed. You could unroll the loop to reduce the number of dynamic checks you need, but then that increases your code size. So you mostly have to check that the multiplication doesn't overflow on every iteration. You may also have to do dynamic type checks if you can't prove that x,y,i will always be integers.

FORTRAN is faster for many tasks, and is probably more popular in high performance computing.

Also tasks that can be moved to the GPU go a lot faster. You can interact with those programs in C, but not natively. But some languages, like Julia, can easily move calculations to/from the GPU. And also can transparently take advantage of parallelism.

Julia is in the process of growing rapidly for high performance computing. I don't know if it has officially passed C there. But if not yet, it will.

To add onto your point: from what I've heard the split at HPC conferences is about 40% C to 60% Fortran.

In my limited experience, almost every language specific presentation at an HPC conference is about C++. New projects and initiatives like DPC++, SYCL/oneAPI, Ginkgo, etc are C++ based. And if you look at the open source libraries of the national laboratories, e.g. LLNL, C++ libraries are more common than C libraries, which in turn are more common than the Fortran libraries.

If anything I'd say that C++ is the default for HPC conferences.

That sounds right.

I wonder what the current split is. I'm not in HPC computing, but I see a lot of posts about how quickly Julia is being adopted.

Python users who use numpy and scipy are actually using a lot of Fortran code.

C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

The reason why C programs often are fast or run faster is because the language forces you to approach almost all abstraction head on - this goes both ways however, in that linked lists are much faster to write in C than a safe array type, so programs can end up with bad data structures for too long.

Since the lingua franca of the operating system is still C or C-like, there are actual optimization opportunities that go missing because of the C interface: It's hard to devise many alternatives, but if you are calling into libc for mathematics in a hot loop it may be worth avoiding it and rolling your own since the compiler can't really inline libc calls.

>C - and especially idiomatic C actually withholds a __lot__ of information from the compiler. People like to think C is just a portable assembler, but by god it ain't.

Well it might not be, but the fact that it "withholds a lot of information from the compiler" is an argument in favor of it being (a portable assembler), not the opposite.

> Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.), any algorithm for say parsing JSON can be expressed equally as efficiently in C, while a clever and hacky C-specific algorithm cannot necessarily be expressed in those higher level languages.

This isn't true though. Just off the top of my head: C doesn't expose the carry flag directly, C doesn't let you write signed arithmetic using twos-complement overflow, C doesn't let you share a value between separate compilation units without allowing it to be mutated...

"faster" or "slower" is a pretty terrible metric anyway, given that most projects do not have infinite developer time available. But it's very hard to benchmark "equivalent developer effort" between two different languages.

C loses some optimization potential because of aliasing; it can't be guaranteed that some pointer operation isn't going to change a variable anywhere in memory so that limits certain reordering and loop optimizations.

Languages without arbitrary pointers don't have this issue and safely assume they knows all the assignments done to values in memory allowing for optimizations.

The restrict keyword helps with this.

That, or switching to FORTRAN. For numeric / HPC it is much easier to get a decently fast implementation there.

Provided their user knows what they are doing and never do the mistake to lie to the compiler.

The C language has provided the restrict keyword to solve these aliasing issues since the C99 standard. That is, 20 years ago.

I think this generally doesn't have much of an impact on performance.

Just a personal anecdote, but it regularly makes a substantial (like 2x+) difference for me in tight linear algebra loops. It seems to be required if you want to coax clang to emit vfmad instructions for common C linear algebra idioms. Narrow use case, I know, but it can definitely make a big difference in some domains.

I'll take your word for it. I guess I live in a world where the code I see is rarely tight algebra loops, but I can see that in certain domains you would definitely care.

It certainly does... it's the main reason fortran was often faster than C, isn't it? Aliasing prevents automatic vectorization, among other things.

Whether or not the compiler will be smart enough to autovectorize is a different question.

The developer must also be smart enough not to do the mistake to lie to the compiler, by actually doing aliasing with restricted variables as it is UB and nasal dragons beware.

C compilers don't validate correct use of restrict.

Not all C compilers handle even correct uses of `restrict` properly. For about three years, Rust has been unable to use the LLVM equivalent of `restrict` when downleveling its references, because it keeps finding LLVM miscompilation bugs around it.

These LLVM bugs don't get noticed with C because C programs rarely use `restrict` as pervasively as a downleveled Rust program ends up doing.

I wonder if these same LLVM issues affect LLVM fortran attempts?

I feel like they would have to, if they were to produce a remotely competitive compiler. That's why I'm hoping NVidia's Flang[1] efforts will lead to this aspect of LLVM being cleaned up.

[1]: https://github.com/flang-compiler/flang

Be careful when you say that everything can be converted to C. While that's true in a naive fashion, it gets rid of a lot of additional information.

For example, Rust can sometimes beat C because it's (1) often more friendly to auto-vectorization and (2) it has additional aliasing information.

C code that makes heavy use of callbacks in inner loops is at a disadvantage compared to languages with more powerful inlining facilities. Compare qsort() to std::sort().

Easily. C++ has ways of being faster that C can't really match - namely, templates. They can be hellish to write & debug, but generating type-specific functions is fantastic for optimization & performance. It's also a lot easier to be faster in C++ on key things than it is in C, specifically small-size optimizations. Yes you can do an SSO string or function pointer in C, but it's hard & painful to do so, so it's rarely if ever done. But it's trivial to do in C++, and since the standard library does it for both strings & functions, it's also commonly done.

Similarly in languages like Java or C#, having first-class exceptions means fewer branches & error checking on the hot path over something like C. They are on the whole slower than C for other reasons, but it's not because C is "the best" or "the fastest" at everything. And of course you can't really do de-virtualization optimizations in C.

This seems oriented around just code bases you have seen personally rather than fundamental C limitations and so misleads. To clarify: C can do template-like macros for type specialization (and this is common in some code bases) and easier to debug ways [1] { I've personally used a variant of [1] techniques since the mid 90s }. C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros. Function pointers are in stdlib interfaces (e.g. qsort) and generally pretty common in C code. I believe gcc has been able to inline through C function pointers ("de-virtualize", though you qualify with "really" there..) in the same translation unit for like 2 decades now with the right compiler flags/annotations.

It is true that C cannot do full template meta-programming or certain state save/restore optimizations faster than setjmp/longjmp. Hard/painful/trivial are all a bit more subjective and relative to exposure/practice/training/etc.

Personally, I think Nim [2] is a much friendlier way to go than either C++ or Python without the pitfalls of either, but the ecosystem is admittedly much smaller. Also, I've yet to hit something where re-writing the C++/Rust in Nim did not speed things up from "at least a little" to "quite a bit". { This is more an artifact of "performance incautious programming" in the C++/Rust. Too many people fall into the trap that, since a language has a reputation for speed, they don't need to think. This is probably largely why @mehrdadn's original article had the title it did. ;-) }

[1] https://github.com/glouw/ctl/

[2] https://nim-lang.org

> C can do template-like macros for type specialization

Of course, C++ has those some macro capabilities. But macros are quite limited, and typically the "template-like" ones rely on non-standard preprocessor support like typeof for swap. But then you still lack the ability to specialize swap for different types (eg, you can't replicate std::swap's behavior on std::vector in C)

> C can do setjmp & longjmp exceptions and you can even do a TRY/CATCH like exception macros.

You can, but that's now another parameter to pass along down the call stack, and as LLVM notes https://llvm.org/docs/ExceptionHandling.html#setjmp-longjmp-... it still negatively impacts the non-exception performance path.

While C macros are limited, you underestimate their range. You can absolutely create specialized swaps without typeof (not part of the C preprocessor, incidentally), and the CTL which I linked to even has one. You just tell the macro the name of the type. And you can just invoke swap_ref or something if you want a more indirect one. There's no overloading. So, yeah, you need to know what you are operating upon, or abstract qualities thereof.

Sure, the use of these things in C is (usually) a bit more verbose/burdensome than C++. The misleading statements were performance-oriented, not syntactic sugar-oriented, and I already granted exception optimization. (Some folks, like Go/Rust authors, would tell you exceptions are bad anyway.)

> [..] Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C [..]

How would you write this (admittedly contrived) Rust function in C without invoking UB:

    pub fn foo(a: &mut i32, b: &mut i32) {
        let (new_a, new_b) = a.checked_mul(*b)
            .map(|new_a| (new_a, b.saturating_sub(*a)))
            .unwrap_or((10, 20));

        *a = new_a;
        *b = new_b;
For those unfamiliar with Rust, it multiplies a by b, and if it didn't overflow:

* a = a * b

* b = b - a, saturating at the minimum value (as in, it won't wrap it just stops there)

If it did overflow:

* a = 10

* b = 20

And finally, does it compiler better: https://godbolt.org/z/66n8W9

Something like this? https://godbolt.org/z/6nod5e. It even produces almost identical assembly.

The equivalent to checked_mul is __builtin_mul_overflow, which is a compiler builtin: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.... Similarly, saturating_sub seems like it can be implemented with __builtin_sub_overflow.

Would those work on any compiler, or are they compiler-specific? If those are non-standard compiler-specific extensions, not even a library, is it truly a part of C++?

While I'll grant that Rust only has one fully functional compiler at this time, those functions have been part of Rust's corelib since 1.0. Any Rust compiler would have to support them.

They're not part of the standard, but gcc, clang, and icc (Intel) support them. Alternatively, you could emit the correct instructions in an asm block on supported architectures (x86: 'jo'/'jno' for jumps, 'cmovo'/'cmovno' for conditional moves [1]) or reimplement the operations in software [2]

[1] https://www.felixcloutier.com/x86/jcc and https://www.felixcloutier.com/x86/cmovcc

[2] https://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensur....

Rust already does better than C for some benchmarks. See: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

As the paper showed, even Python can be better than C at carefully selected tasks.

No, the paper showed that use of a particular ecosystem default in c++ led to algorithmically worse performance then use of an ecosystem default in python.

You can always beat c if c happens to implement a bad enough algorithm and your n is large enough. That's no guarantee that the performance ∆ will continue to hold once the two language implement the same algorithm.

I've thought about this very question for years. My answer has been, usually, "no" for all cases that do not allow the developer to write straight assembly.

That is, until things like C++'s stackless coroutines came about, which are a construct intrinsic to the compiler and not functionality directly exposed by C.

Further, any machine code language is going to allow you specific instruction access that a compiler might not otherwise utilize (rare, but it happens). In such cases you can gain 'manual' speedups over what C could allow you to do. I would hope that is the obvious exception, however.

But you are asking a very good question not a lot of developers are willing to think much about.

IME "faster than C" without going into a lot of details what manual optimizations have gone into a specific piece of the C code doesn't mean much. Writing C code doesn't automatically give you high performance, but it gives you quite a bit of wiggle room to manually optimize the code.

In my mind, different languages have different "optimization sound barriers", where further manual optimization becomes impossible, or at least runs into diminishing returns. For instance Javascript can be made nearly as fast as "vanilla" natively compiled C code, but that means writing asm.js by hand, which is entirely different from "idiomatic Javascript". It's not worth it writing asm.js by hand, when C compiled to asm.js (or WASM nowadays) is much more maintainable.

Same with garbage-collected languages in general. You can write fast code with very little overhead from the garbage collector, but this means you need to know exactly how the garbage collector of that language works, and the resulting code will usually look very different and will be harder to maintain than the idiomatic style of that language.

In standard C the optimization sound barrier is a bit further away than in many other high-level languages but it's not all that special, on the other hand C with compiler- and CPU-architecture-specific extensions is pretty good (e.g. extensions can push the sound barrier quite a bit further away).

Well, generally optimized numerical libraries like BLAS are faster than what you write yourself. But a prolbem with calling them from C is that you don't get loop fusion, each call has to go through the data before the next one starts. That isn't true with Haskell's lazy evaluation so it can be faster in some cases.

Of course, C++'s eigen does loop fusion for you with template magic so to go really fast you probably want that.

> Can you actually be faster than C?

Sometimes. I know 2 reasons.

1. In some higher-level languages, some problems can be efficiently solved with code generation. For instance, you can take some of the input data, generate code from that, then run the generated code processing the rest of the input data. Examples of the mechanism include Lisp macros, or .NET shenanigans like System.Reflection.Emit or System.Linq.Expressions.

It’s hard to generate good machine code. Possible to do in C, but you gonna need to embed C compiler for that. They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

If your problem is a good fit for that approach, C# can actually be faster than C.

Parsing JSON is a good example. When your goal is not just parsing the syntax, but also populating structures with the data from the JSON, good C# JSON parsers gonna runtime-generate serialization code by reflecting the structure types being populated. For an isolated program, technically you can generate equivalent C code offline. But if you’re making a general-purpose JSON serialization library that’s borderline impossible to achieve in C, need reflection and code generation.

2. Modern computers are heterogenous, they have 2 chips computing stuff, CPU and GPU. Even when these pieces are on the same chip and using the same RAM like Intel’s UHD graphics or AMD’s APUs, GPU can be still faster for some problems. Not only because more GFlops (that’s not necessarily true for integrated graphics), also because GPUs have better strategy for dealing with RAM latency. They switch threads instead of waiting for the data to arrive. CPU cores only run 1-2 hardware threads each i.e. limited in that regard.

That’s how for some problems HLSL or OpenCL can actually be faster than C.

> They are extremely complex, and were not designed for the use case e.g. relatively slow (designed to run offline on fast developer’s computers, or on even faster build servers).

in 2021 bundling clang along with your program is actually reasonable - if you are compiling small functions without two tons of headers it's measured in milliseconds.

I never tried to, but I think integration of runtime-generated native code gonna cause overhead.

Where do you place these functions once compiled? Into a separate DLL/each?

In .NET it’s quite easy to generate code that calls manually written functions, or access data provided by manually-written stuff. JIT runtime doesn’t treat generated code as something special, e.g. may inline calls across runtime-generated and manually written pieces of the program. With clang you gonna need a layer of indirection to integrate, with function pointers and such.

LLVM can just put the compiled code at some place in your ram and then you can just execute it

Processors need addresses of things. Look at the following code https://godbolt.org/z/PYEqKn note the function uses another symbol, “func.counter”.

Shared libraries include relocation tables https://en.wikipedia.org/wiki/Relocation_%28computing%29 with all code locations which needs patching. That’s how the OSes can load them into arbitrary locations in memory and the code will still work.

Interesting, I didn’t know llvm is that flexible.

Still, LLVM is a huge dependency to redistribute. And probably has many points of failure. For instance, I expect you gonna need to watch for INCLUDE and PATH environment variables when using that thing.

In .NET everything is in the standard library. The generated code is even platform-independent, for no additional overhead. Here’s couple non-trivial examples: https://github.com/Const-me/ComLightInterop/blob/master/ComL... https://github.com/Const-me/ComLightInterop/blob/master/ComL...

> Still, LLVM is a huge dependency to redistribute.

But .NET isn't?

> But .NET isn't?


.NET runtime takes about 30 MB, or 50 MB if you're on Windows and need desktop components. Links there: https://dotnet.microsoft.com/download/dotnet/5.0

clang+llvm package takes 300-400 MB, that's almost an order of magnitude difference: https://github.com/llvm/llvm-project/releases/tag/llvmorg-11...

> clang+llvm package takes 300-400 MB, that's almost an order of magnitude difference

I ship a statically compiled llvm + clang with my software and it does not add 300-400 mb to the binary at all, only something like 20-30 mb.

If you’re willing to compile, ship and support custom builds of .NET runtime, gonna be less than 20-30 MB.

coreclr.dll (the runtime) is 4.9 MB, clrjit.dll (JIT compiler) is 1.3 MB. The rest is mostly standard libraries, the largest piece of that is System.Private.CoreLib.dll at 9 MB. The numbers are for Win64 version of .NET 5.0.2.

Another thing, for .NET apps the runtime is required anyway, the overhead for runtime code generation is zero.

For C++ apps, llvm + clang (or an equivalent) is only needed on developer’s computers, not something you’d normally ship.

> For C++ apps, llvm + clang (or an equivalent) is only needed on developer’s computers, not something you’d normally ship.

unless you want to JIT-compile C++ code at runtime which is the original point.

I'll give Terra[0] as an example for something relatively high-level that uses LLVM as a JIT. It can also be used as an AoT compiler with fancy macro stuff in place of the C preprocessor.

[0]: http://terralang.org/

> Can you actually be faster than C? Like, in the sense that you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto, no bounds checking, pointer arithmetic, etc.),

I'm not sure why you think that, especially with Rust one of the reasons it can be theoretically faster is because it allows for even more undefined behavior and has a variety of datatypes that simply exist to inform the compiler of potential optimizations.

Sure, theoretically.

For instance, C function calls always do some things (like preserving the stack) that may not always be necessary.

I sure wouldn't want to to TRY to beat a C compiler, but it seems obvious to me that it is possible.

> you can transpile any bit of Python or Lisp or Haskell or Rust or JS into C but the opposite isn’t necessarily true because not all those language support all the features exposed in C (such as goto

Turing-completeness tells you that it is in fact necessarily true that you can convert any bit of C into a corresponding bit of Python, Lisp, or Haskell. One obvious approach would be to emit code that implements a C runtime.

For goto in specific, you don't even need to do that. You don't need a goto keyword to implement goto functionality.

Let’s be careful here. All these languages are Turing complete. Heck, isn’t CSS Turing complete now? But insofar as goto in C produces a certain small amount of CPU instructions (1? 2?), can all those other higher level languages do the same? Or will something like JavaScript need a callback-based solution that will do several pointer lookups, memory allocation, etc?

> One obvious approach would be to emit code that implements a C runtime.

how could you negate the python interpreter startup time though ?

Are we talking about compiling or performance? You negate the Python interpreter startup time by changing your implementation of the Python binary, but that isn't relevant to the problem of producing compiler output in the Python language.

C also does lack few features which can provide better performance or provide hints to the compiler.

computed goto, non-aliasing guarantees, actual "const" - just a few on top of my mind.

Generic code is generally faster in C++ unless you manually reimplement it in C.

Rust can be faster than C in few cases because stronger aliasing guarantees, but few compiler bugs prevent it.

Doing some good-performance stuff is also more difficult in C. Strings are null terminated etc..

Sure, managed languages can beat C performance. IIRC, the JVM can reorder branches on the fly to avoid jumps.

Reordering or hinting branches has virtually zero impact on branch prediction performance on modern architectures with advanced hardware branch prediction.

What's much more important for good performance is memory layout and good use of CPU caches. And in this area managed languages struggle a lot. For instance, every object in Java has 16 bytes of overhead for an object header (on 64-bit openjdk jvm). Or you can't organize objects in a flat array. Or there are some guarantees about memory zeroing which often lead to needless memory writes. Or you have to live with GC, which often wastes a lot additional memory and regularly brings unused but reachable memory to caches. Project Valhalla is going to improve some of these limitations hopefully some day, but don't expect the level of C, C++ or Rust performance.

Per Jim Keller, current source of speed in CPU is better branch prediction and data fetching prediction. It is possible C++ (or some other lang) may have more met information to drive these better, in which case it will be faster than C/assembler.


If you use the normal toolchains for C vs another language, the other language still can be faster because it can dynamically generate bytecode based on runtime information. In the normal C toolchain, this is not possible.

However, if you used an alterative toolchain that could also generate bytecode from C at runtime, then I would bet that C would stay on top or be equal.

You can literally emit runtime assembly in C in any C toolchain, what are you on about?

How do you dynamically generate machine code without linking something like LLVM?

I believe that's actually what folk are doing these days, literally linking in LLVM and using it to compile C into machine code then executing it during runtime.

That's what I mean by "alternative toolchain," though; some of the code that you are executing at runtime isn't being generated by the normal GCC/Clang toolchain. It's being generated by your custom setup which links in LLVM.

You'd still need to compile your code with a regular toolchain, but you also need additional tools to compile, optimize, and debug code at runtime. That's what I meant by toolchain, even if it is not the conventional (static) definition.

same way you would with llvm, you can map executable memory without llvm and execute it.

I think about programming languages like I do cars. While rust may be a Ferrari, some kid who doesn't know how to drive a stick and has only had his license for about a month is going to have a rough time beating you driving from New York to Texas even if you're driving a Corolla. I consider myself to be an extremely good software engineer, but lower level programming languages scare me. Recently I've been riding a lot of code in C++ for my Arduino project, but aside from that I can't stand it. You have to be such a good C programmer to make things faster than you probably could knock out using Python + numpy or something like that, but to be fair under the hood numpy uses tons of optimized C code.

I mainly write embedded software in C and C++, but I tend to use python for non-embedded stuff. The funny thing is that I've attempted to speed up multiple python scripts by replacing lists of floats with numpy arrays, and each time it's ended up slightly slower. I suspect numpy only really starts to pay off when you're doing the same operations to 1000+ elements. For modest data sets, or algorithms that don't map particularly well to numpy's operations, the built in data types do better.

I also recently gave numba a go, and it was significantly slower than vanilla python. I was surprised because the decorated function seemed exactly like the type of code that numba would be good at (iterating through a list of coordinates doing square roots and some quirky floating-point modulo math.)

Just checking, even though you likely know. The numpy arrays really only pay off when you vectorise your calculations. Don't expect any speed up if you're still using list comprehensions.

A vector operation is just where instead of having two lists, U and V which add to make W = [x+y for x,y in zip(U,V)] you directly operate on the arrays, W = U + V.

This allows the inner loop to be completely in native code. Sometimes some curious things happen where it's easier to do more vector operations resulting in more looping just because each of those operations is a vector operation and so is faster than the loop. For example, increment a number every time you see a particular element (time series breaking out subsequences) might look like np.cumsum(U == value) which is two loops in practice, but much faster than the iterative approach.

Interesting for me to see someone being scared of lowlevel langs. For me it's the opposite, highlevel langs scare me. They always make me feel that I don't know, what is actually going on

I don't know how cars work , I still drive.

Hang on, this isn't accidentally quadratic, this is accidentally exponential in the worst case. If you set the tree branching factor to one, then the running time of C++‘s std::less becomes 2^h, but there’s only h nodes.

Demo: https://ideone.com/CBIEAE

This creates ~60 objects, but takes something like 2^30 operations to resolve (it times out on this online runner, and takes around 5s on my laptop with -O3).

That's much worse than claimed in this paper! An accidentally-exponential algorithm is the kind of thing that makes DoS attacks trivial...

You're right!! Thanks for pointing this out! I indeed tried to hint at the DAG case in the footnote, but didn't try to explore what happens when the DAG degenerates to a linked list. The biggest obstacle here is, honestly, motivating that an exponential slowdown is a real-world issue that concerns the average programmer, because I know for sure that many would immediately blame the data structure and tell you it's your fault for designing it that way. :-) Whereas if I can illustrate a problem that's fundamentally ingrained into language's built-in data structures by design, and that people would actually encounter in the real world, that's far more compelling.

I know this because I already received such criticism for my examples, with people telling me that it's unclear how wide-reaching the ramifications are in real-world applications (which I suppose is fair enough). People really want to see examples of real-world software being improved with such small tweaks in the algorithms, whereas I didn't have the bandwidth to try to investigate that, and just tried to settle for plausible examples. That criticism would be magnified many times further for DAGs and (especially) degenerate linked lists. (I'm not claiming these are the only relevant scenarios by the way—just saying this is how it is likely to be seen by many readers.) I moved on and didn't spend more time on this (it was kind of a random paper idea I had and not related to anything else), but I think it would be awesome to flesh this out further into something more interesting and compelling and properly publish it.

If you find this interesting and have the time to help joint-author & actually publish a paper on this, please grab my email from arXiv and email me! It would be great to flesh out more consequences and find more interesting examples together. I feel like there's a lot more underneath the surface that I might not have touched (both theoretically and practically), but I hadn't managed to gather enough interest from others in the topic (let alone the examples) to motivate me to look further until now!

This is a pretty clickbaity title in my opinion. Bubble sort in lower level language X is going to be slower than quick sort in high level language Y. And bubblesort in high level language is going to be faster than merge sort for small data sets on low level language X. If you aren't comparing the same underlying routine, or data application, I don't think any conclusion should be made. Comparisons between languages is exactly why asymptomatic analysis was devised. Extract the process from the low level and hardware characteristics, and get the overall complexity growth. But this doesn't work the other way around. You can't compare different routines in different languages and expect big oh to be comparable.

The problem is that a language isn't just its speed, a language is a UX. If that UX makes it easier to accidentally make performance mistakes, then practically speaking, things written in that language are slower than they would be otherwise.

Edit: The original title "Why Python is faster than C++" is much more clickbaity than the editorialized ("When Python is faster than C++")

Here’s an article from 2001 discussing a very similar issue: https://www.joelonsoftware.com/2001/12/11/back-to-basics/

Am I missing something? In the paper the definition of cmp3 on page 2 seems to have a bug - as defined, wouldn't cmp3([1,2,3], [1]) return 0?

    # Uses 3-way cmp() for primitives
    def cmp3(a, b):
        if not isinstance(a, list):
            global c; c += 1
            return cmp(a, b) 
        for x, y in zip(a, b):
            r = cmp3(x, y)
            if r != 0: 
                return r 
        return 0
This isn't generally the expected behaviour for comparing lists, surely?

Oh yeah I think you did find a bug, thanks for pointing it out! I need to check the lengths as well. It shouldn't affect the conclusion (in fact I think I made all comparisons equal-length in the paper?) but I should revise it when I get the chance.

Seems like you can fix this by changing return 0 to return cmp(len(a), len(b)).

If I read the paper correctly, then it compares three-way comparison with two two-way comparisons, for a recursively defined (tree) data type.

The paper points out that the convenience of just defining "less than", and heaving "equals" derived from that can be costly. Specifically, for the recursively defined data structure (tree), a three-way comparison which is derived canonically from the two way compare seems to entail not a linear but an quadratic number of comparisons.

What I don't understand is what is happening in 'lt2'.

this is what I'd expect for __lt__

  lt(a, b) also known as (a __lt__ b) is returning
    True, iff a < b, elementwise for lists
    False otherwise
  for same length lists.
I also do understand cmp2.

  (a __eq__ b) iff not (a __lt__ b)  and not (b __lt__ a)
so looking at

  cmp(a, b) = lt(a, b) - lt(b, a)
I get

  a < b: 1 - 0 ==> 1
  b < a: 0 - 1 ==> -1
  a == b: 0 - 0 ==> 0
which makes sense.

Now two questions arise with respect to the presented hypothesis and the paper:

1. why does the paper call lt2 twice, recursively?

2. why does the paper compare the performance of their lt2 and lt3 instead of the performance of cmp2 and cmp3?

I intuit, when taking the double recursion out of lt2, which imnsho is erroneous, and when comparing cmp2 and cmp3, we'll see a performance penalty of a factor 2, between cmp2 and cmp3, and identical run times for lt2 and lt3, as it should be.

What am I missing?

edit: updated for clarity

Initially I thought this preprint is a click bait (sorry, author), but when I read into details, I realize it is an interesting one. The key observation is the code under section 2.4.2. There, the author triggers the worse case (everything being equal in two trees) and shows that C++'s lt2 behavior leads to its horrible performance: 4.1s in C++ vs 0.018s in Python. Note that the difference is much more than two folds as lt2 and lt3 have different time complexity. PS: after a quick thought, I am not sure if we can avoid the quadratic behavior of lt2 in this particular example.

Haha, thank you! It was pretty demotivating to see so many people immediately dismiss it as clickbait without any attempt to discuss the topic at all, so it actually means a lot to hear that you think differently now. I hope it was fun & worth the read. I know I had a lot of fun writing it. :)

I think the sibling comment may have already answered your question, but if not, I think an earlier response I had might help clarify what exactly I'm comparing & why: https://news.ycombinator.com/item?id=26340952

Note that you do need to read the entire paper to see the overall picture and its consequences; if you stop halfway then it might appear like I'm comparing things unfairly. Feel free to let me know if anything is still confusing after reading the other comments and taking another look at the paper and I'll try to clarify!

let me rephrase, as I indeed have read the paper, before posting.

1. the lt2 definition in the paper is wrong. A lexicographical compare is linear in the size. the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.



> 1. the lt2 definition in the paper is wrong.

Would you mind providing a counterexample to illustrate what incorrect output it's producing?

> A lexicographical compare is linear in the size.

Indeed, lt2() also has a loop that iterates a linear number of times as you mention. It is consistent with this.

> the derived cmp2 is correct and has a run time twice that of cmp3. which matches the stl definitions of lexicographical_compare, see below.

Perhaps you might be confused about what lexicographical_compare does? It does not "compare" in the 3-way sense. It only performs a "less-than" comparison. The name is rather misleading.

2. the c++ behaviour in 2.4.2 is puzzling and most likely bug, worth reporting to and discussing with STL implementors.

I'm not sure what to report to anyone, as I don't find it puzzling; it is entirely consistent with everything I'm aware of and have tried to explain in the paper. It is also not specific to any particular implementation; I believe you will see this behavior on any correct implementation you try it on. It might be helpful if you try to produce a counterexample using what you believe would be a correct & efficient implementation to validate or invalidate this.

Reminds me of the sscanf thing that popped up a few days ago (in fact I assumed this was about that at first): https://news.ycombinator.com/item?id=26302744

I wonder (genuinely asking, not being snarky) what it is about C/C++ that seems to make these issues more common? It's also possible my perception of "more common" has just been inflated by seeing multiple examples in a single week

> I wonder (genuinely asking, not being snarky) what it is about C/C++ that seems to make these issues more common? It's also possible my perception of "more common" has just been inflated by seeing multiple examples in a single week

It's an interesting question. There isn't a whole lot in common between the std::set and scanf issues except that they're both multi-pass algorithms (which I posted a comment about there), so I guess as far as the language is concerned, the question might reduce to "(Why) are multi-pass algorithms more common in C++?" I suppose one possible response, to the extent that this might be the case, might be that in C and C++ operations are so cheap (due to inlining and fewer indirections and such) that people don't blink twice before performing them multiple times, without thinking about the implications. Whereas in Python, everything is already so slow that you would try to factor out common operations if you were optimizing code. However, I'm not sure this is a broad pattern in general; e.g., the hashing example is just as bad in Python.

I think the bigger explanation might be more mundane as far as the language is concerned. Some of it is likely just accidental (there's no particular reason Python couldn't have made the same decision as C++ to implement == in terms of <, for example), and some of it is just a consequence of C and C++ programmers being more likely to look into performance issues, since that's probably why they chose said languages to begin with. Even the C++ example only dawned on me after years of experience optimizing things for performance, so given that Python is already incredibly slow to begin with, if I did see this in Python, chances are pretty good I would just assume it's just the interpreted nature of Python (or a poor set implementation) that's slow and not look into it further.

See this comment and children: https://news.ycombinator.com/item?id=26340233

Very interesting. I think it makes C++ come across looking quite good. The committee has considered a case where C++ comparisons are lacking, and C++20 rectifies the situation by having more expressive comparisons with partial and total orderings. C++ may be slower than other languages in doing innovations, but they move relentlessly forward with trying to adopt a good way of accomplishing new features & rectifying mistakes from the past.

C++20 doesn't quite rectify this unfortunately! The data structures still use std::less even in C++20, meaning the 2-way comparisons would happen twice. Except now each 2-way comparison is potentially implemented on top of 3-way comparisons. Perhaps I or someone else should get in touch with the committee to try to change that, otherwise things are going to be slower rather than faster if we still use 2-way comparisons but now they have to do 3-way comparisons underneath!

The code doesn't compile, i.e. is in the broken state. The author is missing the comparison operators.


Why do you assume -std=c++20? The code is written for C++17, and compiles just fine for that.

How can this paper be taken seriously when the paper doesn't even show the compilation flags?

Once you go accidentally quadratic, any clever combination of optimization flags or compiler magic becomes quite irrelevant though.

To be fair, certain compilation flags can change the time complexity of some algorithms if the mistake is trivial enough to figure out.

Really? Can you give some examples? I know compilers are amazing but this seems too much.

At least GCC was (and by now Clang and probably others are) known for occasionally optimizing out certain uses of functions like strlen(), which can change the time complexity in some trivial cases.

For instance, if you consider strcmp(x, y, strlen(y)) where x is of fixed length, then the whole function would be constant-time if strlen(y) is optimized out, whereas it would take linear time (linear in strlen(y)) otherwise. GCC et al. can optimize out strlen() in some such cases.

In practice you wouldn't want to rely on this, both because compilers aren't anywhere near smart enough to figure out complicated cases and also because these wouldn't happen in debug mode either, but it's not impossible for time complexity to change through optimization.

Summation from 1 to n: https://godbolt.org/z/673hTr

clang seems to optimize it to (n-1)(n-2)/2+2n-1, which is O(1).

Even more trivial: sum from 1 to n, then never use the result. It should get optimized out entirely!

Perhaps because the paper is not about the specifics of the incident (which is menial work to figure out, left as an exercize to the header) but the fact that it can happen and a general explanation of the circumstances?

Have academic CS articles always had click-bait titles?

Goto considered harmful is from 1968, so yes.

A minor historical note on this: the original title of Dijkstra's text was "A Case Against the Goto Statement" and it was the Communications of the ACM editor (Niklaus Wirth) that changed it[0].

[0]: https://en.wikipedia.org/wiki/Considered_harmful

In 2021 the note would probably have been headlined "Academics hate programmers who use this one cool trick!!"

"goto considered harmful" was at least not wrong. It didn't say who considered it harmful, but at least it was still presented as an opinion.

This title states "Python is Faster Than C++", which neither implies that this is just an opinion, nor that it isn't an absolute statement. You have to figure out yourself that it's probably hyperbolic and just referring to special cases.

Not "Python is Faster Than C++" but "When Python is Faster Than C++"

I have noticed this a lot too. I'd guess is has to do with the importance of conferences over journals in CS (as opposed to nearby fields where journals dominate). Conferences tend to be more informal and forgiving of titles like that.

Of course this is just a preprint. If they ultimately publish it somewhere the editors/reviewers may make them give a more conservative title.

Why are we back to learning basic computer science? This isn't news to anyone here is it?

Most people never learned computer science. Few programmers I've met have.

It's also not like having a CS degree makes you immune to this — people are busy, real software has indirection which can make the implications non-obvious, and for many working programmers an awful lot of what they learned was a long time in the past and not something they use on a daily basis.

The C++ code shown is a great example: when you see very simple code in the middle of a paper talking about how a particular patterns fails badly, yes, you're primed to look for a problem but if that showed up for you in code review are you really confident that you'd say something other than “followed standard practice, maybe add braces around the if statement”?

The real lesson here is that nothing beats actually measuring your code to make sure you didn't miss something like this.

I once implemented a regex matcher in a SQL dialect. I forget exactly how large the "pathological" expression was that it could beat perl's C implementation for matching, but I'm pretty sure n was less than 30.

See also https://swtch.com/~rsc/regexp/regexp1.html

We lost that battle many years ago. Is that news to you?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact