Compiler optimizers are designed from a “win in the average” mindset - so whether a particular optimization succeeds or not is really not something the user ought to rely on. If you’re relying on it then you’re playing with fire. (I say that as someone who has played with this particular kind of fire. Sometimes I’m ok, other times I get burned.)
A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)
In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.
This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.
One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.
My optimizer first appeared in Datalight C around 1984 or so. It was the first DFA optimizer for any C compiler on the PC. C compiler benchmark roundups were popular articles in programming magazines at the time. We breathlessly waited for the next roundup article.
When it came, Datalight C was omitted from it! The journalist said they excluded Datalight C because it was buggy, as it deleted the benchmark code and just printed the success message. The benchmarks at the time consisted of things like:
for (i = 0; i < 1000; ++i) a = 3;
so of course it deleted the useless code.
I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
But it wasn't long until the industry realized that optimizers were the future, the benchmarks were revised, and the survivors in the compiler business all did optimizers. DLC recovered, but as you can tell, I am still annoyed at the whole incident.
Recently, a D user remarked that a non-optimized build took a few seconds, but turning on the optimizer caused it to take 35 minutes. I was wondering how this was possible. It turns out his code had a function with 30,000 lines of code in it!
I’m honestly a bit surprised if that’s the first time you’ve seen something like that.
I’ve been working on compilers 30 years, primarily on optimizers (although some frontend work as well). Learned C from Borland C and C++ from…Zortech C++ 1.0, so thank you for that!
Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code. It taught me very early on that you really need to focus on scalability and compile time in compilers, whether it’s the amount of code within a function, or across functions (for IPO / LTO).
And yes, working on compilers is never dull. 25 years ago I thought we’d end up in a monolithic world with x86, C++, and Java being the focus of all work. Instead, there’s been an absolute explosion of programming models, languages, and architectures, as well as entirely new problem spaces like graph compilers for ML.
> Within a few years of working on compilers I came across my first examples of 50k+ line functions. These were often (but not always) the result of source code generators that were translating some kind of problem description to code.
Like the issue in the Linux kernel recently where they had some simple looking min/max macros that generated megabytes of source code.
It's not even just code. I remember once writing a script to generate a large table, only to discover that GCC took forever to compile it as it appeared to have an n^2 time in the size of the file. That can't have been the optimiser because there were no functions. I think I ended up compiling that file with TCC.
> One thing I learned about optimizers is some optimizations undo earlier optimizations. The state of the optimized program will sometimes flip-flop back and forth and never converge on a solution. Then you have to rely on a heuristic to guess at the better form, and live with it. The optimizer has a counter in it so it will "pull the plug" if it fails to converge, and will just go with the last state.
The way every optimizer I've worked on (and written) deals with this is canonical forms. Like, you decree that the canonical form of "multiply integer by 2" is "x << 1", and then you make sure that no optimization ever turns "x << 1" into anything else (though the instruction selector may then turn "x << 1" into "x + x" since that's the best thing on most CPUs).
But that doesn't necessarily make this problem any easier. Just gives you a principled story for how to fix the problem if you find it. I think that usually the canonical forms aren't even that well documented, and if you get it wrong, then you'll still have valid IR so it's not like the IR verifier will tell you that you made a mistake - you'll just find out because of some infinite loop.
And yeah, lots of compiler optimization fixpoints have a counter to kill them after some limit. The LLVM inliner fixpoint is one example of such a thing.
> I was really angry about that, as the journalist never bothered to call us and ask about DLC's behavior. Our sales tanked after that article.
They show code for pattern matching IR in their Rust code, and it's awful. I think it's because they can't just have an IR with pointers, because that would violate Rust's rules. So, they need to call goofy helpers to deconstruct the IR. Consequently, a super simple rewrite rule ends up being a full page of gnarly code. That code would be less than half as long, and much easier to parse, if it was in LLVM IR or B3 IR or any other sensible C++ IR.
Then they show that the egraph rule is "simple". It's certainly shorter than C++ code. But while it is conceptually simpler than their Rust code, it is no conceptually simpler than the same rewrite written in C++. The C++ code for rewrites in either LLVM or B3 is not that much more verbose than the egraph rule, but in a way that makes it easy to understand. Plus, it's code, so it can call any API in the compiler and do any logic it likes - making it inherently more flexible than an egraph.
So, if they had used a sensible programming language to write their compiler then they wouldn't have needed egraphs as an escape hatch from how bad Rust is.
And the phase ordering argument for e-graphs is baloney because:
- Their example of alias analysis, GVN, and RLE assumes the strawman that you put those things in separate phases. Most compilers don't. LLVM does alias analysis on demand and combines GVN and RLE. Ditto in B3.
- The difficulty of phase ordering in LLVM's pass builders (and in any pipeline I've ever worked on) is that you've got passes that are not expressible as egraphs. Until someone can demonstrate an egraph based inliner, SCCP, SROA, coroutine lowerings, and heck an egraph-based FilPizlonator (the pass I use in the Fil-C version of clang to make it memory safe), then you'll be left with those passes having to run the classic way and then you'll still have phase ordering problems.
Set e-graphs apart from e-matchers. An e-graph is a vector-of-sets, where each set contains equal representations of the same thing. What makes it space efficient is that your representation refers to things by index number in the vector, so if you have "3 * (x + 1)" as an expression, you store "x + 1" in vector #1 and "3 * #1". That way if "x + 1" is also known to equivalent to "load ptr" then you add that to #1 and don't need to add anything to #0. In e-graphs, #0 and #1 are e-classes, the sets holding equivalence classes. There's no limit to what transforms you can use, and let the e-graph hold them efficiently.
E-matchers are an algorithm over e-graphs to efficiently search for a pattern in an e-graph, for instance if you want to find "(load ptr) * (load ptr)" then you have to find "#x * #x" for all x in all e-classes, then check whether "load ptr" is a member of e-class #x. This is where you get limits on what you can match. "Pattern matching" style transforms are easy and fast using e-matchers, things like "x * 2" -> "x << 1", but beyond that they don't help.
There's an optimizer problem where you have "load ptr" and you solve ptr and figure out it's pointing to a constant value and you replace the load instruction with the constant value. Later, you get to code emission and you realize that you can't encode your constant in the CPU instruction, there's not enough bits. You now need to take the constant and stuff it into a constant pool and emit a load for it. If you had stored them in an e-graph, you could have chosen to use the load you already had.
Suppose you wanted to do sparse conditional constant/constant-range propagation but your IR uses an e-graph. You could analyze each expression in the e-class, intersect them, and annotate the resulting constant-range on the whole e-class. Then do SCCP as normal looking up the e-class for each instruction as you go.
> In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.
>
> This is sort of what the article is saying, I guess.
I agree with this to some extent but not fully. I think there are shades of grey to this -- adding language features is a fairly complex and time-consuming process, especially for mainstream languages. Even for properties which many people would like to have, such as "no GC", there are complex tradeoffs (e.g. https://em-tg.github.io/csborrow/)
My position is that language users need to empowered in different ways depending on the requirements. If you look at the Haskell example involving inspection testing/fusion, there are certain guarantees around some type conversions (A -> B, B -> A) being eliminated -- these are somewhat specific to the library at hand. Trying to formalize each and every performance-sensitive library's needs using language features is likely not practical.
Rather, I think it makes sense instead focus on a more bottoms-up approach, where you give somewhat general tools to the language users (doesn't need to expose a full IR), and see what common patterns emerge before deciding whether to "bless" some of them as first-class language features.
My point is that if in the course of discovering common patterns you find that the optimizer must do a heroic optimization with a 10x upside when it hits and weird flakiness about when it hits, then that’s a good indication that maybe a language feature that lets you skip the optimization and let the programmer sort of dictate the outcome is a good idea.
By the way, avoiding GC is not the same thing as having value types. Avoiding GC altogether is super hard. But value types aren’t that hard and aren’t about entirely avoiding GC - just avoiding it in specific cases.
Yes, it was a bummer that Java didn't take up on the ideas of Cedar, Oberon linage, Modula-3, Eiffel,... even though some are quoted as its influences.
Still I am confident that it might be getting value types, before C++ reflection, networking, senders/receivers, or safety gets sorted out. Or even that we can finally write portable C++ code using C++20 modules.
Unfortunately, it doesn't look like there's much of a guarantee that value objects wouldn't result in heap allocations. https://openjdk.org/jeps/401 even contains:
> So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
There's a further comment about potential of opting out of atomicity guarantees to not have that problem, but then there are more problems - looks like pre-JIT would still allocate, and who knows how consistent would JIT be about scalarization. IIRC there was also some mention somewhere about just forcing large enough value objects to always be heap allocations.
That’s some very selective quoting there. Let’s do the full thing:
> Heap flattening must maintain the integrity of objects. For example, the flattened data must be small enough to read and write atomically, or else it may become corrupted. On common platforms, "small enough" may mean as few as 64 bits, including the null flag. So while many small value classes can be flattened, classes that declare, say, 2 int fields or a double field, might have to be encoded as ordinary heap objects.
And maybe the end of the next paragraph is even more relevant:
> In the future, 128-bit flattened encodings should be possible on platforms that support atomic reads and writes of that size. And the Null-Restricted Value Types JEP will enable heap flattening for even larger value classes in use cases that are willing to opt out of atomicity guarantees.
Value types still require allocation for types larger than 128 bits if the value is either nullable or atomic — that seems like a reasonable trade-off to me.
Yeah, had edited my comment to expand a bit on that. Nevertheless, this is just one place where allocation of value objects may appear; and it looks like eliding allocations is still generally in the "optimization" category, rather than "guarantee".
When are they going to finish making progress? I've been hearing about this for years now and it doesn't seem to have happened. Which means Java continues with just about the worst design for memory efficiency you could have.
When they are sure the design is sound enough not to do a Python 2/Python 3.
If it was easy it would be done by now.
There are plenty of long time efforts on other language ecosystem that have also taken decades and still not fully done, e.g. C++ modules, contracts, reflection,...
If you really need value type like objects today, it is possible in Panama, even without language syntax for them.
> In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind.
Automatic vectorisation is another big one. It feels to me like vectorisation is less reliable / more complex than TCO? But on the other hand the downside is a linear slowdown, not "your program blows the stack and crashes".
With clang you can add "#pragma clang loop vectorize(assume_safety)" to a loop to reduce the burden of proof of vectorizability and to do it if at all possible, giving a warning when it fails to. gcc has "#pragma GCC ivdep" to reduce dependency analysis, but it's not as powerful as clang's pragma.
If you’re talking about autovectorization in C++ then you have the option of using intrinsics to get real vector code. So I think that’s fine because you have a way to tell the compiler, “I really want simd”.
A better fundamental design would be a SPMD language (like ispc, or GPU languages) and then autoscalarization, which is a lot easier to do reliably than autovectorization.
Intrinsics work poorly in some compilers, and Intel's intrinsics are so hard to read because of inscrutable Hungarian notation that you should just write in asm instead.
Ispc is great but I think you’re also saying that:
- it would be better if the intrinsics had sensible names. I couldn’t agree more.
- it would be better if compilers consistently did a good job of implementing them. I wonder which compilers do a bad job? Does clang do a good job or not so much?
I think intrinsics make sense for the case where the language being used is not otherwise simd and that language already has value types (so it’s easy to add a vector type). It would be great if they at least worked consistently well and had decent names in that case.
Intrinsics have somewhat saner names in C#, and unify under the same types for both portable and platform-specific variants (Vector64/128/256/512<T>, numeric operators on them and methods on VectorXXX.* or Sse42/Avx/AdvSimd/etc.)
>the way tail calls work in functional languages comes to mind
tail call optimizations are useful as optimizations.
But in a language like scheme, tail calls are required to be implemented as loops. If you don't do that, then people can't write code that relies on it. Scheme treats tail call handling as a semantic, not as an optimization.
I am not sure in what cases that's a major win, in generation/copy GCs an allocation is a pointer bump, then the object dies trivially (no references from tenured objects) in the young gen collection.
When you run out of the local arena, there will be a CAS involved, of course.
Removing the GC allocation is a win because it allows you to promote the value’s fields to registers.
I agree just the difference between GC allocation and stack allocation is small. But it’s not really about turning the values into stack allocations; it’s about letting the compiler do downstream optimizations based on the knowledge that it’s dealing with an unaliased private copy of the data and then that unlocks a ton of downstream opts.
The allocation elisions have been tried (escape analysis) with enough inlining to no noticeable benefits (I don't have the source/quote, yet it was over 10y already). The escape analysis already provides the same semantics of proving the Object allocation/etc. can be optimized.
Java does lack value types but they are only useful in (large) arrays. A lot of such code has 'degraded' to direct byte buffers, and in some cases straight unsafe, similar to void* in C.
I implemented the original allocation elision in JavaScriptCore and then oversaw the development of the pass that superseded mine.
Huge speedup, but very hit or miss. Out of a large suite of benchmarks, 90% of the tests saw no change and 10% saw improvements of 3x or so. Note each of these tests was itself a large program; these weren’t some BS microbenchmarks.
Maybe someone from V8 can comment but my understanding is they had a similar experience.
So it’s possible that someone measured this being perf neutral, if they used a too small benchmark suite.
Isn't that "winning on average" mindset the motivation for JIT compilers and optimization? Compile once, instrument the code the recompile again with more targeted optimization? Granted that's only something available to Java/C# and not C++.
Playing with fire is when you require an optimization to hit in a specific place in your code.
Having software that only works if optimized means you're effectively relying on optimizations hitting with a high enough rate overall that your code runs fast enough.
It's like the difference between an amateur gambler placing one huge ass bet and praying that it hits and a professional gambler (like a Blackjack card counter, or a poker pro) placing thousands or even millions of bets and relying on postive EV to make a profit.
The thing where people quote optimisers as only eeking out tiny percentages of performance over decades really annoys me. It's true for the numerical kernels written in C and Fortran where the code is essentially optimal already, and all the compiler has to do is not screw it up.
It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes. Compilers are the reason unskilled devs manage to get tolerable performance out of frameworks.
The cpython is slow complaints? That's what life without a good compiler looks like.
“CPython is slow” is kind of a complaint about the Python object model. I’m not sure that I would pin this one on the compiler, per se. Everything you do in Python involves chasing down pointers and method tables, looking up strings in dictionaries, and that sort of thing.
Maybe it’s splitting hairs to talk about whether this is down to the compiler or the language.
> It's grossly false for the vast majority of code, where the sludge written by thousands of engineers gets completely rewritten by the compiler before it executes.
It used to be true, for sure. What has changed since then is not the sludge or bloat or many cooks. Instead, the major change is that we have written a bunch of code on top of optimizing compilers with the assumption that these optimizations are happening. For example, nowadays, you might write C++ code with a deep call stack and lots of small functions (e.g. through templates), and it’s fast because the compiler inlines and then optimizes the result. Back in the 1990s, you would not have written code that way because the compiler would have choked on it. You see a lot more macro use in code from the 1990s rather than functions, because programmers didn’t trust that the functions will get inlined.
The point is that Python object model prevents most optimizations, and thus makes the interpreter to executive code exactly as written. All inefficiencies become more obvious this way, while an optimizing compiler could have detected and eliminated many of them in a different language.
Yes, that’s a good point. But I think you can already see the difference in a lot of C++ codebases when you compile at -O0. Until recently, std::move() was a function call, and at -O0, this meant that calls which should be basically a memcpy() ended up having thousands of little function calls in it to std::move().
> Everything you do in Python involves chasing down pointers and method tables, looking up strings in dictionaries, and that sort of thing.
One of the things that a Cpp Compiler can do is De-Virtualize Function Calls. That is; solve those function pointer tables at Compile Time. Why could not a good compiler for Python do the same?
It’s a lot easier in C++ or Java. In C++, you make the virtual calls to some base class, and all you need to demonstrate is that only one concrete class is being used at that place. For example, you might see a List<T> in Java, and then the JIT compiler figures out that the field is always assigned = new ArrayList<T>(), which allows it to devirtualize.
In Python, the types are way more general. Basically, every method call is being made to “object”. Every field has a value of type “object”. This makes it much more difficult for the compiler to devirtualize anything. It might be much harder to track which code assigns values to a particular field, because objects can easily escape to obscure parts of your codebase and be modified without knowing their type at all.
This happens even if you write very boring, Java-like code in Python.
C++ has an explicit compilation phase after which things like classes remain stable.
Python only has a byte-compilation phase that turns the parse tree into executable format. Everything past that, including the creation if classes, imports, etc is runtime. You can pick a class and patch it. You can create classes and functions at runtime; in fact, this happens all the time, and not only for lambdas, but this is how decorators work.
A JIT compiler could detect actual usage patterns and replace the code with more efficient versions, until a counter-example is found, and the original "de-optimized" code is run. This is how JavaScript JITs generally work.
The standard optimization user case story is absurd.
1. You’re not gonna get any guarantees that the optimization will happen. That makes it High Level. Just write code. We won’t force you to pollute your code with ugly annotations or pragmas.
2. In turn: check the assembly or whatever the concrete thing that reveals that the optimization you wished for in your head actually went through
There’s some kind of abstraction violation in the above somewhere.
It is a leaky abstraction, all abstractions are. I didn't find it to be much of a problem in practice.
Usually, if you don't get the optimization you wished for, it means that there is something you didn't account for. In C++, it may be exception processing, aliasing rules, etc... Had the compiler made the optimization you wished for, it wouldn't have been correct with regard to the specifications of the language, it may even hide a bug. The solution is then to write it in a way that is more explicit, to make the compiler understand that the edge case can never happen, which will then enable the optimization. It is not really an abstraction violation, more like a form of debugging.
If you really need to get low level, there is some point where you need to write assembly language, which is obviously not portable, but getting every last bit of performance is simply incompatible with portability.
> It is a leaky abstraction, all abstractions are.
It’s not leaky (and that term is kind of meh). It’s just an abstraction! Such optimizations are supposed to be abstracted away (point 1).[1] The problem comes when that is inconvenient; when the client does not want it to be abstracted away.
There’s a mismatch there.
[1] EDIT: The point here is that the API is too abstracted compared to what the client wants. Of course the API could choose to not abstract certain things. For example the Vec[2] type in Rust has specified, as part of the documentation, how it is implemented (to a large degree). They could call it something like “List” and say that whatever the concrete implementation is, is an implementation detail. But they chose not to.
I have the same experience with database query planners. The promise is that you just write your business logic in SQL and the planner takes care of the rest. In practice you spend weeks staring at execution plans.
A key difference between many compilers and DB query planners, is that a compiler can spend more time over its optimisations because it is run dev-side and the benefits (and none of the time taken) are felt by the users. A query planner needs to be more conservative with its resource use.
This is not true of JIT compilers, of course, which have similar constraints to DB query planners. In these cases the goal is to do a good job pretty quickly, rather than an excellent job in a reasonable time.
> A query planner needs to be more conservative with its resource use.
The number of possible distinct query plans grows very rapidly as the complexity increases (exponentially or factorially... I can't remember). So even if you have 10x as much time available for optimisation, it makes a surprisingly small difference.
One approach I've seen with systems like Microsoft Exchange and its undrelying Jet database is that queries are expressed in a lower-level syntax tree DOM structure. The specific query plan is "baked in" by developers right from the beginning, which provides stable and consistent performance in production. It's also lower latency because the time spent by the optimiser at runtime is zero.
An even bigger issue is that regular optimizer see a whole static world (the code). You can know very well the size of most things.
For DBs, it would be 'trivial' if we could know the exact (or very very close) size of the tables and indexes. But the db is a mutating environment, that start with 1 row and 1 second later is 1 million, and 1 second later is 1 row again.
And then you mutate it from thousands of different connections. Also, the users will mutate the shape, structure, runtime parameters, indexes, kind of indexes, data types, etc.
Query planners can (and will) use statistics gathered on the contents of the tables to more effectively execute queries, which can be gathered in the background/batch processes.
For large database vendors there is a good amount of complexity put into up-front information gathering so that query execution is fast.
As the optimal plan will change as data sizes and patterns do, it is almost all JIT with fairly aggressive caching. They keep some stats on each object and plan to help guess when a cached plan should be used or the full planner be engaged again, though this process often needs a little help.
Agreed. You put your finger directly on something that's always bugged me about optimistic code optimization strategies. The code I write is supposed to have deterministic behaviour. If the determinism changes, it should be because of a change I made, not a flag in the compiler. The behaviour is completely opaque and uncorrelatable, makes it very hard to figure out if a given change will lead to better or worse performance.
The excuse for this is that performance is not considered part of the behavior of the program. So it doesn't matter to the question of whether your program is deterministic.
These days you don't know what cpu you run on so you can't make performance guarentees anyway. Even in embedded we have been burn with the only cpu going out of production enough to not depend on it anymore in most cases.
That is only possible if the optimization is wrong (rare), or the program already had undefined behavior (very common, because undefined behavior is very easy to trigger in C).
Yes, but program bugs are obviously very, very common. Developers already do a great job of adding bugs to their programs, they don't need compilers compounding that problem in non-deterministic ways.
Not all programming languages have this amount of expressiveness. How do you, for example, tell how a database what the intended execution plan is for some complicated SQL query?
You can normally only send SQL queries to a database and not execution plans.
clang does have __builtin_rotateleft32 & co. gcc doesn't though. (but of course for both of those just doing the manual rotate should be fine like always)
Let’s assume that you can rewrite it, e.g. write some inline assembly.
The thinking here seems to be that you want multiple things:
1. You want the high-level code since that is easier to reason about
2. You also want some specific optimizations
Maybe the missing link here is some annotation that asserts that some optimization is applied. Then if that assertion fails at some point you might have to bite the bullet and inline that assembly. Because (2) might trump (1).
I don't hate query planning, some people are better than me at writing algo and the database knows my data better than me.
However, what I hate is the lack of transparency (and I feel like this article tries to pin-point just this). When I execute a query locally I get a different plan vs staging vs prod. A plan than can also change depending on some parameters or load or size.
I don't care about understanding all the underlying optimizations, I just care that the query plan I saw is the same and is still the same in prod, and that I can be warned when it changes. PG does not return the hash of the query plan or metrics along the data, which is imo a mistake. With this you could track it in your favorite metrics store and be able to point when and why stuff are executing differently.
Yes! The thing with SQL is that the DB “recompiles” the query plan from source every time you do the query. That results in changes at runtime, which are often good and healthy changes, but are sometimes a very bad surprise.
I like the metrics idea, but by the time you see the change in the metric, it’s too late.
For critical queries it might be helpful to be able to “freeze” a query plan just as one “freezes” a binary executable by compiling. In other words, let the query planner do its job, but only at a time of your choosing, so the performance of a production system doesn’t change suddenly.
So no hints in the source, just an opaque token representing a compiled query plan that can be deployed alongside a binary. With tooling you could be notified if the planner wants to do it differently and decide whether to deploy the new plan, after testing.
(And again, you’d only do this for a critical subset of your choice.)
> D has first-class support for marking functions as “no GC”.
It never occurred to me that this would be considered a hint to the optimizer. It doesn't affect code generation. What it does do is flag any use of the gc in the function and any functions it transitively may call.
Optimizers have been likened to turning a cow into a hamburger. If you're symbolically debugging optimized code, you're looking at the hamburger. Nobody has been able to solve that problem.
It's true that optimizers themselves are hard to show being correct. The one in the D compiler is a conventional DFA optimizer that uses data flow equations I learned from Hennessy and Ullman in a 1982 seminar they taught. So it has been battle tested for 42 years now(!) and it's pretty rare to find a problem with it, unless it's a new pass I added like SROA. The idea is anytime a problem is identified and corrected, it goes into the test suite. This has the effect of always ratcheting it forward and not regress.
The GC dates from around 2000, when I wrote it for a Javascript engine. It was brutally tested for that, and has been pretty solid ever since. People complain about the GC, but not about it being buggy. A buggy GC is a real horror show as it is painfully difficult to debug.
The preceding paragraph had "and occasionally language features" so I thought it would be understood that I didn't mean it as an optimizer-specific thing, but on re-reading the post, I totally see how the other wording "The knobs to steer the optimizer are limited. Usually, these [...]" implies the wrong thing.
I've changed the wording to be clearer and put the D example into a different bucket.
> In some cases, languages have features which enforce performance-related properties at the semantic checking layer, hence, granting more control that integrates with semantic checks instead of relying on the optimizer:
>
> - D has first-class support for marking functions as “no GC”.
But they only allow you to look at it at certain points in time (unless requested more precisely ahead-of-time), and/or lose the ability to do a bunch of meaningful optimizations. Whereas compiled programs can be desired to be paused and made sense of at any assembly instruction, and there'll always be some people wanting to trade perfect debuggability for perf.
Not really. If you look at HotSpot for example it will deoptimize a function for you all the way if you debug it and then reoptimize it when you stop single-stepping.
But how do you get in the "debug it" mode? If you mark it as to-be-debugged ahead-of-time (e.g. adding a breakpoint) then it can indeed ahead-of-time set up exactly that spot to be debuggable from. If you just pause the VM at a random time, it can freely choose to quietly run maybe a couple dozen to thousand instructions to reach a safepoint. Compiled languages don't have that luxury, as they don't get to control when a coredump happens or ptrace attaches, but people nevertheless expect such to provide a debugging environment.
Even if HotSpot had perfect assembly-level debug information (which it cannot, as it does do (a tiny bit of) autovectorization, which by necessity can reorder operations, potentially leading to intermediate states that cannot be mapped to any program state), that just means it'd come at a performance cost (e.g. no autovectorization).
Perhaps there's an argument for having two debugging modes for compiled programs - the "basic" best-effort one that is what we already have, and a feature-complete one that requires tight runtime integration and cannot relate to any of the compiled assembly. But that's adding a decent amount of complication & confusion, doesn't really help coredumps, and still has to pay the cost of having things be deoptimizable even if at limited locations.
Also a general problem is that HotSpot does not tolerate memory corruption, but for memory-unsafe languages you'd quite like to have an as-reasonable-as-possible debugging environment even if 60% of the heap was zeroed by bad code, which throws a pretty large wrench in being able to complete a "just deoptimize everything" as the first step to debugging.
I think this is mostly a philosphical question, rather than optimizer quality.
Once an optimization becomes part of the interface and it is guaranteed, is it really an optimization? Or did it just became part of the language/library/database/whatever?
One example is return value optimization in C++. In C++17 the "optimization" became mandatory in some contexts. What really happened though is that the rules of temporary materialization changed, and in those contexts it just never happens prematurely by the language rules. This ceased to be an optimization and became a mechanism in the language.
What I'm getting at is that unreliability is a defining quality of optimizations.
Sure, there are certain optimizations that become load-bearing, in which case it would be better if they became part of the language's semantics and guarantees, therefore they ceased to be optimizations.
It is still a useful distinction. Programming languages are complex enough to understand. It is useful to have a base of 'this is the simplest description of how programs will work' that is as simple as possible. Then, a separate set of 'and here is a separate description of its performance characteristics; nothing in this portion should be understood to change the defined behavior of any program".
Even if that second description is stable and part of the guarantees you make, keeping it seperate is still incredibly useful from a user perspective.
From an implementation perspective, there is also a useful distinction. Optimizations take a valid representation, and turn it into a different valid representation of the same type that shares all defined behavior. This is a fairly different operation than compilation, which converts between representations. In particular, for the compilation step, you typically have only one compilation function for a given pair of representations; and if you have multiple, you select one ahead of time. For optimizations, each representation has a set of optimization functions, and you need to decided what order to apply them and how many times to do so. Compilation functions, for their part, need to deal with every difference between the two representations, whereas optimization functions get to ignore everything except the part they care about.
C++ return value optimization has always been a language feature rather than a true optimization because it is/was allowed to change the observable behavior of the program.
One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler. I'm still very early in that work, but one of my hopes is to at least demonstrate that the output still meets the specifications, and that we can control things like timing (e.g. no branching on secret data) and cache in this output.
I think that the compilation and optimization step, as a black box, is a disservice for highly reliable software development. Compiler and optimizer bugs are definitely a thing. I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers. Yeah, this makes perfect sense when trying to optimize fixed point multiplication, but it completely destroys the security of DLP or ecDLP based cryptography by introducing timing attacks that can recover the private key. Thankfully, I was fastidious about examining the optimized machine code output of this compiler, and was able to substitute hand coded assembler in its place.
> One area that I have been exploring is building equivalence proofs between high-level specifications, an implementation in C, and the machine code output from the compiler.
"[...] Specifically, the ARM, ARM_HYP (ARM with virtualisation extensions), X64, and RISCV64 versions of seL4 comprise the first (and still only) general-purpose OS kernel with a full code-level functional correctness proof, meaning a mathematical proof that the implementation (written in C) adheres to its specification. [...] On the ARM and RISCV64 platforms, there is a further proof that the binary code which executes on the hardware is a correct translation of the C code. This means that the compiler does not have to be trusted, and extends the functional correctness property to the binary. [...] Combined with the proofs mentioned above, these properties are guaranteed to be enforced not only by a model of the kernel (the specification) but the actual binary that executes on the hardware."
Indeed it is. What I'm working toward is a more efficient way to do the same, that doesn't take the touted 30 man years of effort to accomplish.
I'm working on a hybrid approach between SMT solving and constructive proofs. Model checking done with an SMT solver is pretty sound. I'm actually planning a book on a scalable technique to do this with CBMC. But, the last leg of this really is understanding the compiler output.
> I was bitten by one that injected timing attacks into certain integer operations by branching on the integer data in order to optimize 32-bit multiplications on 8-bit microcontrollers.
FWIW, I think this should be considered a language design problem rather than an optimizer design problem. Black box optimizer behaviour is good for enabling language designs that have little connection to hardware behaviour, and good for portability including to different extensions within an ISA.
C doesn't offer a way to express any timing guarantees. The compiler, OS, CPU designer, etc. can't even do the right thing if they wanted to because the necessary information isn't being received from the programmer.
To a large degree, constant-time programming is hampered by the fact that even hardware is often unwilling to provide constant-time guarantees, let alone any guarantees that the compiler would care to preserve. (Although, to be honest, constant-time guarantees are the sort of things that most compiler writers prefer to explicitly not guarantee in any circumstances whatsoever).
Your comment makes me wonder about the idea of building a superscalar, out of order, speculative implementation of the 6502 or 8080 instruction sets. Might make a good educational project.
Few languages provide such guarantees. But, there really was no way with this particular compiler to pass a hint to generate constant time code.
Black box designs work until the knob or dial you need to control it isn't there. I would have taken a pragma, a command-line option to the compiler, or even a language extension.
This is one example of many as to why I think that user-guided code generation should be an option of a modern tool suite. If I build formal specifications indicating the sort of behavior I expect, I should be able to link these specifications to the output. Ultimately, this will come down to engineering, and possibly, overriding or modifying the optimizer itself. An extensible design that makes it possible to do this would significantly improve my work. Barring that, I have to write assembler by hand to work around bad assumptions made by the optimizer.
I'll refer you to my reply to a sibling comment. I'm hoping that I can build a more efficient means of doing similar work as with seL4, but without the 30 man year effort.
They are on the right track. But, I think there have been some improvements since their effort that can lead to more streamlined equivalence proofs.
Jasmin has some great ideas. My goal in particular is to improve the tooling around C so that verification of C programs is easier. Because there are trillions of lines of C out there, I want to ensure that semi-automated processes can be developed for verifying and fixing this code. For cryptography, this involves similar features as Jasmin. The main difference is that I'm looking at ways to guide this while largely preserving the C language. That's not because I think C is better, but because I want to inject this into existing code bases and development processes, with existing software developers.
It sounds like a very promising approach! Do you think there are really trillions of lines of C? At ten delivered and debugged lines of code per programmer-day we'd have on the order of a single trillion lines ever written, but most of that has been discarded, and probably less than half of the remainder is C.
I suspect something like wasm may be a better way to preserve backward-compatibility with C, although of course it won't help with constant time or confused-deputy vulnerabilities. CHERI might help with the latter.
That's based on current estimates. It's impossible to know for certain, but there is a lot of C software out there.
I'm a big fan of runtime mitigations. I use a few in my own work.
WASM can help in some areas. But, bear in mind that a lot of this C source code is firmware or lower level operating system details (e.g. kernels, device drivers, system services, low-level APIs, portable cryptography routines). In this case, WASM wouldn't be a good fit.
CHERI is also a possibility in some contexts for runtime mitigation. But, since that does involve a hardware component, unless or until such capabilities are available in mainstream devices and microcontrollers, this would only be of limited use.
There are other runtime mitigations that are more mainstream, such as pointer authentication, that are in various states of evaluation, deployment, or regression due to hardware vulnerabilities. I think that each of these runtime mitigations are important for defense in depth, but I think that defense in depth works best when these mitigations are adding an additional layer of protection instead of the only layer of protection.
So, this verification work should be seen as an added layer of protection on top of runtime mitigations, which I hope will become more widely available as time goes on.
We are nearing the death of Proebsting's Law. AMD CEO Lisa Su's HotChips’19 keynote said that compilers had accounted for 8% performance increase over the decade. That means compilers are now only doubling performance every 90 years.
the optimizer should be a magic black box. As soon as you start demanding a particular optimization, it shouldn't be an optimization anymore. In C you have inline assembly and intrinsics and pragmas and macros and so on. If you want the compiler to compile your code a particular way you should be using these, not trying to wrangle the optimizer to invoke a particular optimization.
Except even intrinsics aren't even that much of a guarantee - clang converts them to its internal operations and applies the same optimization passes over them as it does on its own autovectorized code; and there are no intrinsics for x86's inline memory operands, so issues can arise around those (or the inverse - I've seen clang do a memory load of the same constant twice (the second one being behind some branches), despite there being only one such load).
And there also are no intrinsics for most scalar operations, e.g. if you wanted to force "x>>48 == 0x1234" to be actually done via the shift and not "x & 0xffff000000000000 == 0x1234000000000000" (or vice versa).
And of course assembly means writing platform-specific code (potentially undesirable even if you want to only do the optimization for a single architecture, as it means having to learn to write assembly of said architecture).
There is some potential middle-ground of doing black-boxing, but as-is in C/C++ the way to do this is with a no-op asm block, but that can make register allocation worse, and still requires some platform-specific logic for deriving the register kind from value type.
If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right? I've even seen interpretters for C... so while I 100% have felt the pain in this article, I don't know where that leaves us? These are more like compilation assertions/strategies than language features. (Not that these shouldn't be written inline with the code, e.g. it can be nice to write unit tests in the same files as the code in some languages).
In the case of SQL, I'd love access to a flavor where I do the joins on indices explicitly, the query is executed as written, and each join (or filter) can be annoted with a strategy (btree lookup, etc). (The most difficult part about using indices by hand is correctly writing all the synchronous triggers on updates, not the queries, IMO).
> If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right?
Technically, yes. :)
But I think this should perhaps be treated as a bug in how we define/design languages, rather than as an immutable truth.
- We already have time-based versioning for languages.
- We also have "tiers" of support for different platforms in language implementations (e.g. rarer architectures might have Tier 2 or Tier 3 support where the debugging tooling might not quite work)
One idea would be to introduce "tiers" into a language's definition. A smaller implementation could implement the language at Tier 1 (perhaps this would even be within reach for a university course project). An industrial-strength implementation could implement the language at Tier 3.
(Yes, this would also introduce more complications, such as making sure that the dynamic semantics at different tiers are equivalent. At that point, it becomes a matter of tradeoffs -- does introducing tiers help reduce complexity overall?)
> If you have any language where it is "semantially correct" to execute it with a simple interpetter, than all optimizations in that language are not semantically important by definition, right?
Not when "correct" needs optimizations to meet real-time guarantees. It's hard to argue that a program which doesn't run is "correct".
Optimizations in compilers like LLVM are done by many individual code transformation passes, one applied to the result of the previous.
This layering makes the order of the passes important and very sensitive. The passes usually don't have a grand plan, they just keep shuffling code around in different ways. A pass may only be applicable to code in a specific form created by a previous simplification pass. One pass may undo optimizations of a previous pass, or optimize-out a detail required by a later pass.
Separation into passes makes it easier to reason about correctness of each transformation in isolation, but the combined result is kinda slow and complicated.
Explain doesn't give you that information in many (most?) DBMSes. It's a bit like seeing the compiler IR code of your program. It lets you understand some things, while others remain a mystery.
I'm guessing you've tried these flags mentioned in the blog post but haven't had luck with them?
> LLVM supports an interesting feature called Optimization Remarks – these remarks track whether an optimization was performed or missed. Clang support recording remarks using -fsave-optimization-record and Rustc supports -Zremark-dir=<blah>. There are also some tools (opt-viewer.py, optview2) to help view and understand the output.
Daniel Bernstein has given an interesting talk about "the death of optimizing compilers" (PDF slides at https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf) claiming that less and less of our code is in that middle ground where it's performance-critical enough to risk running it through an optimizer but not so performance-critical that it needs to be rewritten in assembly.
If we think of the high-level source code (whether in SQL, JS, or Python) as being a functional spec that the low-level implementation should provably follow, maybe we should check them both in, and at build time, run a proof assistant to verify that the implementation still complies with the spec? Rather than running an optimizing compiler every time. I think this is what nanolith is suggesting in https://news.ycombinator.com/item?id=41965701.
Maybe you run the optimizing compiler once when you first write the code, and again when the proof assistant tells you an optimization is no longer valid, but you check the output before you check it in. Or maybe you have the compiler running all the time on a GPU cluster looking for new optimizations.
(our own convention; the fox has Hox, and the lion too, but they'd probably count in octal. For those keeping score, compare The House of Asterion: "The original says fourteen, but there is ample reason to infer that, as used by Asterion, this numeral stands for infinite")
> The optimizer’s behavior is in some contexts load-bearing and has little margin for error (e.g. missing an optimization).
Well, sure, sometimes, to an extent. But if it's load-bearing, maybe that's a bug? You might have written non-portable code that won't last, because it depends on an implementation detail that isn't standardized.
There are widespread applications where performance isn't critical. For example, any time you do a network request. Web pages shouldn't break because the network is slow today, if you can possibly avoid it.
The web provides no performance guarantees, but it tries pretty hard to provide compatibility guarantees. Your code should, usually, work on new browser versions and on devices that haven't been released yet. New browsers will have different JIT compilers with different performance cliffs. And yet, there are websites written many years ago that still work.
When standardizing things, we need to be precise about what's standardized and what isn't, or protocols "rust shut" and can't be changed without breaking lots of stuff. Not standardizing performance is often a win. (With some exceptions like video game consoles where the hardware is standardized.)
Hyrum's law suggests that all compiler optimizations will eventually be load-bearing for someone, but we should usually try to avoid it. To make sure that the code is robust, perhaps performance should vary. Maybe it would be useful to have something like a chaos monkey for programs, where optimizations vary based on a random seed?
If your "debug optimization" code is so slow as to be unusable (see: Rust), then your optimizations qualify as load-bearing.
The problem is that "optimization level" needs a mindset change. The optimization levels should be "release", "small" and "experimental".
"Release level" needs to be perfectly usable for debugging as well as in production--"debug level" should be "release level". Compilation time should be reasonable and run time should be functional.
After that, for embedded, you should have "small level"--checks should get turned off and any optimizations that make code significantly bigger should get turned off (loop unrolling, for example). You might enable some optimizations that make compile time brutally slow.
Finally, there should be an "experimental level" which tests out optimizations before they go into release.
And there should be no "fast" optimization level. If your optimization is that situation specific, it should stay stuck in "experimental".
And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on. This is actually where compilers really break down nowadays. The compiler needs to eject enough information and context that a debugger can unwind what is going on rather than being an afterthought.
We need the equivalent of an LSP (Language Server Protocol) for debuggers.
DWARF does contain turing-complete VM for doing unwinding, so theoretically it should already be possible to make a compiler that gives precise debug info everywhere, no new protocol necessary.
But completely-reversible-anywhere optimizations are rather limiting, disallowing a bunch of things; including, but not limited to, dead code elimination (more generally, forgetting some fraction of state from somewhere), identical code merging, and instruction reordering (esp. vectorization, which essentially reorders across multiple loop iterations), severely limiting what your optimizer can even do.
> And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on.
You can't do this because some optimizations can't be seen through; most obvious one is identical code merging. If you're in a merged function then you don't know which of the original source lines you're looking at.
I'd like to see a system which only had optimized compilation and used an interpreter for debugging.
You can do that by having several different trampolines for the merged code that stash away a unique cookie on the stack to distinguish which function the code is currently executing. This is not zero-overhead, of course, but one can argue that the slight overhead is worth it given better error reporting and increased debuggability.
> If you're in a merged function then you don't know which of the original source lines you're looking at.
Then don't do that optimization.
How much is gained versus the grief that is caused?
Even when I did a lot of embedded device coding, I don't think that optimization ever saved me more than a couple dozen bytes of code.
And, as you point out, the optimization is stupidly difficult for debuggers to to unwind.
This is one of the disconnects of modern optimization. There is no good backpressure from downstream tools to say "Oy. That optimization is going to make our lives difficult. If you can't demonstrate a significant improvement, don't do it."
Indeed, you can just not do it for important things - usually the error paths leading to a crash or exit.
It's important for embedded software though. But that was a bad example; I should've said tail calls. Which are much more common and more important, because avoiding them blows out the stack for recursive code.
How would this model handle code that needs to be as fast as possible, at the cost of debugging? E.g. all hot-loop code, all high performance numerical code, all latency sensitive and instruction sensitive code?
> If your optimization is that situation specific, it should stay stuck in "experimental".
Yeah, I have a feeling that there’s low-key lots of applications for whom many compiler optimisations that you’re branding “experimental” are in fact run-of-the-mill and would just enable “experimental” mode so frequently it’d just be synonymous with “actually release mode”.
> And through all of this, the compiler should also eject files that carry enough information to allow debuggers to make sense of the compiled code, unwind the optimizations when the users asks, and present a coherent version of what is going on.
This is a pretty cool idea, semi-related, you should check out some of the work around E-Graphs, which attempt to enable the same level of optimisations but also preserve more original context.
It’d be neat to have the equivalent of source-maps for highly optimised code, but I’m not sure how we’d manage them without them becoming enormous. After things like inlining, constant-folding, desugaring, rearrangement operations, etc all take place I suspect you’d be left with code that bears only the merest passing resemblance to its original form.
in the "closing thoughts" [1] of the OP article there's a pointer to "Proebstrings law" [2] which got me to a benchmark [3] how LLVM has improved over 10 years from 2.7 to 11
which leaves me with the strong intuition that indeed what mit benefit more from a rethink is not optimizers but focus on programmer productivity
> Perhaps this means Programming Language Research should be concentrating on something other than optimizations. Perhaps programmer productivity is a more fruitful arena.
This reminds me of Halide (https://halide-lang.org/), which is a really interesting design in this direction.
In Halide, you specify the algorithm at a high level, then specify the "schedule" (execution order, vectorization, storage) separately. Then, it verifies that the two match. For tight loops like image kernels, it's often counter-intuitive which execution plan will give the best performance, so being able to experiment with the guarantee that you are still implementing the algorithm correctly is valuable.
For a database, it would be as if you could send a high level SQL query and an execution plan together.
Halide is very specific, but I could imagine a general-purpose language with a Python-like high level representation for code behavior, with the option to specify a lower-level implementation with a lot of performance annotations.
Great article. One real case I encountered that I find thought provoking, is where a bunch of test failures were bucketed into the same bucket because link-time code-generation had noticed that a bunch of C++ getter functions had the same output code and combined them all. So stack traces became confusing because the address-to-symbol mapping was more complicated than the logic we had in place was prepared for.
i.e. optimization had violated a rule we were implicitly relying on (that each non-inlined function should start at a distinct address, so that address-to-symbol mapping could be done easily). But that’s not an explicit guarantee and optimizers don’t seem to think about it much. (Well for inlining it seems to have had some thought, still sucks, but anyway this case doesn’t fit the pattern of inlining).
I find it hard to say anyone is dead wrong in this case… but I would turn off that LTCG optimization any time I could, except where proven necessary.
I think this is misplacing the need for a rethink. Compilers are amazing at optimizing small programs. What they are less amazing at is optimizing large programs.
The query planner for SQL databases is a good example of this. For small queries, the odds of the planner making a bad choice is rather low. Large queries, on the other hand, are comparatively easy to lead down a bad path. This is why the "no sql" family of databases see more predictable results for a lot of uses. If you can't do complicated queries, then you don't have that peril to worry about.
Worth trying if you have the programs to play with. Run it with and without -O3 and see how the compiler does for you.
> This is why the "no sql" family of databases see more predictable results for a lot of uses. If you can't do complicated queries, then you don't have that peril to worry about.
Can't you also just, well, not use complicated SQL queries?
Certainly, and that is my point? Throwing out query planners feels like the wrong thing to consider. Instead, focus on keeping the things being planned smaller.
I'm not sure if this is the right way to look at it? Do you really care if some optimization was performed or not or do you only care about the performance characteristics of the resulting program? Maybe a more sensible solution would be to have better tests for performance regressions.
Still not an easy task (performance is often highly dependent on the input and regressions can hide in very specific cases) but makes more sense to me than writing high level code and then worrying about the details of how the compier translated that high level code.
One of reasons SQL is declarative is that queries run concurrently yet give the illusion of working in isolation. And unlike the concurrency of independent processes, which is mostly mechanical, this one involves detailed coordination because database queries work on the same data. With these constraints it may be not that easy to come up with an imperative model that is as simple as the current declarative one.
No. Language designers need to think about what their language promises and when some users have differing needs, they may need to fulfill those needs.
> Have a good mental model of what the optimizer can and cannot do.
Most DB query planner designers and implementers have little imagination, and their mental model of what optimizers can and cannot do is, well, extremely narrow-minded. There is huge unexplored space of what query planning can be (at least for analytic queries, and we think in columnar terms) - if we just stop insisting on thinking of DBMS operations as black boxes.
Arguing against query planning by pointing at a quote about databases is wild. Automatic query planning is ubiquitous and hugely succesfull in databases.
I'm surprised that the "query planner" doesn't have a way to eject an opaque object that is the "assembly language of the query" that you can run that it is not allowed to change.
Sure. It's definitely a tradeoff which definitely hurts on rare occasion. I agree that the lack of fallback in most databases is a bit strange. Altogether though the productivity benefits have proven larger than the drawbacks of not defaulting to a query planner.
I've added a clarification in the post to make my position explicit:
> This is not to imply that we should get rid of SQL or get rid of query planning entirely. Rather, more explicit planning would be an additional tool in database user’s toolbelt.
I'm not sure if there was some specific part of the blog post that made you think I'm against automatic query planning altogether; if there was, please share that so that I can tweak the wording to remove that implication.
> I'm not sure if there was some specific part of the blog post that made you think I'm against automatic query planning altogether; if there was, please share that so that I can tweak the wording to remove that implication.
The quote from another article (which I didn't read) starting with "I dislike query planners".
"Against ... altogether" is mildly stronger than I took away from this, more like "generally of the opinion that the tradeoff nearly everyone is making with sql isn't worth it".
Judging by the lack of upvotes other people didn't react as strongly to this quote as I did, so take it as you will.
Frankly, the problem is that (generally, across languages various compiler hints) @inline sometimes fails to inline. At this point I’ve given up on ever having an @inline that reliably inlines, and I would very happily settle for an @assert_inline that doesn’t change the generated assembly at all but reliably crashes out if the function isn’t inline.
Julia is by far the worst language about this. It would be vastly more usable with the addition of @assert_type_stable, @assert_doesn’t_allocate, and @assert_doesn’t_error macros.
On the Julia front, I believe the evolving effect system may help a little bit with obtaining/surfacing that information. JET.jl should be able to solve the type stability (inference) one from a whole-of-program perspective. We have `@inferred` for unit tests. The macros would be a cool addition - I wonder if they would get overused though?
I agree that Julia takes the idea of optimization to the extreme - it's semantically a very dynamic language and only fast due to non-semantically guaranteed optimization. On the other hand, getting access to the generated IR, LLVM and assembly and iteratively improving it is far easier than any other language I've seen.
just in case you're interested, Julia does (now) have tools similar to what you're describing, although of course there are still be lots of possible improvements to be made w.r.t. their UX to make these tools more reliable and easier to use
in particular @assert_type_stable and @assert_doesn't_error is largely provided by JET.jl and the effects system, and @assert_doesn't_allocate is (at least partially) provided in AllocCheck.jl
A great example of when winning in the average works is register allocation. It’s fine there because the cost of any particular variable getting spilled is so low. So, all that matters is that most variables are in registers most of the time. If spill heuristics change for the better, it usually means some of your variables that previously got spilled now are in registers while others that were in registers are now spilled - and the compiler writer declares victory if this is a speedup in some overall average of large benchmarks. Similar thinking plays out in stuff like common subexpression elimination or basically any strength reduction. (In fact, most of those optimizations have the peculiar property that you’ll always be able to craft a program that shows the optimization to be a bad idea; we do them anyway because on average they are a speedup.)
In my view, if a compiler optimization is so critical that users rely on it reliably “hitting” then what you really want is for that optimization to be something guaranteed by the language using syntax or types. The way tail calls work in functional languages comes to mind. Also, the way value types work in C#, Rust, C++, etc - you’re guaranteed that passing them around won’t call into the allocator. Basically, relying on the compiler to deliver an optimization whose speedup from hitting is enormous (like order of magnitude, as in the escape analysis to remove GC allocations case) and whose probability of hitting is not 100% is sort of a language design bug.
This is sort of what the article is saying, I guess. But for example on the issue of the optimizer definitely removing a GC allocation: the best design there is for the GC’d language to have a notion of value types that don’t involve allocation at all. C# has that, Java doesn’t.