Premature optimization has become an umbrella term for people defending poor choices during implementation. Not doing simple mistakes that kill performance for no benefit or choosing option B over A where both take equal effort but B is 20x times more performant should be common sense.
Unfortunately, 9 times out of 10 I hear these often quoted phrases is when engineers defend their status quo bias during code review.
I like Casey Miratory's "non-pessimization" approach in Philosophies of Optimization [1] [2].
He breaks "making a program go faster" into 3 modes: (1) "optimization" which includes counting expected cycles or memory bandwidth, careful benchmarking, designing custom algorithms, etc which can be fun but is expensive and is only rarely applied; (2) "non-pessimization" which is simply avoiding adding extra unnecessary work for the cpu, which should be done 100% of the time but isn't in the industry today; (3) and "fake optimization" which is shallowly repeating 'optimization tips' and conclusions from unrelated cases of (1) without actually measuring it.
Thank you for the links, looks like something definitely worth checking out, especially the name of the whole series "Simple Code, High Performance" because it sounds like the core of my approach to writing code.
"non-pessimization" is my most loved technique because a design reduced to the minimum amount of necessary work and abstractions is more often than not has better performance and is easier to reason about.
The whole series is pretty great. One of the interesting points he makes is that non-pessimization (2) is a necessary precondition to optimization (1). That any attempt to do hardcore optimization of a pessimized codebase is futile and will continuously faceplant into endless expanses of stupid code filled with unnecessary work for the cpu that instantly becomes the bottleneck.
“If an implementation could be written with the same level of complexity and in the same time, with a 20x more performance, then it's this implementation that should be chosen. I'm talking about implementation choice, not optimization.”
I sometime try this approach and write the better implementation in front of the dev if possible.
Not doing simple mistakes that kill performance for no benefit or choosing option B over A where both take equal effort but B is 20x times more performant should be common sense.
The full quote from Knuth is this;
“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”
In your example clearly if there are two approaches and one is 20* faster than the other it's optimization in the right place.
But... it's the "at the wrong times" bit that's actually the most important part. If you have a working solution and you're spending time trying to make it faster without knowing it's too slow then you're trying to optimize at the wrong time. If you know both approaches then it's fine, and good, to pick the faster one. If you don't then you should use the one you know and come back later if you need to.
Go's built-in performance analysis tooling is so excellent.
The profiler, which can do CPU, memory, goroutines, blocking, etc. and can display all of that as a graph or a flame graph, as well as `go tool trace` which gives you a full execution trace, including lots of details about the work the GC does. All that with interactive local web-based viewers.
Performance optimization is always so fun with it.
Yep, but you know, I can't stand that web-based viewer, it's got this hair trigger zoom and if I breathe in the direction of the mouse the graph goes hurtling in or out. I used to look at the PDFs but now I just stay in GoLand, which give you everything you need.
> If you’re writing Ruby code to turn your blog drafts into HTML, it doesn’t much matter if republishing takes a few extra seconds
Unless your software is intended to reload blog posts live like Hugo/Jekyll/other static site generators and it takes ~5,000 ms on an i9 machine when it could take 100 ms if different languages and different implementation choices were made. This is the story of modern software: "I don't care how much time and computing power I waste. It's not the core app of some FAANG giant, so I'm not going to bother, ever."
I agree to the extent that people fail to realize how they could be missing out opportunities to innovate or corner a market when they leave performance on the table. Quite often, new products become possible when a basic task like rendering HTML goes from taking 10 seconds to 10ms. I think you can paint the problem in broad strokes from either side of the "how important is performance?" argument.
From a "get the bills paid" point-of-view, any good project manager also has to know when to tell an engineer to focus on getting the product shipped instead of chasing that next 5% in throughput/latency reduction/overhead. I've seen my fair share of programmers (including myself) refuse to ship to a project because the pursuit of some "undesirable" latency and not finish more important features.
For tasks like video streaming, Automation software (CI pipelines to robotics), video games, professional tools for content creation (DAWs, video editing, Blender, etc.) performance is the feature, but then your product is helping them get the bills paid faster. Medical apparatus(es?) and guidance software on autonomous vehicles are examples of where latency is a life-or-death situation.
I think everyone would benefit from playing with OpenRTOS, or writing some code that deals with video/audio where there are hard deadlines on latency. But I'm never gonna hold some weekend-project static site generator in Ruby to the same standard as macOS.
Agreed with what you said, but even for Blender, performance is important but the feature are Free software, good modeler, correct, feature-packed, good looking renderer AND performance.
I wonder how heavily other fields sacrifice in the name of "ease of implementation".
Could we have houses that are 100x stronger and longer lasting if we allowed a few extra weeks of construction time? Could we 10x battery capacity with a slightly more sophisticated manufacturing process?
I don't think many developers nowadays understand how fast computers are, nor how much needless bloat is in their software.
I work in III/V semiconductors. The product development goes like: identify need, design, attempt a proof of concept, DoEs to determine manufacturing costs/viability, repeat until you have a statistically controlled viable process. There is essentially no room for technical creativity like there is in software. If we spent an extra year on our worst performing product (we've done this), we would get at best 5% improvement with iffy reproducibility.
> I don't think many developers nowadays understand how fast computers are, nor how much needless bloat is in their software.
Most, as in > 50%, of developers aren’t in a position to decide which languages, libraries and tools to use, or how much time they can spend on performance improvement.
For the economics of that comparison to make sense, we should also include material inputs to the construction. It doesn't matter to the cost of the final product if it was materials that were saved or labor cost. Software is the special case where cost almost entirely comes down to labor. So given that, we are of course constantly compromising quality. We realize that less though because buildings are more standardized.
The big issue in other fields mostly isn't manufacturing design or construction process. The main bottleneck is with physical goods and long distances travelled you're subject to high shipping costs so people cut all sorts of corners to make things less bulky and lighter. Cheap plastics are far lighter than wood or metals for instance so we see more plastics get used.
You could have a lot more good enough and much cheaper houses in the US, but we banned manufactured/prefab houses due to lobbying from the “build everything individually out of wood” lobby, require huge setbacks due to the front lawn lobby, and various other things like overly wide roads because of out of date fire codes.
Anyone with APM setup can see what’s going on pretty well. For almost all web apps, you can see that ruby/python are extremely fast and almost all of the time spent on a request is the database.
This is true of every rails app I have worked on in the last 10 years. You’d have to be doing something remarkably bad or have some crazy situation to end up with ruby using any noticeable amount of compute time.
Well that depends on what your application is doing. But yes, if it is a database heavy workload, then what you say applies.
What I was trying to get at was that in these cases the ruby/python/whatever parts are not actually fast, they just do much less stuff. Add that the database might be on another machine, so there's plenty of overhead as well.
Aside: You can sometimes get orders of magnitude of speed gains just by using an embedded database and a fast language on top if your workload allows it. Essentially you don't really do much to get this, just cut overhead.
It was the "bad" untyped languages like PHP, Python and Ruby (Perl was too complicated for us, mere mortals) that saved the web from becoming a Microsoft monopoly, or, more likely, an oligopoly between the same MS, probably Sun, probably IBM or some such. I'm talking about most of the 2000s decade. True, the web has become sort of an oligopoly right now, but at least that was not caused by the programming languages and web frameworks that power it.
What I'm trying to say is that those languages that everyone is quick to judge right now have given us 10, maybe 15 years of extra "life", a period when most of us have "made" our careers and, well, most of our money (those who have managed to make that money, that is). We wouldn't have had (what basically are) web companies worth tens, if not hundreds of billions of dollars, if the web had still meant relying on Struts or on whatever it is Microsoft was putting forward as a web framework in the mid-2000s. We wouldn't have had engineers taking home TC worth 500-600k and then complaining that Python or Ruby are not what the world needs.
It would be nice to have some University poor souls study that period from a sociological pov (and not only) but, helas, afaik nobody is doing that so that scattered comments like mine, from people who have lived through that period, is the only thing that we have.
Come to think of it maybe some data historians at Google can look through the Googlebot logs and not only and compare the percent of websites that they think were .php-run (let’s say) over all this period, in both nominal and relative terms. It is my hunch that those websites represented a big chunk of Googlebot’s hit targets for most of Google’s (the company) existence, hence why I think my comment is valid (among other things)
I am mostly asking for evidence because I despise PHP, Ruby and Python -- I've worked extensively with the first two and dipped my toes in the third several times so it's not an uninformed opinion -- but I'd be open to recognize their historically positive value.
My beef with such technologies... is not with the technologies at all. It's with the people that can't let go and can't recognize that things do change (or they are simply afraid that they have to learn new skills; many programmers are very risk-averse and are very average Joes and Janes just wanting to make a living and they hate the idea of having to learn their entire careers -- not sure if that's "fine" but it obviously works well for millions of them out there).
Same as with these old non-GNU UNIX tools zealots; some of them still fiercely are against the GNU tools... meanwhile a lot of the DevOps people I know (and no small amount of programmers, yours truly included) already move about 75% of their scripting workloads either to the newer Rust variants (rg, exa, fd-find, tuc, and many others) or lightweight quick script-friendly languages like Nim or OCaml.
So my broader point was: "we should utilize nostalgia as a driving force much less than we do right now". Hence my question.
I'll agree with you that the alternative bleak early-dystopian future that PHP et.al. prevented would be horrid and -- again -- I'd recognize the role of the early free / open-source tech as a positive driving force of the IT history.
> But for code that’s on the critical path of a service back-end running on a big sharded fleet with lots of zeroes in the “per second” numbers, it makes sense to think hard about performance at design time.
> Unless your software is intended to reload blog posts live like Hugo/Jekyll/other static site generators
This falls into this?
> But for code that’s on the critical path of a service back-end running on a big sharded fleet with lots of zeroes in the “per second” numbers
A static site generator is a simple utility that should take 100ms at most on a modern crappy laptop. It's not some backend service that's in a critical path, but it's also not a difficult engineering task. Parse input, produce output. This isn't something that should take seconds, which I think is what the OP was getting at. But because of the language choices made, and the millions of lines of bloated code, it does take seconds.
Yes. Maybe not directly, but more like it versus a one-off "I just made a new blog post and can wait 10 seconds for it to render and deploy." If you're reloading the blog posts live then you're on the critical path, not a one-off anymore. So you need to think about performance.
In the former case, the performance really doesn't matter. If you've got a personal blog, does it matter if your update takes 10 seconds or 1 second? Probably not, if it does it's the most important blog in the world.
If you've got customers with many blogs and need to take their updates and render them, then the performance matters because it's shifted from 1 to many (hundreds? thousands?). And now that 10 second delay is a big issue, you're either using a fleet of servers to handle the load or some customers don't see updates for days (oops).
> In the former case, the performance really doesn't matter. If you've got a personal blog, does it matter if your update takes 10 seconds or 1 second?
It shouldn't matter, but the fact that you framed the question as 1 second in the fast case shows that this does matter haha. The fast case in this scenario should be under 10ms, and the slow case should be half a second. It shouldn't take anywhere near 1 second to generate a static site!
A modern low end CPU can process upwards of 40GB/s of data. The last time I checked, the entire set of encyclopedia Brittanica is only around 1 GB. I don't know many blogs that approach that much information. At most, a blog probably only has around a few tens of Kilobytes of actual text. It should never take more than a second to process that, but it often takes upwards of 30 seconds to several minutes.
This is what I was trying to emphasize before haha. Even if code isn't in the critical path, there's no reason it should take several seconds to do almost anything, unless it involves communicating over a network. We can stream audio/video data over a network in real time on low end computers through Zoom. That's gotta be processing several MB of information at least every second in real time. If we can do that, surely we should be able to build a static site generator that can generate any blog in under a second, right?
This sounds like a "using a screwdriver to drive a nail" problem. Jekyll is designed to be run by the end-user, so it's not optimized for fast build times. Its design goals don't prioritize performant builds, because most people don't care if it takes 30s to update their blog.
If you decided to deploy a tool that's designed to be run infrequently, and in a non-time-sensitive way, on a centralized service - it's up to you to make it fast or parallelize it (e.g. github pages).
Trying to say that some piece of software is badly written or designed because it can't be used in a way it's not designed to be used, just seems silly. If I use `grep` to do realtime audio processing, I'm probably going to have a bad time.
I don't understand this at all, this frustrate some to no end with Jekyll - I would be much more productive if I didn't have to wait 20 seconds between each edit to check that the figure is at the right place
Right, I also misconstrued the original commenter as suggesting using it for many people when I realized (after my previous comment) they meant for one user. However, they still are describing moving rendering some kind of markup to HTML to the critical path with live reloading. In that case, performance is absolutely critical (as stated in the initial article too) because, well, it's on the critical path. A 10 second delay in a "live reloading" situation (suppose a two pane view with markup on the left and rendered HTML on the right) would become intolerable if every character change resulted in that kind of delay. So yes, in that case something that can render it faster would be better.
But if it's just a personal blog? Who cares if it takes 10 seconds or 1 second unless it's rendering many people's blogs? In which case it would, again, be moved to the system's critical path as a blog hosting service.
The part where he talks about the quality of benchmarking tools reminded me, a library I think worth mentioning for the .NET crowd is BenchmarkdotNet. Not only can it tell you time for a given test to run (after doing warmups to try and mitigate various problems with running tests cold), it also has options to see how the GC performed, at every generation level. Is this code getting to later generations, or is it staying ephemeral which is cheaper to use and destroy? Or can you avoid heap allocations entirely?
Edit: Oh and I should mention if you want the method level tracking on a run similar to some of his screenshots, if you pay for a Resharper license that comes with dotTrace I believe which gives you that sort of tracking.
dotTrace fantastic, really essential for performance profiling. Likewise, dotMemory is really good when trying to reduce or understand memory usage (tho dotTrace does have some memory tooling too). I've been happily paying for a JetBrains Ultimate personal license for a few years now.
There are very few companies that I'm really rooting for, but JetBrains is absolutely one.
I have a resharper ultimate license through work and a full Jetbrains ultimate at home (I switched to Rider for C#/F#/Unity dev in the past 6 months and really liking it, along with CLion for the times I'm writing rust).
One time at work I dug up something that removed 75% of the runtime of an application because it turned out taking the length of an image was actually a method even though it looked like a simple property, so I cached it at the start of processing each image instead of foring over it over and over. It was insane how much faster the code became. I tracked that down with dotTrace.
And yeah dotMemory is also fantastic, I've dug up some GNARLY memory usage with it. Probably should have mentioned it since I was bringing up the memory portion of BenchmarkdotNet.
> The other two languages I’ve used mostly in recent decades are Java and Ruby and the profiler situation is for both those languages is kind of shitty. I had to pay real money to get the Java profiler I used at AWS and while it worked, it was klunky, not fun to use.
These days, async profiler (https://github.com/jvm-profiling-tools/async-profiler) is much better than the Go tooling for performance. It is a joy to use and features a top-like view for the hottest methods. It works for locks, allocations and CPU time. It also integrates with JMH.
> CPU time is always cheaper than an engineer’s time.
But my time is sometimes cheaper than 10 other developer's time. Or 50. Or all of our customers now and in the future: https://xkcd.com/1205/
Additionally, a fatal flaw of most project management theory is that it measures time as the constrained resource, which is rarely if ever the case. The real constraint is not time. It's either attention or energy, depending on the context (and motivation). We are not machines. Taking 10 minutes off of a process doesn't necessarily allow us to get 10 more minutes of work done, while taking 5 minutes off of another may actually save us 15.
Absolutely. As an Engineer, I've found the Pomodoro technique to be invaluable. I work with high focus for a 20-minute "cycle", get up and walk around, then repeat. No faffing about while working, and the mild exercise stimulates the brain so my "on time" is very effective.
There were several JSON problems mentioned in the article. One was parsing large arrays & floats. Another was DOM style parsing vs streaming. The point of using a streaming JSON parser is to avoid building an "output" per se, and instead handle the elements of the input as though they're events, concurrently with reading the input. In other words, turn parsing into an event handling system with many small events, rather than producing a single large data structure. One reason to do this is to avoid string allocation for every key and string value in the input, because that's where a huge chunk of the time goes when you use a non-streaming parser. Some people call this parsing style SAX parsing (https://en.wikipedia.org/wiki/Simple_API_for_XML) Here's an example https://rapidjson.org/md_doc_sax.html
Having written a few of them, the input is well-defined but the output is not. It generally depends on what you want to do with the JSON parser. If you are trying to get a generic key-value output, you can do that with a hash table, but you have to still access the keys by name. If you want to populate a struct, you should do something different for maximum speed (parse the fields directly).
Another way to state the question is, "If the input was in a format that required no JSON processing, nor other processing, what would that format look like?" For example, maybe it is a simple key,value table. I do not know. What makes it easier for me to understand is a sample of what that format looks like.
> Make it work, then make it right, then make it fast.
Applies to core language choice, I used to hear a lot about rewriting interpreted langs to something faster... but the reality is that a team that's just spent 1 year making an app in python aren't going to pivot to writing Java, Go, or Rust one day. The new team isn't going to be building something new and exciting, they are going to be on a tight timeline to deliver a port of something which already exists.
In my case, the choice is made for me (I write Apple native, so Swift, it is...).
I have found that his advice on using profilers was very important. I thought I knew where it would be slow, but the profiler usually stuck its tongue out at me and laughed.
When I ran a C++ shop, optimization was a black art. Things like keeping code and pipelines in low-level caches could have 100X impact on performance, so we would do things like optimize to avoid breaking cache. This often resulted in things like copying and pasting lines of code to run sequentially, instead of in a loop, or as a subroutine/method.
It's difficult. There's usually some "low-hanging fruit" that will give awesome speedups, then, it gets difficult.
I know people don't like macros and pragmas, but unfortunately I have found that in cycle-sensitive code, they are really the way to go. I have found that with judicious use of those tools, you can kind of write the code in a way that feels natural (and looks normal to a normal programmer), and still get what you need. It is still slightly scary, but a lot less scary than copy-pasting the same code over and over again.
Modern C++ compilers are mixed bag when it comes to optimization, brilliant in some areas and inexplicably obtuse in others requiring the programmer to be quite explicit. This has improved significantly with time but I am still sometimes surprised by the code the compiler has difficulty optimizing, or which are only recognized if written in very narrow ways. As a practical matter, compilers have a limited ability to reason over large areas of code in the way a programmer can but even very local optimizations are sometimes missed.
This is why it is frequently helpful to look at the code generated by the C++ compiler. It gives you insight into the kinds of optimizations the compiler can see and which ones it can't, so you can focus on the ones it struggles with. This knowledge becomes out-of-date on the scale of years, so I periodically re-check what I think I know about what the compiler can optimize.
For some things, like vectorization, the compiler's optimizer almost never produces a good result for non-trivial code and you'll have to do it yourself.
I should add that some types of optimizations (cache efficiency being a big one) are outside the scope of the compiler because it is implicitly part of the code specification that the compiler needs to faithfully reproduce e.g. data structure layout is required to be a certain way for interoperability reasons.
There was some work done a while ago on automatic layout optimisation. See chris lattner's dissertation. It seems to be mostly dead now, but the principles justifying its existence are not.
Compilers can and do allow themselves license to ignore ABI in some cases; for c and c++, this includes link-time-optimised programs and -fwhole-program &c, as well as straight-up localised transformations on localised data structures (for whatever value of 'local' you prefer). Many other programming languages, in particular including JIT-compiled ones, also do not specify any ABI at all.
For C++, the principle backing this is the as-if rule [0]. While compilers do optimize internal ABIs to some extend there is definitely still lots of room for improvement. I think eventually internal function boundaries and data structures should/will eventually only be hints to the optimizer just like the register and inline keywords have lost their original strict meaning - but we are not anywhere close to that yet.
gdb's disas command (or asm view in general) is also very helpful, especially if you want to try control-c profiling [1] and then looking at the asm instructions in the neighborhood of where you halted your program.
The tool is called Compiler Explorer; I think Matt just hosted it in his own otherwise unused existing personal domain, but the name famously stuck as it is short and memorable (at $WORK, our private compiler explorer installation is under go/godbolt!).
I’m curious how this compares in practice to other languages such as rust, Haskell, go, or Java. Cache efficiency at the level of arrays vs linked lists, and specific code for vectorization all happen in programmer land - but are C/C++ engineers really getting that much gain over alternative languages whose compilers don’t allow keywords for unlikely, or inline?
Attributes like inline or likely rarely do that much, so that doesn't explain it. They are merely suggestions, frequently ignored, and the compiler is good at optimizing these things anyway. Languages that allow explicit, deterministic, fine-grained resource management like C, C++, and Rust (systems languages) have significant advantages over ones that do not like Java and Go; certain types of optimizations cannot be effectively implemented without this property. In principle, systems languages can all produce similarly performant code but the amount of work required to get that performance varies greatly.
Modern C++ in particular encourages the extensive use of expressive compile-time metaprogramming to automate optimization that the compiler cannot. In theory, the heavily optimized data structure and algorithm code that is often generated in C++ could be manually written in C or similar. However, in practice, few software engineers have the patience or time to systematically apply those optimizations manually to everything in their code base. I know for a fact that I didn't when I was writing C.
That modern C++ tends to produce the most highly optimized code is more a matter of economy than theoretical capability.
> Attributes like inline or likely rarely do that much, so that doesn't explain it. They are merely suggestions, frequently ignored
In Rust you can annotate a function with `#[inline(always)]` which makes is not a suggestion; it'll always inline it AFAIK (unless it calls itself recursively where, by definition, inlining is not possible).
All major c++ compilers have similar attributes but they are heavily discouraged by the compiler authors.
Clang also has a flatten attribute which can be interesting: it does the conjugate of marking functions inline: the function to which it is applied to will have all its own content inlined, e.g. [[clang::flatten]] void f() { a(); b(); }
Got it, I'll have to dig through the language benchmark game to see if it reflects the best C++ tricks. From what I've seen previously, the biggest perf difference is in native memory management along with easy access to machine specific instructions.
Having recently worked on a large C/C++ code base which had a heavy focus on performance observation, I found myself wondering if tricks like dynamic Code Generation to avoid method overhead and unnecessary branch checks actually had a significant benefit on performance relative to the developer economy impact.
Per language comparisons vary, but one undiscussed reason why C++ often has the most win is because it allows you to control memory layout for data structures. The ability to store objects in data structures as values (instead of with references to separately-allocated objects), pack structures to certain sizes, and align to cache are as important as optimizations from the compiler.
I cannot speak for c++, but I recently rewrote an optimised c routine in assembly. A 2-4x speedup obtained. I might have gotten partway with careful tuning of the source, the way the parent suggests, but I could have not gotten all the way.
Just did this, rewrote a deployed cli from Python[1] to Go[2]. It delivered on all the expectations. Less CPU & RAM usage, less dependence on system stuff like openssl, more diverse platform support.
This definitely doesn't apply to developer tools in general (and sometimes also to infrastructure).
Anyone who had the "pleasure" of using Amazon's internal tools can talk about how "Make it work ASAP" attitude has worked out for their internal tooling :) LPT anyone?
I don't know, I've been places where select pieces of infrastructure were being rewritten from language X to Y for performance reasons, and it seemed to be going just fine. It wasn't "we're rewriting everything now" but finding performance bottlenecks and fixing them by rewriting pieces in the new language.
It works if you have lots of things communicating over APIs.
You have to be thinking about speed from the beginning.
You don't actually have to make it fast, but you have to be designing w/ an eye towards eventual speed. Otherwise you end up w/ a mess that can be made faster, but never fast.
I'd like to see static analysis that shows the call graph with number of nested loops at each level. Perhaps a way to show the loop lines themselves to make it easy to estimate O(nmk) by looking at the output. Obviously more detailed complexity analysis would involve a person going through the code manually.
We have a number of algorithms in Solvespace that are O(n*2) some of them accidentally because certain list operations are O(n). One of the worst offenders was accidentally O(n*4) because an n*2 size input was handed to an O(n*2) algorithm. I wrote a smarter algorithm to avoid that inner call entirely.
TLDR it shouldn't be hard for tooling to produce useful time complexity estimates. Can we get this please?
There's a pretty simple workaround for this that I discovered fairly early on: Create a deterministic test for your code, write down the expected call counts for each function, and then look at the actual call counts.
You can spend all week shaving 15% off of an inner loop function or you can notice that it's being called 3x as much as you expected, fix that problem, and get a 60% performance improvement without changing a single line of that function.
"I approve too! But… Sometimes it just has to be fast, and sometimes that means the performance has to be designed in. Nelson agrees and has smart things to say on the subject."
- so... you're saying premature optimization isn't the root of all evil. Maybe it's time to retire this tired 'conventional wisdom'
> you're saying premature optimization isn't the root of all evil. Maybe it's time to retire this tired 'conventional wisdom'
Funny you mention it. I have a bit of a habit now of reminding people what the remainder of Knuth’s quote actually was.
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
The irony of walking away with the impression that Knuth was saying to not do optimization is that his point was the exact opposite of that, he was emphasizing the word premature, and then saying we absolutely should optimize, after profiling.
This all agrees completely with the article’s takeaways: build it right first, then measure, and then optimize the stuff that you can see is the slowest.
The 97% does have its own performance opportunities; rather than making it faster you want to stop it from getting in the way of the important stuff, by reducing code size or tendency to stomp on all the caches or things like that.
Anyone optimizing via microbenchmarks or wall time exclusively isn’t going to see this.
No, they're saying that not all optimization is premature. Or that upfront performance considerations in the design are not necessarily a case of premature optimization.
Not to mention the paragraph that immediately follows the more famous one, in Djiksta's essay:
>> Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
The only optimization type I understand that with is very isolated but convoluted fixes that buy performance at the cost of readability/etc. Intelligent architecture that is fast, so long as it is does not completely obfuscate the intent of the code, is not premature.
I would say, in most cases it pays to make the overall design of the code in a way that enables fast execution. This includes selection of languages and libraries.
Identifying loops or recursion with many iterations (high N) with a high Big O order may also be avoided during the design stage, of you KNOW the N and O-order at design time. (And don't worry about the O-order for low N problems, that is premature optimization most of the time.).
On the other hand, tweaking every single statement or test in ways that save 1-2 cycles is rarely worth it. Often you end up with obfuscated code that may even make it harder for the compiler to optimize for. This is the 97% of the code where premature optimization makes things harder.
Some programmers may turn this on its head. They don't understand the implications of the design choices, but try to make up for that by employing all sorts of tricks (that may or may not provide some small benefit) in the main body of the code.
I don't think this follows. If premature optimization is the root of all evil, essentially, we're saying "for all evil, there exists a premature optimization which leads to it." If evil, then sourced in a premature optimization.
Even if we parse his "But..." as saying, "there exists some premature optimization which is not the root of an evil" (ignoring probably valid quibbles about whether the optimizations he's talking about truly are premature), this doesn't contradict the original statement.
In fact, "the root of all evil" seems to be an expression which invites us to commit the fallacy of denying the antecedent -- if premature optimization, then evil, in this case -- because it is almost always used to indicate that the first thing is bad.
Moreover, when some operation gets sped up by a orders of magnitude, it can be used for new things that you’d never consider when it was relatively more expensive.
Something that used to be precomputed offline can be done in real time. Something that used to work on a small samples can be applied to the whole data set. Something that used to be coarsely approximated can be computed precisely. Something that used to require large clusters of machines can be handled on customers’ client devices. Something that used to be only an end in itself can be used as a building block for a higher-level computation. Etc.
You don't need thousands of servers to make it worthwhile. If you have code that runs constantly on 11 servers (possibly in k8s) that each cost $1k/month in AWS, and you spend a few weaks optimizing it down to needing 1 server, those weeks just generated the equivalent of a YEARLY $120k revenue stream.
If you require an ROI of 5 years, no interest included, you just created $600k in value over a few weeks.
Yeah for sure. One thing I think about often is if you're writing a compiler and it's slow (say it takes 10s per build), if you have 1000's of engineers running it 100x per day, that starts to add up quick. If you could get it down to 1 second, then you'd save a lot of actual engineer time, just not your own.
Scale is key here, but fast software is always a much better user experience than slow stuff as well. With the compiler example, if it takes 1s as opposed to 1h, then users can iterate much more quickly and get a lot of flexibility.
That's not the only factor. Sometimes it's weight or power or a fixed system like embedded, console or a particular product that's not easily upgradable.
This means working at companies with massive scale can be far better because you get to make code performant and optimize things rather than just focus on getting features out the door.
Indeed. At scale quotes like these are generally put on the backburner and the performance team will deliver gains that justify staffing the team, and then some. At work (though I’m not directly involved in this) there’s even a little table that gives you a rough idea of what kind of win you need to save the company the equivalent of an engineer. If you do it in the right spot it’s not really even that much (though, of course, finding a win there is probably going to be very difficult).
Math & logic are rarely the bottleneck over memory & allocation bottlenecks, right? Does Godbolt assume x86? Does the answer change depending on whether you’re using an AMD or NVIDIA GPU, or an Apple, ARM or Intel processor? Does it depend on which instruction pipelines are full or stalled from the surrounding code, e.g., logic vs math? Hard to say if one of these will always be better. There are also other alternatives, e.g. bitmasking, that might generate fewer instructions… maybe “abs(num) > x” will beat both of those examples?
Ah, yes, thank you, it looks awesome! It doesn't have GPU options (which is fair, since this is standard C++). But I see now I could have figured out the answer to my question in just a few seconds. :)
Yes. Suppose that both numbers are positive, that x>num, and that x+num is bigger than INT_MAX. In that case we hit signed integer overflow, which is undefined behavior. If signed integer overflow happens to wrap around, which it might, then the result could be negative and the function would return the wrong result. Or anything else could happen; undefined behavior is undefined.
In practice, just writing "abs(num) > x" gives quite good machine code, and it does so without introducing hard-to-see bugs.
Depends. In the first it will depend on the branch predictor which will depend on the relative expected magnitudes of num and x
In the 2nd, which I assume should be
{ return num*num > x * x; }
then it depends on the micro-arch, as it's one basic block so no branches and assuming a deep pipeline on x64, one multiplier (pipelined), probably this is faster for 'random-ish' num and x.
That's either wrong, or right in a way that requires playing with definitions until it's meaningless. Take the traditional example, sorting algorithms; they all do the same thing, but at wildly varying speeds. The only way to claim that some do less is to claim that they use fewer instructions (loosely speaking; pretend that CPU operations are constant time and nothing else matters), at which point you're right but claiming a meaningless tautology. Removing work is a strategy for optimization (sometimes a good one!), but it's not the end-all be-all.
You can. You can make it do different things, you can make it do the same things in a better order.
Most real software has a complicated relationship with Amdahl's Law, and there are often ways to reduce latency without really changing the ops/sec. That's true for IPC, but also true for the processor itself. Sometimes using the 'wrong' instruction for an operation increases the instructions per cycle.
Unfortunately, 9 times out of 10 I hear these often quoted phrases is when engineers defend their status quo bias during code review.